easierscrape package

easierscrape module

class easierscrape.Scraper(url, download_path='easierscrape_downloads')

Bases: object

Class for a scraper that targets a specific url and downloads all files to a download_path relative to the current working directory. A Scraper object acts as a “one-stop-shop” for all scraping functions.

clear_downloads()

Deletes the Scraper download directory.

Returns:

True if the Scraper download directory exists and is deleted. False otherwise.

Return type:

bool

get_screenshot()

Downloads screenshot from the Scraper url to the Scraper download directory.

Returns:

True

Return type:

bool

parse_anchors()

Parses a list of anchor tags from the Scraper url.

Returns:

List of anchor tags in the url.

Return type:

List[str]

parse_files(filetypes=[])

Downloads provided filetypes from the Scraper url to the Scraper download directory.

Parameters:

filetypes (List[str]) – List of filetypes (“pdf”, “txt”, etc.) to scrape.

Returns:

List of number of files downloaded per filetype from url (so if filetypes=[“pdf”, “txt”] and the return value is [1, 30] this means that 1 pdf file and 30 txt files were downloaded).

Return type:

List[int]

parse_images()

Downloads all images from the Scraper url to the Scraper download directory.

Returns:

Number of images downloaded from url.

Return type:

int

parse_lists()

Parses a list of lists from the Scraper url.

Returns:

List of lists (each stored as a List) in the url.

Return type:

List[List[str]]

parse_tables(output_type='csv')

Downloads all tables from the Scraper url to the Scraper download directory.

Supported output types are csv and xlsx (defaults to csv).

  • If downloaded as a csv file, each table will be stored in a separate csv.

  • If downloaded as an xlsx file, all tables will be stored as separate sheets in a “tables.xlsx” file.

Parameters:

output_type (str) – The filetype to output to (defaults to csv).

Returns:

Number of tables downloaded from url.

Return type:

int

parse_text()

Parses a list of text fragments from the Scraper url.

Returns:

List of text fragments in the url.

Return type:

List[str]

print_tree(maxdepth, blacklist=[], whitelist=[])

Prints a tree of depth=maxdepth starting at the Scraper url.

Parameters:

maxdepth (int) – The depth you want to print the tree to.

tree_gen(maxdepth, blacklist=[], whitelist=[])

Generates a tree of depth=maxdepth starting at the Scraper url. If the blacklist argument is used, none of the blacklisted domains will appear. If the whitelist argument is used, only the whitelisted domains will appear.

Parameters:
  • maxdepth (int) – The depth you want to generate the tree to.

  • blacklist (List[str]) – A list of all domains to ignore in the tree generation

  • whitelist (List[str]) – A list of only domains to include in the tree generation

Returns:

Head node of an anytree hyperlink tree.

Return type:

Node