Лайфхаки

Маленькие, полезные хитрости

10 Open Source web scraping tools you. The best open source web automation tools for 2022

05.09.2023 в 12:03

10 Open Source web scraping tools you. The best open source web automation tools for 2022

Bonus: if you like our content and this “Open Source Web Scraping Tools” guide, you can join our web browser automation  Slack community .

The rise of Open Source Software (OSS) in the last years, especially after the establishment of GitHub as the de-facto platform for open source projects, brought many great development tools and libraries to a broad audience of developers who now benefit from them daily. With so many existing options, however, how can we decide what best suits our needs?

There are so many repositories, that simply finding the best one for your project can be a large task. Trying different alternatives and then deciding which to use can work, but that’s time-consuming. In this article, we have taken care of the research for you! We will share some of the best Open Source libraries of 2022 for web automation and testing, based on specific criteria that guarantee a robust and productive development experience.

The methodology used to construct this list of open source web automation tools

Before we present the list of our top picks, let’s take some time to discuss the methodology that resulted in our choices. As mentioned, GitHub is the most extensive repository of open-source projects. It also provides excellent statistics regarding a project’s overall quality and social engagement that can aid our conclusions. To make the list a project we should meet as many of the following requirements as possible:

  • The project should be well maintained; the project’s maintainer(s) is/are responding to issues and integrating code contributions. In the best scenario, the project is actively developed as well, with maintainers regularly introducing new releases. 
  • Many active maintainers and collaborators work on the project.
  • The public API should be stable to prevent future versions from breaking changes. 
  • The repository should be well structured, with a clear branch hierarchy.
  • The git commits should be atomic , with descriptive messages and references to specific issues.
  • JavaScript projects should be published on NPM and retain many monthly downloads. This signifies that people trust and use the project in production environments.
  • The project should provide clear documentation on how to operate the corresponding library. 
  • Institutions and individuals back the project, which signifies the importance of the project to the overall ecosystem. 
  • All the features and mechanics are thoroughly tested. 
  • A Continuous Integration pipeline is established to automate the integration of code changes.

Heritrix. Configuring Crawl Jobs

Basic Job Settings

crawler-beans.cxml file that contains the Spring configuration for the job.

Crawl Limits

In addition to limits imposed on the scope of the crawl it is possible to enforce arbitrary limits on the duration and extent of the crawl with the following settings:

maxBytesDownload
Stop the crawl after a fixed number of bytes have been downloaded. Zero means unlimited.
maxDocumentDownload
Stop the crawl after downloading a fixed number of documents. Zero means unlimited.
maxTimeSeconds
Stop the crawl after a certain number of seconds have elapsed. Zero means unlimited. For reference there are 3600 seconds in an hour and 86400 seconds in a day.

To set these values modify the CrawlLimitEnforcer bean.

Note

These are not hard limits. Once one of these limits is hit it will trigger a graceful termination of the crawl job. URIs already being crawled will be completed. As a result the set limit will be exceeded by some amount.

maxToeThreads

The maximum number of toe threads to run.

If running a domain crawl smaller than 100 hosts, a value approximately twice the number of hosts should be enough. Values larger then 150-200 are rarely worthwhile unless running on machines with exceptional resources.

metadata.operatorContactUrl

The URI of the crawl initiator. This setting gives the administrator of a crawled host a URI to refer to in case of problems.

metadata.operatorContactUrl=http://www.archive.org metadata.jobName=basic metadata.description=Basic crawl starting with useful defaults

Robots.txt Honoring Policy

The valid values of “robotsPolicyName” are:

obey
Obey robots.txt directives and nofollow robots meta tags
classic
Same as “obey”
robotsTxtOnly
Obey robots.txt directives but ignore robots meta tags
ignore
Ignore robots.txt directives and robots meta tags

Note

Heritrix currently only supports wildcards (*) at the end of paths in robots.txt rules.

The only supported value for robots meta tags is “nofollow” which will cause the HTML extractor to stop processing and ignore all links (including embeds like images and stylesheets). Heritrix does not support “rel=nofollow” on individual links.

Scrapy documentation. Scrapy at a glance ¶

Scrapy (/ˈskreɪpaɪ/) is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for, it can also be used to extract data using APIs (such as) or as a general purpose web crawler.

Walk-through of an example spider

In order to show you what Scrapy brings to the table, we’ll walk you through an example of a Scrapy Spider using the simplest way to run a spider.

Here’s the code for a spider that scrapes famous quotes from website, following the pagination:

Put this in a text file, name it to something like quotes_spider.py and run the spider using the

When this finishes you will have in the quotes.jsonl file a list of the quotes in JSON Lines format, containing text and author, looking like this:

{ "author" : "Jane Austen" , "text" : " \u201c The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid. \u201d " } { "author" : "Steve Martin" , "text" : " \u201c A day without sunshine is like, you know, night. \u201d " } { "author" : "Garrison Keillor" , "text" : " \u201c Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.

What just happened?

When you ran the command scrapy runspider quotes_spider.py , Scrapy looked for a Spider definition inside it and ran it through its crawler engine.

start_urls attribute (in this case, only the URL for quotes in humor category) and called the default callback method parse , passing the response object as an argument. In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author, look for a link to the next page and schedule another request using the same parse method as callback.

Here you notice one of the main advantages about Scrapy: requests are. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

While this enables you to do very fast crawls (sending multiple concurrent requests at the same time, in a fault-tolerant way) Scrapy also gives you control over the politeness of the crawl through. You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and eventhat tries to figure out these automatically.

Note

This is usingto generate the JSON file, you can easily change the export format (XML or CSV, for example) or the storage backend (FTP or Amazon S3 , for example). You can also write anto store the items in a database.

Scrapy. Introducing Scrapy

A framework is a reusable, “semi-complete” application that can be specialized to produce custom applications. (Source: Johnson & Foote, 1988 )

In other words, the Scrapy framework provides a set of Python scripts that contain most of the code required to use Python for web scraping. We need only to add the last bit of code required to tell Python what pages to visit, what information to extract from those pages, and what to do with it. Scrapy also comes with a set of scripts to setup a new project and to control the scrapers that we will create.

It also means that Scrapy doesn’t work on its own. It requires a working Python installation (Python 2.7 and higher or 3.4 and higher - it should work in both Python 2 and 3), and a series of libraries to work. If you haven’t installed Python or Scrapy on your machine, you can refer to the setup instructions . If you install Scrapy as suggested there, it should take care to install all required libraries as well.

scrapy version

in a shell. If all is good, you should get the following back (as of February 2017):

Scrapy 2.1.0

If you have a newer version, you should be fine as well.

To introduce the use of Scrapy, we will reuse the same example we used in the previous section. We will start by scraping a list of URLs from the list of faculty of the Psychological & Brain Sciences and then visit those URLs to scrape detailed information about those faculty members.