Лайфхаки

Маленькие, полезные хитрости

Introduction to web Scraping Techniques and tools. Web Scraping Techniques Explained

10.09.2023 в 14:53

Introduction to web Scraping Techniques and tools. Web Scraping Techniques Explained

Apparently, manual scraping is not a viable option since it is extremely time-consuming and ineffective. Instead of wasting a whole day copying and pasting around in front of the screen, there are 5 ways you can get data at scale effectively.

Using Web Scraping Tools

Automatic scrapers offer a simple and more accessible way for anyone to scrape websites. And here is why:

  • No coding required: Most web scraping tools nowadays are designed for anyone regardless of programming skills. While using them to pull data from websites, you only need to know how to use the mouse to click.
  • High efficiency: Utilizing web scraping tools to collect data is money-saving, time-saving, and resource-efficient. For example, you can generate 100,000 data points in less than $100.
  • Scrape scalable data: You can scrape millions of pages based on your needs without worrying about the infrastructure & network bandwidths.
  • Available for most websites: Many websites may use anti-scraping methods to stop web scraping on their pages. So, web scraping tools have built-in features to bypass such architecture. When websites implement anti-bots mechanisms on websites to discourage scrapers from collecting data, good scraping tools can tackle these anti-scraping techniques and deliver a seamless scraping experience.
  • Flexible and accessible: Using web scraping tools’ cloud infrastructure, you can scrape data at any time, anywhere.

Scrapy at a glance ¶

Scrapy (/ˈskreɪpaɪ/) is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for, it can also be used to extract data using APIs (such as) or as a general purpose web crawler.

Walk-through of an example spider

In order to show you what Scrapy brings to the table, we’ll walk you through an example of a Scrapy Spider using the simplest way to run a spider.

Here’s the code for a spider that scrapes famous quotes from website, following the pagination:

Put this in a text file, name it to something like quotes_spider.py and run the spider using the

When this finishes you will have in the quotes.jsonl file a list of the quotes in JSON Lines format, containing text and author, looking like this:

{ "author" : "Jane Austen" , "text" : " \u201c The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid. \u201d " } { "author" : "Steve Martin" , "text" : " \u201c A day without sunshine is like, you know, night. \u201d " } { "author" : "Garrison Keillor" , "text" : " \u201c Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.

What just happened?

When you ran the command scrapy runspider quotes_spider.py , Scrapy looked for a Spider definition inside it and ran it through its crawler engine.

start_urls attribute (in this case, only the URL for quotes in humor category) and called the default callback method parse , passing the response object as an argument. In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author, look for a link to the next page and schedule another request using the same parse method as callback.

Here you notice one of the main advantages about Scrapy: requests are. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

While this enables you to do very fast crawls (sending multiple concurrent requests at the same time, in a fault-tolerant way) Scrapy also gives you control over the politeness of the crawl through. You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and eventhat tries to figure out these automatically.

Note

This is usingto generate the JSON file, you can easily change the export format (XML or CSV, for example) or the storage backend (FTP or Amazon S3 , for example). You can also write anto store the items in a database.