Лайфхаки

Маленькие, полезные хитрости

The State of Web Scraping 2023. Challenges in Web Scraping

09.05.2023 в 15:25

The State of Web Scraping 2023. Challenges in Web Scraping

Anyone with internet access has the ability to create a site, which has made the web a chaotic environment characterized by ever-changing technologies and styles . Due to this, you'll have to deal with some challenges while scraping:

  • Variety : The diversity of layouts, styles, content and data structures available online makes it impossible to write a single spider to scrape it all. Each website is unique. Thus, each web crawling script must be custom-built for the specific target.
  • Longevity : Scraping involves extracting data from the HTML elements of a website. Thus, its logic depends on the site's structure. But web pages can change their structure and content without notice! That makes the scrapers stop working and force you to adapt the data retrieval logic accordingly.
  • Scalability : As the amount of data to collect increases, the performance of your spider becomes a concern. However, there are several solutions available to make your Python scraping process more scalable: you can use distributed systems, adopt parallel scraping, or optimize code performance.

Some of the approaches are IP blocking, JavaScript challenges, and CAPTCHAs, which make data extraction less straightforward. Yet, you can bypass them using rotating proxies and headless browsers , for example. Or you can just use ZenRows to save you the hassle and easily get around them.

Apify blog. Apify Marketplace is now Apify Freelancers

Apify Marketplace and the introduction of the offers process a few years ago was an essential step in letting our smaller team of the time leverage their connections to developers all around the world. It gave a chance for those external developers to take on smaller projects and deliver solutions, with Apify acting as the conduit between customer and developer and making sure that the whole process worked well, and the solution was created to our high standards.

But from now on, Apify Marketplace is Apify Freelancers . We will continue to rely on Apify-approved external developers to deliver some solutions, but now Apify staff will be more involved in guiding customers in their choice. The customer no longer needs to make that decision before contacting us.

Our growing network of freelance developers is part of the thriving Apify Partners ecosystem , which collaborates with us to provide efficient and complex web scraping and automation solutions to our customers.

Apify is, and always has been, dedicated to making sure that developers and the open-source community can make a living from their work. Apify Freelancers gives talented coders the chance to get involved in bigger projects, while Apify Store gives them the opportunity to create actors that can be rented out on a monthly basis. If you’re a dev and you’re interested, please contact us .

Ai Web Scraping Python. Создание парсера на Python

Теперь давайте узнаем, как создать парсер на Python. Цель этого руководства — научиться извлекать все данные о цитатах на сайте Quotes to Scrape . Вы научитесь извлекать текст, автора и список тегов для каждой цитаты.  

Но сначала давайте взглянем на целевой сайт. Вот как выглядит веб-страница Quotes to Scrape:

Как выглядит Quotes to Scrape

Как видите, Quotes to Scrape — это не что иное, как песочница для парсинга веб-страниц. Сайт содержит разбитый на страницы список цитат. Парсер Python, который вы собираетесь создать, извлечет все цитаты на каждой странице, и предоставит их в виде данных CSV.

Теперь пришло время понять, какие библиотеки Python для парсинга лучше всего подходят для достижения этой цели. Как вы можете увидеть на картинке ниже на вкладке Network окна Chrome DevTools целевой сайт не выполняет запросов Fetch/XHR .  

Обратите внимание, что раздел Fetch/XHR пустой.

Другими словами, Quotes to Scrape не использует JavaScript для извлечения данных на веб-страницах. Это обычная ситуация для большинства сайтов, отображаемых на сервере. Поскольку целевой сайт не использует JavaScript для отображения страницы или извлечения данных, вам не нужен Selenium для парсинга. Вы можете использовать его, но это не обязательно.

Как вы уже узнали, Selenium открывает страницы в браузере. Поскольку это занимает время и ресурсы, Selenium вызывает расходы на производительность. Вы можете избежать этого, используя Beautiful Soup вместе с Requests. Теперь давайте узнаем, как создать простой скрипт парсинга веб-страниц на Python для извлечения данных с сайта с помощью Beautiful Soup.

Приступим

Прежде чем написать первые строки кода, вам необходимо настроить проект парсинга на Python. Технически необходим только один файл .py. Однако использование расширенной IDE (интегрированной среды разработки) упростит процесс написания кода. Здесь вы узнаете, как настроить проект Python в, но подойдет и любая другая IDE.  

Откройте PyCharm и выберите «Файл > Новый проект…». Во всплывающем окне «Новый проект» выберите «Pure Python» и создайте свой проект.

Всплывающее окно PyCharm «Новый проект»

Например, вы можете назвать свой проект python-web-scraper. Нажмите «Создать», и теперь у вас будет доступ к вашему пустому проекту Python. По умолчанию PyCharm инициализирует файл main.py. Вы можете переименовать его в scraper.py. Вот как теперь будет выглядеть ваш проект:  

Пустой проект python-web-scraping Python в PyCharm

Как видите, PyCharm автоматически инициализирует для вас файл Python. Не обращайте внимания на его содержимое и удаляйте каждую строку кода, чтобы начать с нуля.

Web Scraping blog. Scrape Blogs Posts Fast with a Web Scraper

Speaking of building a blog fast, we think of a web scraper for content curation. Put simply, it is the act of scraping blog posts on the Internet, sorting through large amounts of blogs and presenting the best posts in a meaningful and organized way. 

A new-developing blog can grow very fast with the right strategy. One of the best strategies is content curation, because it does not create, it shares, which saves lots of your time and still attracts audiences to your blog. How to find the right content for your blog is not easy. Reading through all these contents on the Internet would not be a good idea. There is a better way I want to share with you.

With two steps, you will be able to find the best content for your blog.

 

Step 1. Find websites relevant to your blog.

Almost every website has a theme. Once you've set up your own blog’s theme, you can go look for websites that are relevant to your blog and do well in the market. Markdown these websites on your memo list.

 

Step 2. Use Web Scraper Octoparse to scrape blogs for you

It’s time to discover the right content for your blog. For a new-developing blog, the content should be popular in the first place, and then relevant. This means you should consider more about the content’s popularity than its relevance to your blog, only a few keywords connection will be fine.

Therefore, when using Octoparse to do the extraction, the only thing you need to focus on is the article’s view, rate, and etc. There is a set of data that I scraped from www.scoop.it with Octoparse , let’s see what we can do with these data. (Find out how to use Octoparse in Tutorials )

The data shown above is what I exported from Octoparse.  It shows the articles' total views, today's views, and titles. The first two kinds of information are related to the popularity of these articles.

 

Playwright Scraping. FAQ

To wrap this article up let's take a look at some frequently asked questions about web scraping using headless browsers that we couldn't quite fit into this article:

How can I tell whether it's a dynamic website?

The easiest way to determine whether any of the dynamic content is present on the web page is to disable javascript in your browser and see if data is missing. Sometimes data might not be visible in the browser but is still present in the page source code - we can click "view page source" and look for data there. Often, dynamic data is located in javascript variables underHTML tags. For more on that see How to Scrape Hidden Web Data

Should I parse HTML using browser or do it in my scraper code?

While the browser has a very capable javascript environment generally using HTML parsing libraries (such as beautifulsoup in Python) will result in faster and easier-to-maintain scraper code.
A popular scraping idiom is to wait for the dynamic data to load and then pull the whole rendered page source (HTML code) into scraper code and parse the data there.

Can I scrape web applications or SPAs using browser automation?

Yes, web applications or Single Page Apps (SPA) function the same as any other dynamic website. Using browser automation toolkits we can click around, scroll and replicate all the user interactions a normal browser could do!

What are static page websites?

Static websites are essentially the opposite of dynamic websites - all the content is always present in the page source (HTML source code). However, static page websites can still use javascript to unpack or transform this data on page load, so browser automation can still be beneficial.

Can I scrape a javascript website with python without using browser automation?

When it comes to using python in web scraping dynamic content we have two solutions: reverse engineer the website's behavior or use browser automation.

That being said, there's a lot of space in the middle for niche, creative solutions. For example, a common tool used in web scraping is Js2Py which can be used to execute javascript in python. Using this tool we can quickly replicate some key javascript functionality without the need to recreate it in Python.

What is a headless browser?

A headless browser is a browser instance without visible GUI elements. This means headless browsers can run on servers that have no displays. Headless chrome and headless firefox also run much faster compared to their headful counterparts making them ideal for web scraping.

Playwright extract text. Web Scraping With Playwright: Tutorial for 2022

You most probably won’t get surprised if we tell you that in recent years, the internet and its impact have grown tremendously. This can be attributed to the growth of the technologies that help create more user-friendly applications. Moreover, there is more and more automation at every step – from the development to the testing of web applications.

Having good tools to test web applications is crucial. Libraries such as Playwright help speed up processes by opening the web application in a browser and other user interactions such as clicking elements, typing text, and, of course, extracting public data from the web.

In this post, we’ll explain everything you need to know about Playwright and how it can be used for automation and even web scraping.

What is Playwright?

Playwright is a testing and automation framework that can automate web browser interactions. Simply put, you can write code that can open a browser. This means that all the web browser capabilities are available for use. The automation scripts can navigate to URLs, enter text, click buttons, extract text, etc. The most exciting feature of Playwright is that it can work with multiple pages at the same time, without getting blocked or having to wait for operations to complete in any of them.

It supports most browsers such as Google Chrome, Microsoft Edge using Chromium, Firefox. Safari is supported when using WebKit. In fact, cross-browser web automation is Playwright’s strength. The same code can be efficiently executed for all the browsers. Moreover, Playwright supports various programming languages such as Node.js, Python, Java, and .NET. You can write the code that opens websites and interacts with them using any of these languages.

Playwright’s documentation  is extensive. It covers everything from getting started to a detailed explanation about all the classes and methods.

Support for proxies in Playwright

Playwright supports the use of proxies. Before we explore this subject further, here is a quick code snippet showing how to start using a proxy with Chromium:

Node.js:

const   {  chromium }   =   require ( 'playwright' );  "

const  browser =   await  chromium . launch ();

Python:

from playwright . async_api import  async_playwright

import  asyncio

with   async_playwright ()   as   p :

browser =   await  p . chromium . launch ()

This code needs only slight modifications to fully utilize proxies.

In the case of Node.js, the launch function can accept an optional parameter of LauchOptions  type. This LaunchOption  object can, in turn, send several other parameters, e.g.,   headless . The other parameter needed is proxy . This proxy is another object with properties such as server , username , password , etc. The first step is to create an object where these parameters can be specified.

// Node.js

const  launchOptions =   {

    proxy :   {

        server :   123.123.123.123 : 80 '

    },

    headless :   false

}

The next step is to pass this object to the launch function:

const  browser =   await  chromium . launch ( launchOptions );

In the case of Python, it’s slightly different. There’s no need to create an object of LaunchOptions . Instead, all the values can be sent as separate parameters. Here’s how the proxy dictionary will be sent:

# Python

proxy_to_use =   {

    'server' :   '123.123.123.123:80'

}

browser =   await  pw .

When deciding on which proxy to use, it’s best to use residential proxies as they don’t leave a footprint and won’t trigger any security alarms. For example, our own Oxylabs’ Residential Proxies  can help you with an extensive and stable proxy network. You can access proxies in a specific country, state, or even a city. What’s essential, you can integrate them easily with Playwright as well.

Basic scraping with Playwright

Let’s move to another topic where we’ll cover how to get started with Playwright using Node.js and Python.