Can you imagine having an automatic tool that opens a Chrome browser for you, goes to whatever URL you decide, and fetches any data you can find on that page? Welcome to the future of front-end testing and web-scraping.
In this article we will review puppeteer, one of the best open-source web scraping tools used today. But before we start let’s clear some of the basic terms.
Website scraping definition
In short, web scraping is the act of automatically surfing the web and fetching interesting pieces of information from it. Website scraping is becoming more and more popular these days, allowing companies in every industry to gain data in large quantities.
Web scraping allows you to gain many insights: Competitive analysis, price comparison, review analysis, customer sentiment analysis, keyword research, and more. It can also allow you to build a real-time solution for combining search results from multiple websites, such as Kayak or Skyscanner.
A web scraper can be built with a simple HTTP request library or through a headless browser. sending a simple HTTP GET request to the website you are trying to get data from, and parsing the data from the response. The drawback is that the requests look very suspicious, missing a lot of important user data. This allows any website that uses an anti-scraping system to easily block your scraping attempts.
This is where headless browsers save the day.
What are headless browsers?
Headless browsers are software operated browsers that aren’t actually showing the data they have received from the website they visit, allowing them to crawl online web pages in a fast and efficient manner. Headless browsers have many different configurations that need to be set, and changed from request to request. otherwise, they are often detected while scraping websites, due to their irregular behavior.
This is why you should choose the headless browser you use carefully. This article will focus on puppeteer which is widely used by the web scraping community.
What is Puppeteer?
There are endless tweaks and configurations you can play with when using Puppeteer. The most common configuration, which is actually a list of different configurations, is its ability to change the browser fingerprint, allowing to disguise to a certain user. The most important note about this library is that it is maintained in the highest standards by google, allowing you to enjoy a seamless, bullet-proof library that is just fun to work with.
What are browser fingerprints
Just like we leave fingerprints when entering a room, web browsers leave them when they visit a website. A website that has been visited by a browser (headless or not), can extract data from this visit and look at the browser’s fingerprints. The problem is, the fingerprints here are checked and identified at the ‘door’. when creating a web crawler bot that continuously visits a specific website too often.
Choose the right IP address
The main way a site crawler is blocked is by blacklisting its IP address. As a human you would not visit 1000 Amazon pages an hour, would you? Fetching so many pages at once means you are either a troubled shopping addict or a web crawler.
This means that for building a successful website scraper, we need to choose the right pool of IP addresses and constantly change them.
There are numerous companies that offer proxy services. We will not go into that in this blog post (come back for a future blog post about proxies and IP addresses). But choosing the right supplier for your IP addresses is crucial, and more than that – constantly changing the IP the requests for a specific website are sent from.
In our Web Scraping Platform, we use several different proxy providers, alternating the usage between them, and so should you. Also, it is very important to test the IP address since many of the residential addresses are invalid or too slow when you try to use them.
So, How is Puppeteer used for web scraping?
Using Puppeteer for web scraping involves a few steps.
- Downloading puppeteer and installing puppeteer.
- Configuring your code to randomly change the browser fingerprint and IP address.
- Writing the crawler – the piece of software that sends Puppeteer to the website you want to scrape and collects the links that contain data that is valuable for you.
- Writing the scraper part – after you have the HTML of the page containing the data you need, a parsing operation is required to extract the data off the page.
Like many simpler HTTP request libraries, Puppeteer can use a web proxy for sending requests over the web. This means we can use this ability to change this part of the browser’s fingerprint when using it.
Why not use a simple library like requests then? Or even Selenium? Well that is where web scraping can be a little tricky.
Since some websites check a browser’s fingerprint when fetching data, they can easily identify these libraries and conclude this is an auto crawler being used as an online scraping tool.
But if I use a full-blown web browser, won’t it be a heavy load for the machine running it? Well the short answer is yes, but it can be modified to be lighter and more of a smooth sail. This is where headless mode comes into our favor, allowing us to run multiple instances of our web scraping bot or even to create a cloud web scraper with ease and with a minimum amount of resources needed.
Please note, that running Puppeteer in Docker cannot be done in full mode. There are a lot of discussions about this issue, but it comes in our favor, forcing us to use the more lightweight mode and use the headless mode.
If you find this too complicated, there are many off the shelf tools and companies who offer web scraping tools and solutions.
Web scraping tools
Some companies like Diffbot or Crawlera have been developing web scraping tools for several purposes, and many small scraper tools and scraping software pieces (free or paid) are out there for your usage.
We would recommend, for most developers, to use a more end to end solution like a cloud web scraper, since some of the simple web scraping programs are constantly blocked (or require expensive proxies) and are not maintained properly by their developers.
In conclusion, creating a scraper bot isn’t such an easy task, but definitely possible. There are various resources about how to crawl a website or even how to create a web crawler tool yourself.
In the next posts I will talk about the differences between crawling and parsing, and what are some of the challenges and solutions for web parsing.