Data scraping has become one of the most used techniques for collecting publicly available data, analyzing it, and using it for improving the performance of your business. It has become a common way for businesses to extract SEO data and make educated decisions and choices which would help their business expand.
Since the internet of today is dominated by search engines — which represent a huge source of new information — the easiest way to get data is to extract it from search engines, such as Google, Bing, Yahoo, etc. This is where you might start running into some problems.
What is web scraping?
To put it simple words, web scraping (also known as web harvesting, web data extraction, or data scraping) is a process of extracting data from online sources. In most cases, this technique involves a computer program which browses through different websites, or different pages of the same website, and extract the data that you are after. In this case its SEO data.
As you can imagine, this data can be very important to companies that need insight on various aspects of consumers’ interests, demands, etc. However, it might also raise some legal /ethical concerns where scraping Google SERPs (Search Engine Results Page), is one of the most sought after requirements.
Wait, is it OK to scrape google search results?
So, let’s start with whether or not it is OK to scrape Google search results. The short answer is that it is neither legal nor illegal. It might go against Google’s terms of service, and the company has definitely discouraged it, but there were examples where Google — as well as other search engines — were caught doing it, themselves..
The thing about scraping is that it depends on what data you are extracting. Of course, there are laws that protect copyrighted content, but these laws apply for original ideas, and they cannot protect aggregated data. In other words, search engines only aggregate data are not considered as intellectual property. In other words, when you scrape search engines you are not violating the laws, but you might be violating the website TOS and this is another topic for discussion.
What results can you get from web scraping?
This process lets you extract quite a lot of information, actually. It can provide you with data such as statistics, consumer interests, search engine results, product reviews, prices, and more. Of course, consumers can benefit from this too, by using web scraping for price comparison (comparing prices of the same products on different online stores). This is only scratching the surface when it comes to the potential of web scraping and the usefulness of the collected data. There is much more to it, such as providing SEO keyword ranking data, seeing how your website is ranking on Google and other search engines and more and more use cases only for SEO, like:
Organic Keyword Ranking
As you can imagine, keywords have become the way the modern internet runs searches. After all, how else are you going to find what you need (except for speech where the technology is not mature enough)?
An entire science was developed around keywords, and by extracting data from a certain website and analyzing it, you can determine which keywords and title tags the website might be targeting. This also lets you deduce what is bringing traffic to the website or websites.
This is not only useful for companies that are analyzing the markets, but also for bloggers, SEO professionals, and market research companies. By checking your SEO keyword ranking on targeted keywords, you can see whether or not your web page is visible to internet users who search for the same keywords.
Check SERP Ranking
Obviously, website keyword ranking can be crucial for getting traffic and scoring a good Google SERP (Search Engine Results Pages) rank. There are multiple SEO keyword research tools and SEO keyword checkers that can help you determine which keywords to use, such as KWfinder or SEMRUSH. Doing this properly will allow you to boost your Google SERP rank and get more traffic to your site, thus making it busier and more profitable.
However, with all that said, it is also important to remember that a lot of people, as well as some companies, tend to misuse web scraping. In some cases, businesses can use it in order to get an unfair advantage over the competition, and sometimes — they even directly copy their content. Bing did this in 2011 when it was discovered that it is presenting Google’s search results as its own.
Web scraping has ended up being criticized and ended up having a bad reputation, which is why there are precautions that you must take before you start your web scraping operation.
Keyword Search Volume
Keyword Search Volume, in laymen terms, is the number of searches for some specific keyword in a certain timeframe. It is a good way to provide marketers with a general sense of what keywords are being searched, in which amount, when, and alike.
To understand how this works, think of Google Trends — a free online tool that Google itself offers, which can be used for viewing and analyzing any keyword that comes to mind. This tool will create a chart, displaying keyword data, and you can then compare it with other keywords, and see which one has more searches over time.
Choose the best proxy
It is very important to use a proxy service and choose it correctly. When browsing a website, your browser sends many details of information to the website upon requesting the HTML pages. This information includes your IP Address, system and browser language, screen resolution, operating system, and, many more details that consist of your browser fingerprint.
When you scrape a website, you will send multiple requests to very similar resources and it will just look strange if the same IP address or user will make so many requests to the same website or resource. In a perfect scraping world, we could have just refrained from sending these pieces of information, but websites aren’t so innocent and a request without the user’s information will just be blocked or treated as a bot.
Therefore, the only option is sending a broad range of combinations of IP Addresses and browser configurations. Browser configurations can be set using Puppeteer or Selenium, but your IP address can’t be manipulated so easily. The solution for manipulating and changing your IP address is to use a proxy service.
The way things work is that your web browser or headless browser sends the web request to the proxy service. The proxy service then sends out the requests to the resource you are trying to scrape, but since the traffic is generated at the proxy service, you actually use the proxy service’s IP address. At least it seems this way to the resource you are scraping. The response is then forwarded back to your computer, and to the browser, it seems like a regular request has been done.
The things to consider are the geolocation of your proxy service, and speed and responsiveness of it. If you are trying to scrape with slow proxies, the requests will keep timing out and fail, or your requests will be blocked very fast.
In other words, they provide you with content, but they know browser fingerprint and particularly your IP address. This allows them to determine your identity, your location and whether you should have access to their content (think geo-restrictions), and more.
They can also use browser fingerprint in order to determine whether you are browsing as a human user, or if you are using web scraping software. This is why you need to mix things up a bit and use the best proxies for scraping, so that your browser fingerprint wouldn’t be suspicious, and potentially lead to a block your IP address.
This leads us to our next point which is:
When consistently scraping a website like Google Search, you will start to get blocked after an unknown number of times you try to scrape it. This is due to the fact that websites use anti-bot defenses, and the most common of them use IP address comparison and analytics.
Basically, you cannot use a single IP address even if it is not your own. This means that, even if you use a proxy, you need to use multiple IP addresses in order to avoid getting blocked. If you don’t use a proxy while web scraping, your IP will blocked very fast. If you do use a proxy, but only use a single IP address — that address will get blocked as well, and you will have to move on to the next one after only a few minutes.
The solution is simple — set up a pool of IP addresses and allow the proxy to move through them freely for each request. The best proxy you can use are the premium ones since they often provide advanced features, great speeds, and other advantages. That way, your proxy search can go unnoticed, and you can collect the data you need.
Residential VS Datacenter Proxies
There are two different types of proxies to keep in mind — residential proxies and DataCenter proxies. The residential proxy uses a real IP address, that comes from the internet service provider, while datacenter proxies are provided by third-party server farms. Basically, when you think about proxy — you are really thinking about datacenter proxy, and that is the one you should not want to use for web scraping.
Set the right user agent
Another thing to keep in mind is setting a real user agent, as websites do tend to examine User Agents, and block requests made by those that don’t belong to any major browser. This is often what gets data scrapers blocked, as it is a common oversight. Advanced users may want to consider setting their User-Agent to the Googlebot User-Agent. This can be beneficial, as pretty much all websites out there want to show up on Google, and will let its Googlebot enter the website with no issues.
Whichever you end up using, make sure to keep your User Agents up to date, and you should be fine.
Use Headless browser
Now, when it comes to advanced web scraping, one of the best things to do is to use a headless browser, which is a web browser without a graphical user interface.
In other words, to trick them successfully, you will have to use a headless browser, or just use a Scraper API to do all the work instead. There are two tools that we can recommend, which you can use to write a program that will control the browser in a way a real person would:
Puppeteer is basically a Node.js library, that provides you with an advanced API that can control headless browser. It is capable of doing most of the things that you yourself can do manually, such as taking screenshots, generate PDFs, automate form submissions, test extensions, and much more. It is an impressive tool, and should definitely do the trick.
As for Selenium, this is an open-source automated testing tool that can be used in many different browsers and platforms. Its focus is on automating web-based applications, and it doesn’t really operate like a single tool, but rather as a suite of software.
Modern SEO Scraping
How would it feel like to have all of the troubles of web scraping gone and to receive all of the data you need in a simple API that fits your exact needs? We have developed a SAAS solution for SEO scraping that allows you to fully control what data you get and when, but saves your time on writing code. The data is returned in a JSON, .CSV or .XML format and you don’t need to think about configurations, proxy servers, or hiring extra software engineers to build scrapers.
Don’t forget to check out our Blog for more articles like this one!