According to article Search Engine Scraping Google has been using a variety of defensive methods that make web scraping a difficult task:
- Google tests the "User-Agent" of HTTP requests. Some "User-Agent" is already blocked by default, if you try to use a URL google rejects the request and only shows a blank page.
- Google is using a complex request limitation system, depending on language, country, user-agent and search words.
- Search engines are not easily fooled by just changing IP address
- Google has been using sophisticated systems of user behavior analysis, through Avascripts that check how the mouse walks through the screen and the clicks on the screen, and using "deep Learning" artificial intelligence techniques detect suspicious behaviors.
- Google has been using change techniques in the HTML code in each request to hinder the use of scraping
- According to Matthew Lee, if you try to access Google using Tor you will find many Captchas for all Google pages, making it unviable its use for scraping.
I also checked by performing some tests that the HTML code of a google search is different depending on the country of origin, when the request comes from Brazil it provides an HTML code as a response, when the request is from another country comes a different HTML code, despite the browser rendering a very similar page. That is, your scraping program that works by sending requests for an IP from Brazil to Google will not work if you use a proxy from another country.
When strange behavior is detected, there are several possible reactions:
- First defense is the page show the 'CAPTCHA' where the user should check that it is not a bot. When you solve CAPTCHA it creates a cookie in your browser allowing access. After one day the 'CAPTCHA' is removed and you can access again normally.
- Second defense is to present an error page, when your IP address enters the blacklist, to resolve this or you wait for your IP to be unlocked or change your IP address.
- Third defense is a permanent blockade of the network segment. Google has already blocked several whole network segments for months, this only happens when a scraping tool sends a large number of requests from this network.
Techniques for Performing Google Scraping
The more words the user needs to search for in a shorter time the more difficult the scraping work. Scraping scripts must overcome the following challenges:
- Utilise IP address rotation using Proxies (addresses that are not in black lists, unfortunately most of the proxies I tested appears the CAPTCHA page)
- Time management of requests, requests with the same time interval are detected easily, and many requests in a short time interval take your IP to blacklist, the ideal is to generate a random interval of a few seconds between requests.
- You must send the URL, cookies and HTTP headers correctly, emulating a typical browser.
Solutions
The use of a tool for scraping using all the techniques mentioned above, such as the Googlescraper (python library). This framework controls a browser which makes it difficult for Google to detect that the browser is automated.
I also performed some tests with the Python library called Googlesearch, works well to perform google searches, it inserts a time between searches and page changes, but does not perform image or news searches.
Code example:
# Pegar as primeiras 20 pesquisas para "Breaking Code" WordPress blog
from googlesearch import search
for url in search('"Breaking Code" WordPress blog', stop=20):
print(url)
Sorry for such a long reply, but I thought the explanation was necessary. I hope I helped.
Thank you so much for the answer, had been studying about but needed a second opinion and really it is difficult to deal with google but I found a private api called " app.zenserp" that solved this impasse, Thanks for the good words friend!
– stack.cardoso