scraping google

Question

scraping google

Asked 5 years, 5 months ago

Viewed 421 times

-2

Regards

I am writing a script in python3 to perform news searches and tbm images... and created queries for Bing google using Apis mechanicalsoup and requests.

the case is over with google, after a search it blocks to check if it is human. I used the Onion router (TOR) but it didn’t work.

this is my code

 browser = mechanicalsoup.StatefulBrowser()
        browser.open("https://www.google.com/")
        browser.select_form('form[action="/search"]')
        browser["q"] = colecao # < parte da func
        browser.submit_selected(btnName="btnG") # seguindo um exemplo que diria burlar o google  :/ 
        for link in browser.links():
            pagina = link.attrs['href']
            if (pagina.startswith('/url?') and not pagina.startswith("/url?q=http://webcache.googleusercontent.com")):
                pagina = re.sub(r"^/url\?q=([^&]*)&.*", r"\1", pagina)
                if pagina[0:23] != 'https://accounts.google' and pagina[0:22] != "https://support.google" and pagina[0:18] != "https://www.google" and pagina[0:7] != '/search':
                    removendo.append(pagina)

They could give me ideas or suggestions to improve my project. since thanks for the time :)

1 answer

Browser other questions tagged python google web-scraping robots google-search

You are not signed in. Login or sign up in order to post.

by Rodrigo Eggea • 79 points · Answer 1 · 2020-05-23T05:38:05+00:00

According to article Search Engine Scraping Google has been using a variety of defensive methods that make web scraping a difficult task:

Google tests the "User-Agent" of HTTP requests. Some "User-Agent" is already blocked by default, if you try to use a URL google rejects the request and only shows a blank page.
Google is using a complex request limitation system, depending on language, country, user-agent and search words.
Search engines are not easily fooled by just changing IP address
Google has been using sophisticated systems of user behavior analysis, through Avascripts that check how the mouse walks through the screen and the clicks on the screen, and using "deep Learning" artificial intelligence techniques detect suspicious behaviors.
Google has been using change techniques in the HTML code in each request to hinder the use of scraping
According to Matthew Lee, if you try to access Google using Tor you will find many Captchas for all Google pages, making it unviable its use for scraping.

I also checked by performing some tests that the HTML code of a google search is different depending on the country of origin, when the request comes from Brazil it provides an HTML code as a response, when the request is from another country comes a different HTML code, despite the browser rendering a very similar page. That is, your scraping program that works by sending requests for an IP from Brazil to Google will not work if you use a proxy from another country.

When strange behavior is detected, there are several possible reactions:

First defense is the page show the 'CAPTCHA' where the user should check that it is not a bot. When you solve CAPTCHA it creates a cookie in your browser allowing access. After one day the 'CAPTCHA' is removed and you can access again normally.
Second defense is to present an error page, when your IP address enters the blacklist, to resolve this or you wait for your IP to be unlocked or change your IP address.
Third defense is a permanent blockade of the network segment. Google has already blocked several whole network segments for months, this only happens when a scraping tool sends a large number of requests from this network.

Techniques for Performing Google Scraping

The more words the user needs to search for in a shorter time the more difficult the scraping work. Scraping scripts must overcome the following challenges:

Utilise IP address rotation using Proxies (addresses that are not in black lists, unfortunately most of the proxies I tested appears the CAPTCHA page)
Time management of requests, requests with the same time interval are detected easily, and many requests in a short time interval take your IP to blacklist, the ideal is to generate a random interval of a few seconds between requests.
You must send the URL, cookies and HTTP headers correctly, emulating a typical browser.

Solutions

The use of a tool for scraping using all the techniques mentioned above, such as the Googlescraper (python library). This framework controls a browser which makes it difficult for Google to detect that the browser is automated.
I also performed some tests with the Python library called Googlesearch, works well to perform google searches, it inserts a time between searches and page changes, but does not perform image or news searches.

Code example:

# Pegar as primeiras 20 pesquisas para "Breaking Code" WordPress blog
from googlesearch import search
for url in search('"Breaking Code" WordPress blog', stop=20):
    print(url)

Sorry for such a long reply, but I thought the explanation was necessary. I hope I helped.