Data is not recovered webdriver Selenium

Question

Data is not recovered webdriver Selenium

Asked 6 years, 6 months ago

Viewed 501 times

0

I want to get some data by city (all over 5 thousand Brazilian cities) the site IBGE.

Ex. of url: https://cidades.ibge.gov.br/brasil/ac/rio-branco/pesquisa/23/27652 or https://cidades.ibge.gov.br/brasil/ac/rio-branco/pesquisa/23/27652?detalhes=true

driver = webdriver.Chrome()
driver.get('https://cidades.ibge.gov.br/brasil/es/atilio-vivacqua/pesquisa/23/27652')
soup = BeautifulSoup(driver.page_source, 'html.parser')
div_conteudo = soup.find('div', class_='conteudo')

I want to take the values of sex, among others, and this table containing the values is precisely what is not returned. This part has an onEmpty event, which is the only thing I see that might be getting in the way.

I’ll put part of the return.

In the code of the page appears like this

After closing the tag </pesquisa-header> and before closing the tag </pesquisa> must start tag <pesquisa-tabela> as in image 2, but does not appear in return

1 answer

Browser other questions tagged python selenium scraping webdriver

You are not signed in. Login or sign up in order to post.

by nosklo • **5,801** points · Answer 1 · 2019-01-24T23:40:16+00:00

Well, as you may already know, when accessing a URL the browser loads the code HTML which is a text, and generates a tree of objects in memory, the famous GIFT.

The problem is that on the IBGE page you are trying to access, after the DOM is created in memory, javascript code is executed that modifies objects directly in this tree.

That means the elements you’re trying to read there are no page source code - only after the page is loaded and the javascript is executed will these elements be created, dynamically.

The BeautifulSoup tries to do the same as a browser, reads the source code in HTML text and generates an object tree in python memory - but with one difference: Like the BeautifulSoup does not execute javascript, these dynamic elements will not be present in this case. You will have access only to the data that came in the page source code.

So instead of retrieving the page’s source code, parse it with the BeautifulSoup, and generate a GIFT in python, the way to use Selenium correctly is to access the DOM that is already in the browser memory, because the browser runs javascript and has the elements you want.

To do this exist functions of Selenium who access the DOM browser directly. The code below should return all tags pesquisa-tabela:

tags = driver.find_elements_by_name('pesquisa-tabela')

With this you also avoid the double parsing HTML since you won’t have to execute the BeautifulSoup; After all the browser has done this job.

Another way to get content after javascript is to use javascript code injected into the browser:

codigo_html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

OBS: Remember that these two methods may fail if your script is too fast and javascript has not yet run. Maybe you’ll have to wait for javascript to finish running using one of the Selenium standby functions, or even time.sleep().

However, in the case of the IBGE site, you are in luck. All this is dispensable, because IBGE has an API accessible to the public!! https://servicodados.ibge.gov.br/api/docs/pesquisas?versao=1

Using the API is much better as it is a stable, secure and supported way by IBGE itself to provide the data. Some advantages:

No need to open browser
No need to extract data between tags, as the Apis already return everything in JSON.
Not having to worry about page layout changes, where your script would stop working and require maintenance.
Use less memory
Less network resources because you don’t need to download unnecessary images and files, it goes straight to information

For example, to get the data from the page you listed above (Rio Branco Census 2010), it would be:

import requests
api_pesquisas =  'https://servicodados.ibge.gov.br/api/v1/pesquisas/{pesquisa}/'
url_titulos = api_pesquisas + 'periodos/all/indicadores/{posicao}?scope=sub&lang=pt'
url_dados = api_pesquisas + 'indicadores/{posicao}/resultados/{localidades}?scope=sub&pt'

json_titulos = requests.get(url_titulos.format(pesquisa=23, posicao=1)).json()

titulo = {} # converte em dict id -> indicador
while json_titulos:
    l = json_titulos.pop()
    titulo[l['id']] = l['indicador']
    json_titulos[:0] = l['children']

dados = requests.get(url_dados.format(
    pesquisa=23, posicao=1, localidades=120040)).json()

for info in dados:
    print(titulo[info['id']], '-->', info['res'][0]['res']['2010'])

The result:

Coletivos --> 79
Com morador --> 39
Sem morador --> 40
Ocupados --> 94397
Com entrevista realizada --> 90598
Sem entrevista realizada --> 3799
Não ocupados --> 12699
Uso ocasional --> 2322
Vagos --> 10377
Recenseados --> 107175
População residente --> 336038
Masculino --> 163592
Menos de 1 ano de idade --> 2889
1 a 4 anos de idade --> 12547
5 a 9 anos --> 16319
10 a 14 anos de idade --> 17935
...
Urbana --> 308545
Rural --> 27493
Média de moradores em domicílios particulares ocupados --> 3.54