What is the best way to scrape the Datasus website in Python?

Asked

Viewed 509 times

0

The link is this: http://tabnet.datasus.gov.br/cgi/tabcgi.exe?sih/cnv/nrbr.def

I’m trying to send a POST through requests with a dictionary containing the categories I want, but then the URL remains static.

Do you think Selenium would be better suited for this? Someone has done something similar?

  • Dude, I already made an implementation for data scraping, I used scrapy (python). Here’s the suggestion.

1 answer

2

I do not recommend using Selenium, because according to Sasa Buklijas on "Do not use Selenium for Web Scraping" He claims that Selenium is not a specialized web scraping tool (data extraction technique used to collect data from websites), but rather a tool to perform automated testing of web applications. It is recommended to use tools such as Scrapy or Beautiful Soup + Requests.

The site of Datasus I find difficult the use of Selenium because the site has many checkboxes, generating many combinations to be able to download all content, it will be very difficult to do this in Selenium, and there are other better tools for this purpose.

I’ve done something similar to get all Enem results using Bash Script and Curl, as the following steps:

  1. Use the Google Chrome
  2. Open the Datasus website
  3. Right click, and select "Inspect", will open the developer tools to the right of the browser.
  4. Click on the "Network" tab in the developer tools
  5. On the Datasus website on the left, select the options, and click on the "Show" button that appears on the site
  6. In the developer tools click on the request sent to the server called "tabcgi.exe? Sih/cnv/nrbr.def" with the RIGHT button, select COPY -> Copy as...

Depending on where to develop the script, if it is in Linux select "Copy as Curl (Bash)" if it is in Windows "Copy as Curl (cmd)".

With the copied Curl command just paste into the Linux Bash and it will make the request equal to the Browser. You can go modifying the parameters of the Curl request to go looking for other site information.

Example below Curl request:

curl 'http://tabnet.datasus.gov.br/cgi/tabcgi.exe?sih/cnv/nrbr.def' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Origin: http://tabnet.datasus.gov.br' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'Referer: http://tabnet.datasus.gov.br/cgi/tabcgi.exe?sih/cnv/nrbr.def' \
  -H 'Accept-Language: pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7' \
  -H 'Cookie: TS014879da=01e046ca4c72569773aca201f18700eeeba156dca36d80d2164402d50541a167b0fe28c1eed5e284f878cbb97def8098f34d4600bd' \
  --data 'Linha=Macrorregi%E3o_de_Sa%FAde&Coluna=--N%E3o-Ativa--&Incremento=Interna%E7%F5es&Arquivos=nrbr2003.dbf&pesqmes1=Digite+o+texto+e+ache+f%E1cil&SMunic%EDpio=1&pesqmes2=Digite+o+texto+e+ache+f%E1cil&SCapital=1&pesqmes3=Digite+o+texto+e+ache+f%E1cil&SRegi%E3o_de_Sa%FAde_%28CIR%29=1&pesqmes4=Digite+o+texto+e+ache+f%E1cil&SMacrorregi%E3o_de_Sa%FAde=TODAS_AS_CATEGORIAS__&pesqmes5=Digite+o+texto+e+ache+f%E1cil&SMicrorregi%E3o_IBGE=TODAS_AS_CATEGORIAS__&pesqmes6=Digite+o+texto+e+ache+f%E1cil&SRegi%E3o_Metropolitana_-_RIDE=TODAS_AS_CATEGORIAS__&pesqmes7=Digite+o+texto+e+ache+f%E1cil&STerrit%F3rio_da_Cidadania=TODAS_AS_CATEGORIAS__&pesqmes8=Digite+o+texto+e+ache+f%E1cil&SMesorregi%E3o_PNDR=TODAS_AS_CATEGORIAS__&SAmaz%F4nia_Legal=TODAS_AS_CATEGORIAS__&SSemi%E1rido=TODAS_AS_CATEGORIAS__&SFaixa_de_Fronteira=TODAS_AS_CATEGORIAS__&SZona_de_Fronteira=TODAS_AS_CATEGORIAS__&SMunic%EDpio_de_extrema_pobreza=TODAS_AS_CATEGORIAS__&SCar%E1ter_atendimento=TODAS_AS_CATEGORIAS__&SRegime=TODAS_AS_CATEGORIAS__&pesqmes16=Digite+o+texto+e+ache+f%E1cil&SCap%EDtulo_CID-10=TODAS_AS_CATEGORIAS__&pesqmes17=Digite+o+texto+e+ache+f%E1cil&SLista_Morb__CID-10=TODAS_AS_CATEGORIAS__&pesqmes18=Digite+o+texto+e+ache+f%E1cil&SFaixa_Et%E1ria_1=3&pesqmes19=Digite+o+texto+e+ache+f%E1cil&SFaixa_Et%E1ria_2=TODAS_AS_CATEGORIAS__&SSexo=TODAS_AS_CATEGORIAS__&SCor%2Fra%E7a=TODAS_AS_CATEGORIAS__&zeradas=exibirlz&formato=prn&mostre=Mostra' \
  --compressed \
  --insecure 

In the --data field you have to change the values for all possibilities and go making new Curl requests until you download all the information from the site.

The return of Curl requests comes in HTML, if you prefer to remove all HTML tags you can use a pipe with Lynx.

$ curl <comando> | lynx --dump -stdin > resultado1.txt

Browser other questions tagged

You are not signed in. Login or sign up in order to post.