Python-based web Scrapping does not provide complete html page information

Asked

Viewed 323 times

1

Personal greetings, I’m trying to use python to get the information from the page http://www.nfce.se.gov.br/portal/painelMonitor.jsp , is a page from Faz where it has the ping status of NFC-e submissions, but when using Up with python to pull the information from the page, it does not bring me the information that is there in real time, He’s not bringing me all the information I can see with the browser F12 Inspect, would anyone have any idea what it might be? Below is the code I’m using:

#!/usr/bin/env python

# -*- coding: UTF-8 -*-

import requests
from bs4 import BeautifulSoup


page = requests.get('http://www.nfce.se.gov.br/portal/painelMonitor.jsp')
soup = BeautifulSoup(page.text, 'html.parser')

autorizador = soup.find(id='tabDados')

print(autorizador.prettify())

2 answers

2

Page information is populated by Javascript code that runs after the key is loaded. This code makes the requests in real time to the various servers, and etc...

With normal web-Scrapping technologies it is not possible to get this data. When it is so, you have two options: Tnetar look at the page code (as downloaded by Python - with requests or the Scrapping framework you are using) , and reverse engineer the Javascript-referenced Urls and try to do the same queries from your Python code. In this case, it would even be favorable to do something like this since the Javascript that does the operations is not obfuscated. In addition to a little code on the page itself that you upload, most of the Javascript code is in the URL http://www.nfce.se.gov.br/portal/framework/js/nfce/nfc-e.js .

This option, depending on what data you need and how complex the code is, can be very complicated (if it’s an obfuscated Javascript page, don’t even try to start there).

The other option is to use instead of beautifulsoup, the Selenium - Selenium uses a "real" browser (although, in the current configuration, without a visible window), and runs whole the javascript of the page as in a normal navigation, including performing other HTTP requests. And it exposes an API that allows you to see the generated HTML by javascript executed.

Note that the big difference is that Selenium actually includes a full browser, with a javascript engine (by default it uses firefox, but this is configurable) - and allows your Python program to see the page as it turned out while Javascript is currently running.

This second option is definitely more appropriate for you to follow, since you will not be subject to spending hours analyzing the Javascript code of the page and trying to replicate its behavior, just for a few weeks or months, the authors change the code and you have to redo all the work. With Selenium you take the already rendered HTML from the page and can proceed to isolate the data you need with Beautfulsoup normally.

2


The code below uses lxml to parse the page directly from the URL:

import requests, lxml.html

resp = requests.get('http://www.nfce.se.gov.br/portal/ConStatusAuto?Origem=1')
doc = lxml.html.fromstring(resp.text)
for tr in doc.xpath('//tr'):
    nome = tr[0].text_content().strip()
    print(nome.ljust(25), '|'.join('{: >7}'.format(td.text_content().strip()) 
        for td in tr.xpath('.//td')[1:]))

Upshot:

SEFAZ Amazonas                   |  783ms|  900ms|  361ms|    0ms|  900ms|  661ms|    0ms|  806ms|   50ms|  783ms|  783ms|  196ms|    0ms|    0ms|    0ms
SEFAZ São Paulo                  |    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms
SEFAZ Paraná                     |  429ms|1s808ms|  559ms|  433ms|1s808ms|  620ms|  429ms|  799ms|  495ms|  512ms|  521ms|  518ms|    0ms|    0ms|    0ms
SEFAZ Goias                      |    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms
SEFAZ Mato Grosso                |    0ms|  896ms|  310ms|  609ms|  861ms|  301ms|    0ms|  896ms|  325ms|    0ms|  476ms|  159ms|    0ms|    0ms|    0ms
SEFAZ Rio Grande do Sul          |  852ms|1s886ms|  974ms|  852ms|1s166ms|  964ms|  871ms|1s886ms|  986ms|  867ms|  986ms|  940ms|    0ms|    0ms|    0ms
SEFAZ Virtual RS                 |  761ms|5s854ms| 1s11ms|  830ms|1s245ms|  964ms|  761ms|5s854ms| 1s65ms|  899ms|  969ms|  925ms|    0ms|    0ms|    0ms
SEFAZ Mato Grosso do Sul         |    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms
SEFAZ Ceará                      |    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms
SEFAZ Minas Gerais               |    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms
SEFAZ Pernambuco                 |    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms|    0ms
  • saved my life man, thanks, that’s what I wanted, now you would know how to bring only the SEFAZ Virtual RS ?

  • something else, as you arrived at this link http://www.nfce.se.gov.br/portal/ConStatusAuto?Origem=1 ?

  • @Rafaelxaviersuarez when requesting the url, the SE server will return all the data. It is up to you to filter the ones you want

  • @Rafaelxaviersuarez the link I arrived inspecting the page in the browser through the tab "Network"

Browser other questions tagged

You are not signed in. Login or sign up in order to post.