Web scraping with python on authenticated websites

Asked

Viewed 507 times

0

I am trying to automate a web data collection process using Python. In my case, I need to pull the information from the page https://app.ixml.com.br/documentos/nfe. However, before going to this page, you need to log in to https://app.ixml.com.br/login. The code below should theoretically log into the site:

import re
from robobrowser import RoboBrowser


username = 'meu email'
password = 'minha senha'

br = RoboBrowser()

br.open('https://app.ixml.com.br/login')

form = br.get_form()

form['email'] = username
form['senha'] = password

br.submit_form(form)

src = str(br.parsed())

However, by printing the src variable, I get the source code from the page https://app.ixml.com.br/login, that is, before logging in. If I enter the following lines at the end of the previous code

br.open('https://app.ixml.com.br/documentos/nfe')
src2 = str(br.parsed())

The variable src2 contains the page source code https://app.ixml.com.br/.. I tried some variations, like creating a new br object, but I got the same result. How can I access the information in https://app.ixml.com.br/documentos/nfe?

  • You are trying to authenticate yourself as a robot, and the site does not give permission

  • Ah, got it, thanks. Is there any way I can access it then?

1 answer

1

The library requests also allows you to perform logins, would be as follows:

from bs4 import BeautifulSoup
import requests

session = requests.Session()

# email e senha correspondem ao "name" das tags no html que se referem aos campos para login.

payload = {'email':'[SEU_EMAIL]', 
          'senha':'[SUA_SENHA]'
         }

# Fazer o login
s = session.post("https://docs.bsoft.com.br/", data=payload)

# Acessar a página pós login
s = session.get('https://app.ixml.com.br/documentos/nfe')

soup = BeautifulSoup(s.text, 'html.parser')

Reference: Scraping Data Behind Site Logins with Python

Browser other questions tagged

You are not signed in. Login or sign up in order to post.