Web scraping with Python (Selenium and Request)

Question

Web scraping with Python (Selenium and Request)

Asked 6 years, 5 months ago

Viewed 1,193 times

3

Hello,

I am trying to perform a web scraping on a page protected by login, I have already managed to access both via Request, and via Selenium, the problem is after login.

The page is as follows: https://eduardocavalcanti.com/login After logging in, it redirects to that page automatically: https://eduardocavalcanti.com/dashboard

However, when I log in via browser, if I ask to access the page https://eduardocavalcanti.com/an_fundamentalista/petr/ it accesses without problems for account of that already made the login.

But this is not working with the Request. Even though I asked him to access the page https://eduardocavalcanti.com/an_fundamentalista/petr/ it goes to some other page.

I’m new in this area, I’ve done some consulting, but I can’t find a basis for reference.

request code:

import requests
from bs4 import BeautifulSoup

loginPage = 'https://eduardocavalcanti.com/login/'
protectedPage = 'https://eduardocavalcanti.com/dashboard'
petrUrl = 'https://eduardocavalcanti.com/an_fundamentalista/petr/'
payload = {
    'user_login': '[email protected]',
    'password': 'minhasenha'
}

sess = requests.session()
sess.post(loginPage, data=payload)
#petr = sess.get(protectedPage)
petr = sess.get(petrUrl )
soup = BeautifulSoup(petr.content, 'html.parser')
print(soup)

selenium code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Firefox()
browser.get("https://eduardocavalcanti.com/an_fundamentalista/itsa/") 
time.sleep(10)
username = browser.find_element_by_name("user_login")
password = browser.find_element_by_name("user_pass")
username.send_keys("[email protected]")
password.send_keys("minha_senha")
login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()
time.sleep(5)
browser.get("https://eduardocavalcanti.com/an_fundamentalista/petr/")
DadosEmpresa = browser.find_element_by_xpath("/html/body").text
#DadosEmpresa = browser.find_elements_by_xpath("/html/body")
#for item in DadosEmpresa:
    #print(item.text)

The problem I find is that the structure that Selenium returns would take work to put in a Python dictionary. Is there any way for Selenium to return the page tables in a more structured format? So I could use Beautifulsoup.

Regarding Request, is there any block on the site that prevents him from accessing the page? I tried to use cookies, give timesleep and nothing worked.

1

Peter, can’t use the two modules together? Type: browser = webdriver.Firefox() browser.get("...") soup = BeautifulSoup(browser.page_source, 'html.parser')??

– NoobSaibot

2019/02/10 at 20:21
Fucking maluko, I hadn’t even thought about it. Like I said, I’m new, but it worked. Thanks!

– Pedro Costa

2019/02/11 at 22:11
1

@Pedrocosta, include the user-agent headers in your get, this helps your robot to look like browser access and prevents site blocking.

– leogregianin

2019/02/12 at 02:22

1 answer

Browser other questions tagged python selenium request scraping beautifulsoup

You are not signed in. Login or sign up in order to post.

by Rodrigo Eggea • 79 points · Answer 1 · 2020-05-24T03:49:12+00:00

The best way to turn an HTML table into a more structured format is with the Pandas library. As I do not have access to the logged area, I will put an example code for you to adapt to your table:

import pandas as pd 
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Firefox()
browser.get("https://eduardocavalcanti.com/an_fundamentalista/itsa/") 
time.sleep(10)
username = browser.find_element_by_name("user_login")
password = browser.find_element_by_name("user_pass")
username.send_keys("[email protected]")
password.send_keys("minha_senha")
login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()
time.sleep(5)
browser.get("https://eduardocavalcanti.com/an_fundamentalista/petr/")
DadosEmpresa = browser.find_element_by_xpath("/html/body").text

# Aqui você utiliza o Pandas para transformar a tabela HTML em um DataFrame
df_tabela = pd.read_html(DadosEmpresa)        # carrega tabela HTML em um DataFrame do Pandas
df = df_tabela[['id da empresa','nome da empresa','descricao']]  # alterar para ser igual aos titulos da sua tabela no site
df.columns = ['id','empresa','descricao']        # coloca nome que quiser nas variaveis
print(df)

Here is a tutorial to help you use Pandas: How to use Pandas read_html to Scrape Data from HTML Tables

Web Scraping with Python, Selenium and Pandas - TV Source Code