3
Hello,
I am trying to perform a web scraping on a page protected by login, I have already managed to access both via Request, and via Selenium, the problem is after login.
The page is as follows: https://eduardocavalcanti.com/login After logging in, it redirects to that page automatically: https://eduardocavalcanti.com/dashboard
However, when I log in via browser, if I ask to access the page https://eduardocavalcanti.com/an_fundamentalista/petr/ it accesses without problems for account of that already made the login.
But this is not working with the Request. Even though I asked him to access the page https://eduardocavalcanti.com/an_fundamentalista/petr/ it goes to some other page.
I’m new in this area, I’ve done some consulting, but I can’t find a basis for reference.
request code:
import requests
from bs4 import BeautifulSoup
loginPage = 'https://eduardocavalcanti.com/login/'
protectedPage = 'https://eduardocavalcanti.com/dashboard'
petrUrl = 'https://eduardocavalcanti.com/an_fundamentalista/petr/'
payload = {
'user_login': '[email protected]',
'password': 'minhasenha'
}
sess = requests.session()
sess.post(loginPage, data=payload)
#petr = sess.get(protectedPage)
petr = sess.get(petrUrl )
soup = BeautifulSoup(petr.content, 'html.parser')
print(soup)
selenium code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
browser = webdriver.Firefox()
browser.get("https://eduardocavalcanti.com/an_fundamentalista/itsa/")
time.sleep(10)
username = browser.find_element_by_name("user_login")
password = browser.find_element_by_name("user_pass")
username.send_keys("[email protected]")
password.send_keys("minha_senha")
login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()
time.sleep(5)
browser.get("https://eduardocavalcanti.com/an_fundamentalista/petr/")
DadosEmpresa = browser.find_element_by_xpath("/html/body").text
#DadosEmpresa = browser.find_elements_by_xpath("/html/body")
#for item in DadosEmpresa:
#print(item.text)
The problem I find is that the structure that Selenium returns would take work to put in a Python dictionary. Is there any way for Selenium to return the page tables in a more structured format? So I could use Beautifulsoup.
Regarding Request, is there any block on the site that prevents him from accessing the page? I tried to use cookies, give timesleep and nothing worked.
Peter, can’t use the two modules together? Type:
browser = webdriver.Firefox() browser.get("...") soup = BeautifulSoup(browser.page_source, 'html.parser')
??– NoobSaibot
Fucking maluko, I hadn’t even thought about it. Like I said, I’m new, but it worked. Thanks!
– Pedro Costa
@Pedrocosta, include the user-agent headers in your get, this helps your robot to look like browser access and prevents site blocking.
– leogregianin