0
I’m trying to scrap a page that uses aspx. The point is that when I go to inspect element the data I need is there, but when I give the requests on the page comes everything from html, except the part of the data I need. This data is in the middle of html between an "OC Grid", and only this part is cut. I believe it should be something in the parameters of the POST, but I’m starting in the scraping and I stuck to it. Any help will be very life well. Thanks in advance!
import os
import io
import re
import sys
import json
import boto3
import os.path
import requests
import zipfile
import urllib3
urllib3.disable_warnings()
from bs4 import BeautifulSoup
from python3_anticaptcha import ImageToTextTask
with requests.Session() as session:
url = 'https://www.bec.sp.gov.br/bec_pregao_UI/_Imagens/imagem_aleatoria.aspx'
url_acesso = f'https://www.bec.sp.gov.br/bec_pregao_UI/OC/pregao_oc_pesquisa.aspx?chave='
response = session.get(url_acesso, allow_redirects=True)
cookie_1 = response.cookies.get_dict()
response_1 = session.get(url)
imagem = open('img.jpg', 'wb')
imagem.write(response_1.content)
ANTICAPTCHA_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxx" # KEY DA API QUEBRA DE CAPTCHA
user_answer = ImageToTextTask.ImageToTextTask(anticaptcha_key = ANTICAPTCHA_KEY).captcha_handler(captcha_file='img.jpg')
texto = user_answer['solution']['text']
payload = {
"__EVENTTARGET": "",
"__EVENTARGUMENT": "",
"__LASTFOCUS": "",
"__VIEWSTATE": "viewstate",
"__VIEWSTATEGENERATOR": "A4827007",
"__EVENTVALIDATION": "",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$c_ddlListaAtividadeGrupo": "1","ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$c_ddlListaGrupoSituacao": "0",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cSecretaria": "",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cUgeCodigo": "",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cUgeDenominacao": "",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cMunicipio": "",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cTipoEdital": "0",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cEnteFederativo": "0",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cPartExclusiva": "0",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cAgrupamento": "0",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cNumeroOc": "",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cItemCodigo": "",
"ctl00$conteudo$Wuc_OC1$Wuc_filtroPesquisaOc1$cItemDescricao": "",
"ctl00$conteudo$Wuc_OC1$noRobot": texto,
"ctl00$conteudo$Wuc_OC1$c_btnPesquisa": "Pesquisar"
response_2 = session.post(url_acesso, data=payload, cookies=cookie_1, headers={'Content-type': 'application/json'}, allow_redirects=True)
'''
The technology used in the backend doesn’t matter much (well, it matters in the case of the accepted answer, which I marked as duplicate - but see my answer there) - as you noticed, the data is loaded dynamically afterward static HTM to be loaded - so you will have to use the
requests
on Python’s side to send apost
copying the data that javascript from the page sends, or using Selenium instead of requests - Selenium runs in a browser, and then all the javascript gives page runs and you have access to the full page.– jsbueno
I got it, man, I appreciate it. The problem I’m having with requests is I’m not getting to work with the parameters "__EVENTVALIDATION" and "__VIEWSTATE" to send this post, they are extremely large. Do you have any idea if I can find another way past Lenium?
– Samp
For the backend to respond, you need to send the data it expects. instead of trying to copy and paste this data into your program, you can read it straight from the page, after making a first
get
. Or save this data to a file, and read from the file to fill the post data in Python.– jsbueno