Problem extracting web page data with Beautiful Soup in python

Question

Problem extracting web page data with Beautiful Soup in python

Asked 3 years, 10 months ago

Viewed 19 times

0

I made a script in python to access the portal of records of the Inmetro to make a search among the existing certificates.

In this case, my script accesses this link and takes all records from the page. After that, the script opens each of the records and takes the certificate data.

Example, for the certificate of link, it should take the data from "Date","Change","Brand","Model","Description" and "Barcode" and store.

When doing this outside the main loop, only for a specific record, it apparently works, but when inserting into the main loop it doesn’t. Instead of getting the data from individual certificates, the script is getting the data from the home page, where all certificates are.

Code to obtain the links of the records:

webPage = requests.get('http://registro.inmetro.gov.br/consulta/Default.aspx?pag=1&acao=pesquisar&NumeroRegistro=&ctl00%24MainContent%24ControlPesquisa1%24Situacao=&dataConcessaoInicio=&dataConcessaoFinal=2021-08-25&ObjetoProduto=Sistemas+e+equipamentos+para+energia+fotovoltaica+%28m%C3%B3dulo%2C+controlador+de+carga%2C+inversor+e+bateria%29&MarcaModelo=&CodigodeBarra=&Atestado=&Fornecedor=&CNPJ=&ctl00%24MainContent%24ControlPesquisa1%24SelectUF=&Municipio=')
soup = bs(webPage.content)
infoBox = soup.find(class_='corpo')
registerList = infoBox.find_all('a',href=True)

Code to obtain the data of each record from the previous list:

def get_content_value(rowData):
    return[li.get_text() for li in soup.select('.print')]

def get_content_key(rowData):
    return[th.get_text() for th in rowData]

def get_info_box(url):  
    webPage = requests.get(url)
    soup = bs(webPage.content)
    infoBox = soup.find(class_='table table-striped')
    infoRows = infoBox.find_all('tr')
    
    for index, row in enumerate(infoRows):
        if index == 0:
            key = row.find_all('th')
            contentKey = get_content_key(key)
        else:
            contentValue = get_content_value(row.find('tr'))

    certData = {k: [] for k in contentKey}
    count = 0        
    
    for row,item in enumerate(contentValue):
        certData[contentKey[count]].append(contentValue[row])
        count += 1
        if count > 5:
            count = 0
    return certData

registerDataset = []
errors = []

for index,item in enumerate(registerList):
    if index % 5 == 0:
        print('Actual Index: ', index)
        print('Until Finish: ', len(registerList)-index)
    if index == 4:
        break # somente para não encher o site de acessos, estava usando ao debugar o código
    try:
        url = registerList[index]['href']
        registerDataset.append(get_info_box(url))
    except Exception as e:
        print(index)
        print(e)
        errors.append(index)

No answers

Browser other questions tagged python url web-scraping python-requests beautifulsoup

You are not signed in. Login or sign up in order to post.