Empty Dictionary Print - Webscrapping/python/xpath

Asked

Viewed 101 times

-1

Guys, I can’t understand why the result of this scrapp comes out an empty dictionary. Could help me understand what my mistake is?

import requests 
from lxml import html

quimicos = []

resp = requests.get(url="https://www.chemicalbook.com/ProductCASList_12_0_EN.htm", headers ={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36' })

tree = html.fromstring(html=resp.content)

Linhas = tree.xpath("//table[@id='ContentPlaceHolder1_ProductClassDetail']/tbody/tr") 


for linha in Linhas:
    l = { 
    'Agente' : linha.xpath(".//td[2]/a/text()"), 
    'CAS' : linha.xpath(".//td[3]/a/text()") 
    }
    quimicos.append(l)

print(len(quimicos))

1 answer

0


Take a look at this code... See if you understand.. Any questions ask..

import requests 
from lxml import html


def get_data(url_total):
    resp = requests.get(url=url_total, headers ={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36' })
    tree = html.fromstring(html=resp.content)
    tr_elements = tree.xpath('//tr')
    total = 0
    col=[]
    for t in tr_elements:
        total+=1
        name=t.text_content().strip()
        print ('%d:"%s"'%(total,name))
        col.append((name,[]))


def main():
    url="https://www.chemicalbook.com" 
    resto_url = "/ProductCASList_12_0_EN.htm"
    resp = requests.get(url=url+resto_url, headers ={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36' })
    tree = html.fromstring(html=resp.content)
    proximos = tree.xpath('//*[@id="form1"]/div[2]/div[9]//a/@href')
    get_data(url+resto_url)
    for p in proximos:
        url_total = url+p
        #print(url_total)
        get_data(url_total)


if __name__ == '__main__':
    main()

Using the Beautifulsoup library:

from bs4 import BeautifulSoup
import requests 
import json


def get_data(cell, contador, resposta_parcial):
    if contador == 0:
        pass
    elif contador == 1:
        resposta_parcial.append(json.dumps("Chemical Name:"+cell.text.strip()))
        print("Chemical Name", str(cell.text))
    elif contador == 2:
        resposta_parcial.append(json.dumps("CAS:"+cell.text.strip()))
        print("CAS", str(cell.text))
    elif contador == 3:
        resposta_parcial.append(json.dumps("MF:"+cell.text.strip()))
        print("MF", str(cell.text))


def main():
    resposta_total = []
    resposta_parcial = []
    page_url = 'https://www.chemicalbook.com/ProductCASList_12_0_EN.htm'
    req = requests.get(page_url)
    soup = BeautifulSoup(req.text, 'html.parser')

    tables = soup.find_all('table')

    for t in tables:
        rows = t.find_all('tr', recursive=False)
        for row in rows:
            cells = row.find_all(['td'], recursive=False)
            contador = 0
            resposta_parcial = []
            for cell in cells:
                get_data(cell, contador, resposta_parcial)
                contador+=1
                if contador == 4:
                    contador = 0
            resposta_total.append(resposta_parcial)

    for r in resposta_total:
        print(r)

if __name__ == '__main__':
    main()
  • Good morning, Vinicius. Thank you very much for your attention. The Code that you have remade seems more logical and structured, I found it very good! But when running I had some errors and tried to share here but exceeds the character limit: Traceback (Most recent call last): (lines 33, 29, 15 and 19) - I put so to fit the answer* Return codecs.charmap_encode(input,self.errors,encoding_table)[0] Unicodeencodeerror: 'charmap' codec can’t Encode Character ' x8c' in position 16: Character maps to <Undefined>

  • What version of python is Voce using? Probably this error is about special characters..

  • I’m using the 3.7.4

  • I’m using 3.6... I believe that’s not it then.. I tested this code on a Linux.. This error could be due to something from Windows as well.. I couldn’t replicate the error on my machine.. If I discover anything I’ll let you know.. Take a look at this answer too: https://stackoverflow.com/questions/27092833/unicodeencorror-charmap-codec-cant-encode-characters

  • I will look. Thank you for your attention, Vinicius!

  • Enter this code and see which Find that your request is using: print("Encoding: " + str(Resp.encoding)). The result should be utf-8. Veja: https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8

  • I executed and came out that same Encoding: utf-8

  • I found the mistake! It was just this part: It was like this ('//[@id='Form1']/div[2]/div[9]//a/@href') and I changed to look like this ("//[@id='Form1']/div[2]/div[9]//a/@href") After I changed it worked!!! caraaaaaaaio que felicidade hahahahahaha :DDD I am now trying to save the file in json format

  • The single quotes from the outside were conflicting with the inside ones.. You can use these two forms: ('//[@id="form1"]/div[2]/div[9]//a/@href')&#xA;ou&#xA; ("//[@id='Form1']/div[2]/div[9]//a/@href")

  • Got it!! : Thank you very much, Vinicius!! I merged my old code with this new one to try to generate the json file, but I’m not getting it. I’ll post more if I can

  • Please, Vinicius, could you help me generate the file in Json or CSV as shown on the site? I intend to do a data processing of these files in powerBi

  • I updated the answer Marcelo. After a look at the version with the Beautifulsoup library

  • Toppppper awesome!!! I’ll take a look yes!! thank you very much!!!

Show 8 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.