Take a look at this code...
See if you understand.. Any questions ask..
import requests
from lxml import html
def get_data(url_total):
resp = requests.get(url=url_total, headers ={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36' })
tree = html.fromstring(html=resp.content)
tr_elements = tree.xpath('//tr')
total = 0
col=[]
for t in tr_elements:
total+=1
name=t.text_content().strip()
print ('%d:"%s"'%(total,name))
col.append((name,[]))
def main():
url="https://www.chemicalbook.com"
resto_url = "/ProductCASList_12_0_EN.htm"
resp = requests.get(url=url+resto_url, headers ={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36' })
tree = html.fromstring(html=resp.content)
proximos = tree.xpath('//*[@id="form1"]/div[2]/div[9]//a/@href')
get_data(url+resto_url)
for p in proximos:
url_total = url+p
#print(url_total)
get_data(url_total)
if __name__ == '__main__':
main()
Using the Beautifulsoup library:
from bs4 import BeautifulSoup
import requests
import json
def get_data(cell, contador, resposta_parcial):
if contador == 0:
pass
elif contador == 1:
resposta_parcial.append(json.dumps("Chemical Name:"+cell.text.strip()))
print("Chemical Name", str(cell.text))
elif contador == 2:
resposta_parcial.append(json.dumps("CAS:"+cell.text.strip()))
print("CAS", str(cell.text))
elif contador == 3:
resposta_parcial.append(json.dumps("MF:"+cell.text.strip()))
print("MF", str(cell.text))
def main():
resposta_total = []
resposta_parcial = []
page_url = 'https://www.chemicalbook.com/ProductCASList_12_0_EN.htm'
req = requests.get(page_url)
soup = BeautifulSoup(req.text, 'html.parser')
tables = soup.find_all('table')
for t in tables:
rows = t.find_all('tr', recursive=False)
for row in rows:
cells = row.find_all(['td'], recursive=False)
contador = 0
resposta_parcial = []
for cell in cells:
get_data(cell, contador, resposta_parcial)
contador+=1
if contador == 4:
contador = 0
resposta_total.append(resposta_parcial)
for r in resposta_total:
print(r)
if __name__ == '__main__':
main()
Good morning, Vinicius. Thank you very much for your attention. The Code that you have remade seems more logical and structured, I found it very good! But when running I had some errors and tried to share here but exceeds the character limit: Traceback (Most recent call last): (lines 33, 29, 15 and 19) - I put so to fit the answer* Return codecs.charmap_encode(input,self.errors,encoding_table)[0] Unicodeencodeerror: 'charmap' codec can’t Encode Character ' x8c' in position 16: Character maps to <Undefined>
– Marcelo Augusto
What version of python is Voce using? Probably this error is about special characters..
– Vinicius Bussola
I’m using the 3.7.4
– Marcelo Augusto
I’m using 3.6... I believe that’s not it then.. I tested this code on a Linux.. This error could be due to something from Windows as well.. I couldn’t replicate the error on my machine.. If I discover anything I’ll let you know.. Take a look at this answer too: https://stackoverflow.com/questions/27092833/unicodeencorror-charmap-codec-cant-encode-characters
– Vinicius Bussola
I will look. Thank you for your attention, Vinicius!
– Marcelo Augusto
Enter this code and see which Find that your request is using: print("Encoding: " + str(Resp.encoding)). The result should be utf-8. Veja: https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8
– Vinicius Bussola
I executed and came out that same Encoding: utf-8
– Marcelo Augusto
I found the mistake! It was just this part: It was like this ('//[@id='Form1']/div[2]/div[9]//a/@href') and I changed to look like this ("//[@id='Form1']/div[2]/div[9]//a/@href") After I changed it worked!!! caraaaaaaaio que felicidade hahahahahaha :DDD I am now trying to save the file in json format
– Marcelo Augusto
The single quotes from the outside were conflicting with the inside ones.. You can use these two forms: ('//[@id="form1"]/div[2]/div[9]//a/@href')
ou
 ("//[@id='Form1']/div[2]/div[9]//a/@href")
– Vinicius Bussola
Got it!! : Thank you very much, Vinicius!! I merged my old code with this new one to try to generate the json file, but I’m not getting it. I’ll post more if I can
– Marcelo Augusto
Please, Vinicius, could you help me generate the file in Json or CSV as shown on the site? I intend to do a data processing of these files in powerBi
– Marcelo Augusto
I updated the answer Marcelo. After a look at the version with the Beautifulsoup library
– Vinicius Bussola
Toppppper awesome!!! I’ll take a look yes!! thank you very much!!!
– Marcelo Augusto