Why does the comma disappear when converting an HTML to str in Python?

Asked

Viewed 44 times

0

I’m extracting some data from a scraping page following a tutorial I saw on Youtube. Follow the code:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from time import  sleep


url = 'https://www.ccee.org.br/portal/faces/pages_publico/o-que-fazemos/como_ccee_atua/precos/preco_horario?_afrLoop=600752069821480&_adf.ctrl-state=hyxt02uk_54#!%40%40%3F_afrLoop%3D600752069821480%26_adf.ctrl-state%3Dhyxt02uk_58'
driver = webdriver.Safari()
driver.get(url)
sleep(15)
driver.find_element_by_xpath("//li[@id='SUL']//a").click()
sleep(1)
element = driver.find_element_by_xpath("//table[@id='listaValoresPrecoHorario']")
html_content = element.get_attribute('outerHTML')
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find(name='table')
df = pd.read_html(str(table),encoding='UTF-8')[0]
print(df)
driver.quit()

However, as the data is being converted to string, some values are losing the comma:

>> Hora PLD HORÁRIO (R$/MWh)
>> 0   00:00                18254
>> 1   01:00                17454
>> 2   02:00           MIN 173,01
>> 3   03:00                17318
>> 4   04:00                17330
>> 5   05:00                17380
>> 6   06:00                17619
>> 7   07:00                18226
>> 8   08:00                19117
>> 9   09:00                19476
>> 10  10:00                19777
>> 11  11:00                19425
>> 12  12:00                19244
>> 13  13:00                19426
>> 14  14:00           MAX 200,27
>> 15  15:00                19983
>> 16  16:00                19752
>> 17  17:00                19272
>> 18  18:00                19477
>> 19  19:00                19967
>> 20  20:00                19938
>> 21  21:00                19979
>> 22  22:00                19114
>> 23  23:00                18522

Also, I’m not able to convert the values to float. Probably because some lines have text and not number. But I wanted to at least recover the commas from the numbers. How can I do that? Thanks in advance.

Note: Strangely the lines that have text besides number are those that preserve the commas.

  • What is the result of print(html_content)?

  • And print(table)?

  • 1

    I have the impression that the problem is in pd.read_html. According to the documentation the parameter thousands is by default ','. Apparently pd.read_html is converting to number by price column. So when you have the text MIN and MAX is not converted. Documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

  • In html_content has the excerpt of the table in html of the page, in table is the same excerpt only now converted to type bs4.element.Tag. Eduardo Bissi, thank you very much. Solved here. How do I leave your comment in evidence?

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.