Why does the comma disappear when converting an HTML to str in Python?

Question

Why does the comma disappear when converting an HTML to str in Python?

Asked 4 years, 4 months ago

Viewed 44 times

0

I’m extracting some data from a scraping page following a tutorial I saw on Youtube. Follow the code:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from time import  sleep


url = 'https://www.ccee.org.br/portal/faces/pages_publico/o-que-fazemos/como_ccee_atua/precos/preco_horario?_afrLoop=600752069821480&_adf.ctrl-state=hyxt02uk_54#!%40%40%3F_afrLoop%3D600752069821480%26_adf.ctrl-state%3Dhyxt02uk_58'
driver = webdriver.Safari()
driver.get(url)
sleep(15)
driver.find_element_by_xpath("//li[@id='SUL']//a").click()
sleep(1)
element = driver.find_element_by_xpath("//table[@id='listaValoresPrecoHorario']")
html_content = element.get_attribute('outerHTML')
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find(name='table')
df = pd.read_html(str(table),encoding='UTF-8')[0]
print(df)
driver.quit()

However, as the data is being converted to string, some values are losing the comma:

>> Hora PLD HORÁRIO (R$/MWh)
>> 0   00:00                18254
>> 1   01:00                17454
>> 2   02:00           MIN 173,01
>> 3   03:00                17318
>> 4   04:00                17330
>> 5   05:00                17380
>> 6   06:00                17619
>> 7   07:00                18226
>> 8   08:00                19117
>> 9   09:00                19476
>> 10  10:00                19777
>> 11  11:00                19425
>> 12  12:00                19244
>> 13  13:00                19426
>> 14  14:00           MAX 200,27
>> 15  15:00                19983
>> 16  16:00                19752
>> 17  17:00                19272
>> 18  18:00                19477
>> 19  19:00                19967
>> 20  20:00                19938
>> 21  21:00                19979
>> 22  22:00                19114
>> 23  23:00                18522

Also, I’m not able to convert the values to float. Probably because some lines have text and not number. But I wanted to at least recover the commas from the numbers. How can I do that? Thanks in advance.

Note: Strangely the lines that have text besides number are those that preserve the commas.

What is the result of print(html_content)?

– Eduardo Bissi

2021/02/26 at 12:52
And print(table)?

– Eduardo Bissi

2021/02/26 at 12:57
1

I have the impression that the problem is in pd.read_html. According to the documentation the parameter thousands is by default ','. Apparently pd.read_html is converting to number by price column. So when you have the text MIN and MAX is not converted. Documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

– Eduardo Bissi

2021/02/26 at 13:06
In html_content has the excerpt of the table in html of the page, in table is the same excerpt only now converted to type bs4.element.Tag. Eduardo Bissi, thank you very much. Solved here. How do I leave your comment in evidence?

– Rodrigo Junior

2021/02/26 at 13:41

No answers

Browser other questions tagged html python string pandas web-scraping

You are not signed in. Login or sign up in order to post.