0
I’m extracting some data from a scraping page following a tutorial I saw on Youtube. Follow the code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep
url = 'https://www.ccee.org.br/portal/faces/pages_publico/o-que-fazemos/como_ccee_atua/precos/preco_horario?_afrLoop=600752069821480&_adf.ctrl-state=hyxt02uk_54#!%40%40%3F_afrLoop%3D600752069821480%26_adf.ctrl-state%3Dhyxt02uk_58'
driver = webdriver.Safari()
driver.get(url)
sleep(15)
driver.find_element_by_xpath("//li[@id='SUL']//a").click()
sleep(1)
element = driver.find_element_by_xpath("//table[@id='listaValoresPrecoHorario']")
html_content = element.get_attribute('outerHTML')
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find(name='table')
df = pd.read_html(str(table),encoding='UTF-8')[0]
print(df)
driver.quit()
However, as the data is being converted to string, some values are losing the comma:
>> Hora PLD HORÁRIO (R$/MWh)
>> 0 00:00 18254
>> 1 01:00 17454
>> 2 02:00 MIN 173,01
>> 3 03:00 17318
>> 4 04:00 17330
>> 5 05:00 17380
>> 6 06:00 17619
>> 7 07:00 18226
>> 8 08:00 19117
>> 9 09:00 19476
>> 10 10:00 19777
>> 11 11:00 19425
>> 12 12:00 19244
>> 13 13:00 19426
>> 14 14:00 MAX 200,27
>> 15 15:00 19983
>> 16 16:00 19752
>> 17 17:00 19272
>> 18 18:00 19477
>> 19 19:00 19967
>> 20 20:00 19938
>> 21 21:00 19979
>> 22 22:00 19114
>> 23 23:00 18522
Also, I’m not able to convert the values to float. Probably because some lines have text and not number. But I wanted to at least recover the commas from the numbers. How can I do that? Thanks in advance.
Note: Strangely the lines that have text besides number are those that preserve the commas.
What is the result of print(html_content)?
– Eduardo Bissi
And print(table)?
– Eduardo Bissi
I have the impression that the problem is in
pd.read_html
. According to the documentation the parameterthousands
is by default ','. Apparentlypd.read_html
is converting to number by price column. So when you have the text MIN and MAX is not converted. Documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html– Eduardo Bissi
In html_content has the excerpt of the table in html of the page, in table is the same excerpt only now converted to type bs4.element.Tag. Eduardo Bissi, thank you very much. Solved here. How do I leave your comment in evidence?
– Rodrigo Junior