How to extract information from table(html) and move to a Dataframe - Using Selenium

Question

How to extract information from table(html) and move to a Dataframe - Using Selenium

Asked 5 years, 10 months ago

Viewed 1,450 times

1

I am using Selenium to access the site http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-ajustes-do-pregao-ptBR.asp and manipulate the date box and ok button. So far I managed to do the task successfully.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import urllib 


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

import datetime
import shutil
from time import sleep

import os

options = webdriver.ChromeOptions()

options.add_experimental_option("prefs", {
  "download": {"prompt_for_download": False} })
options.add_experimental_option('useAutomationExtension', False)


#Realizando a chamada do Driver do Chrome e abertura do site
g = webdriver.Chrome()
#g.get('https://www.google.com/')
gg = g.get('http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-ajustes-do-pregao-ptBR.asp')
sleep(10)

t_dt = g.find_element_by_name('dData1') #g.find_element_by_xpath('//*[@id="dData1"]')
t_dt.clear() 
t_dt.send_keys('24/09/2019')
sleep(5)

t_bt = g.find_element_by_xpath('//*[@id="divContainerIframeBmf"]/div[1]/div/form/div/div[2]/button')

t_bt.click()

#Data de atualização 
g.find_element_by_xpath('//*[@id="divContainerIframeBmf"]/div[1]/div/form/div/div[3]/p').text

html = g.page_source.encode('utf-8')

soup = BeautifulSoup(html, 'lxml')

results = []

for row in soup.find_all('tr')[1:]:
    data = row.find_all('td')
    merc = data[0]
    venc = data[1]
    prec_ant = data[2]
    prec_atu = data[3]
    vari = data[4]
    results.append({'Mercadoria':merc.text, 
                    'Vencimento':venc.text, 
                    'Preço de ajuste anterior':prec_ant.text,
                    'Preço de ajuste atual':prec_atu.text,
                    'Variação': vari.text,
                   })

df = pd.DataFrame(results)

df.head()

The result is close to expected, the problem occurs in the merchandise column, where there is a mixture and with that the first line of each merchandise loses the formatting.

You can list table data?

– Tmilitino

2019/09/25 at 12:55
with you through this passage #Data extracted from table cols = g.find_elements_by_xpath('//*[@id="tblDadosAjustes"]/tbody/tr/td') for col in cols: print(col.text.split(' n'))

– LePy

2019/09/25 at 13:03
puts this snippet of code and says that you can print the data, and its difficulty is to take this data to a datafreme, because it was not clear in the question. Thank you!

– Tmilitino

2019/09/25 at 13:28
1

I made an edition in the complete code I believe it is clearer the doubt I have

– LePy

2019/09/25 at 14:38
If one of the answers below solved your problem and there was no doubt left, choose the one you liked the most and mark it as correct/accepted by clicking on the " " that is next to it, which also marks your question as solved. If you still have any questions or would like further clarification, feel free to comment.

– Lucas

2021/03/22 at 22:07

2 answers

Browser other questions tagged python selenium selenium-webdriver beautifulsoup

You are not signed in. Login or sign up in order to post.

by fabiogdy • 11 points · Answer 1 · 2019-09-25T03:29:56+00:00

1

Opa!

Pandas has the read_html method (pandas.read_html)! If you can assign the html content of the page to some variable, you only need to pass this variable as parameter to the read_html method, which it already returns a Dataframe for you.

I think this is the simplest and quickest way!

with the requests package, it looks like this:

import requests as re

import pandas as pd

url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-ajustes-do-pregao-ptBR.asp"
params = {
    'type': 'js',
    'session': '6$6E079F1ED270FE09AAAF72AD1C802103|5286dfffe4e737f8|1',
    'svrid': 6,
    'flavor': 'post',
    'visitID': 'HIGOKLAIJAFBOMFPQKMPCHLOALJFEIDG',
    'modifiedSince': 1569350877570,
    'app': '5286dfffe4e737f8',
    'dData1': '16/09/2019'
}

page = re.post(url, data=params, verify=False)
df = pd.read_html(page.content)

The 'dData1' field you can change the date you want!

As far as I’ve tried, it worked...

In this case I will not be able to use pandas for page access restrictions, I will have to do this action initially via Selenium and beautifulsoup

– LePy

2019/09/25 at 11:24
I’m even getting the data. The main issue is that I am unable to reproduce to a Dataframe the same format as the site table.

– LePy

2019/09/25 at 12:03
Below the codes I’m using:

– LePy

2019/09/25 at 12:03
#Update date g.find_element_by_xpath('//*[@id="divContainerIframeBmf"]/div[1]/div/form/div/div[3]/p'). text

– LePy

2019/09/25 at 12:04
#Table columns Elements = g.find_elements_by_xpath('//*[@id="tblDadosAjustes"]/thead/tr/th') for element in Elements: print(element.text.split(" n"))

– LePy

2019/09/25 at 12:04
#Data extracted from table Tables = g.find_elements_by_xpath('//*[@id="tblDadosAjustes"]/tbody/tr/td') for table in Tables: print(table.text.split(" n"))

– LePy

2019/09/25 at 12:04
At this point, I believe I can use the pandas, however, I’m not getting

– LePy

2019/09/25 at 12:05
I used the requests package... I think it’s better than using Selenium in this case... I’ll post the code in another answer here in that same topic

– fabiogdy

2019/09/25 at 12:36
thank you very much, I’m waiting

– LePy

2019/09/25 at 12:39
@Leandrob tries this: pd.read_html(g.find_elements_by_id('tblDadosAjustes')[0]. get_attribute('outerHTML'))

– fabiogdy

2019/09/25 at 13:39
I made an edition in the code inserted in the question. I had an evolution, I need a support in a detail. I believe it is easy for you. Thank you

– LePy

2019/09/25 at 14:38
@Leandrob I edited the code that you posted (only the loop for)... I added a "gambiarra" so he gets the elements td correctly... I tested it here and it worked

– fabiogdy

2019/09/25 at 15:03
Thank you very much!

– LePy

2019/09/25 at 18:34

Show 8 more comments

by Lucas • **3,858** points · Answer 2 · 2020-12-02T13:44:07+00:00

A little late, but maybe the answer interests someone else.

The problem is that the "Merchandise" line is only defined in html in the first line of each merchandise. Outside the first row, the first element in the html table is the expiration.

In the solution I present below I use a regex to check if the first element of the line has the maturity format (r'[A-Z][0-9]{2}'). If yes, I use the tickets normally and save the name of the merchandise. Otherwise, I use the last name saved from merchandise and start inserting new data from maturity. See:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import urllib 
import re #regex module add for solution

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

import datetime
import shutil
from time import sleep

import os

options = webdriver.ChromeOptions()
# options.headless = True #rodar sem abrir o browser
options.add_experimental_option("prefs", {
  "download": {"prompt_for_download": False} })
options.add_experimental_option('useAutomationExtension', False)


#Realizando a chamada do Driver do Chrome e abertura do site
g = webdriver.Chrome(options=options)
#g.get('https://www.google.com/')
gg = g.get('http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-ajustes-do-pregao-ptBR.asp')
sleep(10)

t_dt = g.find_element_by_name('dData1') #g.find_element_by_xpath('//*[@id="dData1"]')
t_dt.clear() 
t_dt.send_keys('24/09/2019')
sleep(5)

t_bt = g.find_element_by_xpath('//*[@id="divContainerIframeBmf"]/div[1]/div/form/div/div[2]/button')

t_bt.click()

#Data de atualização 
g.find_element_by_xpath('//*[@id="divContainerIframeBmf"]/div[1]/div/form/div/div[3]/p').text

html = g.page_source.encode('utf-8')

soup = BeautifulSoup(html, 'lxml')

results = []

for row in soup.find_all('tr')[1:]:
    data = row.find_all('td')
    if re.match(r'[A-Z][0-9]{2}',data[1].text.strip()):
        results.append({'Mercadoria':data[0].text.strip(), 
                        'Vencimento':data[1].text.strip(), 
                        'Preço de ajuste anterior':data[2].text.strip(),
                        'Preço de ajuste atual':data[3].text.strip(),
                        'Variação': data[4].text.strip(),
                    })
        #guarda o nome da mercadoria para usar abaixo
        mercadoria=data[0].text.strip()
    else:
        results.append({'Mercadoria':mercadoria, 
                       'Vencimento':data[0].text.strip(), 
                        'Preço de ajuste anterior':data[1].text.strip(),
                        'Preço de ajuste atual':data[2].text.strip(),
                        'Variação': data[3].text.strip(),
                    })


df = pd.DataFrame(results)

print(df)

g.quit()

Output:

                                Mercadoria Vencimento Preço de ajuste anterior Preço de ajuste atual Variação
0  AFS   - RANDE DA AFRICA DO SUL (TIPO A)        V19               14.895,900            14.886,600   -9,300
1  AFS   - RANDE DA AFRICA DO SUL (TIPO A)        X19               14.971,500            14.952,500  -19,000
2  AFS   - RANDE DA AFRICA DO SUL (TIPO A)        Z19               15.024,500            15.006,300  -18,200
3                   ARB   - Peso Argertino        V19                  71,3070               71,7160   0,4090
4                   ARB   - Peso Argertino        X19                  63,6910               63,6380  -0,0530