The method of reading the table in html that uses is mistaken because it is verbose, complicated and conducive to errors, in addition to being subject to the specifics of a very tough and specific module for the automation of web browsers, the Selenium, when in fact all that needs a web browser is the header impersonating a user agent, in case I caught the string in my own Chrome browser with the url chrome://version/
.
The Modulo Pandas offers its users a tool for reading tables in HTML, the method pandas.read_html()
which allows the reading of a path, URL or string containing HTML text.
Some websites only allow themselves to be accessed by certain user agents, in case the site you want to access makes this restriction, thus preventing automation tools such as pandas.read_html()
to directly read the data.
To circumvent this limitation it is possible to use the module Requests
to access the site simulating a known browser and to obtain the HTML of the page.
To extract the HTML table use a lightweight parser with XPATH support, in case I used the module lxml
import pandas as pd
import requests
from lxml import html, etree
url = 'https://br.investing.com/equities/trending-stocks'
#User-Agent vai personificar um Chrome mas pode ser outro navegador. X-Requested-With informa que foi feita uma requisiçãocom XMLHttpRequest
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
#Carrega a página com o cabeçalho preparado.
page = requests.get(url, headers=header)
#Faz a análise do HTML e busca uma tabela específica via XPATH.
table = html.fromstring(page.content).xpath(r'//*[@id="trendingInnerContent"]/table')[0]
#Retorna somente a tabela ao texto HTML original e cria o dataframe.
df = pd.read_html(etree.tostring(table))
print(df)
Resulting:
[ Unnamed: 0 Nome Último Máxima Mínima Variação Var. % Vol. Hora Unnamed: 9
0 NaN Ambev ON 1621 1633 1600 +0,03 +0,19% 22,16M 07/05 NaN
1 NaN Weg ON 3342 3399 3306 -031 -0,92% 10,12M 07/05 NaN
2 NaN Petrorio ON 1921 1922 1867 +0,01 +0,05% 13,96M 07/05 NaN
3 NaN Petrobras PN 2438 2445 2346 +0,88 +3,74% 73,69M 07/05 NaN
4 NaN Banco do Brasil ON 2994 3048 2976 +0,73 +2,50% 26,70M 07/05 NaN
5 NaN Rumo ON 2134 2139 2070 +0,64 +3,09% 6,47M 07/05 NaN
6 NaN Gafisa ON 451 451 436 +0,15 +3,44% 4,08M 07/05 NaN
7 NaN Neogrid ON 707 720 705 +0,02 +0,28% 1,67M 07/05 NaN
8 NaN Met. Gerdau PN 1620 1642 1587 -005 -0,31% 9,15M 07/05 NaN
9 NaN Itau Unibanco PN 2763 2770 2724 +0,33 +1,21% 26,59M 07/05 NaN
10 NaN JHSF Part ON 731 745 717 +0,19 +2,67% 11,26M 07/05 NaN
11 NaN Triunfo ON 465 475 410 +0,56 +13,69% 11,62M 07/05 NaN
12 NaN CTEEP PN 2556 2563 2535 +0,22 +0,87% 1,08M 07/05 NaN
13 NaN Oi ON 177 180 169 +0,06 +3,51% 79,78M 07/05 NaN
14 NaN Lojas Americanas ON 1924 1954 1864 -019 -0,98% 7,72M 07/05 NaN
15 NaN Vale ON 11545 11655 11419 +0,40 +0,35% 21,60M 07/05 NaN
16 NaN Klabin Unit 2790 2801 2745 -005 -0,18% 4,49M 07/05 NaN
17 NaN J B Duarte PN 330 336 303 -002 -0,60% 102,40K 07/05 NaN
18 NaN C&A Modas ON 1247 1253 1221 +0,20 +1,63% 856,70K 07/05 NaN
19 NaN Itausa PN 1034 1034 1016 +0,14 +1,37% 23,88M 07/05 NaN
20 NaN TAEE UNIT 3885 3924 3819 -033 -0,84% 3,11M 07/05 NaN
21 NaN Hercules PN 1056 1079 1030 +0,19 +1,83% 2,60K 07/05 NaN
22 NaN Via Varejo ON 1216 1231 1200 +0,20 +1,67% 25,89M 07/05 NaN
23 NaN BR Malls Par ON 1061 1083 1034 +0,29 +2,81% 21,78M 07/05 NaN
24 NaN JBS ON 3118 3118 2980 +0,54 +1,76% 7,37M 07/05 NaN
25 NaN Lojas Renner ON 4356 4357 4176 +1,74 +4,16% 9,84M 07/05 NaN
26 NaN Magazine Luiza ON 1989 1990 1933 +0,43 +2,21% 23,14M 07/05 NaN
27 NaN Portobello ON 1201 1212 1152 +0,26 +2,21% 2,52M 07/05 NaN
28 NaN Sanepar Unit 2064 2066 2037 +0,34 +1,67% 905,60K 07/05 NaN
29 NaN Neoenergia ON 1675 1702 1651 +0,49 +3,01% 4,24M 07/05 NaN]
Test the example on Google Colab
Thanks a lot man! I’m back to python now, some concepts are lost.
– André Leite