Join cells with python

Asked

Viewed 68 times

0

I am making a web scraping to take data from the best actions of the day and join in a table in an excel file. I am trying by code:

from selenium import webdriver
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from selenium.webdriver.common.keys import Keys

browser = webdriver.Edge(EdgeChromiumDriverManager().install())

# Access the website with the most popular stocks
link = 'https://br.investing.com/equities/trending-stocks'
browser.get(link)




import pandas as pd
table = (browser.find_element_by_xpath('//*[@id="trendingInnerContent"]/table').text)
table = table.split(sep='\n')
empty = []
for item in table:
    item = item.split(' ')
    filtered = [x for x in item if x.strip()]
    empty.append(filtered)
   
tb = pd.DataFrame(empty)
tb.to_excel('Atualizado.xlsx', encoding='utf-8', header=False, index = False)
del(empty[0])

But it is separating the names in several cells, I wanted to join them for the table to be right

2 answers

1

The method of reading the table in html that uses is mistaken because it is verbose, complicated and conducive to errors, in addition to being subject to the specifics of a very tough and specific module for the automation of web browsers, the Selenium, when in fact all that needs a web browser is the header impersonating a user agent, in case I caught the string in my own Chrome browser with the url chrome://version/.

The Modulo Pandas offers its users a tool for reading tables in HTML, the method pandas.read_html() which allows the reading of a path, URL or string containing HTML text.

Some websites only allow themselves to be accessed by certain user agents, in case the site you want to access makes this restriction, thus preventing automation tools such as pandas.read_html() to directly read the data.
To circumvent this limitation it is possible to use the module Requests to access the site simulating a known browser and to obtain the HTML of the page.

To extract the HTML table use a lightweight parser with XPATH support, in case I used the module lxml

import pandas as pd
import requests
from lxml import html, etree

url = 'https://br.investing.com/equities/trending-stocks'

#User-Agent vai personificar um Chrome mas pode ser outro navegador. X-Requested-With informa que foi feita uma requisiçãocom XMLHttpRequest
header = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

#Carrega a página com o cabeçalho preparado.
page = requests.get(url, headers=header)
#Faz a análise do HTML e busca uma tabela específica via XPATH. 
table = html.fromstring(page.content).xpath(r'//*[@id="trendingInnerContent"]/table')[0]
#Retorna somente a tabela ao texto HTML original e cria o dataframe. 
df = pd.read_html(etree.tostring(table))
print(df)

Resulting:

[    Unnamed: 0                 Nome  Último  Máxima  Mínima Variação   Var. %     Vol.   Hora  Unnamed: 9
0          NaN             Ambev ON    1621    1633    1600    +0,03   +0,19%   22,16M  07/05         NaN
1          NaN               Weg ON    3342    3399    3306     -031   -0,92%   10,12M  07/05         NaN
2          NaN          Petrorio ON    1921    1922    1867    +0,01   +0,05%   13,96M  07/05         NaN
3          NaN         Petrobras PN    2438    2445    2346    +0,88   +3,74%   73,69M  07/05         NaN
4          NaN   Banco do Brasil ON    2994    3048    2976    +0,73   +2,50%   26,70M  07/05         NaN
5          NaN              Rumo ON    2134    2139    2070    +0,64   +3,09%    6,47M  07/05         NaN
6          NaN            Gafisa ON     451     451     436    +0,15   +3,44%    4,08M  07/05         NaN
7          NaN           Neogrid ON     707     720     705    +0,02   +0,28%    1,67M  07/05         NaN
8          NaN       Met. Gerdau PN    1620    1642    1587     -005   -0,31%    9,15M  07/05         NaN
9          NaN     Itau Unibanco PN    2763    2770    2724    +0,33   +1,21%   26,59M  07/05         NaN
10         NaN         JHSF Part ON     731     745     717    +0,19   +2,67%   11,26M  07/05         NaN
11         NaN           Triunfo ON     465     475     410    +0,56  +13,69%   11,62M  07/05         NaN
12         NaN             CTEEP PN    2556    2563    2535    +0,22   +0,87%    1,08M  07/05         NaN
13         NaN                Oi ON     177     180     169    +0,06   +3,51%   79,78M  07/05         NaN
14         NaN  Lojas Americanas ON    1924    1954    1864     -019   -0,98%    7,72M  07/05         NaN
15         NaN              Vale ON   11545   11655   11419    +0,40   +0,35%   21,60M  07/05         NaN
16         NaN          Klabin Unit    2790    2801    2745     -005   -0,18%    4,49M  07/05         NaN
17         NaN        J B Duarte PN     330     336     303     -002   -0,60%  102,40K  07/05         NaN
18         NaN         C&A Modas ON    1247    1253    1221    +0,20   +1,63%  856,70K  07/05         NaN
19         NaN            Itausa PN    1034    1034    1016    +0,14   +1,37%   23,88M  07/05         NaN
20         NaN            TAEE UNIT    3885    3924    3819     -033   -0,84%    3,11M  07/05         NaN
21         NaN          Hercules PN    1056    1079    1030    +0,19   +1,83%    2,60K  07/05         NaN
22         NaN        Via Varejo ON    1216    1231    1200    +0,20   +1,67%   25,89M  07/05         NaN
23         NaN      BR Malls Par ON    1061    1083    1034    +0,29   +2,81%   21,78M  07/05         NaN
24         NaN               JBS ON    3118    3118    2980    +0,54   +1,76%    7,37M  07/05         NaN
25         NaN      Lojas Renner ON    4356    4357    4176    +1,74   +4,16%    9,84M  07/05         NaN
26         NaN    Magazine Luiza ON    1989    1990    1933    +0,43   +2,21%   23,14M  07/05         NaN
27         NaN        Portobello ON    1201    1212    1152    +0,26   +2,21%    2,52M  07/05         NaN
28         NaN         Sanepar Unit    2064    2066    2037    +0,34   +1,67%  905,60K  07/05         NaN
29         NaN        Neoenergia ON    1675    1702    1651    +0,49   +3,01%    4,24M  07/05         NaN]

Test the example on Google Colab

0


Oops, I believe whatever’s going on is that the item in your script is receiving the split(' ').

So the result for:

    Petrobras PN    23,67   23,81   23,47   +0,05   +0,21%  15,74M  

would be:

['', '', '', '', 'Petrobras', 'PN', '', '', '', '23,67', '', '', '', '23,81', '', '', '', '23,47', '', '', '', '+0,05', '', '', '', '+0,21%', '', '', '', '15,74M', '', '', '', '']

Right after vc filters to the variable filtered stripping

Your Filtered variable looks like this:

['Petrobras', 'PN', '23,67', '23,81', '23,47', '+0,05', '+0,21%', '15,74M']

And that’s where the names are splitting.

Assuming that there are always 6 indexes.

I think the solution would be:

indices = filtered[-6:]
nome = " ".join(filtered[:-6])

In this case presented, we would have

>>> indices
['23,67', '23,81', '23,47', '+0,05', '+0,21%', '15,74M']

>>> nome
'Petrobras PN'

In case you want to put it all together:

tudo_junto = [nome] + indices

Upshot

>>> tudo_junto
['Petrobras PN', '23,67', '23,81', '23,47', '+0,05', '+0,21%', '15,74M']
  • Thanks a lot man! I’m back to python now, some concepts are lost.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.