Beautifulsoup: Catch text inside table

Asked

Viewed 185 times

1

I’m trying to get specific values within a table, I have a similar code that I already use in the same way in another unique table structure within html, the problem and that I can’t get the text of the field, within that structure with a table inside the other.

inserir a descrição da imagem aqui

Below the table structure and the value I want to pick up and what I’ve been trying to do:

<table id="principal">
   <tr>
       <td id="TB01">
           <table class="secondary"></table>
       </td>
       <td id="TB02">
           <table class="secondary"></table>
       </td>
       <td id="TB03">
           <table class="secondary">
               <tr></tr>
               <tr>
                   <td></td>
                   <td></td>
                   <td>ESSE VALOR -> R$5.388,50</td>
               </tr>
           </table>
       </td>
       <td id="TB04">
           <table class="secondary"></table>
       </td>
   </tr>
</table>

Code I’ve been trying to use:

import requests
from bs4 import BeautifulSoup

url = "http://www.bmf.com.br/bmfbovespa/pages/lumis/lum-boletim-online-new-ptBR.asp?Acao=BUSCA&cboMercadoria=DOL"
resp = requests.get(url)

        bs = BeautifulSoup(resp.text, "html.parser")

        trs = (
            bs.find("td", {"id": "TB03"})
            .find("table", {"class": "secondary"})
            .findAll("tr")
        )

        for tr in trs:
            if trs.index(tr) == 2:
                tds = tr.findAll("td")

                for td in tds:
                    if tds.index(td) == 3:
                        valor = td.get_text()
        
        print(valor)

Can anyone help me in how I can return the specific value, always returns None when I print, or this Attributeerror error: 'Nonetype' Object has no attribute 'find'.

  • If any of the answers solved your problem, mark it as correct. If not, comment asking for further clarification.

2 answers

2


Kleyton, this happens because the request does not return the full HTML right away. Let’s follow with what you did in the code:

import requests
from bs4 import BeautifulSoup

url = "http://www.bmf.com.br/bmfbovespa/pages/lumis/lum-boletim-online-new-ptBR.asp?Acao=BUSCA&cboMercadoria=DOL"
resp = requests.get(url)

bs = BeautifulSoup(resp.text, "html.parser")

So far, so good. We have HTML saved in Bs. The next step is to browse the HTML behind the element we want. To be clear, I’m going to do this a little differently than what you adopted, but in essence we’re doing the same thing, okay?

trs = bs.body.div.div.div.form.div.div.table.tbody.tr.find('td', id='TB03')

Above, I caught the HTML raw, selected the first body, within it the first div, within it the first div, and so on, until you get the tag <tr />, where we specifically want the one that has id=TB03.

At this point, if you give one print(trs) will notice that he returns:

>>> <td id="TB03"></td>

Note that the tag is empty. This is where it all goes. The section marked with >>> <<<:

    trs = (
        bs.find("td", {"id": "TB03"})
        >>>.find("table", {"class": "secondary"})<<<
        .findAll("tr")
    )

will return None, because there is nothing inside, and the next line will return error, because as the message itself AttributeError: 'NoneType' object has no attribute 'find'. already says, it is not possible to use the method find() of None, because he has no such method, he has nothing, in fact.

Why does this happen?

If you are on the page you are trying to scrape, with the browser console open in the network mmonitoring tab you will see this: inserir a descrição da imagem aqui

What happens is that in the first requisitions GET, an HTML template is downloaded, then other requests request data from the server to fill in the empty spaces with the information the user wants to see. That is, the HTML you are getting in your code does not yet have the data you want to scrape

How then?

Still watching the image above, you can see that a request GET is made to the address http://cotacao.b3.com.br/mds/api/v1/DerivativeQuotation/DOL, that returns a JSON giant. I believe this is the information you seek. To get it via python just do:

url = "http://cotacao.b3.com.br/mds/api/v1/DerivativeQuotation/DOL"
resp = requests.get(url)

and the JSON will be accessible through resp.content. Now just handle the data the way you need it to stay, that’s up to you

  • Thanks Yoyo, it worked perfectly, I had already doubted that the problem would be in the delay of loading the page, but I had not thought this way its explanation helped me enough, I already managed to extract the data directly from JSON, thank you!

0

As stated in the other answer, a template is loaded to be fed, so requests cannot get the correct values.

Using requests_html

Importing the lib

from requests_html import HTMLSession

Creating the session

session = HTMLSession()

Making the requisition

r = session.get('http://www.bmf.com.br/bmfbovespa/pages/lumis/lum-boletim-online-new-ptBR.asp?Acao=BUSCA&cboMercadoria=DOL')

Rendering to work with the javascript of the page (at this time is downloaded a file required by lib, this will be performed only once)

r.html.render()

Picking up the element on the page

tb03 = r.html.find('#TB03',first=True)

Printing

print(tb03.text)
  • Friend thanks for the help, I’ve been trying to do this way but it’s not working, I tried as you said it’s not very different from what I’ve been trying to do but keeps returning the same error or Attributeerror: 'Nonetype' Object has no attribute 'findAll'.

  • Pass me the site so I can check right from the source

  • I got confused at the time of writing the code and changed the class by id in html upstairs but I noticed and changed the code you gave me example, ok I will update the URL right in my code!

  • @kleytonsolinho I updated the answer, hug!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.