Difficulty with web scraping

Asked

Viewed 533 times

4

<tr bgcolor="FFF8DC">
    <td valign="top">25/06/2014 20:37</td>
    <td valign="top">25/06/2014</td>
    <td>
        <a href="Javascript:AbreArquivo('430489');">BROOKFIELD INCORPORAÇÕES S.A.</a>
        <br>
        Disponibilização do Laudo de Avaliação da pretendida oferta pública para a aquisição das
        ações de emissão da Companhia em circulação no mercado
    </td>
</tr>

Using the Beautifulsoup library I can read the following page:

http://siteempresas.bovespa.com.br/consbov/ExibeFatosRelevantesCvm.asp?site=C

I am having difficulty reading the above html protocol number '430489' via python. This number will be used to download a pdf. I want to create a function that will take as argument this number and will automatically download the pdf on my mac.

  • Rodrigues, is an old question but, some of the answers solved the problem?

3 answers

3

You can do this in many ways, as mentioned by mgibsonbr, to extract a piece of a string the use of regular expressions is commonly used for this purpose, such as manipulate the string.

Assuming we have the variable conteudo which stores the html has the following information:

conteudo = '''
<tr bgcolor="FFF8DC">
    <td valign="top">25/06/2014 20:37</td>
    <td valign="top">25/06/2014</td>
    <td>
        <a href="Javascript:AbreArquivo('430489');">BROOKFIELD INCORPORAÇÕES S.A.</a>
        <br>
        Disponibilização do Laudo de Avaliação da pretendida oferta pública para a aquisição das
        ações de emissão da Companhia em circulação no mercado
    </td>
</tr>
'''

String manipulation

from BeautifulSoup import BeautifulSoup

def getProtocol(html):
   soup = BeautifulSoup(conteudo)
   href = unicode(soup.a['href'].partition('AbreArquivo')[2])

   numero = [int(i) for i in href if i.isnumeric()]
   return int(numero)

protocolo = getProtocol(conteudo)
# Fazer a requisição do PDF aqui

Above we use the method partition to divide the string in the first occurrence of the separator (in this case AbreArquivo).

To string which we want to divide comes as follows: Javascript:AbreArquivo('430489');. When using partition('AbreArquivo')[2] we will have as a result: ('430489');

A list named is created numero which will contain only numbers, we traverse character by character and check if it is a number, if it is, it is added to the list.

Regular Expressions

To extract a number you can use the expression \d+ or [0-9]+ to capture one or more numbers.

from BeautifulSoup import BeautifulSoup
import re

def getProtocol(html):
   soup = BeautifulSoup(html)
   href = soup.a['href']

   numero = re.findall(r'\d+', href)[0]
   return int(numero)

protocolo = getProtocol(conteudo)
# Fazer a requisição do PDF aqui

Note that if the content to be treated comes in a different format you will probably have to adapt the way to treat the string or expression.

2

I’m assuming you’ve already got a reference to the element <a> desired, and can also extract the content of the attribute href (if I’m wrong in these assumptions, add more details to the question). The problem then boils down to extracting the number 430489 string Javascript:AbreArquivo('430489');, right?

There is no general solution to this, since the href would at first support any valid Javascript. However, if you know that your HTML will always come in this format, just use a simple substring function to extract the desired part:

href = soup.tr.a['href']
arq_str = href[len("Javascript:AbreArquivo('") : -len("');")]
arq_int = int(arq_str)

If not familiar with substring operation (sublist), x[inicio:fim] creates a new string/list starting at position inicio and ending just before the position fim. If fim is negative, it starts counting from the end of the string (i.e. len(x) - fim).

Making inicio = len(prefixo) and fim = -len(sufixo) ensures that only the "medium" will be selected, without relying on "magic numbers". There only convert to number, if applicable.

0

Only Beautifulsoup

from bs4 import Beautifulsoup

conteudo = '''
<tr bgcolor="FFF8DC">
    <td valign="top">25/06/2014 20:37</td>
    <td valign="top">25/06/2014</td>
    <td>
        <a href="Javascript:AbreArquivo('430489');">BROOKFIELD INCORPORAÇÕES S.A.</a>
        <br>
        Disponibilização do Laudo de Avaliação da pretendida oferta pública para a aquisição das
        ações de emissão da Companhia em circulação no mercado
    </td>
</tr>
'''
soup = Beautifulsoup(conteudo, 'html.parser')
print(soup.select('a[href*="AbreArquivo"]')[0]['href'].split("'")[1])

#430489

Browser other questions tagged

You are not signed in. Login or sign up in order to post.