You can do this in many ways, as mentioned by mgibsonbr,
to extract a piece of a string the use of regular expressions is commonly used for this purpose, such as manipulate the string.
Assuming we have the variable conteudo
which stores the html has the following information:
conteudo = '''
<tr bgcolor="FFF8DC">
<td valign="top">25/06/2014 20:37</td>
<td valign="top">25/06/2014</td>
<td>
<a href="Javascript:AbreArquivo('430489');">BROOKFIELD INCORPORAÇÕES S.A.</a>
<br>
Disponibilização do Laudo de Avaliação da pretendida oferta pública para a aquisição das
ações de emissão da Companhia em circulação no mercado
</td>
</tr>
'''
String manipulation
from BeautifulSoup import BeautifulSoup
def getProtocol(html):
soup = BeautifulSoup(conteudo)
href = unicode(soup.a['href'].partition('AbreArquivo')[2])
numero = [int(i) for i in href if i.isnumeric()]
return int(numero)
protocolo = getProtocol(conteudo)
# Fazer a requisição do PDF aqui
Above we use the method partition
to divide the string in the first occurrence of the separator (in this case AbreArquivo
).
To string which we want to divide comes as follows: Javascript:AbreArquivo('430489');
. When using partition('AbreArquivo')[2]
we will have as a result: ('430489');
A list named is created numero
which will contain only numbers, we traverse character by character and check if it is a number, if it is, it is added to the list.
Regular Expressions
To extract a number you can use the expression \d+
or [0-9]+
to capture one or more numbers.
from BeautifulSoup import BeautifulSoup
import re
def getProtocol(html):
soup = BeautifulSoup(html)
href = soup.a['href']
numero = re.findall(r'\d+', href)[0]
return int(numero)
protocolo = getProtocol(conteudo)
# Fazer a requisição do PDF aqui
Note that if the content to be treated comes in a different format you will probably have to adapt the way to treat the string or expression.
Rodrigues, is an old question but, some of the answers solved the problem?
– stderr