Increment URL via concatenation and using urllib2.urlop

Asked

Viewed 63 times

1

I’m using the code below to access and do the scrapp of the emails that, at least as far as I have seen, are in three identical Urls that vary only the Numconsulta_cadastro=. By executing the code below he can only get the emails from the first page by repeating them three times.

from bs4 import BeautifulSoup
import urllib2
import re
num=0
while num < 3:
strnum = str(num)
html_page = urllib2.urlopen("http://www.fiepb.com.br/industria/pesquisa.php?page=Numconsulta_cadastro="+strnum+"&totalRows_consulta_cadastro=3372&empresa=&cidade=&atividade=&produto=&materiaprima=&classificador=RAZAOSOCIAL&dados=on&Submit=Enviar+Consulta")
num+=1
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^mailto:")}):
print link.get('href')
  • tried repeating links instead of using a loop with values ?

  • Hello bigown, thank you for your attention. Are 68 links. How to do?

  • Managed to solve your problem?

1 answer

0

I understand your "question" as "how can I generalize the code below?" The following version simplifies the execution of your code using str.format(), a method of str objects that allows you to fill a string with any value by replacing the key pairs with the number parameter entered within the pair (eg:"{0}").

You should also avoid the while: counter structure, where a WHILE loop runs while a variable is incremented. It would be the equivalent of using a Ferrari in a pizza delivery service: theoretically possible, but a beautiful (and expensive) waste of resources. Instead, use the "for X in range(limit):" structure, where Python will generate all elements between 0 and limit, store the current element in X, and run the next code block for each possible value of X, in order.

Finally, you should also define your constants at the beginning of the code, making it easier to modify them in the future. This greatly facilitates maintenance as it improves readability.

from bs4 import BeautifulSoup
import urllib2
import re
WEBSERVICE = "http://www.fiepb.com.br/industria/pesquisa.php"
QUERY_VARIAVEL = "page=Numconsulta_cadastro="
QUERY_CONSTANTE = "&totalRows_consulta_cadastro=3372&empresa=&cidade=&atividade=&produto=&materiaprima=&classificador=RAZAOSOCIAL&dados=on&Submit=Enviar+Consulta"

for email_num in range(3):
    html_page = urllib2.urlopen("{0}?{1}{2}{3}".format(WEBSERVICE,
                                                       QUERY_VARIAVEL,
                                                       email_num,
                                                       QUERY_CONSTANTE))

    soup = BeautifulSoup(html_page)

    for link in soup.findAll('a', attrs={'href': re.compile("^mailto:")}):
        print link.get('href')

PRO tip: take a look at the requests module :)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.