How to collect data in web Crapping in Python?

Asked

Viewed 580 times

2

Within of this URL, has several links , I have to take the links for the month of June 2017, download them and create a dataframe with all the files in one. But I stopped here at this part, how can I do that? I’m trying to use the urllib library, but without success.

import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve

url = 'https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32290_turnstile/turnstile.html'

#Criação da variável page com URL no método request.get
page = requests.get(url)

#coleta,analisa e configura como um objeto BeautifulSoup
soup = BeautifulSoup(page.text,'html.parser')
links = soup.find_all('a')

#retorna os todos os links do Junho de 2017 da página

totalArquivos = 0
for link in links:
    href= link.get('href')
    if href != None and '1706' in href:
        totalArquivos += 1

print(totalArquivos)

2 answers

2

Reply updated:21/05/19

url = 'https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32290_turnstile/turnstile.html'

#Este método retorna um objeto Response  
page = requests.get(url)
if req.status_code == 200:
    print('Requisição bem sucedida!!!')
    content = page.content
else:
    print("ATENÇÃO ERRO DO TIPO:",page)

#aqui usamos o método soup 
soup = BeautifulSoup(page.content,'html.parser')

links = soup.find_all('a')
totalArquivos = 0
for link in links:
    href= link.get('href')
    if href != None and '1706' in href:
        totalArquivos += 1


#Realizando o download dos arquivo

    insira o código aqui
        filename = href.rsplit('/', 1)[-1]
        print("Iniciando o download do arquivo %s" %(filename))   
        print("Foi baixados o %s" %(totalArquivos),"arquivo")
        urlretrieve(url, filename) 
  • Thank you very much for the Feedback! If my answer has helped you with your question, mark it as correct. I recommend reading our Help Center about How to ask a good question. After that, please ask a new question with this question and delete this answer. Our philosophy is to limit a question to a single scope.

  • In this case, you should print the variable href, or add the links (hrefs) in a list.

1

You are not using urllib, at least not in practice. You just imported it and saw no use of it in any program code. On the contrary, I saw a code written

requests.get(url)

Which means you’re trying to use the module requests. In this case, by modifying the code slightly I managed to run it smoothly.

I just erased the Imports from the urllib and imported the module requests:

from bs4 import BeautifulSoup
import requests

url = 'https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32290_turnstile/turnstile.html'

#Criação da variável page com URL no método request.get
page = requests.get(url)

#coleta,analisa e configura como um objeto BeautifulSoup
soup = BeautifulSoup(page.text,'html.parser')
links = soup.find_all('a')

#retorna os todos os links do Junho de 2017 da página

totalArquivos = 0
for link in links:
    href= link.get('href')
    if href != None and '1706' in href:
        totalArquivos += 1

print(totalArquivos)


With the urllib:

To use the module you are already trying to use, you can use only the function open() (which you are already importing) and read the answer with the method .read(). For this, from its original code, let’s modify only a few lines:

This

page = requests.get(url)

becomes this

page = urlopen(url)

And in place at page.text (that is of requests also) you must use page.read(). That is to say:

soup = BeautifulSoup(page.read(),'html.parser')
  • You copied it from somewhere?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.