How to collect data in web Crapping in Python?

Question

How to collect data in web Crapping in Python?

Asked 6 years, 2 months ago

Viewed 580 times

2

Within of this URL, has several links , I have to take the links for the month of June 2017, download them and create a dataframe with all the files in one. But I stopped here at this part, how can I do that? I’m trying to use the urllib library, but without success.

import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve

url = 'https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32290_turnstile/turnstile.html'

#Criação da variável page com URL no método request.get
page = requests.get(url)

#coleta,analisa e configura como um objeto BeautifulSoup
soup = BeautifulSoup(page.text,'html.parser')
links = soup.find_all('a')

#retorna os todos os links do Junho de 2017 da página

totalArquivos = 0
for link in links:
    href= link.get('href')
    if href != None and '1706' in href:
        totalArquivos += 1

print(totalArquivos)

2 answers

Browser other questions tagged python-3.x web-scraping beautifulsoup urllib

You are not signed in. Login or sign up in order to post.

by Sarmento • 41 points · Answer 1 · 2019-05-16T11:03:50+00:00

Reply updated:21/05/19

url = 'https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32290_turnstile/turnstile.html'

#Este método retorna um objeto Response  
page = requests.get(url)
if req.status_code == 200:
    print('Requisição bem sucedida!!!')
    content = page.content
else:
    print("ATENÇÃO ERRO DO TIPO:",page)

#aqui usamos o método soup 
soup = BeautifulSoup(page.content,'html.parser')

links = soup.find_all('a')
totalArquivos = 0
for link in links:
    href= link.get('href')
    if href != None and '1706' in href:
        totalArquivos += 1


#Realizando o download dos arquivo

    insira o código aqui
        filename = href.rsplit('/', 1)[-1]
        print("Iniciando o download do arquivo %s" %(filename))   
        print("Foi baixados o %s" %(totalArquivos),"arquivo")
        urlretrieve(url, filename)

by Breno • **978** points · Answer 2 · 2019-05-16T01:29:39+00:00

You are not using urllib, at least not in practice. You just imported it and saw no use of it in any program code. On the contrary, I saw a code written

requests.get(url)

Which means you’re trying to use the module requests. In this case, by modifying the code slightly I managed to run it smoothly.

I just erased the Imports from the urllib and imported the module requests:

from bs4 import BeautifulSoup
import requests

url = 'https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32290_turnstile/turnstile.html'

#Criação da variável page com URL no método request.get
page = requests.get(url)

#coleta,analisa e configura como um objeto BeautifulSoup
soup = BeautifulSoup(page.text,'html.parser')
links = soup.find_all('a')

#retorna os todos os links do Junho de 2017 da página

totalArquivos = 0
for link in links:
    href= link.get('href')
    if href != None and '1706' in href:
        totalArquivos += 1

print(totalArquivos)

With the `urllib`:

To use the module you are already trying to use, you can use only the function open() (which you are already importing) and read the answer with the method .read(). For this, from its original code, let’s modify only a few lines:

This

page = requests.get(url)

becomes this

page = urlopen(url)

And in place at page.text (that is of requests also) you must use page.read(). That is to say:

soup = BeautifulSoup(page.read(),'html.parser')

How to collect data in web Crapping in Python?

2 answers

With the urllib:

With the `urllib`: