Request url wikipedia by date

Asked

Viewed 78 times

0

Hello, a great friend made me this code, I am very new in python :`

from bs4 import BeautifulSoup
import requests

url = "https://dumps.wikimedia.org/other/pageviews/2018/2018-04/"

page_html = requests.get(url).content

soup = BeautifulSoup(page_html, "html5lib")

links = [ (url+link['href'])
          for link in soup.find_all('a')
          if "pageviews-" in link['href'] ]

for link in links: print(link)

It’s listing the perfect wikipedia dumps, but I need it listed by the file date, step one date range and it lists the files in this date range.

Follow the wikipedia link https://dumps.wikimedia.org/other/pageviews/2018/2018-04/

Someone can help me ?

inserir a descrição da imagem aqui

  • from what I see, the lists are based on the month, you want a range of months is that it? To get every list of every month

  • In fact they are by date and time ...I need to pick up for a date and time a range I pass, for example between 2018-09-04 12:00 and 2018-09-04 12:59. @miguel put examples there in the question

  • True, I’ll try to help. Are you using python3 right? But you put python 2.7 in the tags...

  • Poxa brigadao mesmo.. Yes I’m using 3.0, but this code will pruma lambda aws Function that I believe is only 2.7. I’m pretty new at this

  • I just checked, lambda functions accept python 3.6 :)

2 answers

0

In the example below I am using datetime to display only today’s.

from bs4 import BeautifulSoup
import requests
import datetime

url = "https://dumps.wikimedia.org/other/pageviews/2018/2018-04/"
proxies = {
    'http': 'http://10.1.101.101:8080',
    'https': 'http://10.1.101.101:8080',
    'ftp': 'http://10.1.101.101:8080',

}

page_html = requests.get(url, proxies=proxies).content

soup = BeautifulSoup(page_html, "html5lib")

links = [ (url+link['href'])

          for link in soup.find_all('a')
          if "pageviews-" in link['href'] 
          and datetime.datetime.now().strftime('%Y%m%d') in link['href']]

for link in links: print(link)
  • What is the reason to list these proxies? VPN? I guess it won’t work for someone who doesn’t have the same VPN configured.

  • Only because in my network need to access the internet. Can be removed without problems to those who do not have this requirement.

  • Show !!! I will try here to put between a date and time range :)

  • Thank you very much man I will try to spend here a date and time start and date and time end and bring only those who have in this range

  • Ask @britodfbr this script will be rotated hourly. If I generate an array with all the time and minutes of the last hour and put in the link['href'] where you put the datetime.now() would be the best solution ? Or if it would work ? I’m sorry to abuse is that I’ve been reading and watching courses for 3 days :)

  • kkk, I go through this too and the people here help me a lot. That is why I dedicate some time to help those in need, to pay my debt to the collective in the community. I will see what I can do...

  • @Laertejunior replace the 3rd line of the for with: and datetime.datetime.now(). strftime('%Y%m%d-%H0000') in link['href']. As Voce will run hourly, and the pattern appears to be HHMMSS, this change will work.

  • Cra furunfou but ta bringing of the current hour.. like now 2:15 brought the last which was 2:00 .. I would need from 1:00 to 1:59

  • Will you not perform from hour to hour? 13:15 brings the 13:00, 14:15 brings the 14:00, 15:15 brings the 15:00 and so on and so forth.

  • yes yes will run hourly but need to list the last hour, because otherwise there may be a new file on wikipedia we will say at 14:20 the 14/:00 does not catch and when rotate at 15:00 (until running the script we will put 15:05) will not catch tambementendeu ?

Show 6 more comments

0


I took the liberty of using requests_html instead of Beautifulsoup, but logic can be easily downloaded:

from requests_html import HTMLSession
from datetime import datetime
import re

url = "https://dumps.wikimedia.org/other/pageviews/2018/2018-04/"
session = HTMLSession()
r = session.get(url)

pre = r.html.find('pre', first=True)

data_inicial = datetime(2018, 4, 2)
data_final = datetime(2018, 4, 5)

for link in sorted(pre.links):

    # Tentar isolar número entre dois hífens. Se não conseguir, não é um dos links que queremos
    try:
        date_str = re.search(r"-(\d+)-", link).group(1)
    except AttributeError:
        continue  # Ir pra próxima iteração do for

    # Transformamos o número entre dois hífens em objeto datetime
    data_link = datetime.strptime(date_str, "%Y%m%d")

    # Printamos o link completo somente se estiver entre as datas inicial e final
    if data_inicial <= data_link <= data_final:
        print(url + link)

If we were to include the time and list only pageview pages, we would have

from requests_html import HTMLSession
from datetime import datetime
import re

url = "https://dumps.wikimedia.org/other/pageviews/2018/2018-04/"
session = HTMLSession()
r = session.get(url)

pre = r.html.find('pre', first=True)

data_inicial = datetime(2018, 4, 2, 13, 0) # 02/04/2018, 13:00
data_final = datetime(2018, 4, 5, 9, 0) # 05/04/2018, 9:00

for link in sorted(pre.links):

    # link não inclui a base da url; é algo como `projectviews-20180402-150000`

    # Tentar isolar número entre dois hífens. Se não conseguir, não é um dos links que queremos
    try:
        date_str = re.search(r"-(\d+-\d+)", link).group(1)
        # Regex = encontrar e capturar (parênteses) um ou mais dígitos (\d+) seguidos por hífen e mais um ou mais
        # dígitos. Exemplo - em `pageviews-20180405-220000` é capturada a substring `20180405-220000` (padrão que
        #  está dentro do grupo de captura)
    except AttributeError:
        continue  # Ir pra próxima iteração do for

    # Transformamos o número entre dois hífens em objeto datetime com hora
    data_link = datetime.strptime(date_str, "%Y%m%d-%H%M%S")

    # Printamos o link completo somente se estiver entre as datas inicial e final e se contiver a string "pageviews".
    if data_inicial <= data_link <= data_final and "pageviews" in link:
        print(url + link)
  • Thank you very much Pedro

  • your code worked fine too.. as I add pageviews only, time and minute at start and end date ?

  • You killed me in regex lol ..

  • @Laertejunior heheh, in fact the regex became more complicated than needed; in fact it could only be "-( d+)-"; that is to say: find and capture (parentheses) one or more digits ( d+) that are surrounded by hyphene. I am updating the answer to correct this and give the example of hour and minute.

  • Thank you very much !!!!!!!!!

  • For nothing! Don’t forget to select an answer as correct to mark the question as resolved.

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.