Extract data from a calendar with Python and Beautifulsoup (under Linux Ubuntu-like)

Asked

Viewed 264 times

1

Friends,

I’d like to take data from a calendar:

http://www.purebhakti.com/component/panjika

The first step would be to make the program choose the time zone ( -3:00 Buenos Aires) and click on Submit Time Zone.

After clicking on Submit Time Zone, select the city (Rio de Janeiro) and click on Get Calendar.

Only after these steps will I have access to the calendar effectively to think about extracting the information.

I’d like to, uh, catch the event of the day:

For example, today is the 22nd, so print:

22 Apr 2017 : Ekādaśī, K, 06:09, Śatabhiṣā

+ŚUDDHA EKĀDAŚĪ VRATA: FASTING FOR Varūthinī EKADASI

I thought about using Python and beautifulsoap but I accept suggestions.

Question: How to get the program to the calendar (after selecting the time zone and city automatically)?

I couldn’t get out of:

from bs4 import BeautifulSoup
import requests

url = 'http://www.purebhakti.com/component/panjika'
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) '
                        'Chrome/51.0.2704.103 Safari/537.36'}



req = requests.get(url,headers= header)

html = req.text

soup = BeautifulSoup(html,'html.parser')
  • Has some initial code ?

  • only this. had forgotten to post

  • It’s already a start, if Naum can sound "do it for me" you know ? Quia apouco appears a fan of Phyton...

  • Sorry, it was my mistake.

  • I don’t think it’s complex, but I really don’t know! Thanks for the contribution!

1 answer

2


Try this:

import requests, time
from bs4 import BeautifulSoup as bs

url_post = 'http://www.purebhakti.com/component/panjika'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
payload = {'action': 2, 'timezone': 23, 'location': 'Rio de Janeiro, Brazil        043W15 22S54     -3.00'}

req = requests.post(url_post, headers=headers, data=payload)
soup = bs(req.text, 'html.parser')
eles = soup.select('tr td')
dates = (' '.join(d.select('b')[0].text.strip().split()) for d in eles if d.has_attr('class'))
events = (' '.join(d.text.split()) for d in eles if not d.has_attr('class'))
calendar = dict(zip(dates, events))

data_hoje = time.strftime("%d %b %Y", time.gmtime())
calendar[data_hoje] = calendar.setdefault(data_hoje, 'nenhum evento para hoje')
print(calendar[data_hoje])

Output from last print (today, 22 Feb 2017):

Ekādaśī, K, 05:46, Purvāṣāḍhā +ŚUDDHA EKĀDAŚĪ VRATA: FASTING FOR Vijaya EKADASI

We have to pay close attention to the HTML elements we want, in this case we want the <td>, if they have the class date is a date (dictionary key) otherwise it is the event (corresponding value)

In this case the Keys of our dictionary will be the text that is inside <b> that in turn this is inside a td that has the class attribute

  • Thank you very much. How to go to print, print (Calendar[date today']) ? I tried with from datetime import datetime but it did not work!

  • Ha I’ll see that @Eds , just a sec

  • Thank you very much. Sorry for the work.

  • No problem I should have read that detail in the question @Eds, but wrong because today is day 22 Feb, not 22 Apr :P

  • Very good. Thank you. I will study the code and then mark as solved.

  • OK @Eds, no problem. I’m glad you solved it. The important thing is to really understand the html elements you want and how to remove information

  • @Eds, I edited the last one because I think it looks better... with defaultDict: https://www.tutorialspoint.com/python/dictionary_setdefault.htm

  • Day 10 Apr 2017 Various events take place, such as printing each one in a different line?

  • @Eds instead of ' '.join(d.text.split()), let alone d.text

  • I am trying to modify the following: imagine that I wish to print the event that will take place in X days, for example, 4 days. future = date.fromordinal(hj.toordinal()+4). The problem is that this code generates date like this: 2017-02-26. What to do?

  • @Eds that’s a separate question. Now I can’t help because I’m not on the computer. But try, time.strftime("%d %b %Y", time.gmtime(time.time() + (3600 * 24 * 4))), for 4 more days from today

Show 6 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.