Web Scraping with python

Asked

Viewed 394 times

-2

Good evening. I want to make a simple algorithm to take data from a website (http://www.riooilgas.com.br/? _page=programming&_menu=programming). I’ve done some library code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.riooilgas.com.br/?_page=programacao&_menu=programacao")
res = BeautifulSoup(html.read(),"html5lib")
tags = res.findAll(text="Evento Paralelo - O&G Techweek")
print(res.tags)

I just want the information of day, time "Parallel Event - O&G Techweek " and event name. Are 72 lines, to print in an excel something simple.

Can someone help me?

Thank you

1 answer

2


Unfortunately, the elements you are looking for are not in the HTML code of the site, but are generated dynamically via Javascript after the page is loaded in a browser. Since Beautifulsoup does not perform javascript, you cannot extract this data directly as it started in your code.

One of the options for this type of site is to analyze the javascript code of the page, find out what it does, and "simulate" it with python code written manually. This solution is usually more efficient but much more complex to implement.

In the specific case of the site you requested, it seems that the data is within javascript, as you can see here:

>>> import requests
>>> r = requests.get('http://assets.tuut.com.br/rog-pages/public/script-programacao-main.js?v=23')
>>> data = r.text
>>> data[30:100]
'po de Evento":"Congresso",Bloco:"",Categoria:"Credenciamento","Hor\xe1rio'

As you can see, it’s a format similar to json, but not exactly json. It’s javascript variable definitions with the data. Use the module json python would not work here because the text is not valid json, fortunately there is the module demjson which serves to extract data from json-like formats like this. Using demjson:

>>> d1 = data[data.find('['):data.find(']')+1]
>>> import demjson
>>> eventos = demjson.decode(d1)

Now we have a python object (list) containing the events one by one in each element:

>>> for evento in eventos:
...     print(evento['Nome do evento'], 'as', evento['Horário'], 'em', evento['Lugar'])
Credenciamento as 8:00 às 17:00 em Pavilhão 1
Cerimônia de Abertura as 9:30 às 11:00 em Pavilhão 5
SP 1: A nova geopolítica do petróleo e gás as 11:10 às 12:10 em Pavilhão 5
Os desafios e oportunidades do setor de Upstream num mundo em Transição Energética as 12:25 às 13:40 em Pavilhão 5
SE 01: 40 anos da Bacia de Campos: o que vem pela frente as 14:00 às 16:00 em Pavilhão 5
SE 02: Comércio irregular de combustíveis e seus impactos – programa Combustível Legal as 14:00 às 16:00 em Pavilhão 5
...

As you can see it was easy to extract the data from this site, they already came structured in an organized way in javascript code. But it’s not always that easy - nowadays it’s more and more common dynamic websites with increasingly confusing and obscure javascript code. Then enters another alternative to scraping this type of site: the selenium - Selenium is a lib that allows you to control a browser through python, like Chrome or firefox - Using it you can run javascript. But it’s much less efficient because you’re running an entire browser.

  • Thank you so much! I had seen Lenium, but I thought that with bs4 would be better,I will test here ! thanks !

Browser other questions tagged

You are not signed in. Login or sign up in order to post.