Scrape list on website with beautifulsoup

Question

Scrape list on website with beautifulsoup

Asked 7 years, 11 months ago

Viewed 275 times

1

I need to scrape in Python a list on a website. Only the first list My code is like this:

import requests

from bs4 import BeautifulSoup

page = requests.get("http://www25.senado.leg.br/web/atividade/materias/-/materia/votacao/2363507")

soup = BeautifulSoup(page.content, 'html.parser')

lista = soup.find_all('ul' , class_='unstyled')

You’re scraping all the lists. I want to scratch the voting list of the description "Roll call vote, first round, PEC no 55/2016, amending the Act of Transitional Constitutional Provisions, to institute the New Tax Regime, and gives other provisions (Ceiling of Public Spending)."

But all lists have tag ul and class unstyled Does anyone know how to differentiate the lists?

I searched a little later, I read this site: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

I believe this is it: lists = Soup.find_all('ul' , class_='unstyled', limit=2)

What data need to be extracted from this page?

– Miguel

2017/08/16 at 10:28
The first voting list (Roll call vote, first round, PEC nº 55/2016), with the names of senators and how they voted. With the lists = Soup.find_all('ul' , class_='unstyled', limit=2) command I got it. I will now make commands to save this to a csv

– Reinaldo Chaves

2017/08/17 at 11:10

1 answer

Browser other questions tagged python

You are not signed in. Login or sign up in order to post.

by Guilherme IA • **1,414** points · Answer 1 · 2017-08-23T19:41:55+00:00

0

Dude already tried to use Selenium?

pip install selenium
brew install phantomjs

Code that does the same thing you need.

from selenium import webdriver

browser = webdriver.PhantomJS()
browser.get("http://www25.senado.leg.br/web/atividade/materias/-/materia/votacao/2363507")

list_senadores = browser.find_elements_by_xpath(".//ul[@class='unstyled']")

print("Primeira coluna")
for lis in list_senadores[0].find_elements_by_css_selector("li"):
    print(lis.text)

print("Segunda coluna")
for lis in list_senadores[1].find_elements_by_css_selector("li"):
    print(lis.text)

Thanks, didn’t you know Is the "Brew install phantomjs" command for installing Phantomjs? Is there something you need to install first? No prompt appears: 'Brew' is not recognized as an internal or external command, a operable program or a batch file.

– Reinaldo Chaves

2017/08/23 at 22:43
I got it here, thanks. Mas antes:
Instalei (http://phantomjs.org/)
E dei o comando de path: 
path_to_phantomjs = '/Users/Reinaldo/Documents/phantomjs-2.1.1-windows/bin/phantomjs'
browser = webdriver.PhantomJS(executable_path = path_to_phantomjs)

– Reinaldo Chaves

2017/08/23 at 23:26
1

So, if Voce install with npm, Voce does not need to pass the phantomjs path, but this way I believe it works too! Hugs

– Guilherme IA

2017/08/24 at 13:54
Thank you. Please, one more thing. I am looking at the documentation (http://selenium-python.readthedocs.io/locating-elements.html). I need to scrape the first voting list, from PEC nº 55/2016, which has two senators' columns (from 0 to 40 and from 41 to 80) The list_senators command = browser.find_element_by_xpath(".//ul[@class='unstyled']") only scrapes from 0 to 40 You know how to scrape the second column?

– Reinaldo Chaves

2017/08/24 at 15:07
1

@Reinaldochaves if you change find_element_by_xpath for find_elements_by_xpath he takes all the ul classy unstyled. And very simple the Selenium. Or Voce searches for an element (in this case he took the first UL) or searches all existing. Don’t forget to give one up if it’s useful! Thanks :D

– Guilherme IA

2017/08/24 at 15:23
Thank you. The first command looks like this: list_senators = browser.find_elements_by_xpath(".//ul[@class='unstyled']") But then for gave error: ____ Attributeerror Traceback (Most recent call last) <ipython-input-22-b9c5cbb095f2> in <module>() -----> 1 for s in list_senators.find_elements_by_css_selector("li"): 2 print(s.text) Eerattributror: 'list' Object has no attribute 'find_elements_bycss_selector'

– Reinaldo Chaves

2017/08/24 at 15:32
I think it was removing the CSS selector, thanks: for s in list_senators: print(s.text) But if there is a way to get only the first two columns it would be more productive. If not then I’ll wipe the data

– Reinaldo Chaves

2017/08/24 at 15:41
1

So in this case, when Voce uses Elements to catch all Uls, it becomes a list of ULS. Then you’ll have to do one on the uls and one more on each to get the Lis. In case, the government website sucks, he created all the uls of the site with the same attributes. You must identify which element/attribute differs from each list. I will update the code to Voce bring only the 2 columns of the first block of senators.

– Guilherme IA

2017/08/24 at 16:28
Thank you! I put here: https://github.com/reichaves/plenario_teste/blob/master/teste1.ipynb

– Reinaldo Chaves

2017/08/24 at 20:41
It’s an honor! haha abs

– Guilherme IA

2017/08/24 at 21:37

Show 5 more comments