Automate web scraping in Python

Question

Automate web scraping in Python

Asked 5 years, 3 months ago

Viewed 365 times

1

I’m trying to get the speeches of the deputies, which can be found here. The site has several pages (1 to 300 +/-) and on each page has a table with a "summary" of the information, with 50 lines. Each line has a link that opens the full speech of Deputy X. What I’m trying to do: Save this table with the "summary" -> click on the speaker’s integral X -> save the speech’s integral X -> back to the previous page with the "summaries" -> click on the speech’s integral Y -> save -> back.... -> go to the next page and repeat the whole process to the last page.

For this I tried to use the following loop:

tabela=[]   
html_element=[]
item=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
      30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]

while True:
    try:
        for i in item:
            if i < 50:
                WebDriverWait(driver, 15).until(
                    EC.presence_of_element_located((By.XPATH, "//*[@id='content']/div/table/tbody")))
                driver.find_elements_by_class_name("glyphicon.glyphicon-file")[i].click()
                WebDriverWait(driver, 15).until(
                    EC.presence_of_element_located((By.ID, "content")))
                html_element.append(driver.find_element_by_xpath("//*[@id='content']").get_attribute('outerHTML'))
                driver.execute_script("window.history.go(-1)")
                WebDriverWait(driver, 15).until(
                    EC.presence_of_element_located((By.XPATH, "//*[@id='content']/div/table/tbody")))
            elif i == 50:
                tabela.append(driver.find_element_by_xpath("//*[@id='content']/div/table"))
                WebDriverWait(driver, 15).until(
                    EC.presence_of_element_located((By.XPATH, "//*[@id='content']/div/table/tbody")))
                driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//*[@title='Próxima Página']"))))
                driver.find_element_by_xpath("//*[@title='Próxima Página']").click()
                print("Próxima página")
    except (TimeoutException, WebDriverException) as e:
        print("Última página")
        break

It works partially, I can get the integral of the congressman’s speech. However I’m not getting to the next page or advance two, and there are times when it turns two pages which ends up in wrong data.

1 answer

Browser other questions tagged python selenium-webdriver web-scraping

You are not signed in. Login or sign up in order to post.

by Rodrigo Eggea • 79 points · Answer 1 · 2020-05-20T00:53:38+00:00

According to article of Sasa Buklijas - Do not use Selenium for web scraping, Selenium is a tool for automated testing of web applications, it is not a Web Scraping tool (extraction of data from web sites ), until it can be used in some situations, but it is usually slower and more difficult to perform data extraction. It is recommended to use specific web scraping tools such as libraries "Scrapy" and "Beautiful Soup + Requests" Python, which facilitate the extraction of page data.

Here is an example of a Python program using the "Beautiful Soup + Requests" libraries that extracts Speeches from Members:

# WEB SCRAPER PARA BAIXAR DISCURSOS DA CAMARA
# Autor: Rodrigo Eggea 19/05/2020
import requests
from bs4 import BeautifulSoup 

def save_page(url,filename):
    header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    request = requests.get(url,headers=header)
    request.encoding = 'utf-8'
    page = request.text
    file = open(filename,'w+')
    file.write(page)
    file.close()
    print('Arquivo salvo: ' + filename)

url_template='''https://www.camara.leg.br/internet/sitaqweb/resultadoPesquisaDiscursos.asp?CurrentPage=1&PageSize=1000&BasePesq=plenario&txIndexacao=&txOrador=&txPartido=&dtInicio=01/01/2019&dtFim=31/12/2019&txUF=&txSessao=&listaTipoSessao=&listaTipoInterv=&inFalaPres=&listaTipoFala=&listaFaseSessao=&txAparteante=&listaEtapa=&CampoOrdenacao=dtSessao&TipoOrdenacao=DESC&txTexto=&txSumario='''

for page_number in range(1,22):
    url=url_template.replace('CurrentPage=1','CurrentPage=' + str(page_number))  # MUDA DE PAGINA        
    print('-------------------------------------------------')
    print('PAGINA=',page_number)
    header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    request = requests.get(url, headers=header)
    request.encoding = 'utf-8'     # Forçar o Encoding para UTF-8 (senão request acha que é ISO-8859-1)
    page= request.text
    print('ENCODING=', request.encoding)
    soup = BeautifulSoup(page,'html.parser')

    table = soup.find('table', attrs={'class':'variasColunas'})
    table_body = table.find('tbody')
    rows = table_body.find_all('tr')
    for row in rows:
        if len(row)==17:
            tag_a  = row.find('a', href=True)
            cols   = row.find_all('td')
            DATA   = cols[0].text.strip()
            SESSAO = cols[1].text.strip()
            FASE   = cols[2].text.strip()
            ORADOR = cols[5].text.strip()
            HORA   = cols[6].text.strip()
            PUBLICACAO=cols[7].text.strip()
            print(f'DATA={DATA} SESSAO={SESSAO} FASE={FASE} ORADOR={ORADOR} HORA={HORA} PUBLICACAO={PUBLICACAO}')
            # ------ SALVA PAGINA DO DISCURSO -----------
            if tag_a: 
                raw_path = tag_a['href']
                fixed_path = raw_path.replace('\r\n','').replace('\t','').replace(' ','%20') 
                discurso_url= 'https://www.camara.leg.br/internet/sitaqweb/' + fixed_path
                print('LINK DISCURSO=',discurso_url)
                save_page(discurso_url,PUBLICACAO.replace('/','') + ' ' +  HORA + '.html')
            #-----------------------------------------
        if len(row)==3:
            sumario=row.find('td').text.strip()
            print('SUMARIO=',sumario)

Remarks:

In the page call URL:

https://www.camara.leg.br/internet/sitaqweb/resultadoPesquisaDiscursos.asp?CurrentPage=1&PageSize=1000&BasePesq=plenario&txIndexacao=&txOrador=&txPartido=&dtInicio=01/01/2019&dtFim=31/12/2019&txUF=&txSessao=&listaTipoSessao=&listaTipoInterv=&inFalaPres=&listaTipoFala=&listaFaseSessao=&txAparteante=&listaEtapa=&CampoOrdenacao=dtSessao&TipoOrdenacao=DESC&txTexto=&txSumario=

The Currentpage= is the page you want to visit, and the Pagesize= is the amount of Items per page that the site will bring in the table, but if you put more than 1000 items the page does not load. In the above example it is set to display 1000 items per page, and 21 pages will be visited.