Scraping using Selenium and Beautifulsoup

Asked

Viewed 295 times

0

I’m trying to make a Crap on a book blog, I need to get the titles and categories of all the books posted. In the first attempt, I got an Attribute Error, which should happen several times because the site is poorly done and not always the things I catch will be with the same code. To try to deal with that I added one except for the looping to continue. Follow my code so far:

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd

webdriver.Chrome(executable_path = '/home/porco/Downloads/chromedriver')
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors') 
options.add_argument('--test-type')
options.binary_location='/usr/bin/chromium' 
driver = webdriver.Chrome(chrome_options=options)

url = 'http://amoraosromances.blogspot.com/'
driver.get(url)

A = []
B = []

while True: 
    soup = BeautifulSoup(driver.page_source, 'lxml')

    try:
        for div in soup.findAll('div', class_='post hentry'):
            titulo = div.find('h3', class_='post-title entry-title')
            A.append(titulo.text.strip().title())
            temas = div.find('span', class_='post-labels')
            B.append(temas.text.strip().replace('\n', ' ').replace('Marcadores:', '').title())

            print(titulo.txt)
            print('...rodando...')
    except AttributeError:
        continue 
    try:
        nextButton = driver.find_element_by_xpath('//*[@id="Blog1_blog-pager-older-link"]')
        nextButton.click()  
    except: 
        break

print('...fazendo .csv e json...') 
df=pd.DataFrame(A, columns=['Título'])
df['Tema'] = B
df

df.to_csv('autor-tema2.csv')
df.to_json('autor-tema2.json', orient='records')

driver.quit()

I have two questions, one being related to what I said above the except, I’m not sure if it’s working because it’s been running for a while here and the url has stopped changing. Is there any way that I can add something so that you can have a couple of exits so I can see what stage of the process is at? I put some print(), but they were not very helpful.

The other question is whether there is a way to prevent the browser from opening new windows/tabs, because this site has a lot of advertising and I believe that my computer can catch in the middle of the process if these windows keep opening, are about 3.057 posts. If there’s no stopping it, a way to shut them down as soon as they open would be good too.

Hugs.

1 answer

0

For log you can use the logger, it is better than print for this case.

import logging
logger = logging.getLogger(__name__)
self.logger.warning('Aqui você coloca o texto do log')

Browser other questions tagged

You are not signed in. Login or sign up in order to post.