0
I’m trying to make a Crap on a book blog, I need to get the titles and categories of all the books posted. In the first attempt, I got an Attribute Error, which should happen several times because the site is poorly done and not always the things I catch will be with the same code. To try to deal with that I added one except
for the looping to continue. Follow my code so far:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd
webdriver.Chrome(executable_path = '/home/porco/Downloads/chromedriver')
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--test-type')
options.binary_location='/usr/bin/chromium'
driver = webdriver.Chrome(chrome_options=options)
url = 'http://amoraosromances.blogspot.com/'
driver.get(url)
A = []
B = []
while True:
soup = BeautifulSoup(driver.page_source, 'lxml')
try:
for div in soup.findAll('div', class_='post hentry'):
titulo = div.find('h3', class_='post-title entry-title')
A.append(titulo.text.strip().title())
temas = div.find('span', class_='post-labels')
B.append(temas.text.strip().replace('\n', ' ').replace('Marcadores:', '').title())
print(titulo.txt)
print('...rodando...')
except AttributeError:
continue
try:
nextButton = driver.find_element_by_xpath('//*[@id="Blog1_blog-pager-older-link"]')
nextButton.click()
except:
break
print('...fazendo .csv e json...')
df=pd.DataFrame(A, columns=['Título'])
df['Tema'] = B
df
df.to_csv('autor-tema2.csv')
df.to_json('autor-tema2.json', orient='records')
driver.quit()
I have two questions, one being related to what I said above the except
, I’m not sure if it’s working because it’s been running for a while here and the url has stopped changing. Is there any way that I can add something so that you can have a couple of exits so I can see what stage of the process is at? I put some print()
, but they were not very helpful.
The other question is whether there is a way to prevent the browser from opening new windows/tabs, because this site has a lot of advertising and I believe that my computer can catch in the middle of the process if these windows keep opening, are about 3.057 posts. If there’s no stopping it, a way to shut them down as soon as they open would be good too.
Hugs.