0
I am performing a scraping of articles of a newspaper from Pernambuco (Diário de PE) according to a search I did with some keywords on the subject of interest. The journal search returns 10 results on a dynamic HTML page, containing, at the end, a list of all result pages, numbered 1 to 10.
As for the first page, I was able to scrap the excerpt I want from HTML using Selenium in Python without major problems, but I’m having problems accessing the other pages from the generated list.
References to the next pages are in tags <div class="gsc-cursor-page" aria-label="Página 2" role="link" tabindex="0">2</div>
(taking as an example the reference to the 2nd pg. ). There is no anchor tag <a>
associated with a href generating the hyperlink to the next page and not a "next" button. See an example of this in the image below.
My strategy: I created a function that pick up the references of the pages in these Divs giving the paths through the XPATH. These references were stored in a list, and I created a 'for' loop to iterate reference after reference, using Selenium’s click() function to access them. I created this strategy based on several tips given in Stackoverflow itself (English version) and unfortunately it did not work, and only the 2nd page is returned (beyond the 1st).
Below are the code of the two functions I created, the first to get the HTML of the first page, and the second to do this in the others, using the list I commented.
Imported libraries:
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
import time
First function to scrap the HTML of the desired area on the search results home page:
def get_target_html(url):
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 10)
driver.get(url)
#Finding target HTML
site_html = wait.until(lambda driver: driver.find_element_by_xpath('//div[@class="gsc-expansionArea"]').get_attribute('innerHTML'))
driver.close()
with open('target_page_1.html', 'wt', encoding='utf-8') as file:
file.write(site_html)
Second function to navigate to other pages and scraping HTML:
def get_next_pages(url):
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
#Creating a list that receives all elements according to the given XPATH
page_list = driver.find_elements_by_xpath('//div[@class="gsc-cursor-page"]')
#Reading each element in the list to access the related page and get the target HTML code
count = 1
for page in page_list:
count += 1
number = str(count)
page.click()
time.sleep(10)
target_html = driver.find_element_by_xpath('//div[@class="gsc-expansionArea"]').get_attribute('innerHTML')
#Writing and saving a html file with the target code
with open(f'target_page_{number}.html', 'wt', encoding='utf-8') as file:
file.write(target_html)
driver.close()
print('Finished')
Then called the two functions providing the initial URL (https://www.diariodepernambuco.com.br/capa_busca.html?q=seguran%C3%A7a&q=seguran%C3%A7a%20%2B%20p%C3%BAblica%20%2B%20pernambuco&q=%22seguran%C3%A7a%20p%C3%BAblica%22%20%2B%20pernambuco&q=%22seguran%C3%A7a%20p%C3%BAblica%22%20%2B%20crime*%20%2B%20pernambuco).
The exception that was returned:
File "C:\Users\Victor\OneDrive\Scrapy Projects\Corpus_Jornais\teste_selenium_diariodepernambuco5.py", line 57, in <module>
get_next_pages(url)
File "C:\Users\Victor\OneDrive\Scrapy Projects\Corpus_Jornais\teste_selenium_diariodepernambuco5.py", line 40, in get_next_pages
page.click()
File "C:\Users\Victor\anaconda3\envs\py36\lib\site-packages\selenium\webdriver\remote\webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "C:\Users\Victor\anaconda3\envs\py36\lib\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "C:\Users\Victor\anaconda3\envs\py36\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\Victor\anaconda3\envs\py36\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
StaleElementReferenceException: stale element reference: element is not attached to the page document
(Session info: headless chrome=84.0.4147.89)
I understand that through this exception StaleElementReferenceException
, is informed that the desired object for click() is no longer available, however I believe it is in the list to be accessed. I ran the code for the second function in isolation, and then I asked for the list size (len(page_list)
), returning the exact amount of references that were collected (=9).
Would anyone have any tips on how to deal with this problem?