Problem navigating with Selenium (using Python) in search results presented in dynamic HTML

Asked

Viewed 341 times

0

I am performing a scraping of articles of a newspaper from Pernambuco (Diário de PE) according to a search I did with some keywords on the subject of interest. The journal search returns 10 results on a dynamic HTML page, containing, at the end, a list of all result pages, numbered 1 to 10.

As for the first page, I was able to scrap the excerpt I want from HTML using Selenium in Python without major problems, but I’m having problems accessing the other pages from the generated list.

References to the next pages are in tags <div class="gsc-cursor-page" aria-label="Página 2" role="link" tabindex="0">2</div>(taking as an example the reference to the 2nd pg. ). There is no anchor tag <a> associated with a href generating the hyperlink to the next page and not a "next" button. See an example of this in the image below.

inserir a descrição da imagem aqui

My strategy: I created a function that pick up the references of the pages in these Divs giving the paths through the XPATH. These references were stored in a list, and I created a 'for' loop to iterate reference after reference, using Selenium’s click() function to access them. I created this strategy based on several tips given in Stackoverflow itself (English version) and unfortunately it did not work, and only the 2nd page is returned (beyond the 1st).

Below are the code of the two functions I created, the first to get the HTML of the first page, and the second to do this in the others, using the list I commented.

Imported libraries:

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
import time

First function to scrap the HTML of the desired area on the search results home page:

def get_target_html(url):
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(options=chrome_options)
    wait = WebDriverWait(driver, 10)
    driver.get(url)
    #Finding target HTML
    site_html = wait.until(lambda driver: driver.find_element_by_xpath('//div[@class="gsc-expansionArea"]').get_attribute('innerHTML'))
    driver.close() 
    with open('target_page_1.html', 'wt', encoding='utf-8') as file:
        file.write(site_html)

Second function to navigate to other pages and scraping HTML:

def get_next_pages(url): 
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)
    #Creating a list that receives all elements according to the given XPATH
    page_list = driver.find_elements_by_xpath('//div[@class="gsc-cursor-page"]')
    #Reading each element in the list to access the related page and get the target HTML code
    count = 1
    for page in page_list:
        count += 1
        number = str(count)
        page.click()
        time.sleep(10)
        target_html = driver.find_element_by_xpath('//div[@class="gsc-expansionArea"]').get_attribute('innerHTML')
        #Writing and saving a html file with the target code
        with open(f'target_page_{number}.html', 'wt', encoding='utf-8') as file:
            file.write(target_html)
    driver.close()
    print('Finished')

Then called the two functions providing the initial URL (https://www.diariodepernambuco.com.br/capa_busca.html?q=seguran%C3%A7a&q=seguran%C3%A7a%20%2B%20p%C3%BAblica%20%2B%20pernambuco&q=%22seguran%C3%A7a%20p%C3%BAblica%22%20%2B%20pernambuco&q=%22seguran%C3%A7a%20p%C3%BAblica%22%20%2B%20crime*%20%2B%20pernambuco).

The exception that was returned:

 File "C:\Users\Victor\OneDrive\Scrapy Projects\Corpus_Jornais\teste_selenium_diariodepernambuco5.py", line 57, in <module>
    get_next_pages(url)

  File "C:\Users\Victor\OneDrive\Scrapy Projects\Corpus_Jornais\teste_selenium_diariodepernambuco5.py", line 40, in get_next_pages
    page.click()

  File "C:\Users\Victor\anaconda3\envs\py36\lib\site-packages\selenium\webdriver\remote\webelement.py", line 80, in click
    self._execute(Command.CLICK_ELEMENT)

  File "C:\Users\Victor\anaconda3\envs\py36\lib\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute
    return self._parent.execute(command, params)

  File "C:\Users\Victor\anaconda3\envs\py36\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)

  File "C:\Users\Victor\anaconda3\envs\py36\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)

StaleElementReferenceException: stale element reference: element is not attached to the page document
  (Session info: headless chrome=84.0.4147.89)

I understand that through this exception StaleElementReferenceException, is informed that the desired object for click() is no longer available, however I believe it is in the list to be accessed. I ran the code for the second function in isolation, and then I asked for the list size (len(page_list)), returning the exact amount of references that were collected (=9).

Would anyone have any tips on how to deal with this problem?

1 answer

0


I was able to find a solution by researching more on what was happening. I found here at Stackoverflor an answer that helped me structure what I needed to solve. For those facing this kind of problem, it follows:

First, I recommend reading the answer given to this question: Selenium Staleelementreferenceexception Problems with Python

It determines, in summary, that if that exception appears it is because the element that click() is trying to access has been lost.

I built the solution based on the idea that it is necessary to update the list every time a page is accessed in sequence. The first function I created was kept. The second one was like this:

def get_next_pages(url): 
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)
    #Creating a list that receives all elements according to the given XPATH
    page_list = driver.find_elements_by_xpath('//div[@class="gsc-cursor-page"]')
    pages_amount = len(page_list)
    #Creating a loop to update the pages list and get their target HTML
    for index in range(pages_amount):
        number = str(index+2)
        updated_pages_list = driver.find_elements_by_xpath('//div[@class="gsc-cursor-page"]')
        print(f'Accessing page {number}')
        updated_page_list[index].click()
        time.sleep(5)
        target_html = driver.find_element_by_xpath('//div[@class="gsc-expansionArea"]').get_attribute('innerHTML')
        #Writing and saving a HTML file with the target code
        print(f'Saving page {number} html')
        with open(f'target_page_{number}.html', 'wt', encoding='utf-8') as file:
            file.write(target_html)
    driver.close()
    print('Finished')

Explaining what I’ve changed:

At the beginning of the function a list of pages (page_list) is still created, but this time only as a reference so that the amount of elements in it can be calculated (page_amount).

With this size set, one can perform a 'for', where in each new iteration a new list of pages is created (updated_page_list), accessing the element at the indexed position according to the iteration.

I added a few prints to get an overview of what was being done, using the number = str(index+2) to have the print reference adjusted on screen according to how the list originally appears on the site (from 1 to 10, not from 0 to 9).

Finally, the HTML targets of each of the listed pages are scraped and saved in a numbered file also according to the variable "number".

In the event that any interested party has any doubts about that reply, I shall be happy to reiterate it here. And if someone has a better, more elegant solution, the space to present it.

My greetings!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.