Web Scraping Selenium + Python on JS-generated website = difficulty mapping elements

Asked

Viewed 1,266 times

4

Good afternoon. I am developing a script that:

  1. accesses a system;
  2. within the environment, you will find certain information;
  3. generates a kind of report;
  4. creates a spreadsheet with the data.

My problem is even before the parse. I can access the environment that contains the information, but I can’t get the Selenium webdriver to locate the elements you need to click to access the data that will be in the report.

I have the impression that it is javascript that is causing the confusion, because the information of the frame that "shoots" javascript is accessible, and the page with the result, visible to me, does not seem visible to the script.

How to get around javascript?

How to make the webdriver "see" the final page in the same way I see it?

(EDITED. Code below:)

from selenium import webdriver
import time
from selenium.common.exceptions import NoSuchFrameException
import os

if os.path.exists('c:\\projudi') == False:
    os.makedirs('c:\\projudi')

try:
    planilha = open('c:\\projudi\\relatorio.csv', 'r+')
except FileNotFoundError:
    planilha = open('c:\\projudi\\relatorio.csv', 'w+')

browser = webdriver.Chrome()
browser.get('https://projudi.tjpr.jus.br/projudi')
time.sleep(20)
browser.switch_to_frame('mainFrame')
browser.switch_to_frame('userMainFrame')
links = browser.find_elements_by_class_name('link')
n = len(links)

for x in range(0, n, 2):
    if links[x].text != ('0'):  
        links[x].click()
        time.sleep(2)
        try:
            browser.switch_to_frame('mainFrame')
            browser.switch_to_frame('userMainFrame')
            a = browser.find_elements_by_class_name('link')
        except NoSuchFrameException:
            a = browser.find_elements_by_class_name('link')
        if a != []:
            q = browser.find_elements_by_class_name('resultTable')
            w = q[0].text
            for x in range(len(w)):
                dados = w.split('\n')
            for x in range(len(dados)):
                planilha.writelines(dados[x])
            for x in range(int(len(a))):
                a[x].click()
                time.sleep(2)
                browser.back()
                time.sleep(2)
                browser.switch_to_frame('mainFrame')
                browser.switch_to_frame('userMainFrame')
                a = browser.find_elements_by_class_name('link')
            browser.back()
            time.sleep(2)
        else:       
            browser.back()
            time.sleep(2)
        browser.switch_to_frame('mainFrame')
        browser.switch_to_frame('userMainFrame')
        links = browser.find_elements_by_class_name('link')

planilha.close()    
browser.close() 

My question: when I access the screen that contains the information I need (resultTable), I capture it whole and Gero a variable with a string containing all the data. I split it, and I got a list of strings. So far ok, I play everything to the report file for further processing. Now... how to control FLOW? I already know that I will have to deal with in the list the string that contains the DATA with regex, because I only need to access the information of the present day until 2 days ago. But how to use this information as a REFERENCE pro Python? Example: The scrip captures the table and plays for a list like this:

list = ['0004434-48.2010', 'UNION' '(30 working days) 03/07/2017', '13/07/2017', '0008767-77.2013', '2017' '(10 working days) 03/07/2017', '13/07/2017']

The first item in the list is the first item in the table, row 1 and column 1. It contains the link. The control date is in the THIRD item, row 1 column 3. And item 5 is already the next row (row 2, column 1). I don’t know if I could explain! =/

I need to: 1 - check the date. If it is today or yesterday: Click on the first item on that line. If it’s not, move on to the next line.

  • I think it’s ideal to put the code where it goes, and then the mistake for us to see....

  • If possible, also enter the link to the site in question. It will only be possible to determine how to capture the information if we can diagnose the field.

  • I edited following the guidelines of friends. It contains the link, and it also contains the code. In vdd I have already managed to overcome this step, giving switch_to_frame twice (without really understanding why I needed the two steps, but how it worked I left). Now I am at the moment to effectively capture the information. Unfortunately the system requires login and password...

  • @Can Bergodealmeida export html after login? From the page where it is actually extracting the data.

1 answer

2

From what I read I don’t know if I understand correctly what you want to do, but Selenium has several specific modules to be able to do what you want... the problem is that you would need to go into the html of the page and see which element is which to capture with Selenium.

from selenium.webdriver.common.keys import Keys         #importa a habilidade de input de chaves e senhas
from selenium.webdriver.support.ui import Select        #importa a habilidade de usar o select em boxes e pontos
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait #importa a habilidade de setar o 'wait time' do browser
from selenium.webdriver.support import expected_conditions as EC #importa a biblioteca de condições esperadas

here are some useful libraries of Selenium... Now to check the date of the day and check if the day is the current or the next one I would recommend to see Id, name or id and use the command

variavel = driver.find_element_by_name('elemento').

Now... if you have already captured the information and have played it in a file or variable then I suggest using Pandas to organize the information as dataframes.

To check the dates of a link I would take the link with find_element_by and then analyze which pixel the date starts and which pixel it ends (link[n:m]) and thus use datetime to compare the date searched with the current date.

to pick up the current date

import datetime
from datetime import timedelta
data_hoje = (datetime.datetime.now()).strftime("%d%m%Y")
data_ontem = (datetime.datetime.now() - timedelta(days = 1)).strftime("%d%m%Y")
data_um_dia_n_dias_atras = (datetime.datetime.now() - timedelta(days = n)).strftime("%d%m%Y")

Browser other questions tagged

You are not signed in. Login or sign up in order to post.