Extracting Data with Beautiful Python Soup

Asked

Viewed 823 times

0

I made a Python script to access the TJ-SP website to do a certain search and make a Web Scraping with the search result.

this is the HTML: inserir a descrição da imagem aqui

i want to pick up the text that is contained in that tag: <span class>1009238445480</span> ==$0, only that the element class has no value

also has other parts of HTML that contain this: inserir a descrição da imagem aqui

has the tag <span class="labelClass">Área:</span> Cível, on this tag I need to get the value Cível

has the tag <span id class>Perdas e danos</span>

id class does not contain any value

I also have another tag:

<td valign> ==$0
    <span id class>R$ 245,00</span>
</td>

need to capture the value R$ 245,00

here is my script

from selenium import webdriver
from bs4 import Beautifulsoup
from time import sleep

URL = 'https://esaj.tjsp.jus.br/cpopg/open.do'
user = 'x0x0x0x0x0x0x'
password ='x0x0x0x0x0x0x'

browser = webdriver.Chrome()
browser.get(URL)

browser.find_element_by_id('login').click()
browser.find_element_by_name('user').send_keys(user)
browser.find_element_by_name('password').send_keys(password)
browser.find_element_by_id('Enviar').click()

browser.find_element_by_id('NUMB').send_keys('1009238445480')
browser.find_element_by_id('Enviar_').click()

scrap = BeautifulSoup(browser.page_source, "html.parser")

processo = scrap.find('span', {'class':' '' '})
print(processo)

here I am trying to get the value of the tag: <span class>1009238445480</span> ==$0

processo = scrap.find('span', {'class':' '' '})
    print(processo)

then he returns to me:

None

someone can help me ?

2 answers

0


Your web Crawler is in the right direction. It turns out that you are wrong in the way to search for the elements, I believe you are wrong by lack of theoretical knowledge. So let’s go for a little theory!

Various ways to work with Markup languages (HTML, XML) are good examples.

The tools that work with HTML and XML have various ways of searching the elements within this branch let’s mention some interesting ways for you.

Best-Known:

  • CSS-selector
  • TAG-SELECTOR
  • SELECTOR ATTRIBUTE
  • Xpath
  • DOM Elements

Lesser Known:

  • Hierarchy Map
  • Semantic Selector
  • CSS path

In your case you use scrap.find('span', {'class':' '' '})

we will detail the functioning of the find, in order of execution.

  1. Grab all the tags span
  2. Find all span tags that contain the class specifies ''

Well you realized that the class requires a value and when you pass 'find' will try to search for tags like this: <span class="''" > that is to say you won’t find anything !

But as then we can take this specific span, you can approach for other practices.


Xpath

for example a Xpath returns a path to the specific element for example:

/html/body/div/span corresponds to

<html> <head></head> <body> <div> <span> Elemento Requisitado </span> </div> </body> </html>

You can copy the xpath from an element using the browser inspect tool.

xpath is not recommended because? many sites are dynamic and tags do not always contain a fixed path.


CSS-selector

you use unconsciously , as the name already says it filters any tag by a specific class, some scrapy libraries can filter through style.


TAG-SELECTOR

This you already use, just like CSS-selector the difference is that this filters a specific tag regardless of its attributes, classes, styles


SELECTOR ATTRIBUTE

This will probably solve your problem the SELECTOR ATTRIBUTE is similar to TAG-SELECTOR or CSS-selector the difference is that it filters any tag that uses a specific attribute. How so ?

imagine we have an xml:

< person feminine > Mary < / person>

< person masculine > Peter < / person>

Note that we have the sex as an attribute is different from

< person sex="male" > Peter < / person >

You can use this type of selection, after all you want span tags that contain the class attribute regardless of whether class has value.

If I’m not mistaken in python we can implement it like this. You’ll do a go and go through all found span then you filter by:

`tag.has_attr('class')`

DOM Elements

DOM Elements incorporates everything we’ve seen so far, but it has other cool functions like handling html on the site, with DOM Elements we can get the tags that belong to a specific tag.


References :

-1

Hello,

I am at the same pace. Your initial script is going in the right direction, but there are two Urls. One login and one search. It may be necessary to assign both.

from selenium import webdriver
from bs4 import Beautifulsoup
from time import sleep

URLA = 'https://esaj.tjsp.jus.br/cpopg/open.do'
URLB = 'https://esaj.tjsp.jus.br/sajcas/login?service=https%3A%2F%2Fesaj.tjsp.jus.br%2Fesaj%2Fj_spring_cas_security_check'

user = 'x0x0x0x0x0x0x'
password ='x0x0x0x0x0x0x'

browser = webdriver.Chrome()
browser.get(URLB)

browser.find_element_by_id('login').click()
browser.find_element_by_name('user').send_keys(user)
browser.find_element_by_name('password').send_keys(password)
browser.find_element_by_id('Enviar').click()

browser = webdriver.Chrome()
browser.get(URLA)

browser.find_element_by_id('NUMB').send_keys('1009238445480')
browser.find_element_by_id('Enviar_').click()

scrap = BeautifulSoup(browser.page_source, "html.parser")

processo = scrap.find('span', {'class':' '' '})

print(processo)

I haven’t tested it yet.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.