Beautifulsoup - href search by text

Asked

Viewed 41 times

-1

Good afternoon everyone, I’m with a problem that I haven’t been able to solve or found any related.

If I have:

codigo_pagina = '''<li><span><span style="font-family: Courier New"><a href="page1/somethingA.aspx">
                Something1</a></span></span></li>
            <li><span><span style="font-family: Courier New"><a href="page1/somethingB.aspx">Something2</a></span></li>
            <li><span style="font-family: Courier New"><a href="page1/somethingC.aspx">**Something3**</a></span><span><span style="font-family: Courier New">
                (<a href="page1/anotherthing.aspx">anothertext</a>)</li>

soup = BeautifulSoup(codigo_pagina, "lxml")

path = soup.findAll('a', href=True, text="Something3")
print(path)

i get:

>>> [<a href="page1/somethingC.aspx">Something3</a>]

which is what I want.

But if Something3 goes to a new line (as if I gave a "enter" href is no longer found and I get nothing:

codigo_pagina = '''<li><span><span style="font-family: Courier New"><a href="page1/somethingA.aspx">
                Something1</a></span></span></li>
            <li><span><span style="font-family: Courier New"><a href="page1/somethingB.aspx">Something2</a></span></li>
            <li><span style="font-family: Courier New"><a href="page1/somethingC.aspx">
**Something3**</a></span><span><span style="font-family: Courier New">
                (<a href="page1/anotherthing.aspx">anothertext</a>)</li>

So I get nothing...

>>>

  • I tried to eliminate the new Lines ( n) with soup.replace('\n', ' ').replace('\r', ''), para então fazer o findAll, mas dá-me o erro Typeerror: 'Nonetype' Object is not callable `` because the Soup variable is not a string. I could do it in the path variable but it is no longer worth anything because it does not "see" that the text of href I want is in the next line.

1 answer

0

For some reason, the HTML structure in Pycharm needs to follow some criteria.

I tested your code here and managed to make it work by adding one to every line skipped in HTML. This backslash tells Pycharm that the instruction is not over and that it is continued on the next line.

Follows full code:

from bs4 import BeautifulSoup

codigo_pagina = '''
<li><span><span style="font-family: Courier New"><a href="page1/somethingA.aspx">\
Something1</a></span></span></li>\
\
<li><span><span style="font-family: Courier New"><a href="page1/somethingB.aspx">Something2</a></span></span></li>\
\
<li><span style="font-family: Courier New"><a href="page1/somethingC.aspx">\
\
Something3</a></li></span><span><span style="font-family: Courier New">\

(<a href="page1/anotherthing.aspx">anothertext</a>)</span></span></li>
'''



soup = BeautifulSoup(codigo_pagina, 'lxml')
path = soup.findAll('a', href=True, text="Something3")
print(path)
  • Thanks for your help. But how does it if the html code is on a web page and this page has been made that way (with the text of href in the next line)?

  • It works smoothly! I uploaded a test page and the code managed to return the correct value in both html. The problem, in this case, is the structure of Pycharm itself!

  • I cannot change the code of the variable "codigo_pagina" and add the "" because the code is part of a page programmed by another person who made it this way with this error of putting the text of href in the next line. So I have to try to find a solution that works without touching the "page code".

  • But isn’t it easier to extract, then, from the webpage itself? Or is there no way? If not, try another IDE that might work without the bars.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.