Beautifulsoup - href search by text

Question

Beautifulsoup - href search by text

Asked 6 years, 4 months ago

Viewed 41 times

-1

Good afternoon everyone, I’m with a problem that I haven’t been able to solve or found any related.

If I have:

codigo_pagina = '''<li><span><span style="font-family: Courier New"><a href="page1/somethingA.aspx">
                Something1</a></span></span></li>
            <li><span><span style="font-family: Courier New"><a href="page1/somethingB.aspx">Something2</a></span></li>
            <li><span style="font-family: Courier New"><a href="page1/somethingC.aspx">**Something3**</a></span><span><span style="font-family: Courier New">
                (<a href="page1/anotherthing.aspx">anothertext</a>)</li>

soup = BeautifulSoup(codigo_pagina, "lxml")

path = soup.findAll('a', href=True, text="Something3")
print(path)

i get:

>>> [<a href="page1/somethingC.aspx">Something3</a>]

which is what I want.

But if Something3 goes to a new line (as if I gave a "enter" href is no longer found and I get nothing:

codigo_pagina = '''<li><span><span style="font-family: Courier New"><a href="page1/somethingA.aspx">
                Something1</a></span></span></li>
            <li><span><span style="font-family: Courier New"><a href="page1/somethingB.aspx">Something2</a></span></li>
            <li><span style="font-family: Courier New"><a href="page1/somethingC.aspx">
**Something3**</a></span><span><span style="font-family: Courier New">
                (<a href="page1/anotherthing.aspx">anothertext</a>)</li>

So I get nothing...

>>>

I tried to eliminate the new Lines ( n) with soup.replace('\n', ' ').replace('\r', ''), para então fazer o findAll, mas dá-me o erro Typeerror: 'Nonetype' Object is not callable `` because the Soup variable is not a string. I could do it in the path variable but it is no longer worth anything because it does not "see" that the text of href I want is in the next line.

– Bgreat

2019/04/07 at 15:49

1 answer

Browser other questions tagged python beautifulsoup

You are not signed in. Login or sign up in order to post.

by Guilherme Carvalho Lithg • **646** points · Answer 1 · 2019-04-07T16:16:44+00:00

For some reason, the HTML structure in Pycharm needs to follow some criteria.

I tested your code here and managed to make it work by adding one to every line skipped in HTML. This backslash tells Pycharm that the instruction is not over and that it is continued on the next line.

Follows full code:

from bs4 import BeautifulSoup

codigo_pagina = '''
<li><span><span style="font-family: Courier New"><a href="page1/somethingA.aspx">\
Something1</a></span></span></li>\
\
<li><span><span style="font-family: Courier New"><a href="page1/somethingB.aspx">Something2</a></span></span></li>\
\
<li><span style="font-family: Courier New"><a href="page1/somethingC.aspx">\
\
Something3</a></li></span><span><span style="font-family: Courier New">\

(<a href="page1/anotherthing.aspx">anothertext</a>)</span></span></li>
'''



soup = BeautifulSoup(codigo_pagina, 'lxml')
path = soup.findAll('a', href=True, text="Something3")
print(path)