How to find a "<li>" tag with a "dorm" text from a list of tags in python3?

Asked

Viewed 550 times

1

I’m learning to code now, and python is the first language I’m working on and I’m having trouble capturing the tag:

<li> 4 dormitórios

html:

<div class="crop">
<ul style="width: 500px;">
<li><strong>587 m²</strong> de área útil</li>
<li><strong>1089 m²</strong> Total</li>
<li>
<strong>4</strong>
           dormitórios                                            </li>
<li>
<strong>4</strong>
       suítes                                            </li>
<li>
<strong>8</strong>
        vagas                                            </li>
</ul>
</div>

I am using find com regex, in the expression below:

bsObj.find("div",{"class":"crop"}).find("ul",li=re.compile("^\d*[0-9](dormitórios)*$"))

but it returns None, which is wrong in the code?

1 answer

2

The tag <strong> in the middle of <li> spoils the search a little the way you are doing. However you can approach the problem also in this way:

from bs4 import BeautifulSoup
import re

html = '<div class="crop"><ul style="width: 500px;"><li><strong>587 m²</strong> de área útil</li><li><strong>1089 m²</strong>Total</li><li><strong>4</strong>dormitórios</li><li><strong>4</strong>suítes</li><li><strong>8</strong>vagas</li> \</ul></div>'

soup = BeautifulSoup(html, 'html.parser')
data = soup.findAll('li')
dorms = [i for i in data if i.text.startswith('4') and i.text.endswith('dormitórios')]
print(dorms)

Get a list of <li> that have "4" and "dorms" between the tags.

If you want what’s between the tags <li> but without other tags can:

from bs4 import BeautifulSoup
import re

html = '<div class="crop"><ul style="width: 500px;"><li><strong>587 m²</strong> de área útil</li><li><strong>1089 m²</strong>Total</li><li><strong>4</strong>dormitórios</li><li><strong>4</strong>suítes</li><li><strong>8</strong>vagas</li> \</ul></div>'

soup = BeautifulSoup(html, 'html.parser')
data = soup.findAll('li')
dorms = [i.text for i in data if i.text == '4dormitórios']
print(dorms)

To get only the number of dorms you can only use regex:

import re

html = '<div class="crop"><ul style="width: 500px;"><li><strong>587 m²</strong> de área útil</li><li><strong>1089 m²</strong>Total</li><li><strong>4</strong>dormitórios</li><li><strong>4</strong>suítes</li><li><strong>8</strong>vagas</li> \</ul></div>'

dorms = re.findall('Total</li><li><strong>(.*?)</strong>dormitório', html)
print(dorms) # ['4']
  • 1

    bro, but how do I make it now to bring only the number, but the number may vary, because I’m reading several urls with the same structure, so the text can be "dorm" or "dorms" [s] and I want to capture only the number?

  • 1

    Even easier @Jeudomingos. I’ll put on top

Browser other questions tagged

You are not signed in. Login or sign up in order to post.