The tag <strong>
in the middle of <li>
spoils the search a little the way you are doing. However you can approach the problem also in this way:
from bs4 import BeautifulSoup
import re
html = '<div class="crop"><ul style="width: 500px;"><li><strong>587 m²</strong> de área útil</li><li><strong>1089 m²</strong>Total</li><li><strong>4</strong>dormitórios</li><li><strong>4</strong>suítes</li><li><strong>8</strong>vagas</li> \</ul></div>'
soup = BeautifulSoup(html, 'html.parser')
data = soup.findAll('li')
dorms = [i for i in data if i.text.startswith('4') and i.text.endswith('dormitórios')]
print(dorms)
Get a list of <li>
that have "4" and "dorms" between the tags.
If you want what’s between the tags <li>
but without other tags can:
from bs4 import BeautifulSoup
import re
html = '<div class="crop"><ul style="width: 500px;"><li><strong>587 m²</strong> de área útil</li><li><strong>1089 m²</strong>Total</li><li><strong>4</strong>dormitórios</li><li><strong>4</strong>suítes</li><li><strong>8</strong>vagas</li> \</ul></div>'
soup = BeautifulSoup(html, 'html.parser')
data = soup.findAll('li')
dorms = [i.text for i in data if i.text == '4dormitórios']
print(dorms)
To get only the number of dorms you can only use regex:
import re
html = '<div class="crop"><ul style="width: 500px;"><li><strong>587 m²</strong> de área útil</li><li><strong>1089 m²</strong>Total</li><li><strong>4</strong>dormitórios</li><li><strong>4</strong>suítes</li><li><strong>8</strong>vagas</li> \</ul></div>'
dorms = re.findall('Total</li><li><strong>(.*?)</strong>dormitório', html)
print(dorms) # ['4']
bro, but how do I make it now to bring only the number, but the number may vary, because I’m reading several urls with the same structure, so the text can be "dorm" or "dorms" [s] and I want to capture only the number?
– Jeu Domingos
Even easier @Jeudomingos. I’ll put on top
– Miguel