How can I replace this regular expression using Beautiful Soup

Asked

Viewed 56 times

0

Currently I use this expression to extract everything below the tag <b> until I find another tag <b>:

blocks = re.findall(r'<b>.+?<b>', str(element))

How can I do the same thing using Beautiful Soup?

NOTE: the HTML file is unstructured, and I get it messy in several different ways, so I wanted something that worked for all cases.

1 answer

0

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>Pegar conteúdo da tag b</title></head>
<body>

<b>Teste 1</b>
<b>Teste 2</b>
<b>Teste 3</b>

</body>
</html>

"""

soup = BeautifulSoup(html_doc, 'html.parser')
lista = soup('b')
print(lista)

for item in lista:
    print(item.string)

Upshot:

python3 conteudo_b.py 
[<b>Teste 1</b>, <b>Teste 2</b>, <b>Teste 3</b>]
Teste 1
Teste 2
Teste 3

Browser other questions tagged

You are not signed in. Login or sign up in order to post.