How to read <br/> in HTML files and print as line breaks?

Asked

Viewed 62 times

-2

I made a web scraper using the modules BeautifulSoup and requests, that takes the definition and example of concepts in Urban Dictionary. This is code, using the word "reparation" as an example.

word = 'reparation'
r = requests.get("http://www.urbandictionary.com/define.php?term={}".format(word))
soup = BeautifulSoup(r.content, features='html.parser')
definition = soup.find("div", attrs={"class": "meaning"}).text
example = soup.find("div", attrs={"class": "example"}).text

The program returns the site example as:

"Bob: Here is $50 for me Hitting you. Charles: Thanks for the reparation."

However, there are two line breaks on the site, leaving the example as:

"Bob: Here is $50 for me Hitting you.

Charles: Thanks for the reparation."

How do I embed these breaks in the string example?

1 answer

2


Like text returns the text without the tags, the <br> is also removed.

So the way would be to replace the tags <br> before obtaining the text:

# obtém a tag (e não o text)
example = soup.find("div", attrs={"class": "example"})
# substitui os br que tem dentro da tag
for br in example.find_all("br"):
    br.replace_with("\n")

# agora sim pega o texto
print(example.text)

With this I change each br by a line break (\n), and the exit will be:

Bob: Here is $50 for me hitting you.

Charles: Thanks for the reparation.

Or, instead of replacing the tag, we can add the line break to it:

for br in example.find_all("br"):
    br.append("\n") # usar append em vez de replace_with

print(example.text)

The difference is that now the tags br will have as content the line break (\n), and when we call text, it will remove the tags but keep the \n.

It is worth remembering that both the MDN as to the WHATWG define that the tag br has no content. Although WHATWG defines that actually the content of br is a "Nothing content model", that in turn can only have one inter-element whitespace (which in turn is defined as one or more ASCII whitespaces, which in this case can be spaces, TAB’s and line breaks - ie the \n would be a valid content for the tag br, therefore both the append as to the replace_with would be valid solutions).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.