Decode HTML entities in a Python string

Asked

Viewed 481 times

4

I’m using Python 3 to access a web API. The response to the requests comes in the JSON standard and my problem is that one of the Strings comes encoded with HTML entities (specifically accentuation).

For example:

"orientação-a-objetos"

Is there any parser return strings with solved HTML characters?

2 answers

4

I found this one for Python 3.4+ :

>>> import html
>>> html.unescape('orientação-a-objetos')
'orientação-a-objetos'

In the case of Python 3 (versions prior to 3.4):

>>> import html.parser
>>> h = html.parser.HTMLParser()
>>> h.unescape('orientação-a-objetos')
'orientação-a-objetos'

0

It is also possible to use Beautifulsoup, bs4 for Py3+ or Bs for Py3-, which in addition to converting the HTML encoding to ascii, also allows working with the HTML elements individually (if there is in the input string).

from bs4 import BeautifulSoup
s='orientação-a-objetos'
t = BeautifulSoup(s, 'html.parser')
print(t.get_text())

Browser other questions tagged

You are not signed in. Login or sign up in order to post.