Why does the HTTP request response not recognize special characters?

Asked

Viewed 47 times

1

Why when I make a request using Beautifulsoup in Python, my answer does not consider Latin characters?

Code:

import requests 
from bs4 import BeautifulSoup
req = requests.post(url= f"https://www.linkcorreios.com.br/?id=LE132696585SE")
soup = BeautifulSoup(req.text,'html.parser')
texto = soup.find('ul', {'class': 'linha_status m-0'}).text
print(req)
print(texto)

Answer:

<Response [200]>
Status: Objeto em trânsito - por favor aguarde
Data  : 29/04/2021 | Hora: 10:29
Origem: Unidade de Logística Integrada - Curitiba / PR
Destino: Unidade de Tratamento - Cajamar / SP

I’m using Vscode with Python 3.9.4

1 answer

2

In short, it is a problem of encoding (if you want to understand in depth what is a encoding, read here).

But basically, all the traffic information comes and goes in the form of bytes, which are transformed into text and vice versa. And there are several ways to convert bytes from/to text (multiple encodings different), and this kind of problem happens when trying to use a encoding when actually another was used (for example, if the answer was converted to bytes using a encoding, but you try to convert these bytes to text using other encoding).

In the module documentation requests we can see here the following:

When you make a request, Requests makes educated guesses about the encoding of the Response based on the HTTP headers.

And here:

When you receive a Response, Requests makes a Guess at the encoding to use for Decoding the Response when you access the Response.text attribute.

That is, when a request is made, the requests try to guess the encoding response, based on HTTP headers. So when accessing the attribute text of the answer, will be used this encoding to convert bytes to text. But the same link above still states:

The only time Requests will not do this is if no Explicit charset is present in the HTTP headers and the Content-Type header contains text. In this Situation, RFC 2616 specifies that the default charset must be ISO-8859-1.

That is, if there is no charset specified in the HTTP headers, and the Content-Type contains text, the encoding will be set to ISO-8859-1. We can see that this is exactly the case by printing the answer headers:

response = requests.post('https://www.linkcorreios.com.br/?id=LE132696585SE')
print(response.headers)

The exit was:

{'Date': 'Wed, 23 Jun 2021 11:55:38 GMT', 'Server': 'Apache/2.4.7 (Ubuntu)', 'X-Powered-By': 'PHP/5.5.9-1ubuntu4.21', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '5090', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html'}

We can see that the Content-Type is text/html (that is, it contains text) and there is no charset explicit in none of the headers. Hence the encoding used will be ISO-8859-1, which can be confirmed so:

response = requests.post('https://www.linkcorreios.com.br/?id=LE132696585SE')
print(response.encoding) # ISO-8859-1

Therefore, when trying to access the attribute text, bytes were converted to text using the encoding ISO-8859-1.

But as we have seen, the page is not in ISO-8859-1, otherwise the accents would be shown correctly.


The solution is simple: instead of passing the text of the answer, pass the content:

import requests
from bs4 import BeautifulSoup

response = requests.post('https://www.linkcorreios.com.br/?id=LE132696585SE')

# *** Em vez de response.text, use response.content ***
soup = BeautifulSoup(response.content, 'html.parser')
#                             ^^^^^^^

texto = soup.find('ul', {'class': 'linha_status m-0'}).get_text()
print(response)
print(texto)

So Beautiful Soup will read the "raw bytes" (instead of a text that has been decoded using the encoding wrong), and he will know how to deal with it by converting these bytes into text (more details here). With this, the accents will be shown correctly.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.