In requests, how to correctly read the ISO-8859-1 encoding?

Question

In requests, how to correctly read the ISO-8859-1 encoding?

Asked 6 years, 4 months ago

Viewed 816 times

1

In Python3, with beautifulsoup4 and requests, I want to extract some information from a site that has encoding 'ISO-8859-1'. I tried this strategy to show correctly the text:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding

encoding = req.encoding
text = req.content

decoded_text = text.decode(encoding)

sopa = BeautifulSoup(decoded_text, "lxml")

sopa.find("h1")

And the result that appears is:

<h1>
                        CÃMARA MUNICIPAL DE SÃO PAULO<br/></h1>

When I copy and paste this screen appears correct, but in my computer the whole accent is wrong

I’m on a machine with Ubuntu

Please someone knows a correct way to read the encoding?

Edited June 2, 2019

I had the help of @snakecharmerb here

In the answer he detailed that when no explicit charset is present in HTTP headers and the Content-Type header contains text, RFC 2616 specifies that the standard character set should be ISO-8859-1. What is the case with this site

But clearly the words are UTF-8, so I fix it manually and it works My code went like this and it worked:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding

'ISO-8859-1'

req.headers['content-type']
'text/html'

req.encoding = 'UTF-8'

sopa = BeautifulSoup(req.text,'lxml')

sopa.find('h1').text
'\r\n                        CÂMARA MUNICIPAL DE SÃO PAULO'

2 answers

Browser other questions tagged python character-encoding python-requests beautifulsoup

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2019-05-31T16:22:06+00:00

In fact, the requests already makes the decoding for you,using the correct encoding.

It’s only instead of accessing the attribute .content, access the attribute .text of the object "":

In [382]: import requests                                                                                              

In [383]: data = requests.get("https://slashdot.org")                                                                  

In [384]: type(data.content)                                                                                           
Out[384]: bytes

In [385]: type(data.text)                                                                                              
Out[385]: str

But understanding of encodnig and not kicking what happens is kind vital in this industry. I never tire of recommending the following article, originally written in 2003 by the creator of Stackoverflow:

The Absolute Minimum All Software Programmers Need, Absolutely, Positively To Know About Unicode and Character Sets (No Excuses!)

by Guilherme Nascimento • **98,651** points · Answer 2 · 2019-05-31T15:56:44+00:00

0

If you refer to display in terminal or windows command prompt then to resolve you should set in python the charset script, thus:

# -*- coding: latin-1 -*-

import requests
from bs4 import BeautifulSoup

...

If the page is in iso-8859-1 or windows-1252, if it is in utf-8 use so:

# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

...

It seems to me that the page that linked this in utf-16, so I think it would be like this:

# -*- coding: utf-16 -*-

import requests
from bs4 import BeautifulSoup

...

1

Actually this is 100% incorrect. The charset indication in the source code serves for the accented characters in the file itself. py, not for data read from an external source, such as a file, or in this case an HTTP request.

– jsbueno

2019/05/31 at 17:21
@jsbueno but I did not say that the problem was in HTTP, I said exactly something else, is the requests who downloads the data and the BeautifulSoup who decodes and probably already solve this, the problem is that the site is coming in utf-16 (it was what it seemed to me) and the script is in another. All right until part of the answer may be wrong, but surely what you said in the comment was not what I said, and I will review the answer here.

– Guilherme Nascimento

2019/05/31 at 19:58
1

100% incorrect , and maybe more, since you insist on the error. compiler Python, when reading the.py file and generating the bytecode. Does not interfere with how strings are shown in any terminal. this would only change the look of a string typed directly into the code.

– jsbueno

2019/05/31 at 22:35