In requests, how to correctly read the ISO-8859-1 encoding?

Asked

Viewed 816 times

1

In Python3, with beautifulsoup4 and requests, I want to extract some information from a site that has encoding 'ISO-8859-1'. I tried this strategy to show correctly the text:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding

encoding = req.encoding
text = req.content

decoded_text = text.decode(encoding)

sopa = BeautifulSoup(decoded_text, "lxml")

sopa.find("h1")

And the result that appears is:

<h1>
                        CÃMARA MUNICIPAL DE SÃO PAULO<br/></h1>

When I copy and paste this screen appears correct, but in my computer the whole accent is wrong

I’m on a machine with Ubuntu

Please someone knows a correct way to read the encoding?


Edited June 2, 2019

I had the help of @snakecharmerb here

In the answer he detailed that when no explicit charset is present in HTTP headers and the Content-Type header contains text, RFC 2616 specifies that the standard character set should be ISO-8859-1. What is the case with this site

But clearly the words are UTF-8, so I fix it manually and it works My code went like this and it worked:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding

'ISO-8859-1'

req.headers['content-type']
'text/html'

req.encoding = 'UTF-8'

sopa = BeautifulSoup(req.text,'lxml')

sopa.find('h1').text
'\r\n                        CÂMARA MUNICIPAL DE SÃO PAULO'

2 answers

5

In fact, the requests already makes the decoding for you,using the correct encoding.

It’s only instead of accessing the attribute .content, access the attribute .text of the object "":

In [382]: import requests                                                                                              

In [383]: data = requests.get("https://slashdot.org")                                                                  

In [384]: type(data.content)                                                                                           
Out[384]: bytes

In [385]: type(data.text)                                                                                              
Out[385]: str

But understanding of encodnig and not kicking what happens is kind vital in this industry. I never tire of recommending the following article, originally written in 2003 by the creator of Stackoverflow:

The Absolute Minimum All Software Programmers Need, Absolutely, Positively To Know About Unicode and Character Sets (No Excuses!)

  • Thank you @jsbueno. I requested -> req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')

  • Then beautifulsoup with text -> soup = Beautifulsoup(req.text, "lxml")

  • But the result was error on the screen in accentuation, the same -> soup.find("H1") - MUNICIPAL CHAMBER OF SÃO PAULO

  • Although when I copy here the error does not appear. That’s what I pointed out above

  • Are you using Python 2? Why in Python 3, it should take care of the terminal encoding automatically for you. If it is Python 2 you have two problems - encoding, and the problem of being using Python 2.

  • ah, now I see that you are in Ubuntu, and not in windows - see in your terminal settings, if it is using utf-8 (requests deals with the Latin legacy encoding-1 (iso-8859-1), from there on Oce has Unicode. if you tried to change the temrinal encoding, you may be confusing python)

  • Thank you very much @jsbueno. Another colleague helped me here: https://stackoverflow.com/questions/56385353/how-to-find-out-the-correct-encoding-when-using-beautifulsoup/56404834#56404834

  • He explained to me that when no explicit charset is present in HTTP headers and the Content-Type header contains text, RFC 2616 specifies that the default character set should be ISO-8859-1. What is the case with this site

  • But clearly the words are UTF-8, so I fix it manually and it works. I will change up with the result that worked for me

Show 4 more comments

0

If you refer to display in terminal or windows command prompt then to resolve you should set in python the charset script, thus:

# -*- coding: latin-1 -*-

import requests
from bs4 import BeautifulSoup

...

If the page is in iso-8859-1 or windows-1252, if it is in utf-8 use so:

# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

...

It seems to me that the page that linked this in utf-16, so I think it would be like this:

# -*- coding: utf-16 -*-

import requests
from bs4 import BeautifulSoup

...
  • 1

    Actually this is 100% incorrect. The charset indication in the source code serves for the accented characters in the file itself. py, not for data read from an external source, such as a file, or in this case an HTTP request.

  • @jsbueno but I did not say that the problem was in HTTP, I said exactly something else, is the requests who downloads the data and the BeautifulSoup who decodes and probably already solve this, the problem is that the site is coming in utf-16 (it was what it seemed to me) and the script is in another. All right until part of the answer may be wrong, but surely what you said in the comment was not what I said, and I will review the answer here.

  • 1

    100% incorrect , and maybe more, since you insist on the error. compiler Python, when reading the.py file and generating the bytecode. Does not interfere with how strings are shown in any terminal. this would only change the look of a string typed directly into the code.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.