1
In Python3, with beautifulsoup4 and requests, I want to extract some information from a site that has encoding 'ISO-8859-1'. I tried this strategy to show correctly the text:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding
encoding = req.encoding
text = req.content
decoded_text = text.decode(encoding)
sopa = BeautifulSoup(decoded_text, "lxml")
sopa.find("h1")
And the result that appears is:
<h1>
CÃMARA MUNICIPAL DE SÃO PAULO<br/></h1>
When I copy and paste this screen appears correct, but in my computer the whole accent is wrong
I’m on a machine with Ubuntu
Please someone knows a correct way to read the encoding?
Edited June 2, 2019
I had the help of @snakecharmerb here
In the answer he detailed that when no explicit charset is present in HTTP headers and the Content-Type header contains text, RFC 2616 specifies that the standard character set should be ISO-8859-1. What is the case with this site
But clearly the words are UTF-8, so I fix it manually and it works My code went like this and it worked:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding
'ISO-8859-1'
req.headers['content-type']
'text/html'
req.encoding = 'UTF-8'
sopa = BeautifulSoup(req.text,'lxml')
sopa.find('h1').text
'\r\n CÂMARA MUNICIPAL DE SÃO PAULO'
Thank you @jsbueno. I requested -> req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
– Reinaldo Chaves
Then beautifulsoup with text -> soup = Beautifulsoup(req.text, "lxml")
– Reinaldo Chaves
But the result was error on the screen in accentuation, the same -> soup.find("H1") - MUNICIPAL CHAMBER OF SÃO PAULO
– Reinaldo Chaves
Although when I copy here the error does not appear. That’s what I pointed out above
– Reinaldo Chaves
Are you using Python 2? Why in Python 3, it should take care of the terminal encoding automatically for you. If it is Python 2 you have two problems - encoding, and the problem of being using Python 2.
– jsbueno
ah, now I see that you are in Ubuntu, and not in windows - see in your terminal settings, if it is using utf-8 (requests deals with the Latin legacy encoding-1 (iso-8859-1), from there on Oce has Unicode. if you tried to change the temrinal encoding, you may be confusing python)
– jsbueno
Thank you very much @jsbueno. Another colleague helped me here: https://stackoverflow.com/questions/56385353/how-to-find-out-the-correct-encoding-when-using-beautifulsoup/56404834#56404834
– Reinaldo Chaves
He explained to me that when no explicit charset is present in HTTP headers and the Content-Type header contains text, RFC 2616 specifies that the default character set should be ISO-8859-1. What is the case with this site
– Reinaldo Chaves
But clearly the words are UTF-8, so I fix it manually and it works. I will change up with the result that worked for me
– Reinaldo Chaves