Error in "utf-8" in python 3

Asked

Viewed 767 times

3

I have a problem in python 3 that in the code:

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import urllib.request

page = urllib.request.urlopen("http://beans-r-us.biz/prices.html")

text = page.read().decode('utf8')

print(text)`

gives the error:

Unicodedecodeerror: 'utf-8' codec can’t Decode byte 0xd0 in position 1265: invalid continuation byte`

and I don’t know what to do to fix

Note: I am still beginner in programming, this code is part of the book "use programming head", and the goal of it and "show" the site.

2 answers

1


Error happens in the following line:

text = page.read().decode('utf8')

It attempts to decode the page of the aforementioned site using UTF-8 encoding, but fails to find any poorly formed byte. The content of the page is as follows:

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=shift_jis"><meta http-equiv="Content-Language" content="ja,en"><script type="text/javascript">\r\n\r\n  var _gaq = _gaq || [];\r\n  _gaq.push([\'_setAccount\', \'UA-20569835-2\']);\r\n  _gaq.push([\'_trackPageview\']);\r\n\r\n  (function() {\r\n    var ga = document.createElement(\'script\'); ga.type = \'text/javascript\'; ga.async = true;\r\n    ga.src = (\'https:\' == document.location.protocol ? \'https://ssl\' : \'http://www\') + \'.google-analytics.com/ga.js\';\r\n    var s = document.getElementsByTagName(\'script\')[0]; s.parentNode.insertBefore(ga, s);\r\n  })();\r\n\r\n</script><title>404 Not Found</title></head><body oncontextmenu="return false;" style="width: 100% !important; height: 2600px !important;">\r\n<center><a href="http://cgi.i-mobile.co.jp/ad_link.aspx?guid=on&asid=32341&pnm=0&asn=1"><img border="0" src="http://cgi.i-mobile.co.jp/ad_img.aspx?guid=on&asid=32341&pnm=0&asn=1&asz=0&atp=2&lnk=6666ff&bg=&txt=000000&pbb=1"></a></center>\r\n<center><a href="http://cgi.i-mobile.co.jp/ad_link.aspx?guid=on&asid=32341&pnm=0&asn=2"><img border="0" src="http://cgi.i-mobile.co.jp/ad_img.aspx?guid=on&asid=32341&pnm=0&asn=2&asz=0&atp=2&lnk=6666ff&bg=&txt=000000"></a></center>\r\n\r\n\r\n<center><FONT SIZE="2">ミンナ�ホが選んだ�ゥ11/07のランキング�ソ</FONT></center>\r\n<center><FONT SIZE="2">�ソ ��位 �ソ</FONT></center>\r\n\r\n<br>\r\n<center><FONT SIZE="2">�ソ ��位 �ソ</FONT></center>\r\n\r\n<a name="madop"></a>\r\n<br>\r\n<center><font size="2">他のキーワードで探してみる</FONT></center><center>\r\n<form method="get" action="/genre23.php">\r\n<font size="2"><input type="text" name="query2" value="" size="8"><font size="4">\r\n<SELECT name="genre">\r\n<OPTION value="3">��</OPTION>\r\n\r\n</SELECT>\r\n</FONT><input type="submit" value=" 探す�マ "></FONT>\r\n<input type="hidden" name="cache" value=""><input type="hidden" name="fname" value="">\r\n</form>\r\n</center><br>\r\n<center><font size="2" color="red"><b><a href="/inq/disclaimer.php?ngdom=beans-r-us.biz&ngk=retire%20your%20vehicle">利用規約・削除依頼</a></b></FONT></center>\r\n<br></body></html>'

As you can see, there are several oriental characters present. It is likely that he encountered problems decoding some of these.

  • So how do I make it work?

  • If your intention is just to make the script work, then you can try taking the .decode('utf8')

  • 1

    It worked, thank you

-3

I did the same example without having to remove . Decode("utf-8"), which gave the same answer except for it. The example of the book places at the end of the site . html, remove and you will get the same result.

Ps: use python 3

Browser other questions tagged

You are not signed in. Login or sign up in order to post.