What happened to Unicode in Python 3?

Asked

Viewed 238 times

3

I’m starting to use the Python3 little by little. I ran a certain code that I used to run with Python2.7 and got the following error:

Nameerror: name 'Unicode' is not defined

So I understand that unicode does not exist in Python 3. What should I use instead?

  • Can you post what code snippet specifically? In time, https://docs.python.org/3/howto/unicode.html

  • No need for code. Anyone called unicode(minha_string) generates the error described.

2 answers

4


According to the response in the OS of Martjin Pieters the guy Unicode was renamed to str, which is more intuitive, and the old str was renamed to bytes. It puts a code to handle when you don’t know what the encoding is:

if isinstance(unicode_or_str, str):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

I put in the Github for future reference.

1

This change is actually the main change from Python 2 to Python 3 - and basically on account of her that they opted for the transition with breach of compatibility. All text in Python3 is now "text", does not have an automatic 1:1 mapping for byte values. In practice the class str now behaves exactly like the "Unicode" class behaved in Python 2.

Starting with version 3.3 of Python, to facilitate the writing of programs that could work simultaneously in Python 2 and Python3, the prefix u for strings was re-introduced. In Python 2, this prefix implied that the string was "Unicode", not "str". In Python 3, it does absolutely nothing - the string remains "Unicode". Ex.: u"maçã" , b"nao pode ter acentos"

On the other hand, the prefix b that did nothing in Python2, indicates that one is writing an object of the type bytes in Python 3. This, on the other hand, although it can be used in Apis that actually expect values in bytes (with text already encoded according to some convention for accents, the so-called "encodings" (e.g. latin1, utf-8, cp-852)). But above all, in Python 3 if you try to recover a single element of an "str/bytes" object in Python 2, the result is a "str/bytes" object of length 1. In Python 3, you get a number between 0 and 255 - as with char pointer strings in C:

Python 2:


Python 2.7.17 (default, Nov  7 2019, 10:07:09) 
[GCC 7.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = b"apple"
>>> a[0]
'a'
>>> b = "maçã"
>>> len(b)
6
>>> b = u"maçã"
>>> len(b)
4

Python 3.8.0+ (heads/3.8:d04661f, Oct 24 2019, 09:19:45) Python 3:

[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = b"apple"
>>> a[0]
97
>>> b = "maçã"
>>> len(b)
4
>>> b = u"maçã"
>>> len(b)
4
>>> b[3]
'ã'


For those who are not yet familiar with "Unicode" and "text as bytes", I suggest reading an article written in 2003 by Joel (the founder of stackoverflow):https://www.scribd.com/document/3181016/Programacao-Joel-on-Unicode

Browser other questions tagged

You are not signed in. Login or sign up in order to post.