Encoding utf-8 allows accents?

Asked

Viewed 54,909 times

13

If we do

# encoding: utf-8

in the first line of a Python program, we can make accents in the whole code?

2 answers

20


In fact, it all depends on the configuration of your text editor. A majority of the text editors saved, by default, the files in the UTF-8 encoding (and, at least for Lusophones, the editors who do not save them in ISO-8859-1).

Why does it matter?

To summarize a very complicated story, which began there in the time of the first telegraphs, the Latin alphabet’s character codes were standardized quite early (the ASCII was standardized in 1960), but the "special" characters - cedillas, accents - were "standardized" by each country (or group of countries) separately. Western Europe (and therefore Lusophone, Hispanophone, ... ) converged in the pattern ISO 8859-1.

The problem is that this pattern contains no characters from the Greek, Cyrillic, ... , so it’s impossible to have a document in this pattern that, for example, mixes Portuguese with Greek (and the situation gets even worse when you include Japanese, Chinese, ...)

The invention of Unicode

To unify these encodings and allow polylingual texts (and to avoid ambiguities when exchanging texts between computers with different encodings), Unicode was invented, whose purpose is to assign distinct codes to all characters in all languages of the world.

Unicode texts can be encoded in many different ways - internally. NET and Java use UTF-16; Python 3 chooses between ASCII, UTF-16 and UTF-32 depending on the characters that are in the text you are processing.

Still, UTF-8 is the most popular encoding for text files (e.g. Python source files)

Why this line is needed

Since a byte can only have 256 distinct values, and the set of all languages in the world has more than 256 characters, UTF-8 needs to use more than one byte to represent some characters. In general, accented characters as in the word "blessing" are represented in 2 bytes in UTF-8 (as opposed to only 1 in ISO 8859-1):

             b |    ê   |  n |    ç   |    ã   |  o
ISO 8859-1: 62 |   EA   | 6E |   E7   |   E3   | 6F
     UTF-8: 62 | C3  AA | 6E | C3  A7 | C3  A3 | 6F

This is a problem when you try to read a text written in an encoding as if it were another encoding: if the text was written as UTF-8 but read as ISO 8859-1, it appears as "§§§§יייייי"; otherwise it appears as "b no" (or, in the case of Python, it causes a UnicodeDecodeError).

Python 2, as a special case, detects the presence of this line and uses it to detect the file encoding. In absence of this line, Python enters a more conservative mode, and only accepts ASCII characters (no accent), launching an error if it finds some "weird" character (the details of this mechanism are described in PEP 0263, who proposed the change).

Summary of the opera

If you want to use accents in your Python 2 file, put one of the following three lines at the top of your files:

# encoding: utf-8
# encoding: iso-8859-1
# encoding: win-1252

In approximately descending order of probability, these are the encodings your editor probably uses.

You can also migrate to Python 3, where the code below is perfectly legal...

fmoreira@saucer tmp $ cat encoding.py 
π = 3.14159265359
半径 = 2.5
área = π * 半径 ** 2
print('مساحة = {}'.format(área))

fmoreira@saucer tmp $ python3 encoding.py 
مساحة = 19.6349540849375

...but I obviously do not recommend this technique.

  • Excellent answer, congratulations.

  • 2

    this answer is pretty cool - but skipped a few essential issues to the question: in particular much getne thinks that putting the "encoding" statement at the beginning of the file is a magic that will make all the coding problems of the program go away.

  • 2

    To see in more detail the history of Unicode and understand what it is, it is immortal to read this article. (the title is not to take the wave - it deals with the minimum that one has to know for any program that uses accentuation). http://local.joelonsoftware.com/wiki/O_M%C3%Adnimo_absoluto_que_todos_programmeres_de_software_absolutely need,_,Positivamente_de_Saber_Sobre_Unicode_e_Conjuntos_de_Caracteres%28Sem_Desculpas! %29

  • 2

    about the answer -I have a caveat: far from trying to use other encodings until you find a "compatible with your editor" as suggested in the last part of the answer, it is important yes use utf-8 and configure your editor to use utf-8 (and not the other way around)- utf-8 allows special multi-language characters in the same file, is the default on Mac Os and Linux (and therefore on almost every server where your web application will be and on Android) - just not the default on Windows.

  • I considered putting this in the answer, but I tried to minimize the scope - a program that only contains ASCII characters can choke on Unicode for example. My intention is only to deal with the case where, for example, you want to put a comment in Portuguese in a code that, without comments, would be pure ASCII.

  • Do you know how to do this in the most popular editors for Windows? I move more on Mac + Linux; that’s why I refrained from bringing it up. If yes, post an answer in parallel.

  • I think "configure the editor" has to stay as homework for you want to program. : -) Each editor will have this configuration somewhere reasonable-obvious "edit->preferences->editor->codifying", or "tools->coding"--- at best, could be a question for superuser.com-- And I put a response below talking a little about the concepts of Union that you did not address there, anyway.

Show 2 more comments

11

The encoding declaration line

#encoding: utf-8

allows the Python parser to understand the accents in the source code - that is, placing any accented character is no longer a "syntax error" in Python 2. Other encodings, used by default in Windows, are more limited than utf-8, in order to allow only 256 distinct characters - so it is important to put this line and configure your editor to use utf-8.

But This is not enough to use at-will accentuation in a Python 2.x program. A major change that was implemented in the mid-2000s, and that many people have not yet realized, is that TEXT data in Python 2 has to be "Unicode" type, not "str" type. In Python3, type "str" already has an internal representation in Unicode.

The biggest difference between the two is that for a byte string (the simple str of Python2) an element of the sequence corresponds to a byte. When speaking of text (Unicode type) a sequence element matches always there is a character.

Do the following experiment - (can be on the terminal if it is set to utf-8):

>>> a = "maçã"
>>> for letra in a: print letra,
... 
m a � � � �
>>> a = u"maçã"
>>> for letra in a: print letra,
... 
m a ç ã
>>> 

What happens is that the default Windows encoding in English ("latin1") uses semrpe one byte per character, and then you don’t notice this - but you’ll have a problem if you try to capitalize an accented string, even with this type of encoding. For example:

>>> a=  "maçã".upper()
>>> print a
MAçã
>>> a= u"maçã".upper()
>>> print a
MAÇÃ
>>> 

The recommendation is to understand well what is Unicode and what are the encodings in the article http://local.joelonsoftware.com/wiki/O_M%C3%ADnimo_Absoluto_Que_Todos_os_Programadores_de_Software_Precisam,_Absolutamente,_Positivamente_de_Saber_Sobre_Unicode_e_Conjuntos_de_Caracteres_%28Sem_Desculpas! %29 , always use utf-8 in the programs, and - always use the technique we call "the Unicode sandwich":

When reading text from some external source of your program - be it a file, user input, database, sensor, it will be in bytes, and with some encoding

  • you decode this text to Unicode (with the "Decode method")
  • works with the text in your Python program
  • encodes back into the encoding used by the output of data (terminal, file, database, printer, etc...) with the "Encode) method".

Python 3 and some of the libraries - even the ones used in Python2, already do the coding/decoding step transparently for you. But it’s still vital to understand what’s happening.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.