Encoding problem in Python

Question

Encoding problem in Python

Asked 6 years, 7 months ago

Viewed 1,486 times

4

At one point in my code I receive a variable var of type str containing SENTEN\u00c7A.

in making

var2 = var.encode()
print(var2)

is printed b'SENTEN\\u00c7A'

The original word would be 'SENTENCE'

In doing print('SENTENÇA'.encode()) get back to me b'SENTEN\xc3\x87A'

In doing print(var == 'SENTENÇA') get back to me false

How can I convert my variable var to be equal to 'SENTENCE'? This variable comes from another program, another word can come too, as I do this generic conversion?

1 answer

Browser other questions tagged python python-3.x character-encoding unicode

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2018-12-18T18:45:11+00:00

In short:

Its string was "escaped twice". It has to be read as if it were bytes, and from there, decoded with the codec "unicode_escape". Just do:

var2 = var.encode("latin1").decode("unicode_escape")

Explanation

Its original string var at some point went through a "double encoding" process - in this process, the Unicod character "Ç" - which has code 124 (hexadecimal 0xC7) "u007c" had this sequence "transplanted" into the string. Normally this representation - "u00c7" is used only as a way to display more complicated characters when you see the "Repr" form of the string, or else to place special characters directly through its code in the literal string. The clue to understand that this happened is that when you print the value in bytes of the string, you can notice that the bar " " was printed in duplicate. Python does this to indicate the presence of a "physical" character of , and that the bar is not only being used as a marker to modify the next character of the printed sequence

For example rei_preto = "\u265a" is the character for a black chess king. However, when doing this normally the contents of the string will only be that special character, not the 6-character string " u265a" - see at the ipython prompt:

In [107]: rei = "\u265a"                                                                                 

In [108]: print(rei)                                                                                     
♚

In [109]: len(rei)                                                                                       
Out[109]: 1

So, as I explained above, something in your process applied the "unicode_escape" procedure twice to your text before arriving at the "var" variable. The remedy for this is to transparently transform your text to a set of bytes - ie - each character of the string "SENTEN u00C7A" is passed without any transformation to a Python 3 byte string. This is done with the codec "latin1" - all codes from 0 to 255 have a ratio of 1 to 1 between their representation in text, and their representation in Latin-1 charset (this includes the entire ASCII table and the most common accented characters - those used in Portuguese inclusive). The second step is decode this byte sequence using the special codec "unicode_escape" - this codec finds occurrences of the markup type \xFF, \uAAAA (and others) used by Python, and translates them to the corresponding character.

That is to say:

In [128]: b = "SENTEN\\u00c7A"                                                                           

In [129]: c = b.encode("latin1")                                                                         

In [130]: c.decode("unicode_escape")                                                                     
Out[130]: 'SENTENÇA'

Updating While I was answering you updated the question and described how you are reading this data, with the line:

arquivo = json.loads(sys.argv[2].replace("\\", '\\\\'))

As you can see, this causes the error - exchanging a " " in the input string for two causes two bars to exist - which Python interprets as a "physical bar" and not an "escape indicator". If you just take that replace out of there, probably the code snippet will work.

The way you are using to pass data to the Python script however is by no means reliable - and you should use another mechanism for this. You are passing a JSON notated object through the SHELL - and Shel Lusa ALL JSON delimiter characters [, {, " (in addition to the white space itself), in a special way. The chance of giving something wrong is about 300% (as already given). A person with solid knowledge in Shell and Escaping could write code that would do this - I consider myself a person with solid knowledge in Unicode, but the transformations that Shell makes with these characters are beyond my reach.

Best is to write your data to a temporary file from within PHP and pass only the file name to the Python script - and then "json.load" can read the entire file at once.

A better architecture might be using a server "redis" local - you enter your data there from PHP, and read from redis in the Python process: this would allow whatever you are doing in Python to run as a continuous service, and not with a new process, started via shell, every page-view (which typically is when PHP will need Python services).