Encoding problem in Python

Asked

Viewed 1,486 times

4

At one point in my code I receive a variable var of type str containing SENTEN\u00c7A.

in making

var2 = var.encode()
print(var2)

is printed b'SENTEN\\u00c7A'

The original word would be 'SENTENCE'

In doing print('SENTENÇA'.encode()) get back to me b'SENTEN\xc3\x87A'

In doing print(var == 'SENTENÇA') get back to me false

How can I convert my variable var to be equal to 'SENTENCE'? This variable comes from another program, another word can come too, as I do this generic conversion?

1 answer

7


In short:

Its string was "escaped twice". It has to be read as if it were bytes, and from there, decoded with the codec "unicode_escape". Just do:

var2 = var.encode("latin1").decode("unicode_escape")

Explanation

Its original string var at some point went through a "double encoding" process - in this process, the Unicod character "Ç" - which has code 124 (hexadecimal 0xC7) "u007c" had this sequence "transplanted" into the string. Normally this representation - "u00c7" is used only as a way to display more complicated characters when you see the "Repr" form of the string, or else to place special characters directly through its code in the literal string. The clue to understand that this happened is that when you print the value in bytes of the string, you can notice that the bar " " was printed in duplicate. Python does this to indicate the presence of a "physical" character of , and that the bar is not only being used as a marker to modify the next character of the printed sequence

For example rei_preto = "\u265a" is the character for a black chess king. However, when doing this normally the contents of the string will only be that special character, not the 6-character string " u265a" - see at the ipython prompt:

In [107]: rei = "\u265a"                                                                                 

In [108]: print(rei)                                                                                     
♚

In [109]: len(rei)                                                                                       
Out[109]: 1

So, as I explained above, something in your process applied the "unicode_escape" procedure twice to your text before arriving at the "var" variable. The remedy for this is to transparently transform your text to a set of bytes - ie - each character of the string "SENTEN u00C7A" is passed without any transformation to a Python 3 byte string. This is done with the codec "latin1" - all codes from 0 to 255 have a ratio of 1 to 1 between their representation in text, and their representation in Latin-1 charset (this includes the entire ASCII table and the most common accented characters - those used in Portuguese inclusive). The second step is decode this byte sequence using the special codec "unicode_escape" - this codec finds occurrences of the markup type \xFF, \uAAAA (and others) used by Python, and translates them to the corresponding character.

That is to say:

In [128]: b = "SENTEN\\u00c7A"                                                                           

In [129]: c = b.encode("latin1")                                                                         

In [130]: c.decode("unicode_escape")                                                                     
Out[130]: 'SENTENÇA'

Updating While I was answering you updated the question and described how you are reading this data, with the line:

arquivo = json.loads(sys.argv[2].replace("\\", '\\\\'))

As you can see, this causes the error - exchanging a " " in the input string for two causes two bars to exist - which Python interprets as a "physical bar" and not an "escape indicator". If you just take that replace out of there, probably the code snippet will work.

The way you are using to pass data to the Python script however is by no means reliable - and you should use another mechanism for this. You are passing a JSON notated object through the SHELL - and Shel Lusa ALL JSON delimiter characters [, {, " (in addition to the white space itself), in a special way. The chance of giving something wrong is about 300% (as already given). A person with solid knowledge in Shell and Escaping could write code that would do this - I consider myself a person with solid knowledge in Unicode, but the transformations that Shell makes with these characters are beyond my reach.

Best is to write your data to a temporary file from within PHP and pass only the file name to the Python script - and then "json.load" can read the entire file at once.

A better architecture might be using a server "redis" local - you enter your data there from PHP, and read from redis in the Python process: this would allow whatever you are doing in Python to run as a continuous service, and not with a new process, started via shell, every page-view (which typically is when PHP will need Python services).

  • The strign is right as I put it - if it’s printing like this, it’s because Python is coding at the time of printing to "utf-8", but its terminal is in Latin1. (the hint in this case: when a program tries to interpret a byte string that is in utf-8 as if it were Latin-1, it always appears the "Ã" character followed by some other accented or non-printable character). If you are directly reading the Python output to PHP instead of printing on the terminal then the output of your PHP page is not placing accent headers, and by default, from the 1990’s, the codec is latin1.

  • ... put your php program to include http header Content-Type: text/html; charset=utf-8 on the generated page to view utf-8 characters generated by Python on the generated page.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.