What is " x" in the Python strings?

Asked

Viewed 1,008 times

3

I replied a question here on the site where there was the following string in the Python language:

"\xf7\x1a\xa6\xde\x8f\x17v\xa8\x03\x9d2\xb8\xa1V\xb2\xa9>\xddC\x9d\xc5\xdd\xceV\xd3\xb7\xa4\x05J\r\x08\xb0"

I imagine that \x has something to do with some kind of coding, but I’m not sure what it is.

What would that be \x string? It has to do with hexadecimal?

2 answers

5


It is an escape sequence, meaning that the following two characters should be interpreted as hexadecimal digits for character code interpretation:

Try on the terminal:

>>> 0x65
101

>>> "\x65"
'e'

0xHH It is used literally, i.e., the literal hexa number, if used as a string ("\xHH") is used for character representation.

2

The `" Xhh" inside a string indicates that the next two characters (initialized by "H") will be interpreted as hexadecimal digits, and therefore is a way to represent any arbitrary byte inside a Python string.

Thus, b"\xff" will match a byte string with a single byte of value 255 (ff in hexadecimal).

It is important to keep in mind that in Python 3, as in the Unicode strings of Python 2, a byte of these would not necessarily correspond to a character. Because of the specific encoding used for Python 3 text, all bytes from 0 to 255 correspond to the character encoding known as "latin1" - the same used in many versions of Windows for Brazilian Portuguese. This means that any arbitrary byte specified with the prefix "\xHH" will match a printable Python text character 3.

An interesting experiment can be to write numerical data in a binary file, read them as text and see how the representation appears:

In [23]: f = open("teste.bin", "wb")

In [24]: f.write(bytearray((0, 0, 255, 255, 128, 128)))
Out[24]: 6

In [25]: f.close()

In [26]: open("teste.bin", encoding="latin1").read()
Out[26]: '\x00\x00ÿÿ\x80\x80'

(In this case, the character Ÿ has the code 255 (0xff):) In [30]: print(" xff") lute

Similarly, in Python 3 (and Unicode strings from Python2), the prefix \u allows designating a direct Unicode character by its Codepoint value - for codepoints up to 16 bits (four hexadecimal digits)

So, for example, the Codepoint 0x263A character, which is the smiley-face emoji, can be placed directly in Python source code:

In [42]: a = "\u263a"

In [43]: print(a)
☺

And for more "far" characters, the prefix \U (uppercase "U") allows 8 hex digits - to express characters with Codepoint greater than 65535 (0xffff). The semantics of " Xhh", "uhhhh" and " UHHHHHHHH" are the same.

Now, what might be interesting is that sometimes we get a string "encoded twice" - that is, in that sequence of \xHH has in fact four characters (for example, if we save a file . txt with the sequence \x41 - so it’s a 4 byte file). If we want to read the only character represented by the byte 0x41 (capital "A"), we have to do some maneuvering. To simplify we can simply escape the " " by typing " " in a Python string (always Python 3):

In [37]: a
Out[37]: 'A'

In [38]: a = "\\x41"

In [39]: len(a)
Out[39]: 4

In [40]: a
Out[40]: '\\x41'

That is - in this case, we have the " as a separate character - and not as a character that is combined with the "x" and the next two digits at compile time by Python. In order to "compile" this for a single bytem we have to "decode" (Decode) this text using the special "unicode_escape" codec. Only, it’s not so simple - you can’t apply "Decode" to a text in Python 3, because it’s already considered "decoded" - you need to have a byte-string in order to call the Decode method. Since our variable "a" is a string, the solution is to convert it first to bytes, using the "NCODE" method - we use the encoding "latin1" which conventionally does not change any content value, as long as it is a character with code less than 255:

In [41]: a.encode("latin1").decode("unicode_escape")
Out[41]: 'A'

Browser other questions tagged

You are not signed in. Login or sign up in order to post.