The method encode
returns an instance of bytes
. And an object of the kind bytes
, according to the documentation is a sequence of numbers, whose values are in the range between 0 and 255.
And a number is not "in" a specific format, or on a given basis. Of course, at the end of the day, everything turns into a bunch of bytes, but the way these bytes are interpreted and displayed varies depending on the situation.
The number 97, for example, can be interpreted as the letter "a" (if we consider the ASCII table), or as the numerical value 97 itself (which in turn can be written as 61
in hexadecimal, or 01100001
in binary, or 141
octal, or 97.0
, 00097
, 97,00
, etc). Or it could still be a specific code that varies according to the context (for example, could represent the code of a color in RGB). That is, the bits would be the same, but the way they are displayed may vary.
In the specific case of an object bytes
, when printed, the values corresponding to printable ASCII characters are shown as the characters themselves, and other values are shown in hexadecimal, with the prefix \x
. That was the choice of language to display values when they are part of an object bytes
.
If you want another format, you will have to format it yourself. Two options are to use f-string (from Python 3.6) or bin
. Example:
for b in "café".encode("latin-1"):
print(f'{b:08b} {bin(b)}')
The difference is that bin
prefix 0b
and does not fill with zeros on the left. The output for the above code will be:
01100011 0b1100011
01100001 0b1100001
01100110 0b1100110
11101001 0b11101001
Of course once you have chosen the way to format, you can build the string however you want. For example:
# 01100011 01100001 01100110 11101001
print(' '.join(f'{b:08b}' for b in "café".encode("latin-1")))
For versions prior to 3.6 you can use '{:08b}'.format(b)
in place of f'{b:08b}'
.
Another option, if you want everything together, is to convert the object bytes
for hexadecimal (using the method hex
), then convert to int
and finally pass on this number to bin
:
# 1100011011000010110011011101001
print(bin(int("café".encode("latin-1").hex(), 16))[2:])
And I even used the Slice [2:]
to remove the prefix 0b
. But I believe that uniting everything with join
it seems simpler to me than convert to hexadecimal, then convert to int
, then convert to binary.
To convert to hexadecimal, simply change the formatting, or use hex
directly:
encoded = "café".encode("latin-1")
print(encoded.hex()) # 636166e9
# a partir do Python 3.8, você pode escolher o separador
print(encoded.hex(' ')) # 63 61 66 e9
print(encoded.hex('-')) # 63-61-66-e9
# para versões anteriores a 3.8, você pode usar join para ter o separador
print('-'.join(f'{b:02x}' for b in encoded)) # 63-61-66-e9
print('-'.join(f'{b:02X}' for b in encoded)) # 63-61-66-E9
The difference is that hex
always put the digits of a
to f
as lower case letters, while using f-string you can choose both lower case and upper case (using the format x
or X
). Read the documentation to learn more about formatting options.
Finally, it is worth remembering that it is always possible to take the value of each byte individually, obtaining its respective numerical value:
encoded = "café".encode("latin-1")
print(encoded[1]) # 97
print(type(encoded[1])) # <class 'int'>
And once having this number, you can format it as you like (using f-string, for example).
And notice how the number displayed individually is shown as the numerical value 97
, no longer as the ASCII character a
, which shows that the type actually changes the way the byte is interpreted and displayed. The fact that this value is inside an object bytes
does not cause it to "stay at base 2", so it did not satisfy your expectation that it would be displayed in binary.
any suggestions of how I can improve my question?
– Lucas
The concept that Unicode is a key-value structure where character is represented by a number. Unicode is a database divided into 16 code Planes separated into a total of 163 blocks of code points cataloguing information as schematic name, category , Joining type,.... In python this database is accessed by the module
unicodedata
standardized by the standard Unicode.– Augusto Vasques
Just to be pedantic, Unicode goes far beyond the "number character" mapping, since it also defines collation (alphabetical order according to the locale), syllabic separation and other forms of word breaking, and much more. But of course, for the context of the question - and even for didactic simplification - it is not "wrong" to say that it defines a large "notary" that maps each character to a number (and the plans and blocks would be "only" sub-divisions of this "notary") :-) See here for more details. cc @Augustovasques
– hkotsubo