How to get binary code from a string without ASCII characters in Python?

Asked

Viewed 108 times

1

I’m studying Unicode and encodings. I understand so far that Unicode is a key-value structure in which each character is represented by a number. Example:

import string

alphabet=list(string.ascii_lowercase)

for letter in alphabet:
    print(letter,":",ord(letter))

Returns:

a : 97
b : 98
c : 99
d : 100
e : 101
f : 102
g : 103
h : 104
i : 105
j : 106
k : 107
l : 108
m : 109
n : 110
o : 111
p : 112
q : 113
r : 114
s : 115
t : 116
u : 117
v : 118
w : 119
x : 120
y : 121
z : 122

Already the encoding is something completely different. It is the transformation of a string in bytes. This transformation can be based on several key-value structures that may or may not have the key corresponding to the desired value. The character ä can be transformed into bytes using codec latin-1, but generates an error if the codec used is ascii.

It turns out I would like to recover the binary value of a character in a codec but I’m not getting it because Python always prints the character instead of the byte when the character exists in ASCII. Example:

"a".encode("latin-1")

Returns:

b'a'

When in fact I expected to see 11100001 which, from what I read, is the binary code of a in the codec latin-1.

Note that when the character does not exist in ASCII, Python prints the hexadecimal (which I can then convert into binary):

café.encode("latin-1")

Returns:

b'caf\xe9'

How do I make Python print binary (or hexadecimal) code corresponding to the character instead of the ASCII character?

  • any suggestions of how I can improve my question?

  • 2

    The concept that Unicode is a key-value structure where character is represented by a number. Unicode is a database divided into 16 code Planes separated into a total of 163 blocks of code points cataloguing information as schematic name, category , Joining type,.... In python this database is accessed by the module unicodedata standardized by the standard Unicode.

  • 2

    Just to be pedantic, Unicode goes far beyond the "number character" mapping, since it also defines collation (alphabetical order according to the locale), syllabic separation and other forms of word breaking, and much more. But of course, for the context of the question - and even for didactic simplification - it is not "wrong" to say that it defines a large "notary" that maps each character to a number (and the plans and blocks would be "only" sub-divisions of this "notary") :-) See here for more details. cc @Augustovasques

1 answer

2


The method encode returns an instance of bytes. And an object of the kind bytes, according to the documentation is a sequence of numbers, whose values are in the range between 0 and 255.

And a number is not "in" a specific format, or on a given basis. Of course, at the end of the day, everything turns into a bunch of bytes, but the way these bytes are interpreted and displayed varies depending on the situation.

The number 97, for example, can be interpreted as the letter "a" (if we consider the ASCII table), or as the numerical value 97 itself (which in turn can be written as 61 in hexadecimal, or 01100001 in binary, or 141 octal, or 97.0, 00097, 97,00, etc). Or it could still be a specific code that varies according to the context (for example, could represent the code of a color in RGB). That is, the bits would be the same, but the way they are displayed may vary.

In the specific case of an object bytes, when printed, the values corresponding to printable ASCII characters are shown as the characters themselves, and other values are shown in hexadecimal, with the prefix \x. That was the choice of language to display values when they are part of an object bytes.

If you want another format, you will have to format it yourself. Two options are to use f-string (from Python 3.6) or bin. Example:

for b in "café".encode("latin-1"):
    print(f'{b:08b}  {bin(b)}')

The difference is that bin prefix 0b and does not fill with zeros on the left. The output for the above code will be:

01100011  0b1100011
01100001  0b1100001
01100110  0b1100110
11101001  0b11101001

Of course once you have chosen the way to format, you can build the string however you want. For example:

# 01100011 01100001 01100110 11101001
print(' '.join(f'{b:08b}' for b in "café".encode("latin-1")))

For versions prior to 3.6 you can use '{:08b}'.format(b) in place of f'{b:08b}'.


Another option, if you want everything together, is to convert the object bytes for hexadecimal (using the method hex), then convert to int and finally pass on this number to bin:

# 1100011011000010110011011101001
print(bin(int("café".encode("latin-1").hex(), 16))[2:])

And I even used the Slice [2:] to remove the prefix 0b. But I believe that uniting everything with join it seems simpler to me than convert to hexadecimal, then convert to int, then convert to binary.


To convert to hexadecimal, simply change the formatting, or use hex directly:

encoded = "café".encode("latin-1")
print(encoded.hex()) # 636166e9

# a partir do Python 3.8, você pode escolher o separador
print(encoded.hex(' ')) # 63 61 66 e9
print(encoded.hex('-')) # 63-61-66-e9

# para versões anteriores a 3.8, você pode usar join para ter o separador
print('-'.join(f'{b:02x}' for b in encoded)) # 63-61-66-e9
print('-'.join(f'{b:02X}' for b in encoded)) # 63-61-66-E9

The difference is that hex always put the digits of a to f as lower case letters, while using f-string you can choose both lower case and upper case (using the format x or X). Read the documentation to learn more about formatting options.


Finally, it is worth remembering that it is always possible to take the value of each byte individually, obtaining its respective numerical value:

encoded = "café".encode("latin-1")
print(encoded[1]) # 97
print(type(encoded[1])) # <class 'int'>

And once having this number, you can format it as you like (using f-string, for example).

And notice how the number displayed individually is shown as the numerical value 97, no longer as the ASCII character a, which shows that the type actually changes the way the byte is interpreted and displayed. The fact that this value is inside an object bytes does not cause it to "stay at base 2", so it did not satisfy your expectation that it would be displayed in binary.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.