This has little to do with C or C++ or other programming language.
Internally, the computer only knows numbers. To represent letters it uses a coding. Each person can make their own custom encoding.
There are several current encodings.
Many of these encodings use only the numbers of 0
to 255
(or -128
to 127
), ie, are encodings for 8 bits.
In ASCII encoding (only 7 bits) there is no representation for, for example, ã
.
When the use of computers increased it was necessary to extend the encodings used to represent more than 128 characters.
One of the new encodings created is named ISO-8859-1. In this encoding the ã
code 227. For example, in the ISO-8859-8 encoding, the same code 227 represents the character ד
(Dalet).
So far so good. All encoded numbers fit in 8 bits.
Obviously there is the problem of always having to know which encoding was originally used to convert the numbers into characters. This problem often occurred at the beginning of the Internet when people from different countries exchanged emails, each using a different encoding.
To solve this problem of different encodings, then a scheme was invented to encode more than 256 characters in a single encoding that serves for all countries: Unicode.
But the Unicode codes are too big to fit in 8 bits. Regardless of how these codes are translated for internal representation on the computer, 8 bits are not enough ... so the guy char
not fit for Unicode (with direct representation, UTF-8, UTF-16, ..., little-endian, big-endian, ..., ...)
That’s a good answer. Whenever I comment on Unicode, I usually recommend reading an article that has become a reference in the explanation - and recommend again to @Ugusto anyone with similar doubt: http://local.joelonsoftware.com/wiki/O_M%C3%Adnimo_absoluto_que_todos_programmeres_de_software_absolutely need,_,Positivamente_de_Saber_Sobre_Unicode_e_Conjuntos_de_Caracteres%28Sem_Desculpas! %29
– jsbueno
I saw the article, very good and well explained. Thank you for posting.
– Augusto
Just curious, to complement: in Rust the type
char
occupies 4 bytes, just because of Unicode. https://doc.rust-lang.org/std/primitive.char.html - To store only one byte, we have thei8
(Signed) andu8
(unsigned) respectively.– Bacco