The guy char
in C, and consequently in C++, there is not a good name there. I actually think he should call byte
, because that’s what it is. It being used as a character is just a detail.
Contrary to popular belief, C is a weak typing language. It is statically typed but weak. People don’t understand these terms very well. C can interpret the same data as if it were of a different type or shape than the originally intended one. This can be observed in this code:
char a = 'a';
printf("%c\n", a);
printf("%d\n", a);
Behold working in the ideone. And in the repl it.. Also put on the Github for future reference.
The same data can be displayed as number or character.
Some functions of C allow you to make this interpretation as a character, in general you need to say it should be so. That’s why there is %c
. It indicates that the data should be treated as a character. In general a char
is treated as a number.
Any character encoding that can be stored in 1 byte can be stored in a char
. When C was created only ASCII existed (at least in a relevant way).
More complete encodings that use the whole byte to represent more characters appeared. It was getting complicated, created pages of charset. To "simplify" and enable more characters created the multi-byte character. At this time it was no longer possible to use char
as the type to store the character, since it was guaranteed that it should only have 1 byte.
Na prevents you from using a sequence of char
s to say that it is only a character, but it will be a solution of yours, that its functions know what to do. The third-party C library, including the operating systems, don’t know how to handle it. So nobody does. A lot of people don’t understand that C is a language to work with things in a rough way, you can do whatever you want to do. Breaking the pattern is your problem.
When we need the multi-byte character we usually use the type wchar_t
. It can have a variable size according to the implementation. The specification leaves free. In some cases we use the char16_ t
and the char32_ t
which has its sizes guaranteed by specification. This is standardized.
Let’s run this code to better understand:
char a = 'a';
char b = 'ã';
wchar_t c = 'a';
wchar_t d = 'ã';
cout << sizeof(char) << "\n";
cout << sizeof(a) << "\n";
cout << sizeof('a') << "\n";
cout << sizeof(b) << "\n";
cout << sizeof(c) << "\n";
cout << sizeof(d) << "\n";
cout << sizeof('ã') << "\n";
Behold working in the ideone. And in the repl it.. Also put on the Github for future reference.
Do you realize that the accent does not occupy more bytes? I declared b
as char
it has only one byte, even having accent? And that c
has 4 bytes even has a character that fits in ASCII? The size is determined by the type of data, or variable. Where I explicitly said it is char
he used 1 byte. Where he can infer that a char
it is enough he used 1 byte, where I explicitly said it is a wchar_ t
, occupied 4 bytes. Where he inferred that he needed more than one byte to represent the character he adopted 4 bytes. Then his sizeof('ã')
gave 4 bytes because there was inference that he would be of type wchar_ t
.
It was clear that in this compiler the wchar_t
has 4 bytes.
Every C and C++ library understands the wchar_ t
as a type for storing characters and not numbers, although whenever what is there are numbers, computers do not know what are characters, it just uses a trick to show this to people who want to see it.
Again in C you do as you wish. If you want to make all characters have a byte you can do, even if they have an accent. Of course there are only 256 possible values in one byte. You cannot have all possible characters in this situation.
Thank you, the fact that you have declared b as char makes it not print correctly: http://ideone.com/JshA87 . Already on my computer
b
is printed as what we know so well, a little square with a "?" inside it– Miguel
@Miguel yes, the
cout
doesn’t know how to handle it. That’s what I said, you have to create your own functions to deal with it. So I didn’t even demonstrate this use, I demonstrated that it can use 1 byte. All this is in the answer. How to deal is implementation detail. It is possible to configure this, but it is another problem :)– Maniero
Exact bigown, thank you, in fact I thought that the size of a character was universal (I don’t know why), maybe because I follow very http://codegolf.stackexchange.com/ and they here always come a byte as a character (ASCII)
– Miguel
Just a "on the side"
printf("%d\n", a);
, if I am not mistaken, print out the ASCII code for character 'a' (97): http://img.over-blog-kiwi.com/1/24/59/98/20140923/ob_806ac6_codigo-ascii.jpg– Miguel
The normal is to treat as 1 byte even. The character has no universal size, nor the type has, depends on the compiler, architecture, platform. Actually your second observation is the opposite :P
printf("%c\n", a);
prints the character defined in the ASCIII table for 97, because the normal for C is the same number, is that everyone got used to the character more.char
is a numeric type, not a text. It can be used as if it were text.– Maniero