Why does a char array support a character like ç and a char variable not?

Asked

Viewed 574 times

8

Even though the char variable only supports ASCII characters, why in the following code it has the normal output when a value is inserted with characters that are not part of ASCII as accented characters?

#include<iostream>
using namespace std;

int main(void)
{
    char test2[10];

    cin.get(test2, 10);

    cout << test2 << endl;

    return 0;
}

Also, because an array of char accepts the input of this character type. Why does char not accept it? How can I represent this character in c++ ? Should I use wchar_t? I have read some about the type in books, but as all the books were in English or were translated from English it seems that the authors did not pay much attention to the type wchar_t.

1 answer

9


This has little to do with C or C++ or other programming language.

Internally, the computer only knows numbers. To represent letters it uses a coding. Each person can make their own custom encoding.

There are several current encodings.

Many of these encodings use only the numbers of 0 to 255 (or -128 to 127), ie, are encodings for 8 bits.

In ASCII encoding (only 7 bits) there is no representation for, for example, ã.

When the use of computers increased it was necessary to extend the encodings used to represent more than 128 characters.

One of the new encodings created is named ISO-8859-1. In this encoding the ã code 227. For example, in the ISO-8859-8 encoding, the same code 227 represents the character ד (Dalet).

So far so good. All encoded numbers fit in 8 bits.

Obviously there is the problem of always having to know which encoding was originally used to convert the numbers into characters. This problem often occurred at the beginning of the Internet when people from different countries exchanged emails, each using a different encoding.

To solve this problem of different encodings, then a scheme was invented to encode more than 256 characters in a single encoding that serves for all countries: Unicode.

But the Unicode codes are too big to fit in 8 bits. Regardless of how these codes are translated for internal representation on the computer, 8 bits are not enough ... so the guy char not fit for Unicode (with direct representation, UTF-8, UTF-16, ..., little-endian, big-endian, ..., ...)

  • 2

    That’s a good answer. Whenever I comment on Unicode, I usually recommend reading an article that has become a reference in the explanation - and recommend again to @Ugusto anyone with similar doubt: http://local.joelonsoftware.com/wiki/O_M%C3%Adnimo_absoluto_que_todos_programmeres_de_software_absolutely need,_,Positivamente_de_Saber_Sobre_Unicode_e_Conjuntos_de_Caracteres%28Sem_Desculpas! %29

  • I saw the article, very good and well explained. Thank you for posting.

  • 1

    Just curious, to complement: in Rust the type char occupies 4 bytes, just because of Unicode. https://doc.rust-lang.org/std/primitive.char.html - To store only one byte, we have the i8 (Signed) and u8 (unsigned) respectively.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.