Character size (ASCII vs other encodings) in bytes

Asked

Viewed 4,614 times

6

Seeing this issue a doubt arose, coming from PHP and in the past having "problems" derived from character encoding, ex: srtpos vs mb_strpos, i knew that all ASCII characters have 1 byte, but I thought that special characters would have more, I associated that to the fact that a character is special this would also be multi-byte.

That’s if I record one simples.txt with an "a" character, for example, it is 1 byte long, but if you save it with an "ã" character, it is 2 bytes long. But that example indicates that the special character has 4 bytes.

#include <iostream>
using namespace std;

int main() {
    char a = 'a';
    cout << sizeof(char) << "\n"; // 1
    cout << sizeof(a) << "\n"; // 1
    cout << sizeof('ã')  << "\n"; // 4
}

Where do we stand?

2 answers

6


The guy char in C, and consequently in C++, there is not a good name there. I actually think he should call byte, because that’s what it is. It being used as a character is just a detail.

Contrary to popular belief, C is a weak typing language. It is statically typed but weak. People don’t understand these terms very well. C can interpret the same data as if it were of a different type or shape than the originally intended one. This can be observed in this code:

char a = 'a';
printf("%c\n", a);
printf("%d\n", a);

Behold working in the ideone. And in the repl it.. Also put on the Github for future reference.

The same data can be displayed as number or character.

Some functions of C allow you to make this interpretation as a character, in general you need to say it should be so. That’s why there is %c. It indicates that the data should be treated as a character. In general a char is treated as a number.

Any character encoding that can be stored in 1 byte can be stored in a char. When C was created only ASCII existed (at least in a relevant way).

More complete encodings that use the whole byte to represent more characters appeared. It was getting complicated, created pages of charset. To "simplify" and enable more characters created the multi-byte character. At this time it was no longer possible to use char as the type to store the character, since it was guaranteed that it should only have 1 byte.

Na prevents you from using a sequence of chars to say that it is only a character, but it will be a solution of yours, that its functions know what to do. The third-party C library, including the operating systems, don’t know how to handle it. So nobody does. A lot of people don’t understand that C is a language to work with things in a rough way, you can do whatever you want to do. Breaking the pattern is your problem.

When we need the multi-byte character we usually use the type wchar_t. It can have a variable size according to the implementation. The specification leaves free. In some cases we use the char16_ t and the char32_ t which has its sizes guaranteed by specification. This is standardized.

Let’s run this code to better understand:

char a = 'a';
char b = 'ã';
wchar_t c = 'a';
wchar_t d = 'ã';
cout << sizeof(char) << "\n";
cout << sizeof(a) << "\n";
cout << sizeof('a') << "\n";
cout << sizeof(b)  << "\n";
cout << sizeof(c)  << "\n";
cout << sizeof(d)  << "\n";
cout << sizeof('ã')  << "\n";

Behold working in the ideone. And in the repl it.. Also put on the Github for future reference.

Do you realize that the accent does not occupy more bytes? I declared b as char it has only one byte, even having accent? And that c has 4 bytes even has a character that fits in ASCII? The size is determined by the type of data, or variable. Where I explicitly said it is char he used 1 byte. Where he can infer that a char it is enough he used 1 byte, where I explicitly said it is a wchar_ t, occupied 4 bytes. Where he inferred that he needed more than one byte to represent the character he adopted 4 bytes. Then his sizeof('ã') gave 4 bytes because there was inference that he would be of type wchar_ t.

It was clear that in this compiler the wchar_t has 4 bytes.

Every C and C++ library understands the wchar_ t as a type for storing characters and not numbers, although whenever what is there are numbers, computers do not know what are characters, it just uses a trick to show this to people who want to see it.

Again in C you do as you wish. If you want to make all characters have a byte you can do, even if they have an accent. Of course there are only 256 possible values in one byte. You cannot have all possible characters in this situation.

  • Thank you, the fact that you have declared b as char makes it not print correctly: http://ideone.com/JshA87 . Already on my computer b is printed as what we know so well, a little square with a "?" inside it

  • @Miguel yes, the cout doesn’t know how to handle it. That’s what I said, you have to create your own functions to deal with it. So I didn’t even demonstrate this use, I demonstrated that it can use 1 byte. All this is in the answer. How to deal is implementation detail. It is possible to configure this, but it is another problem :)

  • Exact bigown, thank you, in fact I thought that the size of a character was universal (I don’t know why), maybe because I follow very http://codegolf.stackexchange.com/ and they here always come a byte as a character (ASCII)

  • Just a "on the side" printf("%d\n", a); , if I am not mistaken, print out the ASCII code for character 'a' (97): http://img.over-blog-kiwi.com/1/24/59/98/20140923/ob_806ac6_codigo-ascii.jpg

  • The normal is to treat as 1 byte even. The character has no universal size, nor the type has, depends on the compiler, architecture, platform. Actually your second observation is the opposite :P printf("%c\n", a); prints the character defined in the ASCIII table for 97, because the normal for C is the same number, is that everyone got used to the character more. char is a numeric type, not a text. It can be used as if it were text.

3

TL;DR; Depends on encoding and some language/platform details.

Each UTF-8 character occupies 1 to 6 bytes,

Each UTF-16 character occupies 16 bits

Each UTF-32 character occupies 32 bits

Each character of an ascii string occupies 1 byte

Source


Well, I think it’s good to just remember that when Voce works with a language/platform it is free to decide how it will allocate space in memory for each of the sopurtados types.

C

In the case of C it does minimal work, it allocates enough space for the type and can add a few extra bytes to padding, to be more user friendly to cache them and read/write in memory.

See this question for more information on C

C#

In the case of C# For example all nonprimitive objects have an overhead of 8 or 16 bytes, this question also clarifies why.

Python

Objects

Python also uses a technique similar to C#. The answer to this question in SOEN indicates that all Python objects occupy extra 16bytes (in 64bits). It seems that all objects store a reference count and a reference to the type of the object. There is the official documentation of python that explains as an objéto is structured.

I found a very detailed article on this subject

It seems that Python also padds objects up to 256bytes, if Voce allocate a 10bytes object it will actually occupy 16.

Strings

It also gives more details about the size a string occupies.

An empty string takes 37bytes, each additional character adds one byte to the size. Unicode strings are similar but they have an overhead of 50 bytes and each additional character occupies 4 bytes (I believe there was an error given by the author). In python 3 the overhead is 49 bytes.

The information seems to be somewhat contradictory to what is given in a question in the SOEN. But this will depend on the version of python you are using, so stay here for reference.

This other question in the SOEN also has a table that explains how much space each object occupies.

Bytes  type        empty + scaling notes
24     int         NA
28     long        NA
37     str         + 1 byte per additional character
52     unicode     + 4 bytes per additional character
56     tuple       + 8 bytes per additional item
72     list        + 32 for first, 8 for each additional
232    set         sixth item increases to 744; 22nd, 2280; 86th, 8424
280    dict        sixth item increases to 1048; 22nd, 3352; 86th, 12568 *
64     class inst  has a __dict__ attr, same scaling as dict above
16     __slots__   class with slots has no dict, seems to store in 
                   mutable tuple-like structure.
120    func def    doesn't include default args and other attrs
904    class def   has a proxy __dict__ structure for class attrs
104    old class   makes sense, less stuff, has real dict though.
  • 1

    Small fix: utf16: The encoding is variable-length, as code points are encoded with one or two 16-bit code Units.

  • Another correction: since the creation of UTF-16, UTF-8 can no longer have 5 or 6 bytes, since 4 is sufficient to encode 0x10ffff.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.