All information on a computer is stored in bytes, and with characters/text is no different. One of the most basic ideas for working with text, still used today, is to map characters to numerical values - and so that the computer can manipulate them, turn those numerical values into bytes.
The ASCII (American Standard Code for Information Interchange), created in the 1960s, does this. So the letter a
has the numeric value 97, the comma has the value 44, and so on. Probably because it was invented by an Anglophone entity, it only covers the characters used in the English language: letters from A to Z (without accents), digits from 0 to 9, spaces and some punctuation marks (in addition to some others such as the DEL, the line break, control characters, etc.).
Thus, the ASCII is a Character set ("character set"): term Unicode defines as "a set of elements used to represent textual information". The ASCII also has control characters which are not necessarily text (there are characters for "non-printable things", such as the Bell, that only serves to emit a sound, among others). But most ASCII characters in fact serve to "represent textual information".
Unicode also defines the term Coded Character Set (in free translation): A set of characters in which each character is associated with a numerical value. Usually abbreviated as "Character set", "charset" or "code set"; or as CCS. As such, ASCII would also be a Coded Character Set. Interesting to note that the terms Charset, Character Set and Code Set are all considered synonymous with Coded Character Set.
In total there were 128 ASCII characters, with values from 0 to 127. That is, any of these values could be stored in 7 bits, and therefore any ASCII value could fit in one byte (8 bits). This left one bit left, so there were 128 more values that could be used and still fit in one byte. And they were not only used, but also abused: several different people had the same idea, and ended up creating different maps for values between 128 and 255 (different Character sets, and from 0 to 127, the characters were the same as the ASCII).
The result was that any of the values between 128 and 255 could represent completely different characters, depending on the mapping used. Each of these mappings was called Code Page. Many used these values for characters that existed in their native languages but not in ASCII, for example. The fact that many created such mappings for their specific needs, without worrying whether others were doing the same thing, led to the creation of several Code Pages.
As each value between 128 and 255 could represent a completely different character depending on the Code Page used, this made it very difficult - not to say impossible - to use these characters simultaneously, in addition to other problems.
Unicode
The Unicode, in the end, it is "just" more a mapping (much larger and more comprehensive than ASCII and Code Pages, but still it is a mapping: a large "notary", or a large coded Character set). Each character has an associated numerical value, which is called Code Point.
The values of the code points are written in hexadecimal and prefixed U+
(indicating that this is not just any numerical value, but a code point defined by Unicode). So the letter a
, code point 97 decimal is represented as U+0061 (as 61 in base 16 equals 97 in base 10). Interestingly note that the value shall be written with a minimum of 4 digits, then you couldn’t write it as "U+61" or "U+061".
A code point can have values between zero (U+0000) and 1.114.111 (U+10FFFF), which gives a total 1.114.112 code points possible. This range of values between zero and 1,114,111 is called Unicode Codespace. Currently, only part of them are being used: if we consult the most current version of Unicode (13.0), there it is said that it "has 143,859 characters".
On the origin of the U+xxxx notation, read here.
The code points are divided into 17 "plans". The first is called Basic Multilingual Plane (BMP), and contains the code points between U+0000 and U+FFFF. This plan is the one that contains most of the code points most used in everyday life, since it covers the characters of practically all the languages of the world, in addition to numerous symbols (including punctuation marks, line breaks, spaces, etc).
In addition to BMP, there are 16 more plans, covering code points between U+010000 and U+10FFFF (Supplementary Multilingual Plane (SMP), which contains code points between U+10000 and U+1FFFF, Supplementary Ideographic Plane (SIP), which contains the code points between U+20000 and U+2FFFF, etc), and are called supplementary or astral ("suplementary Planes" or "astral Planes") - and the code points contained in these plans are called astral code points. In them there are still several known characters, such as the PILE OF POO (U+1F4A9) and several other emojis (yes, your mobile emojis also have code points associated).
In addition to the plans, there are also the blocks (groupings of consecutive code points). Each block is composed of usually by characters that have some common feature, whether it be used in one or more specific languages, or in some particular area (such as mathematical symbols, although there is more than one block for them).
But this separation is not so "perfect" as well. For example, the yen symbol (¥
) is in the block Latin-1 Supplement (which contains the code points between U+0080 and U+00FF), while the euro symbol (€
) - and many other symbols of other coins - is in the block Currency Symbols (containing the code points between U+20A0 and U+20CF). And there are several other cases of "similar" symbols that are in different blocks (not counting the various blocks that contain "Miscellanea" in the name, such as the Miscellaneous Symbols and the Miscellaneous Symbols and Pictographs, which contain numerous symbols, many with no relation to each other).
In addition to each code point being in a block, they also have several estates indicating certain characteristics of the letter, as if it is a letter (and if this letter is uppercase, lowercase, titlecase, etc), if it is a digit, a whitespace, etc..
One of these properties is the "General_category" (informally called "category" only). The symbols of the yen and the euro, for example, have the category Symbol, Currency (also represented by the acronym "Sc"). Every code point has a category, including those that do not yet have a defined mapping (see here the full list of Unicode categories).
It is worth noting that Unicode has backward compatibility with ASCII, as the codepoints between U+0000 and U+007F (zero and 127) correspond to the same characters of ascii table.
An important point is that Unicode only defines the mapping between characters and their respective numerical values (the code points). Unicode, by itself, does not say how these numerical values should be represented in bytes. That’s where the encodings.
Encodings
How many different ways I can represent, in bytes, the code point U+0020?
One way is to take the numerical value (20 in hexadecimal, or 32 in decimal), and write it as a byte (in binary, it would be 00100000
). This works great for code points between U+0000 and U+00FF, as all these values fit in one byte (is what ASCII does, so it can be considered at the same time a Character set and an encoding - different from Unicode, which is a Character set, but not an encoding). But since Unicode allows code points up to U+10FFFF, many values will need more than one byte to be represented.
So we could use two bytes, which would be 00 20
(so we could have values up to U+FFFF). But it is also possible to write them as 20 00
. The first form is known as "big endian", and the second, "little endian". To know which one is being used, there is the convention of putting a "tag" at the beginning: the bytes FE FF
, if placed in that order at the beginning (before the code points themselves), it means that all subsequent bytes are in big endian. If they are in little endian, the mark shall be FF FE
. This marking (FF FE
or FE FF
) is called Byte Order Mark (GOOD). Already this coding format (this encoding), which uses two bytes for each code point, is called UTF-16 ("UTF" means "Unicode Transformation Format", and 16 comes from the amount of bits used for each value).
For code points with values above U+FFFF (which need more than 2 bytes to be represented), the UTF-16 "breaks" the value into two parts, creating the so-called surrogate pair (the "substitute pair"). The algorithm is described in detail in Wikipedia, but basically, a code point like U+1F4A9 is "broken" (or "decomposed") into two parts (each containing 2 bytes): D83D
and DCA9
.
An interesting detail is that all code points between U+D800 and U+DFFF are reserved for surrogate pairs, the first value (also called high surrogate) is always in the range 0xD800-0xDBFF, and the second value (low surrogate) always is in the range 0xDC00-0xDFFF.
Other encoding much used is the UTF-8, which takes a different approach. Depending on the value of the code point, it can be written up to 4 bytes, as shown in the table below:
Bytes |
Bits p/ code point |
1st code point |
Last code point |
Byte 1 |
Byte 2 |
Byte 3 |
Byte 4 |
1 |
7 |
U+0000 |
U+007F |
0xxxxxxx |
|
|
|
2 |
11 |
U+0080 |
U+07FF |
110xxxxx |
10xxxxxx |
|
|
3 |
16 |
U+0800 |
U+FFFF |
1110xxxx |
10xxxxxx |
10xxxxxx |
|
4 |
21 |
U+10000 |
U+10FFFFF |
11110xxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
In this case, the letters x
represent the bits of the numeric value of the code point. Ex: the code point U+00F5, which in binary is 11110101
. According to the table above, it falls in the second case (it needs 2 bytes, and 11 bits are used for code point). Since the value of code point only uses 8 bits, 3 zeros are added to the left to complete the 11, leaving 00011110101
. Then the first byte will be 110
followed by the first 5 bits of code point (00011
), and the second byte will be 10
followed by the remaining 6 bits (110101
). The result is bytes 11000011
and 10110101
(or, in hexadecimal, 0xC3 and 0xB5).
Note that from zero to 127 (U+007F), the values are encoded in only one byte, which makes the UTF-8 compatible with ASCII (because all characters in this range are encoded in the same way). Another important point is that with UTF-8, you don’t need to use BOM - in fact Unicode recommends that nay be used: see the Section 3.10, item D95, that says BOM is not mandatory or recommended ("is neither required nor Recommended").
There are still others encodings, like UTF-32, which uses 4 bytes to represent all code points, plus many others: the list of encodings existing is well-spread. Remembering that not all encodings are able to represent all possible values of Unicode code points, and the amount of bytes required to represent a code point can vary depending on the encoding used.
What is a character?
The term "character" is widely used to designate a number of things: from symbols (letters, numbers, punctuation marks) to special control codes (such as the CARRIAGE RETURN, which returns the cursor to the beginning of the line), among others.
Unicode defines the term "character" thus (in free translation):
(1) The smallest component of a written language that has semantic value; it refers to the meaning and/or abstract form, rather than a specific form, although in certain tables some visual representation is essential for the understanding of readers. (2) Synonym for abstract character ("abstract character" is defined as "unit of information used to organize, control or represent textual data"). (3) The basic coding unit for Unicode. (4) The name (in English) for ideograms of Chinese origin.
Definition 3 implies that each Unicode code point represents a character, and in many cases this is even true: the letters A to Z, the numbers, the space, the punctuation marks, all these characters have their own code point, in a ratio 1 to 1. But if we consider definition 1 ("the smallest component of a written language that has semantic value"), this rule no longer applies.
For example, the letter "a" with an acute accent: "á". In the Portuguese language, it fits the definition of "the smallest component that has semantic value", since if we remove the accent, it can change the meaning of a word (as happens with "wise", "knew" and "knew"). Then the "á", according to this definition, could be considered a different character from the "a" (and also the "â", the "ã" and the "à").
But then there must be only one code point to represent the "a", right? Well, actually, in Unicode there are two ways to represent this character:
- like code point U+00E1 (LATIN SMALL LETTER A WITH ACUTE) (
á
)
- as a combination of two code points:
The first form is called Normalization Form C (NFC), and the second, of Normalization Form D (NFD) - both described here (read more about this here). In the NFD form, the character referring to the accent is combined with the previous one (in this case, the letter "a"), and when they are printed, they are shown as if they were only one thing: the "á" (a single symbol/character/"drawing on the screen").
This is possible because there are code points that have this characteristic of being able to be combined with others (so-called Combining Characters). Many of them, including the acute accent, are in the block Combining Diacritical Marks.
So if you see an "a" on the computer screen, there is no way to know (just looking) if it is in NFC or NFD. You will only know if you "brush the bits" and see how many code points there are. Most languages have some mechanism - simple or not, it varies a lot - to make such a check.
Anyway, if we consider that the "á" is a character (because it is the "smallest component of a written language that has semantic value"), then we cannot assume that a code point always corresponds to a single character.
There are also other ways to combine code points. Flag emojis, for example, are actually the combination of two code points. Basically, the codes defined by ISO 3166 (defining 2-letter codes for each country) and for each letter there is one Regional Indicator Symbol corresponding (such as the "REGIONAL INDICATOR SYMBOL LETTER A", for example, which corresponds to the letter "A").
Ex: the Australian code, according to ISO 3166, is "AU", so we use the Regional Indicator Symbols corresponding (that of letter "A" (U+1F1E6) and that of letter "U" (U+1F1FA)). When these two code points are placed one after the other, the system (if supported by these emojis) shows them as if they were one thing - in this case, the Australian flag. Although there was only one symbol/drawing/"character" on the screen, two code points were required to represent it. Note: if the system/application/editor/source being used does not support the flag emojis, it can display the characters themselves, more or less like this:
In fact, the combination of two Regional Indicator Symbols represents a specific country/region, and is usually shown as the respective flag of that place (but it is not the flag itself).
Anyway, the flag emoji is the "smallest component that has semantic value", since only one Regional Indicator Symbol isolated does not mean anything, but two of them together represent a specific country/region.
Family emojis are even more complicated. There are several emojis that correspond to a single code point, such as MAN (U+1F468), WOMAN (U+1F469) and GIRL (U+1F467).
But many systems also have family emojis, with numerous variations of son(s) and daughter(s). For example, there is a "family with father, mother and two daughters", but there is no code point that matches this "character". In fact, this emoji consists of seven code points, in this order:
The emojis of man, woman and girls are "united" by the special/invisible character ZERO WIDTH JOINER (ZWJ). This character is used to join emojis (although is not limited to this), and these emoji sequences united by a ZWJ are called Emoji ZWJ Sequences. This sequence of codepoints can be displayed in different ways. If the system/program used recognizes this sequence, a single family image is shown:
But if this sequence is not supported, the emojis are shown next to each other:
So an Emoji ZWJ Sequence can be considered a single character? After all, in systems that support these sequences, it is shown as one thing (a single symbol/"drawing" on the screen). Even on the keyboards of multiple systems, mostly cell phones, you only need to type once to send it - although some might argue that this is just a shortcut that facilitates usability, and it doesn’t necessarily mean that it’s a single character.
Anyway, we can consider that the family and flag emojis fit the definition of "smallest component of a written language that has semantic value"?
We can argue that they are "drawings" and therefore are not part of the "written language". But in a way, all the letters, in the background, are just "drawings" (with well-defined rules about their forms and meanings, but still, they are drawings, right?), so why can’t an emoji be considered part of a text "written"?
But regardless of whether it is "written" or not, we can consider that they are the "smallest components that have semantic value" (an emoji isolated from a man, woman or child means one thing, already all together in a single symbol/drawing means another; the flag symbolizes a country, already each of the Regional Indicator Symbols, separately, no).
In Unicode, a sequence of code points that can be interpreted as "one thing" is called Grapheme Cluster. Therefore, it does not matter if they are considered "characters". What matters is that we can have groups of code points that, when united according to certain rules, have a meaning of their own, different from what each would have separately. This can even influence the algorithms for determine the size of a text, or invert a string, because the end result may be different depending on the chosen definition: if you consider that each code point is a character, when calculating the size of the string or inverting it, you will get different results than you would if you considered that each grapheme cluster is a character (and if you consider bytes instead of code points, the result will be yet another, since it varies according to the encoding used - may even generate an invalid sequence of bytes, in the event of reversing them "blindly").
The same goes for regular expressions. The point (.
) means "any character" (except line breaks), but in the vast majority of implementations it actually corresponds to a code point. That’s why the regex ^.{5}$
(exactly 5 characters) will give a match in string sabiá
if it is in NFC, but not if it is in NFD (because in this case the "á" will be decomposed into 2 code points and the string will have a total of 6 code points). Some languages/Engines support the shortcut \X
, which corresponds to a grapheme cluster, but depending on the implementation it may not recognize all cases (such as the flag emojis and the ZWJ Sequences).
So a Grapheme Cluster is or is not the same as a character? And as for Zero Width Joiner, is it a character? It has a code point (U+200D), but does it by itself have any meaning? Alone, is it a component that has semantic value, or does it only make sense along with other code points? And the Right to Left Marker, that affects the directionality of a text, but alone has no effect (and is even printed/shown)?
What in fact is a character? In itself unicode glossary (already seen above) there are 4 definitions, and the first and third are, in my interpretation, contradictory (a line from the "smallest component that has semantic value", the other line from the "basic coding unit for Unicode", which I understand to be a code point, and we have seen that these two definitions may be incompatible with each other).
In short, Unicode itself uses and defines the term "character" in many different ways - and in my understanding, some are even contradictory. In the end, my conclusion so far is that there is no single, exact and canonical definition of what a character is, and that in fact it varies greatly depending on the context.
Collation
A collation is a set of rules used to ordain classify characters (informally we can call "rules to define alphabetical order").
Their rules usually vary according to the language used. For example, in Sweden, the character z
is placed before the ö
, in Germany is the opposite (see here an example).
Another example is the Slovak language, in which the digraph "ch" is considered a separate "letter", which in alphabetical order goes after the "h" (so the word "cha" would come afterward of "ha" if both existed in Slovak - see here an example). Because it is considered a basic unit to determine the order, the digraph "ch" - when the locale Slovak is being used - is considered a Collation Grapheme ("a sequence of characters treated as a basic sorting unit").
The same rules about what comes before or after can affect accented characters ("á" may come before or after "a"), uppercase and lowercase letters, characters that do not exist in the language in question, etc. All this is defined by Unicode Collation Algorithm.
Other terms
Finally, there are other related terms:
- Code Value or Code Unit: according to the Unicode, is the "smallest combination of bits representing a coded text unit for processing or exchange". For example, UTF-8 uses 8-bit Units code each (and each code point can use 1 to 4 Units code, as seen above), while UTF-16 uses 16-bit Units code each (a code point can use 1 Unit code, or 2 in the case of surrogate pairs). Though the terms are synonymous, code value is considered obsolete.
- Octet: an ordered 8-bit sequence, considered "a unit" (in the sense of being "one thing") - for practical purposes, Unicode considers it the same as a byte.
- Character Map: Wikipedia cites that "historically has been synonymous with Character set", but currently seems no longer, as there is no mention of this term in the Glossary of Unicode. The only current references I’ve found are about the windows utility and similar.
- Actually the term Code Point can be used in a generic way to denote a value corresponding to a character, defined by a Coded Character Set (any one, not necessarily the Unicode). This is why many use the term "Unicode Code Point" to make it clear that they are using the values defined by Unicode. And there’s still the term Code Position, which is a synonym for Code Point.
Sources used for this answer:
Unicode code point is the unique identification in the theoretical standard established by the international consortium UNICODE that defines just over 137,000 typographical characters in its version 11. Such code points are mapped to the various Muti-byte implementations of the theoretical standard such as UTF-8 (8-bit Unicode Transformation Format), UTF-16, etc. To facilitate the first 256 Unicode characters are equal to those defined in the ISO-8859-1 standard.
– anonimo
Code page is a terminology used in older versions of Windows to define which of the monobyte character sets would be used to represent regional characters.
– anonimo
is... no, I asked because I know that almost everyone thinks they know what it is and gives wrong definitions, so they need correct canonical answers.
– Maniero