What is "utf-8"

UTF-8 is a way to encode text Unicode ASCII compatible. That is, all valid ASCII text is valid UTF-8 text (although not all valid UTF-8 text is valid ASCII). The most commonly used encoding on the Internet.

UTF-8 is incompatible with ISO 8859 encodings.

This is possible because ASCII only represents 128 characters, which fit in 7 bits. Since most systems use 8-bit multiples, the most significant bit of each character always has the value 0.

Depiction

UTF-8 characters that are also in ASCII have the same representation in both. For example, a is represented by the byte 97 (71 hexadecimal) in both ASCII and UTF-8.

Code points which are not representable in ASCII occupy more than one byte in UTF-8, and all bytes used have the most significant bit a 1. One consequence of this is that the byte 0 only appears to represent characters NULL.

The first byte of a code point has information of how many bytes this character occupies. The remaining bytes always have the most significant bits a 10.

Quantos bytes ocupa | Bytes
--------------------+------------------------------------
1 byte              | 0xxxxxxx
2 bytes             | 110xxxxx 10xxxxxx
3 bytes             | 1110xxxx 10xxxxxx 10xxxxxx
4 bytes             | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

None code point occupies more than 4 bytes in UTF-8.

However, each character (grapheme) can still occupy more than 4 bytes (if it consists of more than one code point).

Not all byte strings are valid. For example, the byte sequence 11000000 01000000 is invalid because a byte started by 110 always has to be followed by a byte started by 10). Moreover, no code point can start with 10.

Optionally, a string of UTF-8 characters can start with bytes 239 187 191 (EF BB BF hexadecimally, ï»¿ in ISO-8859-1).
This can be used to explicitly indicate that a document is in UTF-8 and not in any other encoding. However, there are other heuristics to detect if a document is in UTF-8. One way is to check for invalid byte sequences in UTF-8. If there is, there is a high probability of the document using another encoding.

Performance characteristics

Each code point in UTF-8 occupies 1 to 4 bytes. In texts using mostly characters present in ASCII, UTF-8 occupies less space than UTF-16 and UTF-32.

However, in texts with a high number of Asian characters, UTF-8 takes up more space than UTF-16.

All characters take up as much or less space in UTF-8 as in UTF-32.

In terms of time complexity, assuming that bytes are stored in a array, random access to a code point is linear, as in UTF-16 but unlike UTF-32 (which is constant).

Random access to a character (grapheme) is linear in the three encodings.

Sequential access to code points and graphemes are constant in the three encodings.