What is "unicode"

Unicode is a standard for encoding, rendering, and text manipulation with the intention of supporting all characters required for written text that incorporates all writing systems, technical symbols, and punctuation.

Unicode assigns to each character a code point as a single reference:

  • U+0041 To
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Format

Utfs describe how to encode code points, Code points in Portuguese, such as byte representations. The most common forms are UTF-8 (which encodes the code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization, and other locality-sensitive operations.

Characters

For general information, see Unicode article on Wikipedia.

For a more detailed explanation, see this question.