Unicode and UTF-8

Asked

Viewed 568 times

2

What is the difference between Unicode and UTF-8? They are the same encoding or one is derived from the other?

  • 1

    Unicode can be implemented with several character encodings: https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings?wprov=sfsi1 UTF8 is one of these implementations and is ASCII-compatible

1 answer

3


Unicode may refer:

  • to the universal standard character set - UCS (Universal Character Set) - defined and maintained by Unicode Consortium
  • or to codepoints identifying the characters in the UCS.

UTF means Unicode or (UCS) Transformation Format, which is like characters Unicode are represented in the computer memory or transmitted.

To really understand the difference between Unicode and UTF-8 it is necessary to understand the following concepts.

  1. Abstract character

    An abstract character is a platonic ideal of a element fundamental text.

    The platonic ideal refers, for example, to the concept that R is equal R or any other representation of the letter R.

    Already the question of element fundamental text is more complicated, for example c can be understood as a single character or as the composition of two: c and ¸.

    Part of the work of Unicode Consortium is in identifying these fundamental elements. See below examples of the relationship between (nonfundamental) text and characters.

    Relação entre elementos de textos e caracteres

  2. Codepoint

    Once the characters a are defined Unicode Consortium defines codepoints for these characters. These codepoints sane identifiers and DO NOT necessarily relate to how characters are stored in memory, they are an abstraction that allows to identify all characters mapped easily.

    Those codepoints are also commonly referred to as Unicodes and are represented as U+0041 or \u0041 (Lyric To)

    As it is possible to see below an abstract character can be represented by one or more characters Unicode

    relação entre caracteres abstratos e codepoints

  3. code Unit

    Refers to how the codepoints are stored in memory or transmitted. UTF-8 uses code Units 8-bit to store characters Unicode. Depending on the character UTF-8 uses 1 to 4 code Units, for example, To uses a code Unit, Ω two and ? four.

    The image below shows how the different formats encode the same characters. In UTF-32 the Codepoint is equal to code Unit, the same does not apply to other formats and the conversion of Codepoint for code Units is not obvious.

    inserir a descrição da imagem aqui


Information and images taken from Unicode Standard Version 10.0 - Core Specification

Browser other questions tagged

You are not signed in. Login or sign up in order to post.