What are the main differences between Unicode, UTF, ASCII, ANSI?

Asked

Viewed 21,759 times

19

What are the main differences between "encodings" Unicode, UTF, ASCII, ANSI?

They are all really encodings or some are just "sub-categories" of others?

I don’t want to know all the details of each, just a brief of each and, if possible, how they differ from each other.

  • 2

    I would translate this from here, but I was too lazy to see where Google "missed". So I leave it to other answers.

  • As soon as I wrote the question I looked in Sozão (as usual) and came across this. If no one answers, I translate myself. I find it interesting to have this kind of content here, so I’ll keep the question at first =D

  • Your question is excellent to stay here. Only after I see Jon Skeet’s answers I get lazy to answer something like that. kkkk

  • 1

    I understand you, I understand you hahaha

  • Related: https://answall.com/q/394834/112052

  • read this here - https://medium.com/@Sestrem/o-m%C3%Adnimo-que-todo-desenvolvedor-saber-sobre-Unicode-e-character-sets-789a4229ecf5 is from 2003 by the stackoverflow host .

  • 1

    @jsbueno in the middle of those 3 years and little I ended up reading (and rereading) it. Grateful for the tip.

Show 2 more comments

2 answers

18


ASCII

American Standard Code for Information Interchange. As the name already says is a standard that suits Americans well. It goes from number 0 to 127, and the first 32 and the last are considered control, the others represent "printable characters", that is, recognized by humans. It is quite universal. It can be represented with only 7 bits, although normally a byte is used.

Tabela ASCII

It is clear that it has no accents, that the Americans do not even use.

ANSI

There is no such encoding.

The term is American National Standards Institute, the equivalent of our ABNT.

As he established some standards of use of characters to meet various demands, many encodings (actually pages of code) end up being called generically ANSI, even to make a counterpoint to Unicode which is another entity with another type of encoding. Usually these code pages are considered extensions to ASCII, but nothing prevents some encoding specific is not 100% compatible.

Again it was an American solution to deal with international characters since ASCII did not serve well.

Depending on the context, and even the time, it means something different. Today the term is used for the Windows 1252 since much of Microsoft’s documentation refers to its encoding as ANSI. ISO 8859-1, also known as Latin1, is also widely used.

All encodings called ANSI that I know can be represented by 1 byte.

So it depends on what you’re talking about.

UTF

Alone doesn’t mean much. It’s Unicode Transformation Format. There are a few encodings who use this acronym. UTF-8, UTF-16 and UTF-32 are the encodings best known.

In Wikipedia articles there are several details. They are very complex and almost nobody knows how to use right in all its fullness, including me. Most implementations are wrong and/or do not meet the standard, especially UTF-8.

UTF-8 is ASCII-compatible (it accepts ASCII as valid characters). But not with any other character encoding system. It’s the most complete and complex encoding there is. Some are passionate about it (and this is the best term I’ve found) and others hate it, even though they recognize its usefulness. It is complex for the human (programmer) to understand and for the computer to handle.

The size of UTF-8 and UTF-16 is variable, the first of 1 to 4 bytes (depending on the version could go up to 6 bytes, but in practice it does not happen) and the second is 2 or 4 bytes. UTF-32 always has 4 bytes.

There’s a comparison between them. I don’t know how much it takes. It’s certainly not complete.

Unicode

It is a standard for representation of texts established by a consortium. Among the standards set by him are some encodings. But it actually refers to much more than that. It originated from the Universal Coded Character Set or UCS that was much simpler and solved almost everything I needed.

A article that everyone should read, even if you don’t agree with everything there is.

Supported character sets are separated into planes. One can have a overview of them in the Wikipedia article. Plane 0 or BMP being the most widely used, wildly.

All these standards are made official by ISO which is the international body regulating technical standards.

Has to do with UTF.

  • Jeez, it was much better :/ ... if I had known you were in the area I would have waited XD +1

  • 1

    It’s just that I’ve been doing it in my head links and make it better. I found this answer in the OS, I thought more or less, but it’s right, so tb won +1 (my last vote for today)

12

According to the @randrade linkou, did a quick translation, removed some things about specific programming and about opinionated things, also tried not to leave to the letter the translation (my English is +or- I will review).

  • "Unicode" is not a specific encoding, it refers to any encoding that uses the union of codes to form a character.

  • UTF-16: 2 bytes per "unit of code".

  • UTF-8: In this format each character varies between 1 and 4 bytes. Where ASCII values use 1 byte each

  • UTF-32: This format uses 4 bytes per "code point" (probably to form a character).

  • ASCII: Uses a single byte for each character, it uses only 7bits to represent all characters (Unicode uses 0-127), it does not include accents and multiple special characters.

  • ANSI: There is no fixed standard for this encoding, there are several types actually. A very common example would be Windows-1252.

Other types you can find information on Unicode.org and possibly this link can be useful for you code Charts.

Detail:

1 byte equals 8 bits

Browser other questions tagged

You are not signed in. Login or sign up in order to post.