When to use ANSI and when to use UTF-8?

Asked

Viewed 31,112 times

35

Is it more advantageous to use an ANSI type instead of a UTF-8 type or vice versa? Is there any gain in performance or storage between types?

  • Excellent question, much more objective than the semi-duplicate that asks why still use other things, as if UTF-8 were solution to everything. And @bigown’s excellent response, based on facts, not on blind flight (such good practices).

  • 1

    I don’t even think my answer is that good, I just avoided being biased like the answer you gave in the question you quoted that in the end didn’t even answer the real question. http://answall.com/a/28404/101

  • Just to record: I converted an Excel table to TXT file with the fields separated by TABS. In the txt file accented characters appeared correctly. But when I converted the data to a Mysql table (UTF-8) the accented characters were all messed up. I found that the TXT file was set to UTF-8. I saved it then, as ANSI, I played in Mysql again and the accented characters appeared correctly. Here’s an example of how ANSI solved the problem.

2 answers

28

When to use ANSI and how much to use UTF-8?

Strictly speaking when you use UTF-8 is adopting the set of ANSI characters. But I think you are using the term ANSI erroneously. It’s not your fault, the term has been used in the wrong way for many years. You probably want to compare UTF-8 and ISO-8859-1/Latin 1 (which is often mistaken for CP1252/Windows-1252 which is another encoding/charset which serves the same purpose and which has essentially identical characteristics).

The first criterion to choose one or the other is to verify with whom you will exchange this data. There is no point in thinking which is best when it is incompatible with the activity it will be used. Determine the requirement of compatibility with which will be exchanged and do not worry about anything else. If you have no compatibility problems with anything, use the simplest.

If you have full control of how the exchange will be done or have freedom of choice, to avoid conversions look for choose the way the technologies used by you prefer. If this is still your choice then let’s look at other points.

It is more advantageous to use an ANSI type instead of a UTF-8 type or vice versa?

I’m going to start by disagreeing with Bruno’s answer that "ANSI" is practically obsolete. There are still many cases that it can not only be used, it is mandatory. Of course there is a wave of preference for UTF-8. This is undeniable. But both are useful tools and will be used for a long time.

There is no doubt that the UTF-8 is more modern, more flexible, more complete, more reliable in most cases and most popular, but the question remains whether all this is necessary. The greater advantage of "ANSI" is its simplicity for most cases.

"ANSI" still has the performance and storage advantages discussed below.

To greater advantage of UTF-8 is universality (which is relative, does not mean that it really serves for everything, serves for all characters) and this has an even greater meaning as it gets more popular. This universality is due to the fact that it allows several bytes to represent 1 character, while this ends up being its main disadvantage, the cause of most of the difficulties of the encoding. With more bytes One can represent a very large number of characters without resorting to artifices. And the flexibility allows the occupied space to not be so large despite consuming more processing to process most algorithms.

But in addition to having a heavy implementation, often with such complex defects that is, by definition there is the possibility of ambiguities of forms to represent the same character, which causes problems in comparisons (what you see is not what is represented in string) and a string may become invalid in cases of loss of some truncated information. I won’t even say that using UTF-8 is extremely complex and very few programmers know how to use it correctly. It’s okay that you don’t need to understand everything for the case that will guarantee to use only the simple, but then the UTF-8 is not so advantageous as it is. Curiously the people who benefited most from it are the ones who complain most about it.

One of the problems cited is the length confusion in bytes of string and the length in characters (or code points as they are actually called). Most programmers do not know well what the function/method Length of a string returns. In fact this may vary according to the technology used.

There are other problems cited in Why are other encodings used besides UTF-8?.

It is possible to abstract the treatment of codepages of the "ANSI" if necessary. In the vast majority of times they are not necessary, but if it is possible to create a data structure that encapsulates and abstracts this treatment in a transparent way for the user of the string. Of course it’s not perfect, it doesn’t solve all the problems but it solves one of the problems mentioned. But UTF-8 doesn’t solve all the problems either. Why has no one done this (at least nothing public and well known)? Because it is not a real necessity in most cases.

Finally, what version of UTF-8 are we talking about? Yeah, it has versions. It has backward compatibility between them, but if you try to get something generated with a newer version using a new feature and manipulate with an older implementation, you will have difficulties.

Is there any gain in performance or storage between types?

Certainly the "ANSI" is faster and takes up less space than the UTF-8. In specific cases where only ASCII table characters (up to 127) are used UTF-8 can occupy the same space.

There is the guarantee that the "ANSI" occupies 1 byte, the UTF-8 does not. It is clear that the UTF-8 not only occupies more space but the time to process as well. When you can guarantee the size in bytes can be more efficient. It is possible to have some more or less efficient algorithms with UTF-8 but the need to handle different sizes generates an additional cost. There is no miracle. We can roughly say that "ANSI" is a array and UTF-8 is a linked list directly in sequence. There is no way to get to a character in UTF-8 without going through other characters. Even the individual character requires a check to know if there is a complement to it, possibly through a branch in the processor, which is very expensive.

It is indisputable which wins in performance. It is possible to discuss if it is important. The difference isn’t usually that big and other factors can make the difference even more derisory. But there are also problems that require extreme performance.

The occupied space is larger with UTF-8 but the difference isn’t usually that big (and if it is, you probably don’t have much choice, which is highly unlikely in the Western world). But if there is any real reason that size matters, choose "ANSI".

Basically the difference will happen in accented characters. In "ANSI" the character will always occupy 1 byte and in UTF-8 will occupy 2 bytes. I won’t compare anything but the accents because "ANSI" won’t be able to manipulate. We’re talking about situations where you have a choice. To know the table of accents allowed, see in Wikipedia. Note that the characters without accent, that is, constants of the ASCII table, the UTF-8 will occupy only 1 byte. Most characters used have no accent.

What might complicate the decision is the variable size. There are file formats that require a fixed field/line size. But in this case there is probably also the requirement for the encoding and/or charset.

Completion

Remember that the subject is extremely complex, to say all that would be necessary would give a whole book (without exaggeration).

Particularly I seek to use ISO-8859-1 in everything I do where I have total freedom. It is simpler, easier, more efficient and solves all the problems I have in the software I do. Unfortunately for one reason or another I end up being forced to use the UTF-8 or even UTF-16 (this not for files unless really necessary) in some situations. No major problem, it has already been shown that there are advantages in it too.

26


TL;DR

  1. UTF-8 is a widely used scheme whereas ANSI is virtually obsolete.
  2. ANSI uses a single byte whereas UTF-8 uses an encoding multibyte.
  3. UTF-8 can represent a much larger character range than the rather limited ANSI.
  4. UTF-8 code points are uniformly standardized while ANSI has many different versions.

Difference between ANSI and UTF-8

ANSI and UTF-8 are two widely used character encoding schemes at one time or another.

The main difference between them is that UTF-8 was created to be more or less equivalent to ANSI but without all the many drawbacks it had. Both schemes expand the basic set of ASCII characters, which means that up to 127 initial characters, they are basically equivalent.

The first drawback of ANSI is the use of bytes fixed to represent the characters. In comparison, UTF-8 is more flexible as it is an encoding scheme multibyte.

Depending on user needs, any code point between 1 and 6 bytes can be used to represent a character. Because ANSI uses only one byte (or 8 bits), it can only represent a maximum of 256 characters, which is not nearly 1,112,064 characters, control codes, and slots reserved that the Unicode format can represent with the UTF-8 scheme.

Using coding multibyte makes it possible to accommodate all these code points and still consume minimal memory. The first byte UTF-8 matches exactly with ASCII and, because of this, the most common characters require only one byte.

To accommodate more characters, there were multiple ANSI pages created for different languages. You could therefore not use certain characters since they did not belong to the same encoding pages.

This also required the program to know beforehand which page would be used or incorrect characters would appear.

UTF-8 has none of these problems since it has its own point of code.

UTF-8 is superior to ANSI in every way. There is no reason to prefer ANSI over UTF-8 when creating applications that can be decoded by all computers. The only plausible reason would be to run an old application which you don’t have a viable replacement for.

Source: Difference Between

Translation: Me, myself, and nothing from Irene :p

Browser other questions tagged

You are not signed in. Login or sign up in order to post.