Why are other encodings used besides UTF-8?

Asked

Viewed 2,310 times

23

If UTF-8 encoding can represent all Unicode characters, why are there still applications that adopt other encoding standards such as ANSI?

It would not be easier to abandon all those that do not offer this compatibility and generally become a hindrance for programmers, as in the case of accents?

  • 1

    I choose to use Win 1252 in some of my software, without having any accentuation problems, including some that interact with UTF-8 and 16. I save a lot of trouble avoiding unnecessary normalization, and know exactly how much I need storage for the strings. I just see encoding give problem when the programmer does not understand the subject. UTF-8 is not 100% trouble free either, this is legend. The same character can be represented combinatively and not combinatively, and who does not know what he is doing, back and a half also complicates with pure UTF-8.

  • The Win-1252 is not a standard, neither the Brazilian government, nor the government of Portugal, nor the W3C... No national or international democratic body recommends the use of Win-1252. What we call "standard" here is precisely the "consensual recommendation" already decided. You unfortunately do not have the option to use Win-1252 as you say, only the option to comply or not comply with the UTF-8 usage recommendation. Perhaps there is some general confusion in this regard...

  • 1

    @PK As I posted in your reply, the UTF-8 is not a universal solution. I agree that it’s good for the absurd most cases, but the question talks about abandoning everything in favor of UTF-8. The "disadvantages" sections illustrate some points that show that UTF-8 also has problems: https://en.wikipedia.org/wiki/UTF-8 . And I don’t usually see problems with accents in applications with other patterns, which are not a few. I see problems with people who have trouble understanding the limits of each encoding, which is understandable (besides the poor documentation of many languages with regard to this).

  • In this other question there was an answer based on facts: http://answall.com/a/30220/70

  • 3

    Relevant: http:/xkcd.com/927/

  • utf8 does not represent all Unicode chars.. if you saw this statement somewhere, you are wrong..

  • @Danielomine, give me a hand: you can give me an example of non-Unicode UTF-8?

  • open new pertunta.. @Jjoao

Show 3 more comments

5 answers

19


(for a more cultural focus see this other question)

Question-1. "(...) why there are still applications that adopt standards such as ANSI, among other encodings?"

Answer. I would say "there are very few". Some applications of these are technically justified because they do not use a sharp alphabet; and others, which imposes on Portuguese speakers the absence of accents and/or interchangeability, are doomed to scrapping.

Question-2. "Wouldn’t it be easier to leave (...)?"

Answer. Yeah, when I say "doomed to scrap" that’s about it. The problem maybe, is that you can’t wait years for this, need in fact applications that respect the UTF-8 today, now...

Many people, even if they do not express in writing, openly say that it is the pressure of international companies, who make soft bodies in Brazil forcing you to use Windows-1252, or government companies, which interrupted the software update in 1980... I do not agree, if only to justify... I think that we can not blame them alone (!), we are ourselves, professionals of the area, we made soft body for years, by not demanding the UTF-8 in our work environment, in our relationship with customers and suppliers.

Completion. We must agree with @utluiz, which reminds us that we must partly strive every day to maintain the whole environment in UTF-8, and in part we must conform, with facts and factors... and forget the subject, until the world changes 100%.


PS: web pages and storing texts in databases, are emblematic cases. Why so many webdesigners took so long (and some even today) to worry about preparing their pages and templates HTML with UTF-8? How many programmers participate in "localization" and improvement of open-source, such as Mysql or Postgresql? The Brazilian distribution of Postgresql does not offer as a standard template (DATABASE default) something with ENCODING = 'UTF8' LC_COLLATE = 'pt_BR.UTF-8' LC_CTYPE = 'pt_BR.UTF-8'... And, as it is not default, how many hosting companies took the trouble to change the default to the Brazilian standard? How many programmers, when they could, bothered to set up their databases in this way? I myself was once my victim... Until you change posture.


How about we change a minor detail of the question, change it to "Why do I still allow you to use encodings other than UTF-8?"

An opposite position, as we assume ourselves as part of the environment

In general we position ourselves as "victims" of our environment: ambience as a fact, and as something driven by decisions that we are not part of.

But the "environment" in this case, is something where, for example, the Stackoverflow-Portuguese community, can act, can have some effect, even if small. If we choose to conform to this "small change", the questions we should ask ourselves totally change (!).

Why we, analyst and programmers, cannot demand from our work environment, that we adopt the standard UTF-8? Why software and computer companies cannot require their customers and suppliers to exchange data on UTF-8?

Of course, you cannot ask from those who do not, but we know that 90% of configuration cases default of a national product, nationalized or "localized", may adopt UTF-8. Moreover, when it comes to data exchange, that is, formats such as XML and HTML, fully open and under totally standardized environment (e.g. IETF and W3C recommendations), we can guess that 99% can be UTF-8.

Of course, the second requirement, in extraordinary contexts, when one cannot offer UTF-8, is to ISO 8859-1. There are still a number of contexts, wisely expressed by the reply of @utluiz, where the difficulty of using UTF-8 is "explained" a little better, and is justified by our weak culture of using good practice, as well as culture and history of not demanding their rights as a Portuguese-speaking public or consumer.

This response is in part a reminder, that patterns are useful and necessary, that especially in Brazil, we waste a lot of time of our lives making conversions, adjusting data, adjusting settings, and adapting libraries. Analysts and programmers waste time, users are subject to products and "services" without cedilla.

Contextualizing

When it comes to computer science, computing and digital media, even Portugal means "colonized country". Speakers of the Portuguese language have always been imposed a foreign condition (e.g. conform to a text without accents).

Gradually the European standards were being adopted, and the minimum requirements to express the alphabet of Portuguese in a standardized way, being accepted by manufacturers of machines, software and other resources. The consolidation of the standard was of great importance ISO 8859-1 (known as ISO-Latin-1).

In Brazil, however, a pouporry de encodings... And with the emergence of Unicode, and the emergence of "recommendations by use of UTF-8", the diversity (of this pouporry) only increased.

Reminiscing

As already stated in this and other responses, UTF-8 has been a standard for years in fact and de jure.

The W3C has been suggesting use of UTF-8 (see RFC-3629) in all its recommendations. Likewise, the Brazilian government, with the recommendation e-PING.

All operating systems in use, in fixed or mobile computing, support UTF-8. Even QR-Code offers UTF-8...

On the Web, UTF-8 is already the most widely used encoding (default in fact) since 2007:

  • article "Moving to Unicode 5.1", 2008, shows in graphs and with Google data, that in December 2007 UTF-8 encoding became the most frequent encoding on web pages, passing ASCII and ISO 8859-1.

  • article on blogosphere, 2012, re-evaluates and demonstrates that UTF-8 remains predominant, even on "technologically uncompromised" pages such as, blogs, where only 6% of explicitly coded pages were detected with something that was not UTF-8.


Examples of questions related to problems with UTF-8:

It is clearly, even today, a "headache" for Brazilian analysts and programmers, in installations, configurations, and mainly in the exchange of data.



EDIT (ref. comments @Bacco)

On the question of "freedom of choice". Two examples:

  • We are free to choose between Java, PHP or Python, etc. They are all "standard languages" It is a matter of taste, context, etc. and the programmer "adopts its standard". There is no need for "one language for all" as there are no relevant coordination problems. The benefits of "one for all" do not exceed the benefits of diversity. The existence of a number of large communities is sufficient to reduce excess diversity.

  • We are not free to choose the (property) number of our house, it should meet a standard that is the street footage or your court. If we invent by numerology or taste, we create confusion in the street, and we make it difficult to deliver letters in the house itself. In this case the general benefit of "adopting the standard" exceeds the personal benefits of diversity.

In the case of codification we are not free to choose: the W3C, the Brazilian government, etc. have already chosen for us. It is the UTF-8. The benefits of adopting the standard (rather than diversity) are much greater, interoperability emerges, simplicity... as programmers we waste much less time of our lives (getting rid of conversion checks, conversions and error risks).

NOTE: this thing of "benefits" (global vs individual) can be measured; the game of possibilities, of having scenarios with more or less diversity, is known as coordination game. The pattern is the only thing that solves a coordination dilemma.

  • 2

    Worked well as a response :)

  • 1

    Let’s all use only Java, to standardize the language too. " As already put in this and other responses, UTF-8 is, for years, a de facto and de jure standard.". I think this has become a religious site. /Rant

  • @Bacco, but this is :-) In a company that develops JAVA, it is natural to program JAVA (!)... Why is it that in Brazil, where the alphabet has accents and needs to be interchangeable, there is no effort to comply with the standard? I’ve lost count of the times I asked customers or suppliers for "material" (XML, HTML or TXT content) in UTF-8, and they gave me a banana (!).

  • 1

    @Peterkrauss is, but the whole thread here is proposing UTF-8 for everyone, not your specific environment. UTF-8 is good for a lot of things, but it is far from being absolutely perfect and suitable for any situation. The one you mentioned is a specific problem, and I agree that when you have a standard in your company, it should be followed. To want HIS pattern to be used by all is religion.

  • @Bacco, I don’t know if I’ve captured your placement... it seems to make sense. Anyway, perhaps, to be more objective, the discussion should be separated into "work environment" (including databases, Ides, etc.) and "data exchange", which I’m understanding you called "UTF-8 for all".

  • 3

    @Peterkrauss simplifying: Win 1252 is as standard as UTF-8. Both are properly described. UTF-8 is younger, more comprehensive, but that doesn’t mean it’s trouble-free. I think everyone should use whatever is convenient for their specific use, as long as they know what they do and their limits. If I go to make an international application, of course I go from Unicode (not necessarily UTF). If I am making an Nfe application, both are good, since in the OUTPUT of the data I use the standard that the external API asks for. The day an Arab client wants the Nfe program, I can rethink ;)

  • @Bacco (sorry readers for the chat), I think there is a conflict of positions. Win-1252 is not a standard in the sense I built in the text. Simple explanation: who recommends the Win-1252?? The Brazilian government? the government of Portugal?? the W3C?? Some democratic body of international scope??

  • 2

    @Peterkrauss notice that I’m talking about the whole thread, not just your answer. There are many things in the answer that I agree with, just do not agree with the points that reinforce the thesis of the question, as if everyone adopting UTF-8 solved all problems. Maybe I wasn’t clear, and I should have just commented on the question. UTF-8 for blind flight (such good practices) is a great option, probably suitable for the vast majority of situations. Only that UTF-8 is not absolute, and has its problems, such as need for normalization, unpredictable size, and the fact of having versions, among others.

  • @Bacco, ok, certainly does not solve "everything", but all using UTF8 solves much. The "thesis" is right and there is broad consensus on this, by the survey I did... We could imagine the opposite: "everything in UTF8 does not solve anything". I don’t think you defend that thesis either... So you are advocating an intermediate "thesis": please enter with an answer so that we can know your arguments and vote for them.

  • 1

    @Peterkrauss in fact I put as a comment in the intention to make it clear that there are opposing positions, but not in the intention to defend the thesis. I had no idea that the chat would be this long, and I would say that the three main arguments are in the previous comment. Perhaps my mistake was to post in your answer instead of the question (and recognizing this I continued in "chat"). I just don’t know how much it pays to drop all these comments here. I don’t promise to put as an answer, but if I hit a moment of inspiration, I’ll write something. Here is a more neutral starting point https://en.wikipedia.org/wiki/UTF-8

  • I’m a bit of a layman, but I don’t understand the java-coding comparison. From what I know, it would be ideal to have a universal encoding for applications that require compatibility with other systems.Programming languages do not need to follow a standard, as it depends on the orientation for which it was created. I think this serves as an example, because the compatibility between different languages can be made from standardized and universal forms, like xml for example.

  • 1

    @Weslleycxsardinha is exactly this point. Each encoding has an orientation as well, just as XML is full of problems and does not suit all cases. Here there is an immense confusion of "majority" with "best". To adopt UTF-8 for everything would be as absurd as XML for everything, or Java for everything. None of these things are so good as to serve any situation. And a lot of the arguments are "fashion of the moment". It’s okay that I’ve only been programming for 25 years, but in that short time I’ve seen a lot of "definitive solution" that today no one remembers more. Almost everything != Everything.

  • @Bacco I agree, looking at it from this point of view, it makes sense. I think UTF-8 should be the default standard for system integration, data sharing, avoiding unnecessary problems, but not for all things. My particular verdict is that UTF-8 should be standard and the most widely used, as who has never suffered from database structuring? but should not replace the others.

  • 1

    Dear friends, the text of the question perhaps lacks an introduction, not to seem biased as @Bacco feels, perhaps in the face of the position "... It would not be easier to abandon all...". As for the point here woven by Bacco, I still find a distortion: it is as if to defend that each municipality of Brazil has the right to color and draw the road signs as they wish (!)... And of course, poor county doesn’t even have money for plate placement, we don’t need to exclude it from the map. But at some point even the poor adhere to conventions, for the life of all gets better.

  • 1

    It’s not that, but I won’t insist. In short, you want to turn it into standard pq most use. My final suggestion and exit from the conversation for those who are not lazy to think and study is: learn to work with encodings, that suffering ends. I know several programmers who have no problem with using various encodings, however the discussion leads us to believe that the very people who do not understand how it works are the ones who are right to choose a standard. Who knows what you are doing, will choose the best encoding for each specific case as it has always been. Just document.

  • 1

    On the -1 I have just given: fallacies have been added to the reply. I only made this comment because it was mentioned in the body of the reply. I will not touch the fallacies not to do Edit War, but I feel obliged to make it clear that the statements about the W3C and others are recommendations, and were used as rhetoric, as they do not apply to what was asked and much less invalidate what I have stated insistently.

  • Dear @Bacco the answer is open in Wiki for you (or any other) edit, and I never created "Edit War", I commit to continue with the same posture that I have on my Wikis here from SO-en and SO-en... As for what was asked, note that it is indeed common in Stackoverflow to run away from focus to offer the answer to a wider audience. But I was clear and objective, separating in two parts, the objective response and the rambling. In this case there was still a strengthening by moderators.

  • 1

    @Peterkrauss does not intend to resume the discussion. The fact that it was mentioned in the body of the reply forced me to express my disagreement. Because I made comments, I believe that the most that would fit would be another comment, but since that is the case, so be it. If anyone is interested in my point of view, I have expressed it. I do not want to turn it into a personal matter, because it is a purely technical matter and it is part of my everyday life, so I am sure of what has been said. If you are curious, see the link I put in a comment of the question, because it is a very similar answer to the one I would give.

Show 13 more comments

14

Relatively recently several operating systems did not support UTF-8. There are still many applications from this time in use and, in many cases, companies will not bother to update them just for reasons of good practice.

In addition, IDE’s like Eclipse and Visual Studio adopt the encoding operating system as standard, which in the case of Windows here in Brazil is CP1252 or WINDOWS-1252. I don’t know if there’s any special reason for an IDE like Eclipse not to use UTF-8, but whenever I create a new Workspace in Eclipse I have to manually set UTF-8 as a standard.

Another factor that hinders the change is that some source code versioning (SCM) systems do not handle coding change well. Even when they do, the IDE’s or client tools can also get confused. I’ve seen several cases where developers have had trouble making merge because of encoding in SVN and CVS.

Finally, in practice it is not so impossible and it is not difficult if everyone involved in the project buys the idea and does it properly. So the main impediments to the use of the UTF-8 in modern systems are, in my opinion, the following:

  • Difficulty working with the tools
  • Not having a consensus on the team about adopting the UTF-8
  • Not having good practices and standardization as a priority
  • Lack of knowledge about good practice
  • Lack of time
  • Lack of mutiny, because I’m used to working in a certain way

It may also be a combination of these factors

6

Because "ANSI" is more simple and solves most cases. For more details see my other answer on the subject.

  • UTF8 is slower.
  • It occupies more space in most cases.
  • It is difficult to manipulate correctly (implement and even use).
  • There are problems of reliability and ambiguity.
  • Although this is changing, it cannot be used in various situations and will never have 100% domain name (there will always be legacy applications and there will always be professionals who understand that there are better formats even if many people want to advertise that a format almost cures cancer).
  • There are other problems I talk about another answer.
  • 7

    But only UTF-8 brings the loved one in 3 days. :)

3

Among the basic points to be observed when adopting standards, we must analyze the technical part and especially the business model.

If you want the system to be plastered and limited regionally, choose a local charset such as ansi, win1252, shift-jis, Big5, among others.

If you want a flexible system where you can make it accessible globally, use a suitable Encode such as utf8 or utf16. There is also the utf32 known as ucs-4.

UTF8 is recommended worldwide for use in global access systems.

Divergences arise due to the difference between the environments in which web programmers and desktop programmers work. When they both collide, aggression begins.

Desktop programmers typically deal more with specific local markets. Web programmers need to think global all the time because, usually what develops gets public on the internet with global access.

A desktop software that runs in the corner uncle’s bakery will not have any Indian, Malay, Chinese or Arabic accessing the system. Therefore, they do not care about internationalization standards and, in a way, they are right.

However, there is a catch. At this point we enter the discussion about business model.

Even for a programmer who develops only localized software (for the bar on the corner), imagine when this programmer thinks about exporting the product to another country. He is invited to work in South Korea, but his product does not have an internationalized standard, so he would lose to a competitor who has internationalized standards.

0

In the question UTF8, Ansi, etc., it should not be forgotten that:
at least we have to deal with the past, with legacy texts

It’s probably all been said. Still, here’s some of my personal novel:

(1) Read with texts where coexist several languages, Greek characters, mathematicians, Cyrillic, Chinese. (2) I worked on a PT - Chinese dictionary before Unicode was popular... (3) Read with translation alignment (example EN-RU).

I hate UTF-8. But I hate it even more

  • ... cannot type the desired characters
  • ... there are three hundred different encodings
  • ... be text formats that do not indicate the encoding used.

I have had to process large amounts of legacy texts. And I find that:

  • although the tools have a short duration (in 5 years are born, rot, die, and become irrelevant), the texts/content have a huge duration.
  • ignore the encodings used in the past would forego the past
  • hundreds of different encodings appear (try iconv -l to recall some names)

By the way: Do not forget the tools linked to encodings (example iconv, recode)!

1) What encodings there are?

iconv -l          // 1173 alguns dos quais alias

2) How do I convert my text in windows Cyrillic format (WINDOWS-1251) to UTF-8 ?

iconv -f WINDOWS-1251 -t utf8 file.txt > file-utf8.txt

3) How to know on which radius of encoding will be this old language PT file?

for a in `iconv -l`
do
  iconv -c -f $a -t utf8 antiguidade.txt |grep -q "ção" && echo $a
done

4) And by the way, libraries like glibc have functions like iconv facilitate export / import in multiple encodings (functions with analogs exist for everything that is language)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.