Use UTF-8 or Latin1?

Asked

Viewed 5,033 times

2

I started a new project and when creating the database (Mysql), I didn’t think twice, I put a CHARSET=utf8. The application will support Portuguese and English and users should use only these two languages.

In a specific module users can write a procedure. That is, a relatively long text, which I will use an editor WYSIWYG HTML. Users format their text and I write HTML to the database. For this column I chose VARCHAR(65535), so I would better use the space in the bank.

However, of course, Mysql reported that the most I can get in VARCHAR is 21845 per account the UTF-8 (maximum takes 3 bytes).

Question: Currently still worth it I use Latin1, ensuring that each character will have only 1 byte? Or is this obsolete and is better to do with UTF-8?

  • 5
  • @RBZ got to read the answer you indicated. However it is 3 years ago, besides having some information that is now incorrect.

  • I believe there is no difference in the response of the post. But in any case, we will wait for someone who can confirm this ! ;]

  • 5

    Being 3 years ago does not change, the choice of the type of something is according to its need (the need of the application it creates), if it was a better encoding than the other would not exist the "worst", maybe it would already have been removed or discontinued. And even if something was supposed to have changed, the correct thing would be for you to put a reward on the existing question asking for updated answers and perhaps comment on the existing answer stating that something has changed.

  • 3

    If it will not suit non-Western people, and does not intend to fill their application with emoticons, Latin-1 is a good one yet, it is more compact and more performatic (by 1:1 association between bytes and characters). If you want to internationalize, you can use any of the Unicode variants. Each case is a case. The important thing is to understand that blind praise of UTF-8 is lack of knowing what happens "under the hood". Unicode spends more space to have more characters. UTF-8 costs a lot more processing than the "full" Unicode, but it balances space and versatility (it only takes extra space above char 127).

3 answers

11


There is no correct answer for choosing the encoding. The choice must be made according to their needs. That is why banks accept several types.

If your system has no chance to receive any special character, as in the case you are describing where the content will always be a HTML in which you can, a priori, exchange all the special characters for your representations Unicode (i and.. &#nnnn; where nnnn is the code Unicode), then it is likely that you do not need to store this data in UTF-8. You can even have your entire database as one collation UTF-8 and just that field of HTML with a collation different.

But often you have no control over how the HTML will be recorded in the field, you do not have a filter to convert in cases where the user paste some special character, etc. If this is the case, then the best strategy is to use the Unicode.

Another issue is that you choose a field varchar or a field text to store that kind of information. Each type of field has its advantages and disadvantages, especially if you have any intention of applying filters or sorts on this content. The fields text can also be indexed, but has a limit (prefix) that you should choose for the comparison of characters. There are also features of FULL TEXT SEARCH in Mysql that can be applied in both field types.

If it is only a matter of storing and retrieving the data, I would indicate the use of a field of type text where you wouldn’t worry about size limitations, in case you don’t have this user input control.

Another aspect, is that nowadays the concern of the field occupy 1 byte or 2 bytes per character does not have much sense given the cost per byte of disk storage. Only if you have a system with a very large amount of data that needs to replicate in multiple instances and your provider’s storage cost is expensive.

If this is your primary concern and you’re not sure if the content will use it or not Unicode, choose the UTF-8. This will facilitate your scripts bank, your conversions when reading in the program and to display on pages HTML.

-1

In Mysql you have the types MEDIUMTEXT (16M characters) and LONGTEXT (4B), so there is no need to worry about limitations imposed by encoding. Standardize in UTF-8 and be happy :)

  • these two types, in case the column is optional, they do not occupy space in the database unnecessarily?

  • I find the answer totally pertinent. I just don’t want to waste space in the database.

  • No, they only take up the necessary space. The only "waste" is 3 or 4 bytes per row x column to store the size of the string (in MEDIUMTEXT and LONGTEXT respectively).

-3

Use latin1 only if you’re sure you won’t need characters other than those we need in Latin. latin1 is no longer recommended because it represents insufficient characters (including by the breeders themselves who recommend utf-8 in its replacement). Currently, utf-8 is the most suitable, as it allows encoding almost all characters of all languages with 1, 2 or 3 bytes. Rarely reaches 4 bytes.

  • 2

    Why? Latin1 has just the characters you say you don’t have. Even UTF-8 infection is wrong.

  • 2

    I use latin1 (actually, Win 1252, when for desktop) for most things just for having all the special characters that the Portuguese language uses, and being much more performative than UTF-8. It would be nice to review the concepts and [Edit] the post to give an improved.

  • Why the information on utf-8 is wrong?

  • On the other hand, do you think 256 characters can represent anything you want? Example: Try entering symbols such as € (Euro) in your databases that say you use latin1 and then say if the result is satisfactory.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.