There is no correct answer for choosing the encoding. The choice must be made according to their needs. That is why banks accept several types.
If your system has no chance to receive any special character, as in the case you are describing where the content will always be a HTML
in which you can, a priori, exchange all the special characters for your representations Unicode (i and.. &#nnnn;
where nnnn
is the code Unicode), then it is likely that you do not need to store this data in UTF-8
. You can even have your entire database as one collation UTF-8
and just that field of HTML
with a collation different.
But often you have no control over how the HTML
will be recorded in the field, you do not have a filter to convert in cases where the user paste some special character, etc. If this is the case, then the best strategy is to use the Unicode.
Another issue is that you choose a field varchar
or a field text
to store that kind of information. Each type of field has its advantages and disadvantages, especially if you have any intention of applying filters or sorts on this content. The fields text
can also be indexed, but has a limit (prefix) that you should choose for the comparison of characters. There are also features of FULL TEXT SEARCH in Mysql that can be applied in both field types.
If it is only a matter of storing and retrieving the data, I would indicate the use of a field of type text
where you wouldn’t worry about size limitations, in case you don’t have this user input control.
Another aspect, is that nowadays the concern of the field occupy 1 byte or 2 bytes per character does not have much sense given the cost per byte of disk storage. Only if you have a system with a very large amount of data that needs to replicate in multiple instances and your provider’s storage cost is expensive.
If this is your primary concern and you’re not sure if the content will use it or not Unicode, choose the UTF-8
. This will facilitate your scripts bank, your conversions when reading in the program and to display on pages HTML
.
Possible duplicate of Which encoding to choose for a database?
– rbz
@RBZ got to read the answer you indicated. However it is 3 years ago, besides having some information that is now incorrect.
– Tulio F. Polachini
I believe there is no difference in the response of the post. But in any case, we will wait for someone who can confirm this ! ;]
– rbz
Being 3 years ago does not change, the choice of the type of something is according to its need (the need of the application it creates), if it was a better encoding than the other would not exist the "worst", maybe it would already have been removed or discontinued. And even if something was supposed to have changed, the correct thing would be for you to put a reward on the existing question asking for updated answers and perhaps comment on the existing answer stating that something has changed.
– Guilherme Nascimento
If it will not suit non-Western people, and does not intend to fill their application with emoticons, Latin-1 is a good one yet, it is more compact and more performatic (by 1:1 association between bytes and characters). If you want to internationalize, you can use any of the Unicode variants. Each case is a case. The important thing is to understand that blind praise of UTF-8 is lack of knowing what happens "under the hood". Unicode spends more space to have more characters. UTF-8 costs a lot more processing than the "full" Unicode, but it balances space and versatility (it only takes extra space above char 127).
– Bacco