What are the differences between utf8 and utf8mb4?

Asked

Viewed 14,650 times

20

When importing my mysql database to a windows server after having created it on a local server (xampp), I could not import into the server the script I exported from the database. So I decided to go copying the scripts from table by table, and I checked that only part of the script gave error:

ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

By removing all these occurrences from the exported script, I was able to upload the database to the windows server. However, some problems are occurring, such as some pages of the website that get the emphasis changed pro symbols and other problems that I do not know if they are due to the absence of the above line.

I wanted to understand a(s) difference(s) that there is (m) between utf8 and utf8mb4, to see if this may be causing the website problems.

  • utf8mb4 allows an extra byte in the encoding, basically this. For use in current languages, utf8mb4 is the same as the 3 byte version. Probably your problem is elsewhere in the code.

  • but in those lines where are the options CHARSET and COLLATE, it is only for that that they serve, allow an extra byte in coding?

  • 1

    basically changes nothing anywhere but taking up more space in DB when you set something in CHAR. CHAR(10) reserves 30 bytes in utf8, CHAR(10) reserves 40 in utf8mb4, and CHAR(10) reserves 10 bytes in Latin. BMP characters, which are those supported by utf8 are identical to utf8mb4.

  • Oh I get it, so that shouldn’t be the cause of the problem here, send your comment as an answer for me to accept

  • I do not promise, but if I take some more technical references, then put as an answer. I just wanted to move the subject forward so you have a basic notion. I think the answer, It is missing by good sources for staff consult (I think answers of this type deserve a more detailed explanation, so if you give I elaborate better later).

  • Beauty, in the waiting.

  • 4

    Obviously, if someone wants to post a detailed answer, and talking things through, feel free (if it is to explain better, otherwise I recommend leaving it as a comment as well. If to talk nonsense, the comment "saves" the person from negativation).

Show 2 more comments

1 answer

23


In the past, programming languages only supported ASCII encoding that defines 128 symbols. This encoding is excellent for English, producing very compact texts where each letter spends only one byte. With the growth of the internet and an increasingly globalized world, problems quickly began to arise, as the people of Brazil could not use accents in their words. It was then that initiatives began to create an encoding that would bring together all the symbols used all over the world.

ASCII only defines 128 symbols, which makes the first bit of every byte zero in this encoding. The UTF-8 standard took advantage of this and defined the first 128 symbols exactly equal to ASCII. When a character that is not present in this pattern is required, UTF-8 sets the value of the first bit to 1 and defines codes that say whether the character will have 1, 2, 3 or 4 bytes. Therefore a program using UTF-8 will be fully compatible with any ASCII text.

The problem is that Mysql did not fully adhere to the UTF-8 standard. It implemented only symbols up to 3 bytes and forgot the rest. What is stated in Mysql as utf8 is not actually UTF-8, it is just a piece of it. To fix this error, starting with version 5.5, Mysql implemented the full standard from 1 to 4 bytes and as it had already used the name utf8 called its new implementation utf8mb4. Summing up Mysql utf8 is not UTF-8 and utf8mb4 fully follows UTF-8 standard.

Still, the utf8 and utf8mb4 have a great compatibility, most absolute characters will be equal in both systems. If you switch from one to the other you probably won’t see the difference. Unless, of course, Chinese people start using animals as letters, then they will be upset when it appears #û&ý in place of kittens. Even if you use all the existing accents it would be no problem!

The point is, the Mysql standard is the Latin1 encoding, also known as ISO 8859-1 that defines all Latin language characters and can be very well used in Portuguese. When you stopped declaring UTF-8mb4, Mysql used this encoding and as your application is probably in UTF-8 these patterns do not represent the accents in the same way, but represent ASCII in the same way, so the error appears only in accents.

Maybe this part of the script went wrong because the version of Mysql used does not support utf8mb4. If this is the case only use the utf8 in place, the accents will be compatible.

  • 5

    In principle if you do not use the 4th byte, the rest of the characters are the same. The only thing that changes beyond that is how much space Mysql reserves to work with fixed columns if it is mb4 or not. I gave my +1 for the detailed answer, in particular for "the Latin1 encoding, also known as ISO 8859-1 that defines all the characters of the Latin language and can be very well used in Portuguese", which has people who insist on not understanding. And yet, with the advantage of ISO-8859-1 spending 1 byte per character only and being faster to process.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.