What are the differences between utf8 and utf8mb4?

Question

What are the differences between utf8 and utf8mb4?

Asked 9 years, 2 months ago

Viewed 14,650 times

20

When importing my mysql database to a windows server after having created it on a local server (xampp), I could not import into the server the script I exported from the database. So I decided to go copying the scripts from table by table, and I checked that only part of the script gave error:

ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

By removing all these occurrences from the exported script, I was able to upload the database to the windows server. However, some problems are occurring, such as some pages of the website that get the emphasis changed pro symbols and other problems that I do not know if they are due to the absence of the above line.

I wanted to understand a(s) difference(s) that there is (m) between utf8 and utf8mb4, to see if this may be causing the website problems.

utf8mb4 allows an extra byte in the encoding, basically this. For use in current languages, utf8mb4 is the same as the 3 byte version. Probably your problem is elsewhere in the code.

– Bacco

2016/04/22 at 19:17
but in those lines where are the options CHARSET and COLLATE, it is only for that that they serve, allow an extra byte in coding?

– DiChrist

2016/04/22 at 19:18
1

basically changes nothing anywhere but taking up more space in DB when you set something in CHAR. CHAR(10) reserves 30 bytes in utf8, CHAR(10) reserves 40 in utf8mb4, and CHAR(10) reserves 10 bytes in Latin. BMP characters, which are those supported by utf8 are identical to utf8mb4.

– Bacco

2016/04/22 at 19:20
Oh I get it, so that shouldn’t be the cause of the problem here, send your comment as an answer for me to accept

– DiChrist

2016/04/22 at 19:24
I do not promise, but if I take some more technical references, then put as an answer. I just wanted to move the subject forward so you have a basic notion. I think the answer, It is missing by good sources for staff consult (I think answers of this type deserve a more detailed explanation, so if you give I elaborate better later).

– Bacco

2016/04/22 at 19:25
Beauty, in the waiting.

– DiChrist

2016/04/22 at 19:30
4

Obviously, if someone wants to post a detailed answer, and talking things through, feel free (if it is to explain better, otherwise I recommend leaving it as a comment as well. If to talk nonsense, the comment "saves" the person from negativation).

– Bacco

2016/04/22 at 19:32

Show 2 more comments

1 answer

Browser other questions tagged mysql sql utf-8

You are not signed in. Login or sign up in order to post.

by Sérgio Mucciaccia • **2,745** points · Answer 1 · 2016-05-07T01:28:30+00:00

In the past, programming languages only supported ASCII encoding that defines 128 symbols. This encoding is excellent for English, producing very compact texts where each letter spends only one byte. With the growth of the internet and an increasingly globalized world, problems quickly began to arise, as the people of Brazil could not use accents in their words. It was then that initiatives began to create an encoding that would bring together all the symbols used all over the world.

ASCII only defines 128 symbols, which makes the first bit of every byte zero in this encoding. The UTF-8 standard took advantage of this and defined the first 128 symbols exactly equal to ASCII. When a character that is not present in this pattern is required, UTF-8 sets the value of the first bit to 1 and defines codes that say whether the character will have 1, 2, 3 or 4 bytes. Therefore a program using UTF-8 will be fully compatible with any ASCII text.

The problem is that Mysql did not fully adhere to the UTF-8 standard. It implemented only symbols up to 3 bytes and forgot the rest. What is stated in Mysql as utf8 is not actually UTF-8, it is just a piece of it. To fix this error, starting with version 5.5, Mysql implemented the full standard from 1 to 4 bytes and as it had already used the name utf8 called its new implementation utf8mb4. Summing up Mysql utf8 is not UTF-8 and utf8mb4 fully follows UTF-8 standard.

Still, the utf8 and utf8mb4 have a great compatibility, most absolute characters will be equal in both systems. If you switch from one to the other you probably won’t see the difference. Unless, of course, Chinese people start using animals as letters, then they will be upset when it appears #û&ý in place of kittens. Even if you use all the existing accents it would be no problem!

The point is, the Mysql standard is the Latin1 encoding, also known as ISO 8859-1 that defines all Latin language characters and can be very well used in Portuguese. When you stopped declaring UTF-8mb4, Mysql used this encoding and as your application is probably in UTF-8 these patterns do not represent the accents in the same way, but represent ASCII in the same way, so the error appears only in accents.

Maybe this part of the script went wrong because the version of Mysql used does not support utf8mb4. If this is the case only use the utf8 in place, the accents will be compatible.