What is the difference between charset and collation in a database?

Asked

Viewed 2,693 times

26

Every time I create a database, I always come across these definitions and I don’t really know the difference.

I understand what it is charset, but not collation.

2 answers

22


Charset

The Charset defines the character set that you will use to store data in text form, where each character receives a number. A very common example is the 256-character ASCII table. If you indicate that your system’s charset is ASCII, this means that your system will only recognize the 256 characters contained in it. Therefore sometimes we see texts without accentuation on the internet, with a question mark instead of the character, because the page’s charset does not contain the character that should be displayed. Currently, the standard charset is UTF-8, which allows up to 1,114,112 characters, which allows you to display characters in any language, including Chinese and Japanese.

Collation

Collation is the term used to define the set of rules that the server will use for sorting and comparing texts, that is, how the operators =, >, <, order by, etc. For example, depending on the configured Collation, the system will sort the character 'ö' between the characters ːo' and ːp', using another collation, this character can be ordered at another position. So it can give conflict when making queries that relate tables with different collations. In addition, the collation also defines whether the system will differentiate accented characters or whether it will be case sensitive, for example the collation Latin1_general_ci_as defines that the system should treat characters as case insensitive (IC) and acentue sensitive (AS). Examples:

  • latin1_general_ci: No distinction between capital letters and lowercase. Searching for "test", records such as "Test" or "TEST" will be returned.
  • latin1_general_cs: Distinguishes upper and lower case letters. Seeking by "test" will only return "test". Options such as "Test" and "TEST" will not be returned.
  • latin1_swedish_ci: Does not distinguish between lower and upper case letters, nor accented characters and with cedilla, ie the record that contains the "Intuition" word will be returned when there is a search for word "innuendo"

Even if you change the collation of the database the previously created objects will not have the collation changed, for that you will have to recreate the object.

This subject is very complex, I tried to summarize to the maximum, I hope it helps.

  • 3

    It was a good summary, has more information than the current answer accepted. Won my +1.

  • 3

    Your reply was excellent

18

Charset is the set of symbols and encodes

Let’s assume you have the letters "A", "B", "a" and "b". For each letter shall be associated with a number: "A" = 0, "B" = 1, "a" = 2 and "b" = 3. A letter "A" is a symbol and 0 is the encounter to "A". The combination of symbol + encounter is a charset.

Already the collation is a set of character comparison rules.

I have withdrawn this reply from this excellent reference.

  • It’s much clearer now, thank you!

  • 1

    I even gave an improved answer, just to be clear.

  • 2

    Related: http://answall.com/a/21358/4751 (What are collations for)

  • 1

    Excellent response

Browser other questions tagged

You are not signed in. Login or sign up in order to post.