Basically, what localeCompare
and Intl.Collator
offer is a way to compare strings taking into account specific rules other than "standard comparison". It is often said that both serve to consider alphabetical order according to a specific language, but in fact they go a little further.
The "standard comparison" of strings is made with operators >
and <
, and is described in detail here. But basically, it follows the Unicode order code points of each character (to know what a code point is, read here).
Only this order "standard" is not sufficient for all use cases. Many languages, even using the same characters, have different rules for ordering them. For example, in Germany the alphabetical order places the character ä
before the z
, but in Sweden it’s the other way around.
console.log('ä'.localeCompare('z', 'de')); // -1 (ou algum outro valor negativo)
console.log('ä'.localeCompare('z', 'sv')); // 1 (ou algum outro valor positivo)
Note the second parameter, it indicates the locale to be used. Many summarize the locale as being just a "language", but it is actually a set of parameters that can define the language, region, variant/dialects, rules for alphabetical ordering, plus the format of dates, numbers, monetary values, etc. All this is condensed into an identifier. In the example above, I used the identifiers de
and sv
, indicating respectively the German and Swedish languages (these codes are defined by ISO 639). Below we will see other more complex identifier options.
The return in the first case was -1
, and in the second was 1
(tested in Chrome, but in other browsers can return other values). When the return is a negative number, indicates that the character ä
is "smaller" than the z
(that is, in an ordination, the ä
would stay before the z
). When the return is positive, it indicates that it is "higher" (in an ordering, the ä
would stand after the z
), and when it is zero, it means that they are "equal" (i.e., in an ordination, they would be considered equivalent).
It is important to note that the language specification only says that the returned value must be positive, negative or zero (i.e., it is not guaranteed to always be -1
or 1
).
Example of use to sort a string array:
let words = ['teste', 'äbc', 'zebra'];
// ordenar as palavras de acordo com as regras do idioma Alemão
console.log(words.sort((a, b) => a.localeCompare(b, 'de'))); // [ "äbc", "teste", "zebra" ]
// ordenar as palavras de acordo com as regras do idioma Sueco
console.log(words.sort((a, b) => a.localeCompare(b, 'sv'))); // [ "teste", "zebra", "äbc" ]
// usando Collator
let alemao = new Intl.Collator('de');
console.log(words.sort(alemao.compare)); // [ "äbc", "teste", "zebra" ]
let sueco = new Intl.Collator('sv');
console.log(words.sort(sueco.compare)); // [ "teste", "zebra", "äbc" ]
Note the example above that use the method compare
of a Intl.Collator
has the same effect as using localeCompare
. But according to the documentation, wear a Collator
perform better when you need to do multiple comparisons at once (for example, when I want to sort an array of strings). Apart from that detail, basically "everything" what I say about localeCompare
also goes for Intl.Collator
.
Remember that the ordering rules are not limited to a "letter to letter" comparison. In Slovak, for example, the digraph "ch" is placed after the "h" in alphabetical order:
let words = ['chave', 'casa', 'hoje'];
// em eslovaco, o "ch" fica depois do "h"
console.log(words.sort((a, b) => a.localeCompare(b, 'sk-SK'))); // ["casa", "hoje", "chave"]
// em português, ordem "normal"
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR'))); // ["casa", "chave", "hoje"]
Note: I did not use Slovak words :-)
Also note that now the identifiers have the country code (code identifying a country as defined by ISO 3166). In case we have sk-SK
(sk
is the Slovak language code and SK
is the code of Slovakia) and pt-BR
(pt
is the code of the Portuguese language, and BR
is the code of Brazil). This serves to make the locale more specific, since there may be variations, such as pt-PT
(portuguese de Portugal).
This does not always influence the behavior of localeCompare
(pt-BR
and pt-PT
have the same alphabetical ordering rules), but there are other aspects to which a variant can make a difference. For example, en-US
(american English) and en-GB
(British English), although they have the same alphabetical ordering rules, they have different date formats (respectively, month/day/year and day/month/year). That is, to localeCompare
would make no difference.
In addition to the locale, it is possible to pass a number of options that override the behavior of the locale. For example:
let words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// padrão do locale: maiúsculas depois de minúsculas, acentos depois de letras não acentuadas
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR'))); // ["a", "casa", "Casa", "sabia", "sabiá", "sábia"]
words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// colocando maiúsculas antes de minúsculas (demais regras se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR', {caseFirst: 'upper'}))); // ["a", "Casa", "casa", "sabia", "sabiá", "sábia"]
words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// acentos não fazem diferença (regra da "maiúscula depois" se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR', {sensitivity: 'case'}))); // ["a", "casa", "Casa", "sábia", "sabia", "sabiá"]
I re-initialized the array before each call because sort
modifies the array itself, and I wanted to show in the third case that "wise", "knew" and "knew" remains in the same order (since you test in a browser in which sorting is stable - i tested in Chrome 81, but from the 70 it already implements stable ordering, required by ES2019 specification).
Anyway, see that if I use only the rules of the locale pt-BR
, he considers that the capital letters should be after the lower case (so "home" is before "home"), and the accented letters are after the unstressed ones.
But using the options I can override this behavior. For example, using caseFirst: 'upper'
, i say capital letters must come first. But the other locale rule (of accented letters) remains.
In the third example, sensitivity: 'case'
considers that letters with or without accent are equal (in fact he considers the "base Letters" rule, that the another answer already exemplified). So the "uppercase" rule was maintained, and the other words ("wise", "knew" and "knew") did not change position because they were considered "equal" (remembering that I did the test in Chrome, which already implements stable ordering - in other browsers it may be that the order changes).
It is still possible to use some unicode extensions in the locale identifier. This is indicated by the "-u" suffix, followed by the options (the full list can be found here, and a more detailed XML, here).
One of them is the kf
, which is the "Collation Parameter key for Ordering by case". That is, it has the same functionality as the option caseFirst
:
let words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// colocando maiúsculas antes de minúsculas (demais regras se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR-u-kf-upper'))); // ["a", "Casa", "casa", "sabia", "sabiá", "sábia"]
words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// options sobrescreve Unicode extension (ou seja, aqui vai ser "minúsculas primeiro")
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR-u-kf-upper', {caseFirst: 'lower'}))); // ["a", "casa", "Casa", "sabia", "sabiá", "sábia"]
The syntax may seem confusing, but here’s the deal:
pt-BR
: locale code (with language and country)
- suffix "-u" to indicate that it then has the Unicode extensions
- "kf" is the extension itself, and after it has the value (in XML already cited has the values of each of the extensions, although not all are supported by Javascript).
In this case, the value of "kf" is "upper", hence the full locale code is pt-BR-u-kf-upper
. Note that it has the same effect of using the option caseFirst: 'upper'
. However, if I also put the option, this takes precedence (behaviour described in the documentation). So in the second case, caseFirst: 'lower'
overwritten the kf-upper
and the rule of "lowercase first".
As to the kn
("Collation Parameter key for Numeric Handling"), it is equivalent to the option numeric
, which indicates whether to compare strings containing digits considering their numerical value or not. Because the default is to consider that the string '10'
is less than '2'
, because digits 1 and 2 are actually characters and in the lexicographic comparison, the '10'
comes before the '2'
. But if we consider the numerical value, then 2 must come before 10. Ex:
let words = ['abc', 'abc 10', 'abc 2', 'Casa', 'casa'];
// default é considerar a comparação lexicográfica
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR'))); // [ "abc", "abc 10", "abc 2", "casa", "Casa" ]
// extensão "kn" com o valor "true", leva em conta o valor numérico (demais regras - como a "maiúsculas depois" - se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR-u-kn-true'))); // [ "abc", "abc 2", "abc 10", "casa", "Casa" ]
// opção "numeric" com valor "true", equivalente a "kn-true" (demais regras - como a "maiúsculas depois" - se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR', {numeric: true}))); // [ "abc", "abc 2", "abc 10", "casa", "Casa" ]
And of course, it is possible to combine more than one extension into the same identifier (which is equivalent to using the respective options):
let words = ['abc', 'abc 10', 'abc 2', 'Casa', 'casa'];
// considerar valor numérico e maiúsculas primeiro
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR-u-kn-true-kf-upper'))); // [ "abc", "abc 2", "abc 10", "Casa", "casa" ]
words = ['abc', 'abc 10', 'abc 2', 'Casa', 'casa'];
// a mesma coisa, usando options
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR', {caseFirst: 'upper', numeric: true}))); // [ "abc", "abc 2", "abc 10", "Casa", "casa" ]
There are some examples that I could not reproduce or find a case that makes a difference, but I leave quoted here.
The extension "co" (Collation type key, quoted here) has several values (such as "big5han", "Dict", "direct", "ducet", etc), which affect the behavior of one or more locales. For example, here some cases are described:
- the value "search" (which in Javascript, I don’t know why, was put in the option
usage
) causes the Collator
enter a dedicated string search mode. The example quoted is the Czech language, in which a search for "a" should never find an "á", but in terms of sorting it makes no difference (I couldn’t create a Javascript code that made a difference)
- value "pinyin" sorts Chinese characters based on their transliteration for Latin characters. As I don’t know Chinese, I also didn’t get a good example (and needless to say it only affects Chinese locales)
- there are many other values, for which my pathetic knowledge of other languages were not enough to find examples. There does not seem to be much detailed documentation on these options or I was incompetent to not find
The option localeMatcher
can be lookup or best fit. The lookup follows the algorithm described in BCP 47, that basically tries to find the locale that was informed, and if it is not available in the system, will try to find a more "generic", until find some available.
For example, if I search for the locale zh-Hans-CN
(Chinese language (zh), with simplified characters (Hans), CN country code - China), but using identifier zh-Hans-CN-u-alguma-coisa
(assuming "something" is a valid extension). If the extension is not available or not supported, it tries to search for zh-Hans-CN
. If this variant is not available, try searching zh-Hans
, and if not available, search for zh
(and if it is not yet available, locale is used default system/browser).
Already the best fit can seek a more suitable variant. The only concrete example I found was in this article, which describes the case of es-GT
(Spanish spoken in Guatemala). If it was not available and I used the lookup, would be returned only es
(Spanish). But the best fit could return es-MX
(Spanish spoken in Mexico). I did this test but with me it did not work and was returned the locale es
:
let s = 'es-GT';
console.log(Intl.Collator(s, {localeMatcher: 'best fit'}).resolvedOptions().locale); // es
console.log(Intl.Collator(s, {localeMatcher: 'lookup'}).resolvedOptions().locale); // es
The other options and parameters have been detailed in another answer and do not want to repeat everything again.
Finally, it is worth remembering that if no locale is informed (for example, string.localeCompare(outraString)
), the locale is used default which is configured in the system/browser/environment (which in turn, varies according to implementation).
In relation to
localeMatcher
the values"best fit"
and"lookup"
don’t do exactly the same thing? o"lookup"
searches for a more suitable language tag following the BCP 47 standard"best fit"
is the default language set also more suitable, wouldn’t that be the same thing? Sources: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl#Locale_negotiation– felipe cardozo
And that not even in the example of my question what that code means:
console.log("2".localeCompare("10", "en-u-kn-true"));
or when compared words'check'.localeCompare('against');
.– felipe cardozo
In the argument
Locales
also possesses other values that I did not understand asco
,kn
,kf
. Sources: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Collator/Collator#Parameters– felipe cardozo