How does the localCompare() method work?

Question

How does the localCompare() method work?

Asked 5 years, 4 months ago

Viewed 1,239 times

13

The only thing I know about this method is that it compares the strings to see who comes first, after or if they are equal and subsequent has a representative numerical value returned as -1, 1, 0 (or depending on the browser can return other values like -2, 2).

But it’s easy to understand if I’m comparing, for example, "a" and "c" to see who comes first or later, but what if I compare a word with a few letters which he will compare?

var test = 'JavaScript';

console.log(test.localeCompare('abc'));

I don’t understand either which arguments, how many arguments I can pass to the method localeCompare() and what they mean as, for example:

console.log(test.localeCompare('a', 'de', { sensitivity: 'base' }));
console.log("2".localeCompare("10", undefined, {numeric: true}));
console.log("2".localeCompare("10", "en-u-kn-true"));

I’ve read the references to MDN as locationCompare() and Collator(), but to be honest I didn’t understand anything they mentioned nor the examples, someone could explain in an easy way to understand how this method works?

2 answers

8

Basically, what localeCompare and Intl.Collator offer is a way to compare strings taking into account specific rules other than "standard comparison". It is often said that both serve to consider alphabetical order according to a specific language, but in fact they go a little further.

The "standard comparison" of strings is made with operators > and <, and is described in detail here. But basically, it follows the Unicode order code points of each character (to know what a code point is, read here).

Only this order "standard" is not sufficient for all use cases. Many languages, even using the same characters, have different rules for ordering them. For example, in Germany the alphabetical order places the character ä before the z, but in Sweden it’s the other way around.

console.log('ä'.localeCompare('z', 'de')); // -1 (ou algum outro valor negativo)
console.log('ä'.localeCompare('z', 'sv')); // 1 (ou algum outro valor positivo)

Note the second parameter, it indicates the locale to be used. Many summarize the locale as being just a "language", but it is actually a set of parameters that can define the language, region, variant/dialects, rules for alphabetical ordering, plus the format of dates, numbers, monetary values, etc. All this is condensed into an identifier. In the example above, I used the identifiers de and sv, indicating respectively the German and Swedish languages (these codes are defined by ISO 639). Below we will see other more complex identifier options.

The return in the first case was -1, and in the second was 1 (tested in Chrome, but in other browsers can return other values). When the return is a negative number, indicates that the character ä is "smaller" than the z (that is, in an ordination, the ä would stay before the z). When the return is positive, it indicates that it is "higher" (in an ordering, the ä would stand after the z), and when it is zero, it means that they are "equal" (i.e., in an ordination, they would be considered equivalent).

It is important to note that the language specification only says that the returned value must be positive, negative or zero (i.e., it is not guaranteed to always be -1 or 1).

Example of use to sort a string array:

let words = ['teste', 'äbc', 'zebra'];

// ordenar as palavras de acordo com as regras do idioma Alemão
console.log(words.sort((a, b) => a.localeCompare(b, 'de'))); // [ "äbc", "teste", "zebra" ]

// ordenar as palavras de acordo com as regras do idioma Sueco
console.log(words.sort((a, b) => a.localeCompare(b, 'sv'))); // [ "teste", "zebra", "äbc" ]

// usando Collator
let alemao = new Intl.Collator('de');
console.log(words.sort(alemao.compare)); // [ "äbc", "teste", "zebra" ]

let sueco = new Intl.Collator('sv');
console.log(words.sort(sueco.compare)); // [ "teste", "zebra", "äbc" ]

Note the example above that use the method compare of a Intl.Collator has the same effect as using localeCompare. But according to the documentation, wear a Collator perform better when you need to do multiple comparisons at once (for example, when I want to sort an array of strings). Apart from that detail, basically "everything" what I say about localeCompare also goes for Intl.Collator.

Remember that the ordering rules are not limited to a "letter to letter" comparison. In Slovak, for example, the digraph "ch" is placed after the "h" in alphabetical order:

let words = ['chave', 'casa', 'hoje'];

// em eslovaco, o "ch" fica depois do "h"
console.log(words.sort((a, b) => a.localeCompare(b, 'sk-SK'))); // ["casa", "hoje", "chave"]

// em português, ordem "normal"
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR'))); // ["casa", "chave", "hoje"]

^{Note: I did not use Slovak words :-)}

Also note that now the identifiers have the country code (code identifying a country as defined by ISO 3166). In case we have sk-SK (sk is the Slovak language code and SK is the code of Slovakia) and pt-BR (pt is the code of the Portuguese language, and BR is the code of Brazil). This serves to make the locale more specific, since there may be variations, such as pt-PT (portuguese de Portugal).

This does not always influence the behavior of localeCompare (pt-BR and pt-PT have the same alphabetical ordering rules), but there are other aspects to which a variant can make a difference. For example, en-US (american English) and en-GB (British English), although they have the same alphabetical ordering rules, they have different date formats (respectively, month/day/year and day/month/year). That is, to localeCompare would make no difference.

In addition to the locale, it is possible to pass a number of options that override the behavior of the locale. For example:

let words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// padrão do locale: maiúsculas depois de minúsculas, acentos depois de letras não acentuadas
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR'))); // ["a", "casa", "Casa", "sabia", "sabiá", "sábia"]

words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// colocando maiúsculas antes de minúsculas (demais regras se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR', {caseFirst: 'upper'}))); // ["a", "Casa", "casa", "sabia", "sabiá", "sábia"]

words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// acentos não fazem diferença (regra da "maiúscula depois" se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR', {sensitivity: 'case'}))); // ["a", "casa", "Casa", "sábia", "sabia", "sabiá"]

I re-initialized the array before each call because sort modifies the array itself, and I wanted to show in the third case that "wise", "knew" and "knew" remains in the same order (since you test in a browser in which sorting is stable - i tested in Chrome 81, but from the 70 it already implements stable ordering, required by ES2019 specification).

Anyway, see that if I use only the rules of the locale pt-BR, he considers that the capital letters should be after the lower case (so "home" is before "home"), and the accented letters are after the unstressed ones.

But using the options I can override this behavior. For example, using caseFirst: 'upper', i say capital letters must come first. But the other locale rule (of accented letters) remains.

In the third example, sensitivity: 'case' considers that letters with or without accent are equal (in fact he considers the "base Letters" rule, that the another answer already exemplified). So the "uppercase" rule was maintained, and the other words ("wise", "knew" and "knew") did not change position because they were considered "equal" (remembering that I did the test in Chrome, which already implements stable ordering - in other browsers it may be that the order changes).

It is still possible to use some unicode extensions in the locale identifier. This is indicated by the "-u" suffix, followed by the options (the full list can be found here, and a more detailed XML, here).

One of them is the kf , which is the "Collation Parameter key for Ordering by case". That is, it has the same functionality as the option caseFirst:

let words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// colocando maiúsculas antes de minúsculas (demais regras se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR-u-kf-upper'))); // ["a", "Casa", "casa", "sabia", "sabiá", "sábia"]

words = ['a', 'Casa', 'casa', 'sábia', 'sabia', 'sabiá'];
// options sobrescreve Unicode extension (ou seja, aqui vai ser "minúsculas primeiro")
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR-u-kf-upper', {caseFirst: 'lower'}))); // ["a", "casa", "Casa", "sabia", "sabiá", "sábia"]

The syntax may seem confusing, but here’s the deal:

pt-BR: locale code (with language and country)
suffix "-u" to indicate that it then has the Unicode extensions
"kf" is the extension itself, and after it has the value (in XML already cited has the values of each of the extensions, although not all are supported by Javascript).

In this case, the value of "kf" is "upper", hence the full locale code is pt-BR-u-kf-upper. Note that it has the same effect of using the option caseFirst: 'upper'. However, if I also put the option, this takes precedence (behaviour described in the documentation). So in the second case, caseFirst: 'lower' overwritten the kf-upper and the rule of "lowercase first".

As to the kn ("Collation Parameter key for Numeric Handling"), it is equivalent to the option numeric, which indicates whether to compare strings containing digits considering their numerical value or not. Because the default is to consider that the string '10' is less than '2', because digits 1 and 2 are actually characters and in the lexicographic comparison, the '10' comes before the '2'. But if we consider the numerical value, then 2 must come before 10. Ex:

let words = ['abc', 'abc 10', 'abc 2', 'Casa', 'casa'];
// default é considerar a comparação lexicográfica
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR'))); // [ "abc", "abc 10", "abc 2", "casa", "Casa" ]

// extensão "kn" com o valor "true", leva em conta o valor numérico (demais regras - como a "maiúsculas depois" - se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR-u-kn-true'))); // [ "abc", "abc 2", "abc 10", "casa", "Casa" ]

// opção "numeric" com valor "true", equivalente a "kn-true" (demais regras - como a "maiúsculas depois" - se mantém)
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR', {numeric: true}))); // [ "abc", "abc 2", "abc 10", "casa", "Casa" ]

And of course, it is possible to combine more than one extension into the same identifier (which is equivalent to using the respective options):

let words = ['abc', 'abc 10', 'abc 2', 'Casa', 'casa'];

// considerar valor numérico e maiúsculas primeiro
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR-u-kn-true-kf-upper'))); // [ "abc", "abc 2", "abc 10", "Casa", "casa" ]

words = ['abc', 'abc 10', 'abc 2', 'Casa', 'casa'];
// a mesma coisa, usando options
console.log(words.sort((a, b) => a.localeCompare(b, 'pt-BR', {caseFirst: 'upper', numeric: true}))); // [ "abc", "abc 2", "abc 10", "Casa", "casa" ]

There are some examples that I could not reproduce or find a case that makes a difference, but I leave quoted here.

The extension "co" (Collation type key, quoted here) has several values (such as "big5han", "Dict", "direct", "ducet", etc), which affect the behavior of one or more locales. For example, here some cases are described:

the value "search" (which in Javascript, I don’t know why, was put in the option usage) causes the Collator enter a dedicated string search mode. The example quoted is the Czech language, in which a search for "a" should never find an "á", but in terms of sorting it makes no difference (I couldn’t create a Javascript code that made a difference)
value "pinyin" sorts Chinese characters based on their transliteration for Latin characters. As I don’t know Chinese, I also didn’t get a good example (and needless to say it only affects Chinese locales)
there are many other values, for which my pathetic knowledge of other languages were not enough to find examples. There does not seem to be much detailed documentation on these options ^{_{or I was incompetent to not find}}

The option localeMatcher can be lookup or best fit. The lookup follows the algorithm described in BCP 47, that basically tries to find the locale that was informed, and if it is not available in the system, will try to find a more "generic", until find some available.

For example, if I search for the locale zh-Hans-CN (Chinese language (zh), with simplified characters (Hans), CN country code - China), but using identifier zh-Hans-CN-u-alguma-coisa (assuming "something" is a valid extension). If the extension is not available or not supported, it tries to search for zh-Hans-CN. If this variant is not available, try searching zh-Hans, and if not available, search for zh (and if it is not yet available, locale is used default system/browser).

Already the best fit can seek a more suitable variant. The only concrete example I found was in this article, which describes the case of es-GT (Spanish spoken in Guatemala). If it was not available and I used the lookup, would be returned only es (Spanish). But the best fit could return es-MX (Spanish spoken in Mexico). I did this test but with me it did not work and was returned the locale es:

let s = 'es-GT';
console.log(Intl.Collator(s, {localeMatcher: 'best fit'}).resolvedOptions().locale); // es
console.log(Intl.Collator(s, {localeMatcher: 'lookup'}).resolvedOptions().locale); // es

The other options and parameters have been detailed in another answer and do not want to repeat everything again.

Finally, it is worth remembering that if no locale is informed (for example, string.localeCompare(outraString)), the locale is used default which is configured in the system/browser/environment (which in turn, varies according to implementation).

Browser other questions tagged javascript

You are not signed in. Login or sign up in order to post.

by Samir Braga • **10,016** points · Answer 1 · 2020-04-12T18:09:15+00:00

Let’s go to the definition of the method:

interface String {
    /**
     * Determines whether two strings are equivalent in the current or specified locale.
     * @param that String to compare to target string
     * @param locales A locale string or array of locale strings that contain one or more language or locale tags. If you include more than one locale string, list them in descending order of priority so that the first entry is the preferred locale. If you omit this parameter, the default locale of the JavaScript runtime is used. This parameter must conform to BCP 47 standards; see the Intl.Collator object for details.
     * @param options An object that contains one or more properties that specify comparison options. see the Intl.Collator object for details.
     */
    localeCompare(that: string, locales?: string | string[], options?: Intl.CollatorOptions): number;
}

Determines whether two strings are equivalent in the current or specified location.

As we see, this method takes into account the local (language) - informed in the second argument as through the Language Tags - to specify whether a given string comes before, after, or at the same position as another in an alphabetic sorting criterion. This is due to the fact that each language has its own alphabet and these consequently present themselves in their own ordinations.

We are returned a number, which may be:

Negative: whether the string to which the method is called appears before which it is compared in the sorting criterion defined.
Positive: whether the string to which the method is called appears after which it is compared in the sorting criterion defined.
Zero: if the string to which the method is called is equal to the one compared in the sorting criterion defined.

Now, the options of the third argument:

interface CollatorOptions {
    usage?: string;
    localeMatcher?: string;
    numeric?: boolean;
    caseFirst?: string;
    sensitivity?: string;
    ignorePunctuation?: boolean;
}

Case First

Options: "upper" or "Lower". Determines which of these "cases" will be sorted first. When not informed takes into account the criteria adopted by the language.

Example

const items = ['Português', 'português'];

console.log(items.sort((a, b) => a.localeCompare(b, 'pt-BR', { caseFirst: 'lower' })))
console.log(items.sort((a, b) => a.localeCompare(b, 'pt-BR', { caseFirst: 'upper' })))

Ignore Punctuation

Determines whether or not the score should be considered in the ordering. Generally (I have no knowledge to generalize), punctuation characters are sorted first when considered.

Various Conventions also exist for the Handling of strings containing Spaces, modified Letters (such as those with diacritics), and non-letter characters such as Marks of punctuation.

_{https://en.wikipedia.org/wiki/Alphabetical_order}

Example

const items = ['Ele disse', '"olá mundo!"'];

console.log(items.sort((a, b) => a.localeCompare(b, 'pt-BR', { ignorePunctuation: false })))
console.log(items.sort((a, b) => a.localeCompare(b, 'pt-BR', { ignorePunctuation: true })))

Numeric

Determines whether the sort will take into account numerical criteria or not. In the example below, the string "30" comes before the string "8" when it is not using the numeric criterion, since the comparison is made character by character as in a word.

Example

const numbers = ['8', '30'];

console.log(numbers.sort((a, b) => a.localeCompare(b, 'pt-BR', { numeric: false })))
console.log(numbers.sort((a, b) => a.localeCompare(b, 'pt-BR', { numeric: true })))

Locale Matcher

Determines which algorithm will be used in match strings in different languages. Values can be:

Lookup: If you do not find the perfect match between the text a due Language Tag , back looking for one of these tags that more "fit" to some part of the text.
Best Fit: It seeks to bring, to the minimum, the results of the Lookup, may be better. (I did not find a good reference for this algorithm)

I couldn’t find a good example for this option. I accept suggestions.

Sensitivity

Options:

groundwork - strings that do not have the same base letters are not considered equal.
Accent - strings that do not have the same base letters or accents are not considered equal.
marry - strings that do not have the same base letters or "case" are not considered equal.
Variant - strings that do not have the same base letters, accents or upper case are not considered equal (default).

Example

/**
 * Em português "a" e "á" são as mesmas letras.
 * A única diferenã é a acentuação.
 * "a" é a letra base de "á"
 */
 
// Possuem a mesma letra base
console.log("base: ", 'á'.localeCompare('a', 'pt-BR', { sensitivity: 'base' }))

// Mas não os mesmos acentos
console.log("accent: ", 'á'.localeCompare('a', 'pt-BR', { sensitivity: 'accent' }))

console.log("base: ", 'A'.localeCompare('a', 'pt-BR', { sensitivity: 'base' }))

// Nesse, também não o mesmo "case"
console.log("case: ", 'A'.localeCompare('a', 'pt-BR', { sensitivity: 'case' }))

Usage

Options: "Sort" and "search". Defines whether the comparison will be for sorting or search purposes. I also couldn’t find good examples to illustrate the difference.

How does the localCompare() method work?

2 answers

Case First

Ignore Punctuation

Numeric

Locale Matcher

Sensitivity

Usage

Sources: