How does String.prototype.normalize work in Javascript?

Asked

Viewed 1,878 times

11

I was reading a reply here on the site and I came across the method String.prototype.normalize in the second example of code passed.

I had already come across this method in another situation, but honestly I could never understand its functioning, since the documentation it seemed very difficult to be understood.

So the question is: what is your need and how should it be used? And also how the different arguments that can be passed (NFC, NFD, NFKC, NKFD) vary the output?

  • 2

    Why the negative?

1 answer

15


NFC, NFD, NFKC and NFKD normalisation forms are defined by Unicode.

So well summarized, some characters have more than one way of being represented. For example, the á (letter a with acute accent), according to Unicode, can be represented in two ways:

  1. composite - like code point U+00E1 (LATIN SMALL LETTER A WITH ACUTE) (á)
  2. decomposed - as a combination of two code points (in this order):

The first form is called NFC (Canonical Composition), and the second, NFD (Canonical Decomposition). The two above forms are considered "canonically equivalent" when it comes to representing the letter a with acute accent. That is, they are two ways of representing the same thing.

The acute accent (U+0301), in this case, is one of the so-called Diacritical Combining (or Combining characters): characters that can be combined with others (such as Portuguese accents, for example). They always appear after the character to which they apply (in the example above, it appears after the a), and if there is more than one, normalization always returns them in a predefined order (there are rules that define the relative order between them).

The detail is that, only when looking at a text (depending on the font used and the characters involved), there is no way to know if they are in NFD or NFC, since both result in the same symbol (in this case, the "á"). It is also worth remembering that not all accented characters in the world have a composite form, and the only way to be represented is in NFD (with a letter "base" followed by one or more Combining characters).


There are also the NFKD (Compatibility Decomposition) and NFKC (Compatibility Composition) forms, which are based on the concept that there are characters that are "compatible", but not canonically equivalent.

An example are the Letter Like Symbols, which are characters that look like letters, but are not exactly the letters themselves. For example, the DOUBLE-STRUCK CAPITAL H: ℍ.

Follow an image if your browser does not render the character correctly:

inserir a descrição da imagem aqui

The Codepoint of this character is U+210D, and when normalizing it to NFKD or NFKC, it becomes the letter "H" uppercase (Codepoint U+0048):

let str = String.fromCodePoint(0x210d);

// imprime a string e os respectivos code points

console.log(str); // ℍ
console.log(str.codePointAt(0).toString(16)); // 210d

console.log(str.normalize('NFKC')); // H
console.log(str.normalize('NFKC').codePointAt(0).toString(16)); // 48

console.log(str.normalize('NFKD')); // H
console.log(str.normalize('NFKD').codePointAt(0).toString(16)); // 48

In this case, it is not possible to revert the "H" back to "ℍ", since there are several other characters that become "H" in the NFKD and NFKC forms.

In this case both forms resulted in the same character "H". But if the resulting character has diacritical Marks, the NFKC form would return the character in its composite form (with the accent, if there is a corresponding code point, the same way it is done with NFC), while the NFKD form would return the decomposed character (separated from the Combining, in the same way as with NFD). Example:

// transformar string em array de codepoints
function codepoints(s) { return Array.from(s).map(c => c.codePointAt(0).toString(16)); }

// ANGSTROM SIGN - https://www.fileformat.info/info/unicode/char/212b/index.htm
let str = String.fromCodePoint(0x212b);
console.log(str); // Å
console.log(codepoints(str)); // [ 212b ]

console.log(str.normalize('NFKC')); // Å
console.log(codepoints(str.normalize('NFKC'))); // [ c5 ]

console.log(str.normalize('NFKD')); // Å
console.log(codepoints(str.normalize('NFKD'))); // [ 41, 30a ] 

In the above example, I used the character ANGSTOM SIGN (Codepoint U+212B), which is basically the letter "A" with a "ball" on top: Å.

inserir a descrição da imagem aqui

But this character has compatibility with the "letter A with the ball", which in turn has the two forms:

Depending on the font, all 3 options (the ANGSTROM SIGN character, the letter A with "ball" in NFC and NFD) can be displayed in the same way (some fonts may have a slightly different symbol for Angstrom, for example, but this behavior varies widely).

Roughly speaking, the NFC and NFD forms do not change the, say, "essence" of the characters involved (since they are "canonically equivalent"). The NFKC and NFKD forms change this "essence", since they result in different characters, and in a one-way way way (since the equivalence is one for many - several other characters Letter like can become an "A" by being normalized to NFKC or NFKD).

In addition, NFKC and NFKD forms can change the meaning of a text. Ex:

let str = '3' + String.fromCodePoint(0xb2);
console.log(str);
console.log(str.normalize('NFKC'));

I used digit 3 followed by SUPERSCRIPT TWO (U+00B2), then the string corresponds to "3²" (three squared). But the character ² in the NFKC form becomes digit 2, so the normalized string becomes "32", which represents something quite different from the original string.


One of the uses of NFD normalization is to remove accents (actually any Combining characters) of a string, as was done in reply that you linkou and in this other also.

The NFC form could help when inverting strings, as I explain in this answer. If the string is in NFD and I just reverse the order of the codepoints, the Combining will be before the character in which it was applied, and will be applied in another character.

Another use would be to put strings in alphabetical order, or to do searches (normalizing all terms to the same form, you avoid variations of the same characters, facilitating the respective algorithms).

NFKC and NFKD forms are used - among other things - to normalize ligatures, as for example the character LATIN SMALL LIGATURE FF (U+FB00): ff.

inserir a descrição da imagem aqui

It looks like two letters "f" together, but it is a single character. When normalized to NFKC or NFKD, it becomes two characters "f" (U+0066 - LATIN SMALL LETTER F):

let str = String.fromCodePoint(0xfb00);
console.log(str); // ff
console.log(str.normalize('NFKC')); // ff

The and the two letters "f" are not considered canonically equivalent, but "compatible". In this case, it is assumed that they may have different appearances (which does not happen with the a accented, which has the same appearance in both NFC and NFD), although they may have the same meaning, depending on the context (in this case, the normalization would also serve to facilitate sorting algorithms or searches - imagine that the text has but some user searches for ff because you don’t know how to type on your keyboard).

Many characters similar to were added to Unicode for old compatibility reasons Character sets, that already had such characters. Thus, the mappings were also created between them and their respective NFKC and NFKD forms.


Also, the different ways can affect the behavior of your program, depending on how you work with the strings:

let s1 = 'sabiá';
let s2 = 'sabiá';

// uma está em NFC, outra em NFD, portanto são diferentes
console.log(s1 == s2); // false

// normalizando, ambas passam a ser iguais
console.log(s1.normalize('NFC') === s2.normalize('NFC')); // true

// transformar string em array de codepoints
function codepoints(s) { return Array.from(s).map(c => c.codePointAt(0).toString(16)); }

// imprimindo os codepoints é possível ver a diferença
console.log(codepoints(s1)); // [ "73", "61", "62", "69", "e1" ]
console.log(codepoints(s2)); // [ "73", "61", "62", "69", "61", "301" ]

In the example above, we have the same string in NFC and NFD. You can’t tell the difference just by looking at the strings, since both are rendered the same way. Note that the comparison with == gave false, and only by normalizing them to the same way they become equal.

This problem could occur if the user typed a string (which he may have copied and pasted from somewhere else, and in this place it was in NFD, and obviously the user didn’t even notice, since visually there is no difference) and you compared it to another string in your code (which is in a different form than what was typed) - that is, if s1 is what the user typed and s2 is a string hardcoded in your code, if they are in different forms (one in NFC and one in NFD), the comparison might not work. Even if you did console.log string s1, could not understand the problem, since she would be shown as "knew", regardless of being in NFC or NFD.


Another case where you can make a difference is with regular expressions. For example, if I want to check strings with exactly 5 characters:

let s = 'sabiá';
let r = /^.{5}$/; // contém exatamente 5 caracteres

console.log(r.test(s.normalize('NFC'))); // true
console.log(r.test(s.normalize('NFD'))); // false

In regex, it is always said that the point corresponds to any character (except line breaks), but in fact, it corresponds to a code point. And since the NFD string has 6 code points (since the "á" is decomposed into two code points), regex does not give match in that case (see). For these cases, some languages/Engines support the shortcut \X, which corresponds to a grapheme cluster (i.e., the a with accent is considered a "single thing", regardless of being in NFC or NFD, and always gives match with \X - see the difference - but unfortunately Javascript does not support this shortcut, so in that case the solution would be to normalize even). In this question has more information about what is a grapheme cluster.

Remember that problems with regex are not restricted to the point. For example, if I have regex [áéíóú] to fetch accented letters, and in regex they are in NFC, but the string being checked is in NFD, will not be found a match - as well as, happened in this question.

You can read more about these problems between Javascript x Unicode in this article.


To learn more about Unicode and other related terms, see:

Finally, it is worth remembering that the normalization rules are defined by Unicode, and are not unique to Javascript. Many other languages implement normalization, for example Python, Java, C#, Ruby, etc..

  • 1

    @Luizfelipe I updated the answer with a few more things..

  • 1

    I learned a lot.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.