You probably already know what a Unicode code point is (if you don’t know, read here). But in summary, every character (including emojis) has an associated numeric value, which is called code point.
In the case of emoji, his code point is U+1F600 ("GRINNING FACE").
What happens is that internally Javascript represents code points above U+FFFF in the form of surrogate pairs <- this link has the description of the algorithm, but basically the code point U+1F600 is "broken"/decomposed into two values: 0xD83D and 0xDE00 (probably because internally the strings are stored in UTF-16).
const str = '';
console.log(str.codePointAt(0).toString(16)); // 1f600
console.log(str.charCodeAt(0).toString(16)); // d83d
console.log(str.charCodeAt(1).toString(16)); // de00
In the documentation of split
there’s a warning:
When the Empty string ("") is used as a separator, the string is not split by user-Perceived characters (grapheme clusters) or Unicode characters (codepoints), but by UTF-16 codeunits. This Destroys surrogate pairs.
That is, by doing the split
, you will be individually iterating through the parts of surrogate pair. This also influences other aspects of the string, such as its size:
const str = '';
console.log(str.length); // 2
The size is 2 because length
also takes into account the code Units (in the documentation says: "the length of the string, in UTF-16 code Units"), and a surrogate pair uses 2 code Units.
Already the operator of
and the scattering syntax operate on the string’s code points, so the emoji is treated correctly in these cases, as in documentation it is said that the iterator of a String
iterates the code points of the code (and scattering syntax uses iterator protocol under the table, so it works properly).
Another way to get an array of code points is by using Array.from
:
const str = 'Olá! ';
console.log(Array.from(str)); // [ "O", "l", "á", "!", " ", "" ]
To learn more, I suggest the article: Javascript has a Unicode problem.
Remembering that this problem does not occur only with emojis, yes with any character whose code point is greater than U+FFFF.
Another detail is who not every character (in the sense of "a 'unique drawing' that we see on the screen") is composed of only one code point. There are more bizarre cases that only Unicode brings to you.
The problem is i
split()
that probably wasn’t prepared for it, doesn’t seem to have anything to do with iterator or anything like that.– Maniero
Triggering of the documentation of
split()
to the empty tab:Se o separador for uma string vazia, str será convertido em um array de caracteres.
Emojis are formed pairs of Sustituitionconsole.log('BC'.charCodeAt(0));console.log('BC'.charCodeAt(1));
and thesplit()
understands as 2 characters. Documentation on replacement pairs– Augusto Vasques