Why, when an iterator is not used, is an emoji "broken" into two parts?

Question

Why, when an iterator is not used, is an emoji "broken" into two parts?

Asked 5 years, 3 months ago

Viewed 209 times

6

Considering the code passage below:

const str = 'Olá! ';

str.split('').forEach((char) => console.log(char));

Note that all characters were separated from the string correctly. However, the emoji was separated into two parts - � and �.

But, if I use some newer language feature (which works with the protocol iterator), emoji is not divided into two parts. See these two examples:

const str = 'Olá! ';

for (const char of str) {
  console.log(char);
}

Or even an example more similar to the first code snippet of this question. The only difference is that, unlike the split, the scattering syntax has been used:

const str = 'Olá! ';

[...str].forEach((char) => console.log(char));

Scattering syntax uses the Ecmascript iterator protocol below the scenes. So I think there is some relationship.

Why does this happen? What causes this difference?

1

The problem is i split() that probably wasn’t prepared for it, doesn’t seem to have anything to do with iterator or anything like that.

– Maniero

2020/03/31 at 18:22
2

Triggering of the documentation of split() to the empty tab: Se o separador for uma string vazia, str será convertido em um array de caracteres. Emojis are formed pairs of Sustituition console.log('BC'.charCodeAt(0));console.log('BC'.charCodeAt(1)); and the split() understands as 2 characters. Documentation on replacement pairs

– Augusto Vasques

2020/03/31 at 18:37

1 answer

Browser other questions tagged javascript string characters

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-03-31T18:37:55+00:00

You probably already know what a Unicode code point is (if you don’t know, read here). But in summary, every character (including emojis) has an associated numeric value, which is called code point.

In the case of emoji, his code point is U+1F600 ("GRINNING FACE").

What happens is that internally Javascript represents code points above U+FFFF in the form of surrogate pairs <- this link has the description of the algorithm, but basically the code point U+1F600 is "broken"/decomposed into two values: 0xD83D and 0xDE00 (probably because internally the strings are stored in UTF-16).

const str = '';

console.log(str.codePointAt(0).toString(16)); // 1f600

console.log(str.charCodeAt(0).toString(16)); // d83d
console.log(str.charCodeAt(1).toString(16)); // de00

In the documentation of split there’s a warning:

When the Empty string ("") is used as a separator, the string is not split by user-Perceived characters (grapheme clusters) or Unicode characters (codepoints), but by UTF-16 codeunits. This Destroys surrogate pairs.

That is, by doing the split, you will be individually iterating through the parts of surrogate pair. This also influences other aspects of the string, such as its size:

const str = '';

console.log(str.length); // 2

The size is 2 because length also takes into account the code Units (in the documentation says: "the length of the string, in UTF-16 code Units"), and a surrogate pair uses 2 code Units.

Already the operator of and the scattering syntax operate on the string’s code points, so the emoji is treated correctly in these cases, as in documentation it is said that the iterator of a String iterates the code points of the code (and scattering syntax uses iterator protocol under the table, so it works properly).

Another way to get an array of code points is by using Array.from:

const str = 'Olá! ';

console.log(Array.from(str)); // [ "O", "l", "á", "!", " ", "" ]

To learn more, I suggest the article: Javascript has a Unicode problem.

Remembering that this problem does not occur only with emojis, yes with any character whose code point is greater than U+FFFF.

Another detail is who not every character (in the sense of "a 'unique drawing' that we see on the screen") is composed of only one code point. There are more bizarre cases that only Unicode brings to you.