What is the Unicode flag "u" in regular expressions? What is its function?

Question

What is the Unicode flag "u" in regular expressions? What is its function?

Asked 4 years, 4 months ago

Viewed 103 times

3

Some time ago I discovered that regular expressions can also use the flag u, which I think is short for Unicode.

What is the purpose of this flag?
I know it was added in some recent version of Ecmascript, so what behaviors does it modify in regular expressions?
Is it related to some other flag?

1 answer

Browser other questions tagged javascript regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2021-03-16T12:56:17+00:00

This flag changes some aspects of how regex treats the string.

Interpretation of code points in Surrogate pairs

For example, if the string has emoji (PILE OF POO). It is a "character" (in the sense of having a code point defined by Unicode - read here for more details). In case, his code point is U+1F4A9, and as we have seen here, Code points above U+FFFF are stored internally in the form of surrogate pairs (in this case, the emoji is "broken"/decomposed into two values: 0xD83D and 0xDCA9).

So imagine a regex that checks if the string has only one character: /^.$/. This regex has the beginning and end of the string (the markers ^ and $) and between them has the point, which corresponds to a code point (except line breaks).

Without the flag u, the point considers that each part of the surrogate pair is a separate code point. With flag, he plays both parts of surrogate pair as a single code point:

const s = '';
console.log(s.length); // 2 <- o codepoint usa 2 code units
console.log(s.charCodeAt(0).toString(16)); // d83d
console.log(s.charCodeAt(1).toString(16)); // dca9

// testando se tem apenas um code point
console.log(/^.$/.test(s)); // false
console.log(/^.$/u.test(s)); // true

// testando se tem dois code points
console.log(/^.{2}$/.test(s)); // true
console.log(/^.{2}$/u.test(s)); // false

Notice that without the flag, a regex with a single point failure, because it considers that each part of the surrogate pair is a separate code point (and only works when we use a regex that searches for two code points).

Already with the flag, the point interprets the surrogate pair as a single code point, finding a match (and failing when we search for two code points).

We can also see this behavior if we are specifically looking for parts of the surrogate pair:

const s = '';

// buscando pelas partes do surrogate pair
console.log(/\uD83D/.test(s)); // true
console.log(/\uDCA9/.test(s)); // true

// com a flag u, não funciona mais (pois as partes do surrogate pair não são mais verificadas separadamente)
console.log(/\uD83D/u.test(s)); // false
console.log(/\uDCA9/u.test(s)); // false

Or, if I want to check if the string has two :

const s = '';

// sem a flag, cada parte do surrogate pair é tratado separadamente
console.log(/\uD83D\uDCA9{2}/.test(s)); // false
// ou seja, o {2} é aplicado somente ao \uDCA9
console.log(/\uD83D\uDCA9{2}/.test("\uD83D\uDCA9\uDCA9")); // true

// com a flag, o {2} é aplicado a todo o surrogate pair
console.log(/\uD83D\uDCA9{2}/u.test(s)); // true
console.log(/\uD83D\uDCA9{2}/u.test("\uD83D\uDCA9\uDCA9")); // false

In case, without the flag, the quantifier {2} is applied only to \uDCA9 (since each part of the surrogate pair is treated separately) while with the flag, the {2} is applied throughout the surrogate pair, correctly detecting that there are two emojis in the string.

This also interferes with the size of the match returned:

const s = '';

// retorna um array com 2 elementos (as partes do surrogate pair)
console.log(s.match(/./g).map(s => s.codePointAt(0).toString(16))); // ["d83d", "dca9"]

// retorna um array com 1 elemento (o próprio emoji)
console.log(s.match(/./gu).map(s => s.codePointAt(0).toString(16))); // ["1f4a9"]

Of course, if the string only has characters whose code points are smaller than U+FFFF, this is not a concern.

Unicode Property Escapes

To flag u also enables the use of Unicode Property Escapes:

const s = "平仮名";

console.log(s.match(/\p{L}/gu) ); // 平 仮 名
console.log(s.match(/\p{L}/g) ); // null

In the case, \p{L} searches for any code point that is a Unicode-defined letter (all categories starting with "L" from this list). But that only works if you have the flag u qualified.

I will not quote all the possibilities of Unicode Properties, but by documentation you can already have a good idea of the existing options.
But just to quote an example, one use would be to circumvent the limitation of shortcuts as \w and \b, that in Javascript only consider the ASCII characters (even with the flag u enabled). Example:

// \b e \w não levam em conta caracteres acentuados
console.log('sábia sabiá'.match(/\b\w+\b/gu)); // ["s", "bia", "sabi"]

// \p{L} leva em conta caracteres acentuados
console.log('sábia sabiá'.match(/(?<!\p{L})\p{L}+(?!\p{L})/gu)); // ["sábia", "sabiá"]

Escapes and attribute `pattern`

There is another detail (which has already been described in this answer): when the flag u is present, missing some characters can be escaped with \ - in the case, are only: ^ $ \ . * + ? ( ) [ ] { } |.

I mean, a regex that has something like \- works normally without the flag (is interpreted as a hyphen) but with the flag makes a mistake:

// sem a flag u - funciona
let semUnicode = /\d\-\d/;
console.log('válido:', semUnicode);

// com a flag u - erro
let comUnicode = /\d\-\d/u;
console.log('Não vai imprimir esta mensagem porque dá erro na linha acima');

That’s because \- is redundant, since it is enough to put - to be interpreted as a hyphen (except within a character class, in which the hyphen has special significance: [a-z] is a letter of a to z and [a\-z] is "the letter a, or a hyphen, or the letter z").

Another point is that in attribute pattern of an HTML field, regex at all times is compiled with the flag u qualified.

Unicode code point escapes

Another resource that is empowered with the flag u is the use of code point escape for values above U+FFFF:

const s = '';

// sem a flag, procura literalmente por "u{1f4a9}"
let r = /\u{1f4a9}/;
console.log(r.test(s)); // false
console.log(r.test('u{1f4a9}')); // true

// com a flag, procura pelo code point U+1F4A9
r = /\u{1f4a9}/u;
console.log(r.test(s)); // true
console.log(r.test('u{1f4a9}')); // false

For values below U+FFFF, as we have seen in the examples above, just use \uHHHH (where HHHH is the value of the code point in hexadecimal), and does not need the flag to do so (it only changes the interpretation of surrogate pairs). But for values above U+FFFF we need to use the syntax \u{...} (the value of the code point is between keys), and for this to work, you need the flag u enabled (otherwise it will search literally for the characters u, {, etc.).

Relationship with others flags

There is no direct relationship with others flags, but can be used together (as in some examples above, which have the flag g also).

But of course that can make a difference in some cases, thanks to the weirdness of Unicode. For example, if I want to search for all the codepoints that equate to the letter "s" so case insensitive (therefore, using the flag i):

console.log('Com a flag:');
let r = /s/iu;
for (let i = 0; i <= 0x10ffff; i++) {
    const s = String.fromCodePoint(i);
    if (r.test(s)) {
        console.log(`${i.toString(16)} = ${s}`);
    }
}

console.log('\n-----------------\nSem a flag:');
r = /s/i;
for (let i = 0; i <= 0x10ffff; i++) {
    const s = String.fromCodePoint(i);
    if (r.test(s)) {
        console.log(`${i.toString(16)} = ${s}`);
    }
}

Without the flag u, only the letters s and S are found. But with the flag, the letter is also found ſ (LATIN SMALL LETTER LONG S).

Terminology

The name of flag (u) comes from "Unicode", as stated in documentation. And in the language specification there is also mention of this term:

Unicode is true if the Regexp Object’s [[Originalflags]] Internal slot contains "u" and otherwise is false.
…
When the Unicode flag is true, "all characters" Means the CharSet containing all code point values; otherwise "all characters" Means the CharSet containing all code Unit values.

Even, the second paragraph above describes the behavior mentioned at the beginning, to treat the parts of a surrogate pair as a single thing or not.

Finally, the instances of RegExp possess the property unicode, indicating whether the flag is active or not:

let r = /abc/;
console.log(r.unicode); // false

r = /abc/u;
console.log(r.unicode); // true

What is the Unicode flag "u" in regular expressions? What is its function?

1 answer

Interpretation of code points in Surrogate pairs

Unicode Property Escapes

Escapes and attribute pattern

Unicode code point escapes

Relationship with others flags

Terminology

Escapes and attribute `pattern`