What does the REGEX shortcut mean?

Asked

Viewed 8,284 times

28

I see a lot of people wearing \s in regex thinking its meaning is ' ' (space) because I tell you, it’s not, at least not only that.

But then what does the \s in REGEX?

2 answers

23


The \s covers much more than the ' ' (space).

\s = [ \t\n\r\f\v]
  • ' ' (space)
  • \t TAB
  • \n new line (line break)
  • \r Car reset (back the courses to the beginning of the line)
  • \f advance page
  • \v vertical TAB - (used in printer configurations)

9

The list of characters shortcut \s considers may vary from a language/API/tool/engine to another, and even according to the configurations available in each.

In general, the \s always consider the following characters:

The vertical tab (\v) (or "LINE TABULATION") is also considered in various languages, such as Java, Javascript, Ruby and Python.

But in PHP, for example, the \s does not consider the vertical tab. According to the documentation:

\s any whitespace Character

The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32)

That is to say, it only includes HT (horizontal tab), LF (line feed), FF (form feed), CR (Carriage Return) and space.

And in Perl, the vertical tab was only added in version 5.18 as indicated in documentation:

\s Means the five characters [ \f\n\r\t], and Starting in Perl v5.18, the vertical tab;

Anyway, in every language, API or tool/engine, this list may vary (Google Docs, for example, uses the engine RE2, that does not consider the vertibal tab), then always consult the documentation to be sure.


Unicode

In addition, many languages have settings that enable "Unicode mode", which causes the \s corresponds to many other characters than those already mentioned.

For example, in Java, if regex has the option UNICODE_CHARACTER_CLASS, the \s corresponds to all the characters that have the Unicode property White_Space (the list can be consulted directly here). So this code:

Matcher matcher = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher("");
// loop por todos os code points do Unicode
for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
    String s = new String(new int[] { i }, 0, 1);
    matcher.reset(s);
    if (matcher.find()) {
        // se corresponde a \s, imprime o codepoint e o nome do caractere
        System.out.printf("%06X, %s\n", i, Character.getName(i));
    }
}

Generates the following output:

000009, CHARACTER TABULATION
00000A, LINE FEED (LF)
00000B, LINE TABULATION
00000C, FORM FEED (FF)
00000D, CARRIAGE RETURN (CR)
000020, SPACE
000085, NEXT LINE (NEL)
0000A0, NO-BREAK SPACE
001680, OGHAM SPACE MARK
002000, EN QUAD
002001, EM QUAD
002002, EN SPACE
002003, EM SPACE
002004, THREE-PER-EM SPACE
002005, FOUR-PER-EM SPACE
002006, SIX-PER-EM SPACE
002007, FIGURE SPACE
002008, PUNCTUATION SPACE
002009, THIN SPACE
00200A, HAIR SPACE
002028, LINE SEPARATOR
002029, PARAGRAPH SEPARATOR
00202F, NARROW NO-BREAK SPACE
00205F, MEDIUM MATHEMATICAL SPACE
003000, IDEOGRAPHIC SPACE

See here this code running.

Now, if we remove the option UNICODE_CHARACTER_CLASS, the default is to consider only the 6 characters already cited ([ \t\n\r\f\v]):

Matcher matcher = Pattern.compile("\\s").matcher("");
... restante do código igual

The way out becomes:

000009, CHARACTER TABULATION
00000A, LINE FEED (LF)
00000B, LINE TABULATION
00000C, FORM FEED (FF)
00000D, CARRIAGE RETURN (CR)
000020, SPACE

See here this code running.


In Python there is also something similar, but in Python 3 the behavior is the opposite of Java. By default, regex is already in "unicode mode" and the shortcut \s considers all the Unicode whitespace characters. Making a code similar to the previous example:

import unicodedata as u
import re

r = re.compile(r'\s')
for i in range(0x10ffff + 1):
    s = chr(i)
    if r.search(s):
        print('{:02X} {}'.format(i, u.name(s, '')))

For default, this regex is in "Unicode mode". The output is:

09 
0A 
0B 
0C 
0D 
1C 
1D 
1E 
1F 
20 SPACE
85 
A0 NO-BREAK SPACE
1680 OGHAM SPACE MARK
2000 EN QUAD
2001 EM QUAD
2002 EN SPACE
2003 EM SPACE
2004 THREE-PER-EM SPACE
2005 FOUR-PER-EM SPACE
2006 SIX-PER-EM SPACE
2007 FIGURE SPACE
2008 PUNCTUATION SPACE
2009 THIN SPACE
200A HAIR SPACE
2028 LINE SEPARATOR
2029 PARAGRAPH SEPARATOR
202F NARROW NO-BREAK SPACE
205F MEDIUM MATHEMATICAL SPACE
3000 IDEOGRAPHIC SPACE

See here this code running.

Already if we use the flag ASCII, the \s only consider the characters [ \t\n\r\f\v]:

r = re.compile(r'\s', re.ASCII)
... restante do código igual

Exit:

09 
0A 
0B 
0C 
0D 
20 SPACE

See here this code running.

Obs: no Python 2 the behavior is the same as in Java. By default, the \s is equivalent only to characters [ \f\n\r\v\t] (see), and "Unicode mode" is enabled using flag UNICODE (see).


Another detail is that Python returned 4 characters more than Java (1C, 1D, 1E and 1F). Maybe either because of the version of Unicode that each uses internally (tested with Java 8, that uses Unicode 6.2.0, and Python 3.8, which uses Unicode 12.1.0), or due to some internal detail of implementation of each language, which takes into account or not other factors, besides the property White_Space. Anyway, this serves to confirm that the behavior of the shortcut \s in fact varies according to the language used.

And even different libraries of the same language can have different behaviors. For example, if I change the code above to use the module regex (an excellent module that extends the functionality of the module re), the returned character list is the same as the Java code.


Final considerations

Other languages and tools may or may not support "Unicode mode" (which in turn may or may not be mode default), and may or may not provide an option to enable or disable it (others may still support Unicode properties, for example \p{IsWhite_Space} to search for all the Unicode whitespace characters, which may or may not be equivalent to \s). Therefore, it is always worth reading the documentation to be sure about the list of characters that the \s consider (as an addendum, the same goes for other shortcuts, such as \d, \w, \b, etc, because their behaviors may also vary and they may be affected - or not - according to the language/engine and its configuration modes).

Of course, if you are working with controlled inputs and "know" very well which characters the text has and which it does not have, it may not make much difference to use \s in Unicode mode or not, or use a simple space in regex. But it’s important to know the implications of using it, because there are cases where you might have different results (for example, if you just want to consider the spaces but not the line breaks, then it makes a difference to use it).

Finally, some languages support other similar shortcuts, such as POSIX Character classes. For example, in Java it is possible to use \p{Blank}, and in PHP it is possible to use [:blank:], and both correspond to [ \t] (a space or TAB) - although in Java this also changes when it is in "Unicode mode". There is still Engines that support the shortcut \R, which corresponds to line breaks (but still with differences; for example, in Java it corresponds to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029], and in PHP he only considers \r, \n or \r\n).

Depending on the case, using these variations - when available - may be better than using \s (for example, if I want to ignore - or consider only - line breaks).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.