How does the metacharacter t work in a regex?

Asked

Viewed 665 times

4

I have this variable:

y <- c('stack overflow', 'user number  2', 'nova\nlinha', 'nova \n linha')

And these functions with their results:

library(tidyverse)

With \n:

str_detect(string = y, regex(pattern = '\\n'))

[1] FALSE FALSE  TRUE  TRUE

With \s:

str_detect(string = y, regex(pattern = '\\s'))
[1] TRUE TRUE TRUE TRUE

In the strings 'nova\nlinha', 'nova \n linha', while in the first there are no spaces but in the second yes, the return of the function is TRUE for both cases.

I tried to use the \t, as stated in this question:

str_detect(string = y, regex(pattern = '[ \\t]'))
[1]  TRUE  TRUE FALSE  TRUE

Worked properly.

Well, then I had some doubts. In the documentation of regex, \t operates differently. He looks for a tab in string. I have two questions facing this:

  • what is the difference of tab for spacing and in which situations tabs are more common than spaces?

  • why did the last function I wrote work? I used it, but I didn’t understand its logic (this one: str_detect(string = y, regex(pattern = '[ \\t]'))).

NOTE: Use R in Linux and you need to use the double bar (\\) to operate instead of a (\). So, for example, instead of the conventional \s should use the \\s.

  • I replied about the regex, but as for the use of TAB x space, I think it is too broad a subject that goes beyond the scope of the regex: https://www.google.com/search?q=tab+vs+space

1 answer

7


Yes, \t looks for a TAB. The last case works because the brackets form a character class, and the regex finds a match if the string has any character belonging to the class.

In the case, [ \\t] is a class containing a space and a TAB (\t) - notice that there is a space after the [ - then if the string has anyone of these characters, it already serves (it does not need to have all, just one of them exists in the string that regex already finds a match). None of the strings have a TAB, but the first, second and fourth have a space, so the result is TRUE TRUE FALSE TRUE.

So much so that if you remove the gap from the brackets:

str_detect(string = y, regex(pattern = '[\\t]'))

Will give FALSE for all strings, because now the character class only has TAB (no longer has space), and none of the strings has a TAB (but if one of the strings were for example 'com\tTAB', there would be TRUE).

Although in that case, the expression could just be '\\t', since there is no gain in using a character class that has only one character.


Already the \s is a shortcut which corresponds to space, TAB or line breaks (may vary its meaning depending on the language or engine used). That’s why it detects the \n in the third string, even though it has no space or TAB.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.