4
The metacharacters \b
and \B
sane anchors that marks the boundaries of a string (where it starts, ends, or both). Works by the ASCII standard, containing only [A-Za-z0-9_]
. That is, only parts of the string that contain this pattern will be identified.
I have this variable:
x <- c('melodia', 'bom-dia!', 'bom_dia!', 'radial', 'dia', 'diafragma')
x
And the following codes using \b
:
library(tidyverse)
\bdia
marks the beginning (everything that begins with dia
, according to the ASCII standard):
str_subset(string = x, regex(pattern = '\\bdia'))
[1] "bom-dia!" "dia" "diafragma"
dia\b
marks the end (all that ends with dia
, according to the ASCII standard):
str_subset(string = x, regex(pattern = 'dia\\b'))
[1] "melodia" "bom-dia!" "bom_dia!" "dia"
\bdia\b
marks the interval (everything that starts and ends with day, according to the ASCII standard):
str_subset(string = x, regex(pattern = '\\bdia\\b'))
[1] "bom-dia!" "dia"
bom-dia
is returned because the -
is not contained in the informed ASCII standard.
Now using \B
:
str_subset(string = x, regex(pattern = '\\Bdia'))
[1] "melodia" "bom_dia!" "radial"
str_subset(string = x, regex(pattern = 'dia\\B'))
[1] "radial" "diafragma"
str_subset(string = x, regex(pattern = '\\Bdia\\B'))
[1] "radial"
Like \B
is the inverse of \b
(as is the case with \w
and \W
, \s
and \S
and so on), the return was expected, but on account of being a simple example.
Consider the following situation. I have the variable:
y <- c('5203. ._2302', '4243424', '52033.23021', '5201w2211', '53210ggsd3333')
And the code:
str_subset(string = y, regex(pattern = '\\b5\\d{3}\\w'))
[1] "52033.23021" "5201w2211" "53210ggsd3333"
That returns everything that begins with 5
followed by 3
digits with any character \\w
([A-Za-z0-9_
) next. OK.
What I don’t understand are the codes below:
str_subset(string = y, regex(pattern = '\\b5\\d{3}\\w\\b'))
[1] "52033.23021"
str_subset(string = y, regex(pattern = '\\b5\\d{3}\\w\\B'))
[1] "5201w2211" "53210ggsd3333"
While the first code houses all that ends with \w
, should marry everything, right? And, the second, instead of marrying, should deny everything.
That’s what I don’t get.
Therefore,
- what is the difference, regarding the application and occasion of use, of
\b
and\B
?