Difference between the metacharacters b and B

Asked

Viewed 183 times

4

The metacharacters \b and \B sane anchors that marks the boundaries of a string (where it starts, ends, or both). Works by the ASCII standard, containing only [A-Za-z0-9_]. That is, only parts of the string that contain this pattern will be identified.

I have this variable:

x <- c('melodia', 'bom-dia!', 'bom_dia!', 'radial', 'dia', 'diafragma')
x

And the following codes using \b:

library(tidyverse)

\bdia marks the beginning (everything that begins with dia, according to the ASCII standard):

str_subset(string = x, regex(pattern = '\\bdia'))
[1] "bom-dia!"  "dia"       "diafragma"

dia\b marks the end (all that ends with dia, according to the ASCII standard):

str_subset(string = x, regex(pattern = 'dia\\b'))
[1] "melodia"  "bom-dia!" "bom_dia!" "dia"     

\bdia\b marks the interval (everything that starts and ends with day, according to the ASCII standard):

str_subset(string = x, regex(pattern = '\\bdia\\b'))
[1] "bom-dia!" "dia"     

bom-dia is returned because the - is not contained in the informed ASCII standard.

Now using \B:

str_subset(string = x, regex(pattern = '\\Bdia'))
[1] "melodia"  "bom_dia!" "radial"  

str_subset(string = x, regex(pattern = 'dia\\B'))
[1] "radial"    "diafragma"

str_subset(string = x, regex(pattern = '\\Bdia\\B'))
[1] "radial" 

Like \B is the inverse of \b (as is the case with \w and \W, \s and \S and so on), the return was expected, but on account of being a simple example.

Consider the following situation. I have the variable:

y <- c('5203. ._2302', '4243424', '52033.23021', '5201w2211', '53210ggsd3333')

And the code:

str_subset(string = y, regex(pattern = '\\b5\\d{3}\\w'))
[1] "52033.23021"   "5201w2211"     "53210ggsd3333"

That returns everything that begins with 5 followed by 3 digits with any character \\w ([A-Za-z0-9_) next. OK.

What I don’t understand are the codes below:

str_subset(string = y, regex(pattern = '\\b5\\d{3}\\w\\b'))
[1] "52033.23021"

str_subset(string = y, regex(pattern = '\\b5\\d{3}\\w\\B'))
[1] "5201w2211"     "53210ggsd3333"

While the first code houses all that ends with \w, should marry everything, right? And, the second, instead of marrying, should deny everything.

That’s what I don’t get.

Therefore,

  • what is the difference, regarding the application and occasion of use, of \b and \B?

1 answer

5


First of all, it is worth clarifying that the shortcut \b (also known as word Boundary, something like "boundary between words") is a zero-length match (or a match zero): it does not correspond to a character, but to a position of the string. In this case, it is a position that has an alphanumeric character before and does not have an alphanumeric character after (or vice versa).

So much so that the regex a\b finds a match in string "a" (because the end of the string is also a position that has an alphanumeric character before and does not have a after, see).

Already the \B is any string position that is not a \b, that is, a position where both characters - before and after this position - are alphanumeric (or that both are non-alphannumeric).


So in your examples both regex begin with \\b5\\d{3}\\w: we have a \b before the digit 5, and how 5 is alphanumeric, so this \b only finds one match if before the 5 there is no alphanumeric character (i.e., before the 5 may have a non-alphanumeric character, or the 5 can be at the beginning of the string). After the 5 we have 3 digits (\d{3}), followed by \w (which is a shortcut which considers letters, numbers or _).

What changes is what you get after the \w. In the first case, we have \b. Like \w represents an alphanumeric character, then the \b check if what you have next is non-alphanumeric. That’s why he only takes the "52033.23021". See how this string corresponds to regex:

   5  203    3     .23021
\b 5  \d{3}  \w  \b

The second \b corresponds to the position between the 3 and the .: in fact it is a position that has an alphanumeric character before and does not have an alphanumeric character after.

The other strings do not give match for the following reasons:

  • '5203. ._2302': after the 5 and of the 3 digits, there is a point, which does not correspond to \w
  • '4243424': doesn’t have the 5
  • '5201w2211': has the 5, the 3 digits and the letter w, which corresponds to \w. But after the words w there is the digit 2, then the position between the letter w and the digit 2 does not correspond to \b.
  • '53210ggsd3333': has the 5, the 3 digits and digit 0, which corresponds to \w. But after the 0 there’s the letter g, then the position between the digit 0 and the letter g does not correspond to \b.

Already the regex '\\b5\\d{3}\\w\\B' has a \B at the end, that is, it is a position where before and after it has two alphanumeric characters (or two non-alphinical characters). So it takes the strings "5201w2211" and "53210ggsd3333":

   5  201    w     2211
\b 5  \d{3}  \w  \B

   5  321    0     ggsd3333
\b 5  \d{3}  \w  \B

The \w takes an alphanumeric character, in which case the \B will only give match if the next character is also alphanumeric. Note above that in the first string it worked because the \B corresponds to the position between the letter w and the digit 2, and in the second string is the position between the digit 0 and the letter g. And so he doesn’t take the other strings:

  • '5203. ._2302': after the 5 and of the 3 digits, there is a point, which does not correspond to \w
  • '4243424': doesn’t have the 5
  • '52033.23021': has the 5, the 3 digits and digit 3, which corresponds to \w. But after the digit 3 there is a point, which is not alphanumeric. Thus, the position between the 3 and the point does not correspond to \B.

Finally, it’s worth clearing up a mess you made. \b does not correspond only to the beginning or end of the string, but to any position of the string that has an alphanumeric character before and does not have an after (or vice versa). Ex:

x <- c('ele podia ter feito isso', 'que dia feliz', 'o diabo que te carregue', 'foi adiado para amanhã')

str_subset(string = x, regex(pattern = '\\bdia'))
[1] "que dia feliz"           "o diabo que te carregue"

str_subset(string = x, regex(pattern = 'dia\\b'))
[1] "ele podia ter feito isso" "que dia feliz"           

str_subset(string = x, regex(pattern = '\\bdia\\b'))
[1] "que dia feliz"

The first case (\\bdia) take any word that starts with "day," the second case (dia\\b) take any word that ends with "day," and the third case (\\bdia\\b) takes exactly the word "day". Note that these words do not need to be at the beginning or end of the string.

Note also that none of the regex takes the word "postponed", because the positions before and after "day" are also alphanumeric and therefore do not correspond to \b.

Finally, if you want to specifically mark the beginning or end of the string, use the markers ^ and $.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.