Rearview Metacharacter x does not take the corresponding groups when I change the order of these

Question

Rearview Metacharacter x does not take the corresponding groups when I change the order of these

Asked 5 years, 9 months ago

Viewed 89 times

4

The rear-view metacharacter \x repeats something captured in some group () previous in regex.

For example:

library(stringr)

a <- 'quero-quero'

str_extract(string = a, pattern = regex(pattern = '(quero)-\\1'))

[1] "quero-quero"

This result is ok. The procedure is relatively simple as there is only one group. The following procedures do not bring the corresponding groups:

b <- 'lentamente é mente lenta'

str_extract(string = b, pattern = regex(pattern = '(lenta)(mente) é \\2 \\1'))

[1] "lentamente é mente lenta"

This works, but when I invert the order of the groups, I am not brought the corresponding group:

str_extract(string = b, pattern = regex(pattern = '(lenta)(mente) é \\1 \\2'))

[1] NA

The same process is repeated for the following cases:

Additional example 1:

c <- 'bandeirante é bandeira band'

str_extract(string = c, pattern = regex(pattern = '(((band)eira)nte) é \\2 \\3'))

[1] "bandeirante é bandeira band"

# invertendo o grupo
str_extract(string = c, pattern = regex(pattern = '(((band)eira)nte) é \\3 \\2'))

[1] NA

Additional example 2:

d <- 'indolor é sem dor'

str_extract(string = d, pattern = regex(pattern = 'in(d)ol(or) é sem \\1\\2'))

[1] "indolor é sem dor"

# invertendo o grupo

str_extract(string = d, pattern = regex(pattern = 'in(d)ol(or) é sem \\2\\1'))

[1] NA

Li that, but I couldn’t understand what happened. For me, was for the groups to be returned, but in another order. But, I noticed that it is not something so trivial.

Obs: In R, double bar should be used (\\), as stated here.

2 answers

4

When some part of the regex is in parentheses, it forms a catch group. Groups are numbered according to the order in which they appear in the expression. That is, in this regex:

(lenta)(mente)

We have two pairs of parentheses, and therefore two capture groups. The first (group 1) is the one that contains the "slow" string and the second (group 2) contains "mind".

The rear-view mirrors (or backreferences) serve to reference an existing group, so that we don’t have to write the same thing again. This means that the two regex below are equivalent:

(lenta)(mente) é \\2 \\1
lentamente é mente lenta

So much so that the code below:

library(stringr)

a <- 'lentamente é mente lenta'

str_extract(string = a, pattern = '(lenta)(mente) é \\2 \\1')
str_extract(string = a, pattern = 'lentamente é mente lenta')

Results in:

[1] "lentamente é mente lenta"
[1] "lentamente é mente lenta"

Both expressions match the same string ("lentamente é mente lenta"). Notice that \\1 is just a shortcut to "take what was found in the first capture group and put here", and \\2 does the same thing for the second group.

Already the regex (lenta)(mente) é \\1 \\2 is the same as lentamente é lenta mente. It’s a different regex than the first, so she can’t find a match when you use do a string search "lentamente é mente lenta". So much so that the code below:

library(stringr)

a <- c('lentamente é mente lenta', 'lentamente é lenta mente')

str_extract(string = a, pattern = '(lenta)(mente) é \\1 \\2')

Returns the following:

[1] NA                         "lentamente é lenta mente"

In your examples it may not be as clear the usefulness of this, but imagine the example below:

library(stringr)

b <- c('ab', 'cc')

str_extract(string = b, pattern = '([a-z])\\1')

regex searches for a letter from a to z ([a-z]) followed by the same letter. That is, it searches for two letters in a row that are the same. The result is:

[1] NA   "cc"

Notice that this is different from [a-z][a-z]: this regex takes two letters, and both can be any letters from a to z (they don’t necessarily have to be the same letter). If I want to take two letters in a row that are the same, I can’t use that. The problem is that there is no way I know before which letter is repeated (of course I could do aa|bb|cc|dd..., but it would be nothing practical, nor intelligent). Only using the backreference \1 i guarantee that the second letter is the same one that was captured by the parentheses. It is a smart way to reference an excerpt that has been found before.

That said, put the groups out of order (as in (lenta)(mente) é \\2 \\1) nay causes groups to be reversed in the original string. When searching for a match, you are checking whether the string corresponds to the expression - and in this case, the expression says that you have a string with snippets that repeat in a given order. But the search for a match, by itself, does not modify the original string.

If you want to modify something in the string, you should make substitutions, as already indicated in answer from Rui. Ex:

b <- 'lentamente'

sub('(lenta)(mente)', '\\1\\2 é \\2 \\1', b)

The first parameter is regex (lenta)(mente), which has two capture groups (the first is "slow" and the second is "mind"). In the second parameter I indicate which replacement will be made, and notice that I use the backreferences more than once.

In this case, the replacement string says the following:

place the first capture group (\\1), and then the second (\\2)
put space, the letter é and other space
place the second capture group, a space and the first capture group

The result is:

[1] "lentamente é mente lenta"

Browser other questions tagged r regex

You are not signed in. Login or sign up in order to post.

by Rui Barradas • **15,422** points · Answer 1 · 2019-10-13T21:30:28+00:00

I am going to simplify the code a little bit, since there is no need to call regex, The following is, in the case of the question, equivalent.

library(stringr)

b <- 'lentamente é mente lenta'

str_extract(string = b, pattern = '(lenta)(mente) é \\2 \\1')
str_extract(string = b, pattern = '(lenta)(mente) é \\1 \\2')

What’s happening is that

In the first case, the standard '(lenta)(mente) é \\2 \\1' can be found in b and therefore str_extract can extract it. No problem.
In the second case, the standard '(lenta)(mente) é \\1 \\2' does not occur in b. This pattern expands to '(lenta)(mente) é lenta mente', the metacharacters \1 and \2 are correctly replaced by the previously captured strings. Since the entire regex is not found, the result is NA.

See now with sub, that R is processing the mirrors well.

sub('(lenta)(mente)', '\\2 \\1', b)
#[1] "mente lenta é mente lenta"

sub('(lenta)(mente)', '\\1 \\2', b)
#[1] "lenta mente é mente lenta"

The results of the other examples of the question are analogous.