When some part of the regex is in parentheses, it forms a catch group. Groups are numbered according to the order in which they appear in the expression. That is, in this regex:
(lenta)(mente)
We have two pairs of parentheses, and therefore two capture groups. The first (group 1) is the one that contains the "slow" string and the second (group 2) contains "mind".
The rear-view mirrors (or backreferences) serve to reference an existing group, so that we don’t have to write the same thing again. This means that the two regex below are equivalent:
(lenta)(mente) é \\2 \\1
lentamente é mente lenta
So much so that the code below:
library(stringr)
a <- 'lentamente é mente lenta'
str_extract(string = a, pattern = '(lenta)(mente) é \\2 \\1')
str_extract(string = a, pattern = 'lentamente é mente lenta')
Results in:
[1] "lentamente é mente lenta"
[1] "lentamente é mente lenta"
Both expressions match the same string ("lentamente é mente lenta"
). Notice that \\1
is just a shortcut to "take what was found in the first capture group and put here", and \\2
does the same thing for the second group.
Already the regex (lenta)(mente) é \\1 \\2
is the same as lentamente é lenta mente
. It’s a different regex than the first, so she can’t find a match when you use do a string search "lentamente é mente lenta"
. So much so that the code below:
library(stringr)
a <- c('lentamente é mente lenta', 'lentamente é lenta mente')
str_extract(string = a, pattern = '(lenta)(mente) é \\1 \\2')
Returns the following:
[1] NA "lentamente é lenta mente"
In your examples it may not be as clear the usefulness of this, but imagine the example below:
library(stringr)
b <- c('ab', 'cc')
str_extract(string = b, pattern = '([a-z])\\1')
regex searches for a letter from a
to z
([a-z]
) followed by the same letter. That is, it searches for two letters in a row that are the same. The result is:
[1] NA "cc"
Notice that this is different from [a-z][a-z]
: this regex takes two letters, and both can be any letters from a
to z
(they don’t necessarily have to be the same letter). If I want to take two letters in a row that are the same, I can’t use that. The problem is that there is no way I know before which letter is repeated (of course I could do aa|bb|cc|dd...
, but it would be nothing practical, nor intelligent). Only using the backreference \1
i guarantee that the second letter is the same one that was captured by the parentheses. It is a smart way to reference an excerpt that has been found before.
That said, put the groups out of order (as in (lenta)(mente) é \\2 \\1
) nay causes groups to be reversed in the original string. When searching for a match, you are checking whether the string corresponds to the expression - and in this case, the expression says that you have a string with snippets that repeat in a given order. But the search for a match, by itself, does not modify the original string.
If you want to modify something in the string, you should make substitutions, as already indicated in answer from Rui. Ex:
b <- 'lentamente'
sub('(lenta)(mente)', '\\1\\2 é \\2 \\1', b)
The first parameter is regex (lenta)(mente)
, which has two capture groups (the first is "slow" and the second is "mind"). In the second parameter I indicate which replacement will be made, and notice that I use the backreferences more than once.
In this case, the replacement string says the following:
- place the first capture group (
\\1
), and then the second (\\2
)
- put space, the letter
é
and other space
- place the second capture group, a space and the first capture group
The result is:
[1] "lentamente é mente lenta"