It depends on the format these quotes may have. An option would be:
str_extract_all(line, "(?<=@)\\w+")
Returning:
[1] "REF1" "REF2" "REF3" "REF4" "REF5"
This regex uses lookbehind - the stretch between (?<=
and )
- and serves to check if something exists before the current position. In this case, within the lookbehind only has the @
.
The detail is that the @
, for being in a lookbehind, will not be part of the match, then the regex will only return what is after it, which in case is \\w+
. The shortcut \w
means "letters, numbers or the character _
", and the quantifier +
means "one or more occurrences".
In the another answer it was suggested to use \\w*
, but the *
means "zero or more occurrences", which means that if you have a @
alone (no character corresponds to \w
after), is returned a match emptiness. See the difference:
line <- 'Teste @ abc @REF1'
str_extract_all(line, "(?<=@)\\w+")
str_extract_all(line, "(?<=@)\\w*")
The first returns:
[1] "REF1"
And the second returns:
[1] "" "REF1"
If you want, you can be more specific (but then it will depend on the exact format of the quote). For example, if the format is always "3 uppercase letters and 1 digit", then you can use:
str_extract_all(line, "(?<=@)[A-Z]{3}[0-9]")
As no more details were given regarding the format, I leave only this suggestion, but ideally you be as specific as possible to avoid false positives.
For example, how \w
also considers the character _
, then the excerpt @___
is considered valid (see). But of course if you "know" that these cases do not occur with your strings, it is not so much problem to use \w
. Everything depends on.
Thank you! To be more specific, I know that all citations are in the format @surname:
@Vieira2019
. So I know that the last 4 digits will be numbers. I don’t know if this can help?!– Willian Vieira
@Willianvieira In this case it could be
"(?<=@)[A-Za-z]+[0-9]{4}"
(end up with exactly 4 numbers, and with a varied amount of letters).– hkotsubo