Regular expression of citations in R

Asked

Viewed 115 times

5

I would like to extract all reference keys I use in a document markdown beginning with the character @.

Here is an example of the different ways I can quote a document using this key:

line <- 'According to @REF1, north American trees are lagging behind climate change [@REF2; @REF3]. However, some species have shown range limits expansion following warming temperature [*e.g.* white spruce @REF4; @REF5].'

From this example, I would like to obtain a vector with the following keys:

citations
# [1] REF1
# [1] REF2
# [1] REF3
# [1] REF4
# [1] REF5

3 answers

4


It depends on the format these quotes may have. An option would be:

str_extract_all(line, "(?<=@)\\w+")

Returning:

[1] "REF1" "REF2" "REF3" "REF4" "REF5"

This regex uses lookbehind - the stretch between (?<= and ) - and serves to check if something exists before the current position. In this case, within the lookbehind only has the @.

The detail is that the @, for being in a lookbehind, will not be part of the match, then the regex will only return what is after it, which in case is \\w+. The shortcut \w means "letters, numbers or the character _", and the quantifier + means "one or more occurrences".

In the another answer it was suggested to use \\w*, but the * means "zero or more occurrences", which means that if you have a @ alone (no character corresponds to \w after), is returned a match emptiness. See the difference:

line <- 'Teste @ abc @REF1'

str_extract_all(line, "(?<=@)\\w+")
str_extract_all(line, "(?<=@)\\w*")

The first returns:

[1] "REF1"

And the second returns:

[1] ""     "REF1"

If you want, you can be more specific (but then it will depend on the exact format of the quote). For example, if the format is always "3 uppercase letters and 1 digit", then you can use:

str_extract_all(line, "(?<=@)[A-Z]{3}[0-9]")

As no more details were given regarding the format, I leave only this suggestion, but ideally you be as specific as possible to avoid false positives.

For example, how \w also considers the character _, then the excerpt @___ is considered valid (see). But of course if you "know" that these cases do not occur with your strings, it is not so much problem to use \w. Everything depends on.

  • Thank you! To be more specific, I know that all citations are in the format @surname: @Vieira2019. So I know that the last 4 digits will be numbers. I don’t know if this can help?!

  • 2

    @Willianvieira In this case it could be "(?<=@)[A-Za-z]+[0-9]{4}" (end up with exactly 4 numbers, and with a varied amount of letters).

3

With the model you put on I managed to do so.

library(tidyverse)

#usando o tidyverse
citations <- str_extract_all(line, "@\\w*") %>% 
  as_vector() %>% 
  str_remove("@")

citations
#> [1] "REF1" "REF2" "REF3" "REF4" "REF5"

#R base
gregexpr(pattern = "@\\w*", text = line) %>% 
  regmatches(line, m = .) %>% 
  unlist() %>% 
  gsub("@","",.)
#> [1] "REF1" "REF2" "REF3" "REF4" "REF5"

Created on 2019-11-07 by the reprex package (v0.3.0)

1

No need to use the stringr in this exercise. We can use regmatches with gregexpr.

regmatches will extract the substrings that were found with gregexpr -this last function finds all possible strings in the string.

regmatches will return a list where, in this case, the only object is an array with the desired substrings. Then you close the expression regmatches(...) in unlist to receive the desired result.

About the pattern that you should use, I believe a more flexible mode is using the POSIX class [[:alnum:]]. This class type corresponds to upper and lower case letters as well as numbers. Note the use of the parameter pearl=TRUE in gregexpr. This allows the use of [[:alnum:]] as pattern. We also add quantifier + at the end of pattern, which is a "Greedy" quantifier, that is, finds substrings that have letters and numbers on more than one occasion and "match" is always for the largest substring possible. Finally, we use the lookaround lookbehind (?<=...) to find pieces of the string that start at @ as already explained in the above answers.

So the solution stays like this:

unlist(regmatches(line, gregexpr("(?<=@)[[:alnum:]]+", line, perl = TRUE)))

What returns:

[1] "REF1" "REF2" "REF3" "REF4" "REF5"

Note that this solution is approximately 5x faster than unlist(stringr::str_extract_all(...):

microbenchmark::microbenchmark(
  base= unlist(regmatches(line, gregexpr("(?<=@)[[:alnum:]]+", line, perl = TRUE))),
  stringr = unlist(str_extract_all(line, "(?<=@)[[:alnum:]]+"))
)

Upshot:

Unit: microseconds
    expr      min        lq      mean   median       uq      max neval cld
    base  323.824  335.3105  381.1747  366.854  388.370 1300.399   100  a 
 stringr 1596.873 1632.4270 1772.1564 1666.888 1723.958 5634.818   100   b

Browser other questions tagged

You are not signed in. Login or sign up in order to post.