Regex - Extract numbers

Asked

Viewed 109 times

5

I have a column in a data.frame that is similar to this structure:

d <- structure(list(value = c("           2019s/v282930ahead of print        ", 
"           2018s/v252627         ", "           2017s/v222324         ", 
"           2016s/v192021         ", "           2015s/v161718         "
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

I would like to turn it into 4 other columns with the functions mutate and str_extract, if possible (although other suggestions are welcome).

The columns would be:

Year - 4 first digits

Number_1 - 2 digits after string "s/v"

Number_2 - 2 digits after Number_1

Number_3 - 2 digits after Number_2

So as a result for the first row of the new columns would be

 Ano Number_1 Number_2 Number_3
2019       28       29       30

What I’m trying to do is this:

library(dplyr)
library(stringr)

d %>% 
  mutate(value = str_trim(value), 
        year = str_extract(value, "\\d{4}"),
       Number_1 = str_extract(value, "(s/v)\\d{2}"))
      # Number_2 = str_extract(value, "(s/v)\\d{2}) - Não sei
      # Number_3 = str_extract(value, "(s/v)\\d{2}) - Não sei

Could someone give some tips?

1 answer

4


Positive look Behind

The search should make use of the operator "Positive look Behind" (?<=nao_retorna)vai_retornar.

This operator will start searching for the regular expression only after finding the pattern within parentheses (?<=aqui)

To see how this operator works, I recommend running the following code:

d %>% 
  pull(value) %>% 
  str_view("(?<=s/v)")

inserir a descrição da imagem aqui

What the image above shows, is that the search engine will bring you what you find if it comes immediately after (s/v).

The answer

Using this operator, we can solve the problem with the regular expressions used in the code below.

library(tidyverse)
d <- structure(list(value = c("           2019s/v282930ahead of print        ", 
                              "           2018s/v252627         ", "           2017s/v222324         ", 
                              "           2016s/v192021         ", "           2015s/v161718         "
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

d %>% 
  mutate(value = str_trim(value), 
         year = str_extract(value, "\\d{4}"),
         Number_1 = str_extract(value, "(?<=s/v)\\d{2}"),
         Number_2 = str_extract(value, "(?<=s/v\\d{2})\\d{2}"),
         Number_3 = str_extract(value, "(?<=s/v\\d{4})\\d{2}")
  )
#> # A tibble: 5 x 5
#>   value                       year  Number_1 Number_2 Number_3
#>   <chr>                       <chr> <chr>    <chr>    <chr>   
#> 1 2019s/v282930ahead of print 2019  28       29       30      
#> 2 2018s/v252627               2018  25       26       27      
#> 3 2017s/v222324               2017  22       23       24      
#> 4 2016s/v192021               2016  19       20       21      
#> 5 2015s/v161718               2015  16       17       18

Created on 2020-04-24 by the reprex package (v0.3.0)

  • perfect Tomas! Thank you very much! Taking advantage of the line of his excellent explanation, have some material to indicate (preferably in Portuguese) to study regex applied to R? I’m seeing some things, but I hadn’t seen the " look Behind". Anyway, it was super worth it!

  • Regex in [tag:R] is kind of like regex in [tag:perl]. I recommend playing on regex101.com

  • Has this material here too

  • I’ll take a look, thanks!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.