Regex - Extract numbers

Question

Regex - Extract numbers

Asked 5 years, 3 months ago

Viewed 109 times

5

I have a column in a data.frame that is similar to this structure:

d <- structure(list(value = c("           2019s/v282930ahead of print        ", 
"           2018s/v252627         ", "           2017s/v222324         ", 
"           2016s/v192021         ", "           2015s/v161718         "
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

I would like to turn it into 4 other columns with the functions mutate and str_extract, if possible (although other suggestions are welcome).

The columns would be:

Year - 4 first digits

Number_1 - 2 digits after string "s/v"

Number_2 - 2 digits after Number_1

Number_3 - 2 digits after Number_2

So as a result for the first row of the new columns would be

 Ano Number_1 Number_2 Number_3
2019       28       29       30

What I’m trying to do is this:

library(dplyr)
library(stringr)

d %>% 
  mutate(value = str_trim(value), 
        year = str_extract(value, "\\d{4}"),
       Number_1 = str_extract(value, "(s/v)\\d{2}"))
      # Number_2 = str_extract(value, "(s/v)\\d{2}) - Não sei
      # Number_3 = str_extract(value, "(s/v)\\d{2}) - Não sei

Could someone give some tips?

1 answer

Browser other questions tagged r regex dplyr stringr

You are not signed in. Login or sign up in order to post.

by Tomás Barcellos • **5,562** points · Answer 1 · 2020-04-24T23:07:22+00:00

Positive look Behind

The search should make use of the operator "Positive look Behind" (?<=nao_retorna)vai_retornar.

This operator will start searching for the regular expression only after finding the pattern within parentheses (?<=aqui)

To see how this operator works, I recommend running the following code:

d %>% 
  pull(value) %>% 
  str_view("(?<=s/v)")

What the image above shows, is that the search engine will bring you what you find if it comes immediately after (s/v).

The answer

Using this operator, we can solve the problem with the regular expressions used in the code below.

library(tidyverse)
d <- structure(list(value = c("           2019s/v282930ahead of print        ", 
                              "           2018s/v252627         ", "           2017s/v222324         ", 
                              "           2016s/v192021         ", "           2015s/v161718         "
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

d %>% 
  mutate(value = str_trim(value), 
         year = str_extract(value, "\\d{4}"),
         Number_1 = str_extract(value, "(?<=s/v)\\d{2}"),
         Number_2 = str_extract(value, "(?<=s/v\\d{2})\\d{2}"),
         Number_3 = str_extract(value, "(?<=s/v\\d{4})\\d{2}")
  )
#> # A tibble: 5 x 5
#>   value                       year  Number_1 Number_2 Number_3
#>   <chr>                       <chr> <chr>    <chr>    <chr>   
#> 1 2019s/v282930ahead of print 2019  28       29       30      
#> 2 2018s/v252627               2018  25       26       27      
#> 3 2017s/v222324               2017  22       23       24      
#> 4 2016s/v192021               2016  19       20       21      
#> 5 2015s/v161718               2015  16       17       18

^{Created on 2020-04-24 by the reprex package (v0.3.0)}