How to extract separate data from an out-of-format string

Asked

Viewed 64 times

1

When extracting data from a pdf in R, extracting text line by line, I have the following situation. A string of interest that comes with three information separated by long spaces, as exemplified below

a <- "WDFG V/AA 8952                              123546514            sdfasfasfa"

need to extract only "WDFG V/AA 8952"

Since there are many pdfs, you would need a method to insert a looping and scale the process.

1 answer

4


I believe this regular expression solves the problem.

The data are these:

a <- c("João Fernando Freitas                             123546514            sdfasfasfa",
       "WDFG V/AA 8952                              123546514            sdfasfasfa")

And the regular expression.

b <- sub("(^[[:alnum:]]*| ) {2,}([[:alnum:]]| )*$", "\\1", a)
b <- trimws(b)
b
#[1] "João Fernando Freitas" "WDFG V/AA 8952" 

If you want to insert a loop, you might want to put the above code in a function.

extraiNome <- function(x){
  b <- sub("(^[[:alnum:]]*| ) {2,}([[:alnum:]]| )*$", "\\1", x)
  trimws(b)
}

extraiNome(a)
#[1] "João Fernando Freitas" "WDFG V/AA 8952" 
  • thanks, solved the problem

  • Rui, please could help me with an adaptation of the code to extract the first term of the string from the question as I edited it now, thankful I need to extract only "WDFG V/AA 8952"

  • 1

    @Henriquefariadeoliveira I just tested and the response function does this, also with the new string.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.