Text mining with R (stringr)

Asked

Viewed 229 times

1

I have a string lenght 15 and want to remove the first 70 charac. and the last 200 charac. each.

I tried the following code to remove the beginnings and it didn’t work:

 texto2009a <- texto2009 %>% map(str_sub(., 1, 72) <- " ")
  • Welcome to Stackoverflow! Unfortunately, this question cannot be reproduced by anyone trying to answer it. Please, take a look at this link and see how to ask a reproducible question in R. So, people who wish to help you will be able to do this in the best possible way.

    1. What happens when the text is less than 70 characters long? 2) And less than 200? 3) And, by the way, less than 70 + 200?

2 answers

5

Fictional example, which may be reproduced for your case:

x<-c('Bem-vinda ao Stack Overflow em Português')

library(stringr)

str_sub(x, 2, -10) # as aspas do texto também contam como strings. Por isso, adicione uma unidade a mais
#[1] "em-vinda ao Stack Overflow em "

Where, 2 and -10 are respectively the amounts of initial and final characters that you want to remove.

5


From the @Giovani response, I wrote a small function to solve the difference problem between what str_sub does and what the question asks.

On the page help("str_sub"), section Details:

Details

Substrings are inclusive - they include the characters at Both start and end positions. str_sub(string, 1, -1) will Return the complete substring, from the first Character to the last.

Now the question asks (edited by me)

remove the first m characters and the latter n characters

It is therefore necessary to start at m + 1 and in the end it will be n - 1.

library(stringr)

str_sub_als <- function(s, primeiros = 70, ultimos = -200){
    str_sub(s, primeiros + 1, ultimos - 1)
}

x <- c("1234567890", "abcdefghijklmnopqrstuvwxyz")

str_sub(x, 3, -4)
#[1] "34567"                 "cdefghijklmnopqrstuvw"

str_sub_als(x, 3, -4)
#[1] "456"                 "defghijklmnopqrstuv"

Browser other questions tagged

You are not signed in. Login or sign up in order to post.