How to select strings that start with a given word

Question

How to select strings that start with a given word

Asked 6 years, 4 months ago

Viewed 250 times

4

I’m manipulating a dataframe on R v.3.5.2 and would like to know how I create a new variable called ES_1_4 which has only rows in the column Pathways which has the starting string as its value REACTOME_. However, this is just the beginning of the string, but after REACTOME_ miscellaneous words. How do I indicate that the string does not end in _ of REACTOME and that what comes next can be anything?

I had tried so:

ES_1_4 = ES_1_3[ES_1_3$Pathways == "REACTOME_", ]

2 answers

6

Use the function grep. It allows you to perform filters like this one, based only on a string snippet:

ES_1_4 = ES_1_3[grep("REACTOME_", ES_1_3$Pathways), ]

In the above command, the new object ES_1_4 will have all lines of ES_1_3 that have the string REACTOME_ somewhere in the column Pathways.

Browser other questions tagged r string

You are not signed in. Login or sign up in order to post.

by Tomás Barcellos • **5,562** points · Answer 1 · 2019-03-06T01:25:32+00:00

For regular expression to only find at the beginning of the line use ^ at the beginning of the expression. So it is possible:

library(tidyverse)

ES_1_3 <- data_frame(
  Pathways = c("REACTOME_final", "inicio_REACTOME_"),
  outra_coluna = 1:2
)

ES_1_3[grep("^REACTOME_", ES_1_3$Pathways), ]
#> # A tibble: 1 x 2
#>   Pathways       outra_coluna
#>   <chr>                 <int>
#> 1 REACTOME_final            1

ES_1_3[str_starts(ES_1_3$Pathways, "REACTOME_"), ]
#> # A tibble: 1 x 2
#>   Pathways       outra_coluna
#>   <chr>                 <int>
#> 1 REACTOME_final            1

^{Created on 2019-03-05 by the reprex package (v0.2.1)}

Since the latest version of stringr (1.4.0), it is possible to use the function str_starts(), that appears in the second solution. Using it there is no longer a need to remember the regex symbol that demarcates the beginning of the line.

Note that the result of the two solutions above is the result requested in the question (only cases where they start with the word) and differs from the result of the @Marcusnunes reply, which finds the word in any position of string.

ES_1_3[grep("REACTOME_", ES_1_3$Pathways), ]
#> # A tibble: 2 x 2
#>   Pathways         outra_coluna
#>   <chr>                   <int>
#> 1 REACTOME_final              1
#> 2 inicio_REACTOME_            2