How to select rows that have text searching in all columns of a data frame

Asked

Viewed 35 times

3

I want to select only the lines that have the text "Try", similar to grep in Linux. Follow the example:

my.data <- data.frame(
  A = c("prot trypsina catalic", "7", "123", NA, "1419", "ab", "ab", "ab"),
  B = c("1416", "7", "123trypsina", "1011", "1416", "ab", "TRYPSIN", "ab"),
  c = c("b", "a", "trypsina123", "trypsin", "no", "ab", "ab", "ab"),
  d = seq(1:8),
  e = rep("please", 8))

The desired result would be as follows:

my.data[c(1,3,4,7),]
                      A           B           c d      e
1 prot trypsina catalic        1416           b 1 please
3                   123 123trypsina trypsina123 3 please
4                  <NA>        1011     trypsin 4 please
7                    ab     TRYPSIN          ab 7 please

2 answers

3


A solution, in a line:

my.data[apply(my.data, 1, function(x) any(grepl("tryp", x, ignore.case = TRUE))), ]
#>                       A           B           c d      e
#> 1 prot trypsina catalic        1416           b 1 please
#> 3                   123 123trypsina trypsina123 3 please
#> 4                  <NA>        1011     trypsin 4 please
#> 7                    ab     TRYPSIN          ab 7 please

Explanation:

The function grepl returns a logical vector with the presence of the pattern; the option ignore.case is for case-independent search. For example, for column A:

grepl("tryp", my.data$B, ignore.case = TRUE)
#> [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

any is used to see if there is any true occurrence, regardless of position. For example:

any(c(FALSE, FALSE, TRUE))
#> [1] TRUE

apply applies a function to a dimension of an arrangement; in the case of a data frame., 1 indicates to apply to lines. The result together with grepl and any is a logic vector indicating the lines that have some occurrence of the pattern:

apply(my.data, 1, function(x) any(grepl("tryp", x, ignore.case = TRUE)))
[1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE

which is used to index lines from data frame., in the form df[vetorlogico, ]

3

A solution using functions of tidyverse is the following:

library(tidyverse)

my.data %>%
    filter_all(any_vars(str_detect(., pattern = "(?i)tryp")))
#>                       A           B           c d      e
#> 1 prot trypsina catalic        1416           b 1 please
#> 2                   123 123trypsina trypsina123 3 please
#> 3                  <NA>        1011     trypsin 4 please
#> 4                    ab     TRYPSIN          ab 7 please

Created on 2021-06-12 by the reprex package (v2.0.0)

What the above command does is this:

  • filter_all will filter all lines
  • any_vars choose any column that satisfies a logical condition
  • the logical condition is determined by str_detect, that will select any cell that contains the string tryp, whereas (?i) indicates that no differentiation is made between upper and lower case letters tryp
  • Fantastic! Thank you very, very much! Strong hug Marcus Nunes

  • It’s great to know that my response has helped you in some way. So consider vote and accept the answer, so that in the future other people who experience the same problem have a reference to solve it.

  • Hi Marcu Nunes, good morning. I swear I’m trying, but it seems I don’t have a reputation yet because I get the following message "Thanks for the feedback! Votes from users with a reputation lower than 15 are recorded, but do not change publicly the score presented in the post." I’m really sorry, I hope to contribute another opportunity

  • 1

    Don’t worry. The contribution has already been made. Over time, as your reputation grows, these points start to appear publicly. The important thing is to always contribute to the site, not necessarily just in the answers to your questions, but whenever some content helps you in some way. Thanks for the help!

  • I thank you, strong embrace!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.