filter in dplyr using a categorical variable

Question

filter in dplyr using a categorical variable

Asked 8 years, 2 months ago

Viewed 748 times

8

Suppose I have the following data set:

set.seed(12)
dados <- data.frame(grupos=rep(letters[1:5], 5), valores=rnorm(25))
head(dados)
  grupos    valores
1      a -1.8323176
2      b -0.0560389
3      c  0.6692396
4      d  0.8067977
5      e  0.2374370
6      a  0.7894452

How could I do to filter only lines whose groups are equal to a or b? I know how to filter lines equal to one level:

library(dplyr)

dados %>%
  filter(grupos=="a")
  grupos    valores
1      a -1.8323176
2      a  0.7894452
3      a -0.9993864
4      a  0.3844801
5      a -1.3305330

dados %>%
  filter(grupos=="b")
  grupos     valores
1      b -0.05603890
2      b  0.37511302
3      b -0.03578014
4      b  0.65215941
5      b  1.64394981

I could individually make each of the filters and add them together later. However, my original problem is more complicated, as it is a data frame with 26,691 lines, where I must filter 1,116 different values. It is impracticable to filter each of these values individually and then combine them at the end.

You tried: data %>% filter(groups=="a"|groups=="b")

– José

2017/06/09 at 02:38
I tried yes. The problem is that there are 1116 different levels that interest me in my original dataset. To use this solution, I would have to write a code type dados %>% filter(grupos=="a1"|grupos=="a2"|...|grupos=="a1116"), which I find impractical.

– Marcus Nunes

2017/06/09 at 09:59
An alternative is to use regex, as long as it identifies a common pattern in the groups it wants to filter: Example: data<-data[stringr::str_which(data$groups,"(a|b)"),]. Or letters<-str_replace_all(toString(Letters[c(1,2)]),", s","|");data<-data[stringr::str_which(data$groups,letters),]

– José

2017/06/09 at 11:25

1 answer

Browser other questions tagged r dplyr

You are not signed in. Login or sign up in order to post.

by Carlos Cinelli • **16,826** points · Answer 1 · 2017-06-10T07:42:25+00:00

In this case you can use the %in%:

dados %>%
  filter(grupos %in% c("a", "b"))
   grupos valores
1       a -1.4806
2       b  1.5772
3       a -0.2723
4       b -0.3153
5       a -0.7777
6       b -1.2939
7       a -0.7035
8       b  1.1889
9       a  0.2236
10      b  2.0072