Select first lines depending on group efficiently

Question

Select first lines depending on group efficiently

Asked 11 years, 2 months ago

Viewed 259 times

5

Suppose I have the following database

set.seed(100)
base <- expand.grid(grupo = c("a", "b", "c", "d"), score = runif(100))

And that I want to select the lines with smaller score depending on the group according to the table below:

qtds <- data.frame(grupo = levels(base$grupo), qtd = c(1, 2, 3, 4))
qtds

  grupo qtd
1     a   1
2     b   2
3     c   3
4     d   4

That is, I wish to select the line with smaller scoreof the group a, the two lines with smaller score of the group b and so on...

At the moment, I’m doing so:

novaBase <- data.frame()
for(i in levels(base$grupo)){
  novaBase <- rbind(novaBase,
                    base %>% 
                      filter(grupo == i) %>% 
                      filter(row_number(score) <= qtds$qtd[qtds$grupo == i])
                    )
}

   grupo        score
1      a 0.0003950703
2      b 0.0003950703
3      b 0.0039051792
4      c 0.0003950703
5      c 0.0221628349
6      c 0.0039051792
7      d 0.0269371939
8      d 0.0003950703
9      d 0.0221628349
10     d 0.0039051792

This way it works, but seems to me very inefficient, besides the code is difficult to understand. Someone knows a better way?

1 answer

Browser other questions tagged r dplyr

You are not signed in. Login or sign up in order to post.

by Carlos Cinelli • **16,826** points · Answer 1 · 2015-01-28T13:34:47+00:00

A form with the dplyr would be:

base2 <- merge(base, qtds)

base2 %>% group_by(grupo) %>% arrange(score) %>% slice(1:unique(qtd))
Source: local data frame [10 x 3]
Groups: grupo

   grupo      score qtd
1      a 0.03014575   1
2      b 0.03014575   2
3      b 0.03780258   2
4      c 0.03014575   3
5      c 0.03780258   3
6      c 0.05638315   3
7      d 0.03014575   4
8      d 0.03780258   4
9      d 0.05638315   4
10     d 0.09151028   4