In R, what is the best way to select sets of internal lists within a list of lists?

Question

In R, what is the best way to select sets of internal lists within a list of lists?

Asked 8 years, 7 months ago

Viewed 272 times

6

I have a list of lists like the one below:

lista <- list(num = list(1:10, 11:20, 21:30),
              chr = list(letters[1:13], letters[14:26], LETTERS[1:13]))

I’d like to turn it into a data.frame, but for that the two internal lists would have to be the same size. To achieve this goal, I would like to select a set with only the first 10 elements of each internal list (missing some remarks from the list that will be cut will not be a problem).

I managed to accomplish this task by means of a function not very elegant (with loop, posted below) and I wondered if there are more efficient ways to do this.

As we have little documentation on R in Portuguese, I thought it reasonable to ask: no R, how best to select sets of internal lists within a list of lists?

1

I do not understand why to answer your own question. It would not be more interesting to put the code of this link in your original post? Imagine if someone answers your question and this answer gets votes. It will stay ahead of your answer, decontextualizing your doubt. This is not interesting for those who come to do research in the OS in the future.

– Marcus Nunes

2016/11/17 at 00:25
1

This is an option given by Stackoverflow. I understand your concern, but my answer is a possible answer to the question. If someone else has another solution they can post below.

– Tomás Barcellos

2016/11/17 at 00:40

2 answers

7

I find the following way more concise to do what you need:

library(purrr) # para a função map
library(tidyr) # para a função unnest
library(dplyr) # para a função as_data_frame
map(lista, ~map(.x, ~.x[1:10])) %>%
  as_data_frame() %>%
  unnest()

The result is this:

# A tibble: 30 × 2
     num   chr
   <int> <chr>
1      1     a
2      2     b
3      3     c
4      4     d
5      5     e
6      6     f
7      7     g
8      8     h
9      9     i
10    10     j
# ... with 20 more rows

Another way, which also looks cool is:

lista %>%
  as_data_frame() %>%
  mutate(chr = map(chr, ~.x[1:10])) %>%
  unnest()

list columns, that is, columns of data.frames that are lists are being widely used and popularized by Hadley Wickham. See here on R for Data Science.

In the example with list columns I only modified the Chr column, but you could modify all the columns using:

lista %>%
  as_data_frame() %>%
  mutate_all(funs(map(., ~.x[1:10]))) %>%
  unnest()

Complementing the Tomás Benchmark

> lista <- list(
+   num = lapply(1:10, function(x) sample(1:100, 20)),
+   chr = lapply(1:10, function(x) sample(letters, 20))
+ )
> microbenchmark(
+   solucao_tomas = {as.data.frame(sapply(lapply(lista, pegar_elem, 1:10), unlist))},
+   solucao_daniel = {unnest(as_data_frame(map(lista, ~map(.x, ~.x[1:10]))))}
+ )
Unit: microseconds
           expr      min       lq      mean   median       uq      max neval
  solucao_tomas  419.026  439.375  466.7568  454.947  476.889  695.780   100
 solucao_daniel 2456.108 2559.625 2745.8009 2680.130 2836.733 4466.647   100
> lista <- list(
+   num = lapply(1:1000, function(x) sample(1:100, 20)),
+   chr = lapply(1:1000, function(x) sample(letters, 20))
+ )
> microbenchmark(
+   solucao_tomas = {as.data.frame(sapply(lapply(lista, pegar_elem, 1:10), unlist))},
+   solucao_daniel = {unnest(as_data_frame(map(lista, ~map(.x, ~.x[1:10]))))}
+ )
Unit: milliseconds
           expr       min       lq     mean   median       uq      max neval
  solucao_tomas 13.559905 14.15854 14.64829 14.56517 14.83060 16.89264   100
 solucao_daniel  9.871144 10.27053 11.07952 10.80652 11.29402 19.82793   100
> lista <- list(
+   num = lapply(1:10000, function(x) sample(1:100, 20)),
+   chr = lapply(1:10000, function(x) sample(letters, 20))
+ )
> microbenchmark(
+   solucao_tomas = {as.data.frame(sapply(lapply(lista, pegar_elem, 1:10), unlist))},
+   solucao_daniel = {unnest(as_data_frame(map(lista, ~map(.x, ~.x[1:10]))))}
+ )
Unit: milliseconds
           expr       min        lq     mean    median       uq      max neval
  solucao_tomas 156.63202 171.06855 195.3683 180.86325 227.1462 271.7314   100
 solucao_daniel  80.93934  91.22597 100.5079  96.73947 104.7544 154.6254   100

That is, when the list is small Tomás' solution using for is more efficient, however the difference there is in the microsecond house. (efficiency is not very important when the objects are small). When objects begin to grow, the solution using purrr, dplyr and tidyr becomes more efficient. With lists of size 10,000 it becomes 2x faster. This solution is efficient when needed, that is, when the size of objects grows.

Browser other questions tagged r list

You are not signed in. Login or sign up in order to post.

by Tomás Barcellos • **5,562** points · Answer 1 · 2016-11-16T23:57:55+00:00

pegar_elem <- function(x, vetor){
  xx <- x
  for (i in seq_along(xx)) {
    xx[[i]] <- xx[[i]][vetor]
  }
  return(xx)
}

lista2 <- lapply(lista, pegar_elem, 1:10)
as.data.frame(sapply(lista2, unlist))

EDITED

Despite the answers of Daniel Falbel be more elegant, record:

microbenchmark({as.data.frame(sapply(lapply(lista, pegar_elem, 1:10), unlist))},
               {map(lista, ~map(.x, ~.x[1:10])) %>% as_data_frame() %>% unnest()},
               {lista %>% as_data_frame() %>% mutate(chr = map(chr, ~.x[1:10])) %>% unnest()})

  min        lq       mean     median        uq      max   neval
  353.818  367.506  395.5651  395.2225  413.3585  561.525   100
 3735.283 3774.977 3929.1722 3811.0775 3879.1725 6091.565   100
 4090.128 4157.026 4313.8627 4179.2685 4267.0385 6863.874   100