What are columns-lists of a data.frame?

Asked

Viewed 85 times

5

The stimulates the use of columns-list in . But, after all,

  • what are columns-list?

  • on what occasions they are commonly used?

  • they can be created with the r-base or just as tibbles?

For example,

data.frame(idade = 1:5, nome = letters[1:5], lista = lapply(1:5, rnorm))

Error in (Function (..., Row.Names = NULL, check.Rows = FALSE, check.Names = TRUE, :

Arguments imply differing number of Rows: 1, 2, 3, 4, 5

tibble::tibble(idade = 1:5, nome = letters[1:5], lista = lapply(1:5, rnorm))
# A tibble: 5 x 3
  idade nome  lista    
  <int> <chr> <list>   

1     1 a     <dbl [1]>
2     2 b     <dbl [2]>
3     3 c     <dbl [3]>
4     4 d     <dbl [4]>
5     5 e     <dbl [5]>

1 answer

5


List columns or list-Columns are a data structure that can be useful at various times when working with tidyverse. They are mainly used as intermediate structures.

They can be used in R-base but you will have to use the function I to prevent the base from releasing an error. Example:

data.frame(idade = 1:5, nome = letters[1:5], lista = I(lapply(1:5, rnorm)))

  idade nome        lista
1     1    a 0.178046....
2     2    b 0.407768....
3     3    c -0.84749....
4     4    d -0.44864....
5     5    e 1.229863....

An example that illustrates well the use of list-columns is when we are using vector functions that return more than one value within a mutate. For example:

df <- tribble(
  ~x1,
  "a,b,c", 
  "d,e,f,g"
) 

df %>% 
  mutate(x2 = stringr::str_split(x1, ","))
#> # A tibble: 2 x 2
#>   x1      x2       
#>   <chr>   <list>   
#> 1 a,b,c   <chr [3]>
#> 2 d,e,f,g <chr [4]>

Next, it is common to simplify data.frame using the function unnest of tidyr:

df %>% 
  mutate(x2 = stringr::str_split(x1, ",")) %>% 
  unnest()
#> # A tibble: 7 x 2
#>   x1      x2   
#>   <chr>   <chr>
#> 1 a,b,c   a    
#> 2 a,b,c   b    
#> 3 a,b,c   c    
#> 4 d,e,f,g d    
#> 5 d,e,f,g e    
#> 6 d,e,f,g f    
#> # ... with 1 more row

There are many other interesting use cases. Another example I like is the one created by the package rsample:

library(tidyverse)
library(rsample)

vfold_cv(mtcars, v = 5) %>% 
  mutate(
    modelos = map(splits, ~lm(mpg ~ ., data = analysis(.x))),
    mse = map2_dbl(modelos, splits, ~mean((assessment(.y)$mpg - predict(.x, assessment(.y)))^2))
    )

#  5-fold cross-validation 
# A tibble: 5 x 4
  splits         id    modelos    mse
* <list>         <chr> <list>   <dbl>
1 <split [25/7]> Fold1 <S3: lm> 40.4 
2 <split [25/7]> Fold2 <S3: lm>  5.99
3 <split [26/6]> Fold3 <S3: lm>  9.11
4 <split [26/6]> Fold4 <S3: lm> 11.6 
5 <split [26/6]> Fold5 <S3: lm> 21.3 

In the example above we set a model for each fold of cross-validation and then calculate the mean quadratic error for the observations left out at each fold.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.