How to go through the data.frame cases using `dplyr`?

Asked

Viewed 673 times

5

I am trying to analyze the cases (lines) of a data.frame with dplyr, but without success. I created two functions for this:

f1 <- function(x) {
  c(s = sum(x), 
    m = mean(x), 
    v = var(x))
}

f2 <- function(x) {
  apply(x, 1, f1)
}

Man data.frame (data_1):

for (i in 1:6) {
  assign(paste('var', i, sep = '_'), 
         runif(30, 20, 100))
}

data_1 <- do.call(
  cbind.data.frame, 
  mget(ls(pattern = '*v'))
)

Use of the functions of package dplyr:

library(dplyr)

data_1 %>%
  mutate_at(.vars = vars (starts_with('v')),
            .funs = funs(.= f2))

data_1 %>%
  mutate_if(is.numeric, .funs = funs(.= f2))

Error in mutate_impl(.data, Dots) : Evaluation error: dim(X) must have a Positive length.

As the analysis is done in the rows, and I have three functions (sum, mean and variance), the expected return is three columns.

3 answers

3


Here are two ways to do what you ask, one with R base and the other with the package dplyr.

First I’m going to redo the data, with set.seed to make results reproducible. And in an easier and more natural way than with calls to assign.

set.seed(1234)    # Torna os resultados reprodutíveis

data_1 <- as.data.frame(replicate(6, runif(30, 20, 100)))
names(data_1) <- paste0("var", 1:6)

See? Much easier

Solution R base.

cbind(data_1, t(f2(data_1)))

Solution dplyr.

library(dplyr)

data_1 %>%
  bind_cols(data_1 %>% f2() %>% t() %>% as.data.frame())

This instruction applies bind_cols with first argument what comes from the pipe %>% and second argument the result of the application of f2() data. But after that, it is necessary to transpose the output matrix of f2() and turn it into date.frame.

Perhaps it is simpler to have the function output f2 already in the format required by bind_cols.

f2b <- function(x) {
  apply(x, 1, f1) %>% t() %>% as.data.frame()
}

data_1 %>%
  bind_cols(data_1 %>% f2b())

3

The mistake

The error message indicates that the function apply(), call for f2() is being rotated into an object that does not have two dimensions. This is because the mutate will try to apply the function in each of the columns, which in fact does not have two dimensions.

The solution

Executing line operations is a non-trivial issue within the . This is because this package/philosophy was designed to work with tables in long format and by groups.

The biggest proof of this is that there have been efforts by three major developers of tidyverse to attack that question. Hadley Wickham created purrrlyr, Jenny Bryan dealt with the theme here (and mainly here) and Romain François himself, current maintainer of , recently created this package.

The answer I offer then is to use the purrr::transpose() to resolve the issue.

The offers the function transpose which makes lista[[1]]][[2]] in lista[[2]][[1]]. Using this function we can create a coluna-lista for each line.

tidy_data <- data_1 %>% 
  as_tibble() %>% 
  mutate(linhas = transpose(data_1) %>% map(unlist))

tidy_data
# A tibble: 30 x 7
    var1  var2  var3  var4  var5  var6 linhas   
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>   
 1  29.1  56.5  89.2  33.3  79.5  55.1 <dbl [6]>
 2  69.8  41.2  23.3  92.0  93.3  38.3 <dbl [6]>
 3  68.7  44.4  45.4  30.7  99.6  26.6 <dbl [6]>
 4  69.9  60.6  21.1  30.5  95.4  88.0 <dbl [6]>
 5  88.9  34.5  39.1  28.4  58.9  38.8 <dbl [6]>
 6  71.2  80.8  76.5  60.9  42.7  99.1 <dbl [6]>
 7  20.8  36.1  44.6  44.0  40.1  68.2 <dbl [6]>
 8  38.6  40.7  60.7  22.1  60.3  99.9 <dbl [6]>
 9  73.3  99.4  24.1  44.8  59.8  50.0 <dbl [6]>
10  61.1  84.6  65.2  79.4  45.5  64.4 <dbl [6]>
# ... with 20 more rows

After this done, just apply your function to each line with mutate() + map().

tidy_data %>% 
  mutate(estats = map(linhas, f1))

# A tibble: 30 x 8
    var1  var2  var3  var4  var5  var6 linhas    estats   
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>    <list>   
 1  29.1  56.5  89.2  33.3  79.5  55.1 <dbl [6]> <dbl [3]>
 2  69.8  41.2  23.3  92.0  93.3  38.3 <dbl [6]> <dbl [3]>
 3  68.7  44.4  45.4  30.7  99.6  26.6 <dbl [6]> <dbl [3]>
 4  69.9  60.6  21.1  30.5  95.4  88.0 <dbl [6]> <dbl [3]>
 5  88.9  34.5  39.1  28.4  58.9  38.8 <dbl [6]> <dbl [3]>
 6  71.2  80.8  76.5  60.9  42.7  99.1 <dbl [6]> <dbl [3]>
 7  20.8  36.1  44.6  44.0  40.1  68.2 <dbl [6]> <dbl [3]>
 8  38.6  40.7  60.7  22.1  60.3  99.9 <dbl [6]> <dbl [3]>
 9  73.3  99.4  24.1  44.8  59.8  50.0 <dbl [6]> <dbl [3]>
10  61.1  84.6  65.2  79.4  45.5  64.4 <dbl [6]> <dbl [3]>
# ... with 20 more rows

The above solution leaves the result in a column-list, in case you get uncomfortable with them we can expand the mutate and we will have

tidy_data %>% 
  mutate(s = map_dbl(linhas, sum),
         m = map_dbl(linhas, mean),
         v = map_dbl(linhas, sd))

# A tibble: 30 x 10
    var1  var2  var3  var4  var5  var6 linhas        s     m     v
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>    <dbl> <dbl> <dbl>
 1  29.1  56.5  89.2  33.3  79.5  55.1 <dbl [6]>  343.  57.1  24.0
 2  69.8  41.2  23.3  92.0  93.3  38.3 <dbl [6]>  358.  59.7  29.7
 3  68.7  44.4  45.4  30.7  99.6  26.6 <dbl [6]>  315.  52.6  27.4
 4  69.9  60.6  21.1  30.5  95.4  88.0 <dbl [6]>  365.  60.9  30.0
 5  88.9  34.5  39.1  28.4  58.9  38.8 <dbl [6]>  289.  48.1  22.4
 6  71.2  80.8  76.5  60.9  42.7  99.1 <dbl [6]>  431.  71.9  19.0
 7  20.8  36.1  44.6  44.0  40.1  68.2 <dbl [6]>  254.  42.3  15.4
 8  38.6  40.7  60.7  22.1  60.3  99.9 <dbl [6]>  322.  53.7  26.9
 9  73.3  99.4  24.1  44.8  59.8  50.0 <dbl [6]>  351.  58.6  25.8
10  61.1  84.6  65.2  79.4  45.5  64.4 <dbl [6]>  400.  66.7  13.9
# ... with 20 more rows

Another possible solution would be to play the table in a long format, group the data by rows and create a summary with the statistics.

These solutions are more robust than transforming the table into a matrix because data coercion can occur in the matrix to character if there is such a column in the table.

3

An option using the package data.table:

library(data.table)

#transformar o data frame em data.table
data_2 = as.data.table(data_1)

# criar os indices para indicar que o cálculo será feito para cada linha
data_2[, i := .I]

# vetor com nome das colunas para calcular
colNam = paste0('var_', 1:6)

# calcular soma, media e var para cada ID i (ou seja, cada linha)
data_2[, ":=" (s = sum(.SD), m = rowMeans(.SD), v = apply(.SD, 1, var)), by = i, .SDcols = colNam]

# remover coluna ID
data_2[, i := NULL]

Browser other questions tagged

You are not signed in. Login or sign up in order to post.