Computational efficiency in R - lists or vectors

Question

Computational efficiency in R - lists or vectors

Asked 7 years, 5 months ago

Viewed 213 times

7

I am studying computational efficiency in R, generating matrices through different methods.

First I give a vector-shaped matrix and calculate the variances for its columns:

matriz <- matrix(rep(NA, 1000*200), nrow = 1000, ncol=200)

a = system.time(

for(i in 1:200){
  matriz[,i] <- rnorm(1000, i)
  print(var(matriz[,i]))
}
)

I remake the exercise using the function apply

for(i in 1:200){
  matriz[,i] <- rnorm(1000, i)
}

apply(matriz, 2, var)
}
)

Refaço once again using the function mapply

matriz <- matrix(rep(NA, 1000*200), nrow = 1000, ncol=200)

matriz <- mapply(rnorm, n = 1000, mean = 1:200)
v <- apply(matriz, 2 , var)}
)

Then, in contrast to the vector method, I use lists and calculate averages for your lines.

Primarily in a vector form:

lista <- list()

for(i in 1:10){
  matriz <- matrix(rep(NA, 20*100),nrow=20, ncol = 100)
  for(k in 1:20){
    matriz[k,] <- rnorm(20, mean = k) 
  }
  lista[[i]] <- matriz
  print(mean(lista[[i]]))
}

Finally use lists with the lapply function:

for(i in 1:20){
  matriz <- matrix(rep(NA, 20*200), nrow = 20, ncol = 200)
  for(k in 1:20){
    matriz[k,] <- rnorm(20, mean = k) 
  }
  lista[[i]] <- matriz
}

  lapply(lista, mean)

})

The table below shows the calculation times of each method:

|   | user.self| sys.self| elapsed| user.child| sys.child|
|:--|---------:|--------:|-------:|----------:|---------:|
|a  |     0.028|    0.004|   0.030|          0|         0|
|b  |     0.024|    0.000|   0.023|          0|         0|
|c  |     0.020|    0.000|   0.021|          0|         0|
|d  |     0.006|    0.000|   0.006|          0|         0|
|e  |     0.007|    0.000|   0.006|          0|         0|

Of course, the last two times will be smaller, since I have calculated a much smaller matrix. However, can explain to me the advantages and disadvantages of each method and why this occurs?

1 answer

Browser other questions tagged r matrix lapply mapply

You are not signed in. Login or sign up in order to post.

by Daniel Falbel • **12,504** points · Answer 1 · 2018-03-10T20:15:43+00:00

To evaluate code speed, it is very important to fully isolate the problems. In your case, you are measuring the time of two operations:

Create the matrix with random values with 1000 rows and 200 columns
Calculate the variance for each column

I would organize the problem as follows.

Create matrices in R

gerar_for <- function() {

  matriz <- matrix(rep(NA, 1000*200), nrow = 1000, ncol=200)

  for(i in 1:200){
    matriz[,i] <- rnorm(1000, i)
  }

  matriz
}

gerar_mapply <- function() {
  mapply(rnorm, n = 1000, mean = 1:200)
}


gerar_for_slow <- function() {
  matriz <- NULL
  for(i in 1:200){
    matriz <- cbind(matriz, rnorm(1000, i))
  }
  matriz
}

microbenchmark::microbenchmark(
  "for" = gerar_for(),
  "mapply" = gerar_mapply(),
  "for-slow" = gerar_for_slow()
)

Unit: milliseconds
     expr      min        lq     mean    median        uq      max neval cld
      for  15.6097  16.76431  20.0785  18.26528  20.10932 163.0261   100  a 
   mapply  15.5994  17.43291  22.1635  18.68548  21.00221 153.6971   100  a 
 for-slow 148.6910 169.03706 217.5798 178.62365 295.26370 373.7119   100   b

The function microbenchmark is very good to compare speed of functions, because it runs each function more than once, ensuring that the time difference is not only because of some lock that may have given on your computer.

From the table above, we see that there is not much difference between the first two ways of doing, since what grows by allocating memory dynamically is very slow.

Calculate the variance

var_for <- function(matriz) {
  variancias <- numeric(200)
  for(i in 1:200) {
    variancias[i] <- var(matriz[,i])
  }
  variancias
}

var_apply <- function(matriz) {
  apply(matriz, 2, var)
}

var_for_slow <- function(matriz) {
  variancias <- NULL
  for(i in 1:200) {
    variancias <- c(variancias, var(matriz[,i]))
  }
  variancias
}

matriz <- gerar_for()

microbenchmark::microbenchmark(
  "for" = var_for(matriz),
  "apply" = var_apply(matriz),
  "for-slow" = var_for_slow(matriz)
)

Unit: milliseconds
     expr      min       lq     mean   median       uq       max neval cld
      for 5.187810 5.506842 6.672243 5.834702 7.041265  24.80995   100   a
    apply 6.053562 6.822156 9.412554 7.345083 8.566811 152.58045   100   a
 for-slow 5.304672 5.587136 6.798713 6.063436 7.600376  13.52065   100   a

In the table above we see that in this case it doesn’t make much difference between any of the three approaches.

Comparison:

From what I understand, you’re basically comparing the use of apply and for.

The advantages of using for is the ease of making codes where an iteration depends on the result of the previous iteration. This is not so simple with apply. The disadvantages of for is that it is easy to make code that is slow, for example the function gerar_for_slow (above). Another drawback is that you usually have to write more lines of code.

The apply is more or less the opposite of for, it’s hard to make code that depends on the previous iteration. But it’s easier to make code that doesn’t slow down.

For me the greatest advantage to use apply is that you get used to thinking that R is a functional language, and so it will become much easier to learn and delve into the language.

About vectorization

apply shall not be considered as vectorisation in R. apply is simply an alternative way of writing the for.

To be considered vectorized, your loop has to be written in a lower-level programming language (C, Fortran, C++, etc.) and this is what happens to many R functions. For example:

soma_for <- function(vetor) {
  soma <- 0
  for(i in 1:length(vetor)){
    soma <- soma + vetor[i]
  }
  soma
}

soma_vetorizada <- function(vetor) {
  sum(vetor)
}

vetor <- rnorm(1000)
microbenchmark::microbenchmark(
  "for" = soma_for(vetor),
  "vetorizada" = soma_vetorizada(vetor)
)

Unit: microseconds
       expr    min     lq     mean  median     uq      max neval cld
        for 45.723 45.909 75.11931 46.0165 46.294 2773.788   100   b
 vetorizada  1.575  1.607 10.93954  1.6575  1.727  913.892   100  a

So we see the speed difference between the two implementations.