To evaluate code speed, it is very important to fully isolate the problems. In your case, you are measuring the time of two operations:
- Create the matrix with random values with 1000 rows and 200 columns
- Calculate the variance for each column
I would organize the problem as follows.
Create matrices in R
gerar_for <- function() {
matriz <- matrix(rep(NA, 1000*200), nrow = 1000, ncol=200)
for(i in 1:200){
matriz[,i] <- rnorm(1000, i)
}
matriz
}
gerar_mapply <- function() {
mapply(rnorm, n = 1000, mean = 1:200)
}
gerar_for_slow <- function() {
matriz <- NULL
for(i in 1:200){
matriz <- cbind(matriz, rnorm(1000, i))
}
matriz
}
microbenchmark::microbenchmark(
"for" = gerar_for(),
"mapply" = gerar_mapply(),
"for-slow" = gerar_for_slow()
)
Unit: milliseconds
expr min lq mean median uq max neval cld
for 15.6097 16.76431 20.0785 18.26528 20.10932 163.0261 100 a
mapply 15.5994 17.43291 22.1635 18.68548 21.00221 153.6971 100 a
for-slow 148.6910 169.03706 217.5798 178.62365 295.26370 373.7119 100 b
The function microbenchmark
is very good to compare speed of functions, because it runs each function more than once, ensuring that the time difference is not only because of some lock that may have given on your computer.
From the table above, we see that there is not much difference between the first two ways of doing, since what grows by allocating memory dynamically is very slow.
Calculate the variance
var_for <- function(matriz) {
variancias <- numeric(200)
for(i in 1:200) {
variancias[i] <- var(matriz[,i])
}
variancias
}
var_apply <- function(matriz) {
apply(matriz, 2, var)
}
var_for_slow <- function(matriz) {
variancias <- NULL
for(i in 1:200) {
variancias <- c(variancias, var(matriz[,i]))
}
variancias
}
matriz <- gerar_for()
microbenchmark::microbenchmark(
"for" = var_for(matriz),
"apply" = var_apply(matriz),
"for-slow" = var_for_slow(matriz)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
for 5.187810 5.506842 6.672243 5.834702 7.041265 24.80995 100 a
apply 6.053562 6.822156 9.412554 7.345083 8.566811 152.58045 100 a
for-slow 5.304672 5.587136 6.798713 6.063436 7.600376 13.52065 100 a
In the table above we see that in this case it doesn’t make much difference between any of the three approaches.
Comparison:
From what I understand, you’re basically comparing the use of apply
and for
.
The advantages of using for
is the ease of making codes where an iteration depends on the result of the previous iteration. This is not so simple with apply
. The disadvantages of for
is that it is easy to make code that is slow, for example the function gerar_for_slow
(above). Another drawback is that you usually have to write more lines of code.
The apply
is more or less the opposite of for
, it’s hard to make code that depends on the previous iteration. But it’s easier to make code that doesn’t slow down.
For me the greatest advantage to use apply
is that you get used to thinking that R is a functional language, and so it will become much easier to learn and delve into the language.
About vectorization
apply
shall not be considered as vectorisation in R. apply
is simply an alternative way of writing the for
.
To be considered vectorized, your loop has to be written in a lower-level programming language (C, Fortran, C++, etc.) and this is what happens to many R functions. For example:
soma_for <- function(vetor) {
soma <- 0
for(i in 1:length(vetor)){
soma <- soma + vetor[i]
}
soma
}
soma_vetorizada <- function(vetor) {
sum(vetor)
}
vetor <- rnorm(1000)
microbenchmark::microbenchmark(
"for" = soma_for(vetor),
"vetorizada" = soma_vetorizada(vetor)
)
Unit: microseconds
expr min lq mean median uq max neval cld
for 45.723 45.909 75.11931 46.0165 46.294 2773.788 100 b
vetorizada 1.575 1.607 10.93954 1.6575 1.727 913.892 100 a
So we see the speed difference between the two implementations.
When you dose sectorization, you mean vectorization?
– Arduin
corrected! must have been autocorrector :P
– Daniel Falbel