lapply
vs for
lapply
and for
are primitive functions in R. Yes, for
is a primitive function too:
`for`(i, 1:10, {print(i + 1)})
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
for
receives a variable name, a sequence of values and a expression and evaluates this expression by modifying the variable value with the given name for each value of the sequence passed as argument. The important here is the word expression. for
evaluates expressions.
lapply
receives a vector (in the direction of a low-level R vector, that is, anything that can be created using vector
) and a function. Next lapply
applies this function to each element of the vector passed as argument and returns the results in a list.
why I need to give print
within the for
?
The for
does not return any result. It only evaluates the expression that was passed as argument. If this expression does not print nothing on the console, then the for
will not print anything there.
The lapply
by itself also does not print anything on the console. The point is that lapply
returns a list with the result of the function application for each element of a vector. When this value is returned directly in interactive sessions, without being assigned to a variable, R prints this value on the console. So the following works:
lapply(1:3, function(x) x)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
why can I create objects with lapply
and not with for
?
As we have seen previously, lapply
returns a list. If the function returns a list, we can assign a name to the result. That is, we can create a variable that saves the results.
x <- lapply(1:3, function(x) x)
We also saw earlier that the for
returns nothing. The for
only evaluates the expression we passed as argument for each possible value of the variable and the vector we passed as values. If the for
does not return anything, so it is useless to save its result in a variable. What we do in general is to do the for
evaluate expressions that save values to objects.
x <- vector(mode = "list", length = 3)
for (i in 1:3) {
x[[i]] <- i
}
why use for
or lapply
?
In data analysis, most times we want to make loops is to apply the same function to each element of a vector. In this case instead of writing:
x <- vector(mode = "list", length = 3)
for (i in 1:3) {
x[[i]] <- i
}
That besides being longer has a much better chance of having some error, for example forget the index. At some point change the name of the index and forget to change in the expression, etc.
We wrote:
x <- lapply(1:3, function(x) x)
Do you agree that there is less to miss there? In fact, it is more or less the same reason why we use it instead of using one while
to do the same thing:
x <- vector(mode = "list", length = 3)
i <- 1
while (i <= 3) {
x[[i]] <- i
i <- i + 1
}
x
There’s a lot more to miss, right?
Basically, we changed a little flexibility to have fewer places to go wrong. The chapter @Jdemello has an interesting discussion about when to use for
, while
or lapply
.
performance
It’s a myth in R that for
is very slow compared to lapply
. In fact loops
are slow in R when compared to vector codes. To demonstrate this consider the benchmark below:
fun_lapply <- function(n) {
lapply(1:n, function(x) x + 1)
}
fun_for <- function(n) {
out <- vector(mode = "list", length = n)
for (i in 1:n) {
out[[i]] <- i + 1
}
out
}
fun_vet <- function(n) {
as.list(1:n + 1)
}
bench::mark(
fun_lapply(1000),
fun_for(1000),
fun_vet(1000)
) %>% dplyr::select(expression, min, mean, median, max)
# A tibble: 3 x 5
expression min mean median max
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm>
1 fun_lapply(1000) 302µs 328.3µs 319.1µs 2.33ms
2 fun_for(1000) 44.6µs 48.7µs 47.1µs 2.06ms
3 fun_vet(1000) 14µs 18µs 16.7µs 3.73ms
See how in this example, lapply
is much slower than the for
and the vector version is much faster than both loops
. In this case, we can see the overhead caused by lapply
because the function we are using is very simple, just an addition. In practice, the difference between lapply
and for
is minimal because what will actually cost is the function that is running in the middle.
So in conclusion for performance: use vectorization. When you can’t, use lapply
why you will have much less chance of making a mistake.
it is not true that the
for
is slower than thelapply
.– Daniel Falbel
Well, on my machine, for loop is marginally slower, but in fact the for loop has improved a lot in the last editions of R and the difference is imperceptible. In fact, in my opinion, the great benefit of lapply is the fact that it does not modify existing objects in Nvironment.
– JdeMello
which benchmark you are making?
– Daniel Falbel
I’m using the package
microbenchmark
. With a vector with 100,00 elements, my operation gave lapply faster on average.– JdeMello