lapply vs for
lapply and for are primitive functions in R. Yes, for is a primitive function too:
`for`(i, 1:10, {print(i + 1)})
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
for receives a variable name, a sequence of values and a expression and evaluates this expression by modifying the variable value with the given name for each value of the sequence passed as argument. The important here is the word expression. for evaluates expressions.
lapply receives a vector (in the direction of a low-level R vector, that is, anything that can be created using vector) and a function. Next lapply applies this function to each element of the vector passed as argument and returns the results in a list.
why I need to give print within the for?
The for does not return any result. It only evaluates the expression that was passed as argument. If this expression does not print nothing on the console, then the for will not print anything there.
The lapply by itself also does not print anything on the console. The point is that lapply returns a list with the result of the function application for each element of a vector. When this value is returned directly in interactive sessions, without being assigned to a variable, R prints this value on the console. So the following works:
lapply(1:3, function(x) x)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
why can I create objects with lapply and not with for?
As we have seen previously, lapply returns a list. If the function returns a list, we can assign a name to the result. That is, we can create a variable that saves the results.
x <- lapply(1:3, function(x) x)
We also saw earlier that the for returns nothing. The for only evaluates the expression we passed as argument for each possible value of the variable and the vector we passed as values. If the for does not return anything, so it is useless to save its result in a variable. What we do in general is to do the for evaluate expressions that save values to objects.
x <- vector(mode = "list", length = 3)
for (i in 1:3) {
x[[i]] <- i
}
why use for or lapply?
In data analysis, most times we want to make loops is to apply the same function to each element of a vector. In this case instead of writing:
x <- vector(mode = "list", length = 3)
for (i in 1:3) {
x[[i]] <- i
}
That besides being longer has a much better chance of having some error, for example forget the index. At some point change the name of the index and forget to change in the expression, etc.
We wrote:
x <- lapply(1:3, function(x) x)
Do you agree that there is less to miss there? In fact, it is more or less the same reason why we use it instead of using one while to do the same thing:
x <- vector(mode = "list", length = 3)
i <- 1
while (i <= 3) {
x[[i]] <- i
i <- i + 1
}
x
There’s a lot more to miss, right?
Basically, we changed a little flexibility to have fewer places to go wrong. The chapter @Jdemello has an interesting discussion about when to use for, while or lapply.
performance
It’s a myth in R that for is very slow compared to lapply. In fact loops are slow in R when compared to vector codes. To demonstrate this consider the benchmark below:
fun_lapply <- function(n) {
lapply(1:n, function(x) x + 1)
}
fun_for <- function(n) {
out <- vector(mode = "list", length = n)
for (i in 1:n) {
out[[i]] <- i + 1
}
out
}
fun_vet <- function(n) {
as.list(1:n + 1)
}
bench::mark(
fun_lapply(1000),
fun_for(1000),
fun_vet(1000)
) %>% dplyr::select(expression, min, mean, median, max)
# A tibble: 3 x 5
expression min mean median max
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm>
1 fun_lapply(1000) 302µs 328.3µs 319.1µs 2.33ms
2 fun_for(1000) 44.6µs 48.7µs 47.1µs 2.06ms
3 fun_vet(1000) 14µs 18µs 16.7µs 3.73ms
See how in this example, lapply is much slower than the for and the vector version is much faster than both loops. In this case, we can see the overhead caused by lapply because the function we are using is very simple, just an addition. In practice, the difference between lapply and for is minimal because what will actually cost is the function that is running in the middle.
So in conclusion for performance: use vectorization. When you can’t, use lapply why you will have much less chance of making a mistake.
it is not true that the
foris slower than thelapply.– Daniel Falbel
Well, on my machine, for loop is marginally slower, but in fact the for loop has improved a lot in the last editions of R and the difference is imperceptible. In fact, in my opinion, the great benefit of lapply is the fact that it does not modify existing objects in Nvironment.
– JdeMello
which benchmark you are making?
– Daniel Falbel
I’m using the package
microbenchmark. With a vector with 100,00 elements, my operation gave lapply faster on average.– JdeMello