Differences and similarities between apply and for loop functions

Asked

Viewed 969 times

5

I have this list:

dataset<-data.frame(matrix(runif(6*30,20,100),ncol=6))
cluster<-kmeans(dataset,centers=3)
cluster
dataset$kmeans<-as.factor(cluster[['cluster']])
mylist<-split(dataset,dataset$kmeans)
names(mylist)<-paste0('dataset',seq_along(mylist))

Consider that I want to know the name of the variables present in each of the databases in this list. With lapply:

lapply(mylist,function(x){
  names(x)
})

#$dataset1
#[1] "var1"   "var2"   "var3"   "var4"   "var5"   "var6"   "kmeans"

#$dataset2
#[1] "var1"   "var2"   "var3"   "var4"   "var5"   "var6"   "kmeans"

#$dataset3
#[1] "var1"   "var2"   "var3"   "var4"   "var5"   "var6"   "kmeans"

With for (without print):

for(i in mylist){
  names(i)
}

# nada é emitido ao Console

with for (with print):

for(i in mylist){
  print(names(i))
}

#[1] "var1"   "var2"   "var3"   "var4"   "var5"   "var6"   "kmeans"
#[1] "var1"   "var2"   "var3"   "var4"   "var5"   "var6"   "kmeans"
#[1] "var1"   "var2"   "var3"   "var4"   "var5"   "var6"   "kmeans"

In addition, it is possible to create an object with lapply:

x<-lapply(mylist,function(x){
  names(x)
})

But with for, nay:

x<-for(i in mylist){
  print(names(i))
}

I always thought the difference between lapply and for resided only in the speed processing, as shown in this question. But from these details I’ve realized that the difference isn’t just this.

Thus:

  • what are the differences and similarities between functions apply and for (the construction of the blocks within these functions)?
  • why should I use print in for and, why you can’t create an object with it in the same way you can with functions apply?

3 answers

4


lapply vs for

lapply and for are primitive functions in R. Yes, for is a primitive function too:

`for`(i, 1:10, {print(i + 1)})
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
  • for receives a variable name, a sequence of values and a expression and evaluates this expression by modifying the variable value with the given name for each value of the sequence passed as argument. The important here is the word expression. for evaluates expressions.

  • lapply receives a vector (in the direction of a low-level R vector, that is, anything that can be created using vector) and a function. Next lapply applies this function to each element of the vector passed as argument and returns the results in a list.

why I need to give print within the for?

The for does not return any result. It only evaluates the expression that was passed as argument. If this expression does not print nothing on the console, then the for will not print anything there.

The lapply by itself also does not print anything on the console. The point is that lapply returns a list with the result of the function application for each element of a vector. When this value is returned directly in interactive sessions, without being assigned to a variable, R prints this value on the console. So the following works:

lapply(1:3, function(x) x)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

why can I create objects with lapply and not with for?

As we have seen previously, lapply returns a list. If the function returns a list, we can assign a name to the result. That is, we can create a variable that saves the results.

x <- lapply(1:3, function(x) x)

We also saw earlier that the for returns nothing. The for only evaluates the expression we passed as argument for each possible value of the variable and the vector we passed as values. If the for does not return anything, so it is useless to save its result in a variable. What we do in general is to do the for evaluate expressions that save values to objects.

x <- vector(mode = "list", length = 3)
for (i in 1:3) {
  x[[i]] <- i
}

why use for or lapply?

In data analysis, most times we want to make loops is to apply the same function to each element of a vector. In this case instead of writing:

x <- vector(mode = "list", length = 3)
for (i in 1:3) {
  x[[i]] <- i
}

That besides being longer has a much better chance of having some error, for example forget the index. At some point change the name of the index and forget to change in the expression, etc.

We wrote:

x <- lapply(1:3, function(x) x)

Do you agree that there is less to miss there? In fact, it is more or less the same reason why we use it instead of using one while to do the same thing:

x <- vector(mode = "list", length = 3)
i <- 1
while (i <= 3) {
  x[[i]] <- i
  i <- i + 1
}
x

There’s a lot more to miss, right? Basically, we changed a little flexibility to have fewer places to go wrong. The chapter @Jdemello has an interesting discussion about when to use for, while or lapply.

performance

It’s a myth in R that for is very slow compared to lapply. In fact loops are slow in R when compared to vector codes. To demonstrate this consider the benchmark below:

fun_lapply <- function(n) {
  lapply(1:n, function(x) x + 1)
}

fun_for <- function(n) {
  out <- vector(mode = "list", length = n)
  for (i in 1:n) {
    out[[i]] <- i + 1
  }
  out
}

fun_vet <- function(n) {
  as.list(1:n + 1)
}

bench::mark(
  fun_lapply(1000),
  fun_for(1000),
  fun_vet(1000)
) %>% dplyr::select(expression, min, mean, median, max)

# A tibble: 3 x 5
  expression            min     mean   median      max
  <chr>            <bch:tm> <bch:tm> <bch:tm> <bch:tm>
1 fun_lapply(1000)    302µs  328.3µs  319.1µs   2.33ms
2 fun_for(1000)      44.6µs   48.7µs   47.1µs   2.06ms
3 fun_vet(1000)        14µs     18µs   16.7µs   3.73ms

See how in this example, lapply is much slower than the for and the vector version is much faster than both loops. In this case, we can see the overhead caused by lapply because the function we are using is very simple, just an addition. In practice, the difference between lapply and for is minimal because what will actually cost is the function that is running in the middle.

So in conclusion for performance: use vectorization. When you can’t, use lapply why you will have much less chance of making a mistake.

3

Similarities

The family apply is a hidden. The code below seeks to syntactically reconstruct the similarity between the lapply and a for.

resultado <- vector("integer", 10)
for (i in seq_along(resultado)) {
  resultado[[i]] <- i
}
resultado
# [1]  1  2  3  4  5  6  7  8  9 10

resultado2 <- sapply(seq_along(resultado), function(i) {
  i
})
resultado2
#  [1]  1  2  3  4  5  6  7  8  9 10

Another similarity between the for and the apply is that both are functions (like everything that "happens" R).

class(`for`)
# [1] "function"
class(apply)
# [1] "function"

The differences begin when we check the philosophies that give basis to these two functions.

Differences

The for returns NULL invisibly. For this reason, it is thought for side effects. Put another way, the for is a function that takes as argument the elements with which it will interact, executes a code in the environment in which it was called for each of these elements, and at all times returns NULL invisible. Something like this,

meu_for <- function(nome, iteradores, expr) {
  contador <- 1
  while (contador <= length(iteradores)) {
    assign(as.character(substitute(nome)), iteradores[contador])

    eval(substitute(expr))
    contador <- contador + 1
  }
  invisible(NULL)
}

meu_for(i, 1:10, {
  i
})
# Não imprime nada

meu_for(i, 1:10, {
  print(i)
})
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

Already the family apply is thought of more linked to the functional paradigm, which has aversion to side effects. This causes them to seek to return a value instead of waiting for you to do so through side effects.

print and creation of objects in the for

The fact that the for always return NULL is the reason why the loop does not print and apply prints. This is because the R calling an object in the interactive environment is the same as calling print(objeto). But in the middle of the body of a function, calling an object does not print it.

funcao <- function() {
  "Não me imprime"
  "Mas me imprime, porque sou o retorno da função"
}

funcao()
# [1] "Mas me imprime, porque sou o retorno da função"

Also, as Jenny Bryan has already said, someone has probably written a loop for you in functions such as base::lapply or purrr::map. The advantage of relying on these forms is that

  1. You can take advantage of improvements and optimizations made by dedicated people.
  2. The code "inside" of loop is contained in a function and can therefore be tested and reused more easily.

1

Differences between lapply() and for

This is a non-exhaustive comment on that topic.

Really, lapply tends to be faster than for in . That’s why lapply() is partly written in a language low-level (in ). Under the Hood, if you check lapply, you will see the following:

> lapply
function (X, FUN, ...) 
{
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    .Internal(lapply(X, FUN))
}
<bytecode: 0x0000000002dcc0e8>
<environment: namespace:base>

When we call .Internal, we received:

> base::.Internal
function (call)  .Primitive(".Internal")

Function .Primitive are written in and tend to be more efficient. However, they are built in a different way than we are used to R. That’s why lapply() tends to be faster than the for.

Meanwhile, bear in mind that the for loop may be preferable to lapply, for example when you have to modify part of an object:

set.seed(1)
df <- data.frame(num_1 = runif(100, 0, 100), 
           num_2 = rnorm(100, 100, 20), 
           char_1 = sample(letters, 100, replace = T), stringsAsFactors = F)

# modificar somente colunas numericas
cols <- grep(x= names(df), pattern = "(?i)^num", value = T)

With for:

for(i in cols){
  df[[i]] <- round(df[[i]]) 
}

With lapply:

invisible(
lapply(cols, function(x){
  df[[x]] <<- round(df[[i]])
})
)

In that case, we have to use <<- inside lapply in order to modify df permanently. Moreover, we need to use invisible() to prevent the output of lapply come out on the console.[1]

Besides, for allows certain control-flows such as next:

for(i in seq_along(df)){

  if(is.character(df[[i]])) next # pular essa iteracao se condicao eh satisfeita

  df[[i]] <- round(df[[i]]) 
}

If you want to "create" an Object with for, you have to create it before you start the loop because...

why should I use print in for and, why you can’t create an object with it in the same way you can with functions apply?

for is a loop and not a function that returns an object. apply and its combinations are functions that return objects. To create an object with for, it is necessary to define it before the iteration of the for loop (using your mylist):

x <- vector(length = length(mylist))
for(i in seq_along(mylist)){
  x[[i]] <- print(names(mylist)[[i]])
}

Output:

[1] "dataset1"
[1] "dataset2"
[1] "dataset3"
> x
[1] "dataset1" "dataset2" "dataset3"

[1] http://adv-r.had.co.nz/Functionals.html#functionals-not

  • 1

    it is not true that the for is slower than the lapply.

  • Well, on my machine, for loop is marginally slower, but in fact the for loop has improved a lot in the last editions of R and the difference is imperceptible. In fact, in my opinion, the great benefit of lapply is the fact that it does not modify existing objects in Nvironment.

  • which benchmark you are making?

  • I’m using the package microbenchmark. With a vector with 100,00 elements, my operation gave lapply faster on average.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.