Why are loops slow in R? How to avoid them?

Asked

Viewed 396 times

6

It is very common to hear (or read) that loops are not efficient in Rand should be avoided (at this link or another link or even in this).

And proving this statement is simple:

numeros <- rnorm(10000)

com_loop <- function(vetor) {
  res <- 0
  for (i in seq_along(vetor)) {
    res <- res + vetor[i]
  }
  res
}

microbenchmark::microbenchmark(
  loop = com_loop(numeros),
  vetorizado = sum(numeros)
)

Unit: microseconds
       expr     min      lq      mean  median       uq      max neval
       loop 494.709 512.670 562.71062 514.723 551.9285 3074.480   100
 vetorizado   9.750  10.263  10.77702  10.264  10.2640   28.226   100

The questions I ask are:

  1. Why loops are slow in the ?
  2. What alternatives are there? (packages, strategies, etc)

2 answers

9


Excellent questions. Below I will put my two cents on them.

1. Why loops are slow in R?

Loops are slow in R because this is an intrinsic feature of interpreted languages. Any code written in the R language (which is a language interpreted as python or ruby) is read and interpreted for machine language, to be executed there.

C, on the other hand, is a compiled language. All code written in the C language is compiled, transformed into an executable in the native language of the operating system and processor of the machine and only after that will it be executed.

If we loop a language interpreted as R, the translation step from the code written in R to the machine language will occur for each step of the loop. Thus, several extra steps are added in the execution of the program, steps that do not exist in the compiled language. And each intermediate step of these is added to the total time of execution of the program.

I understand that this answer may not directly answer your question. Let me then repeat it as follows:

Why loops in R are slower than vector code?

Although it doesn’t seem like it, the answer to this question is in the above description. Many of the native R codes, such as the sum from your example, they were written in C, C++ or FORTRAN. Note the output that appears in the prompt when typed sum:

sum
function (..., na.rm = FALSE)  .Primitive("sum")

This function was not written in R. It was definitely written in C, C++ or FORTRAN, which makes it much more optimized. After all, these are compiled languages, much more optimized to perform any operations. Hence the difference in execution time in the codes com_loop and vetorizado from the example of your question.

2. What alternatives exist? (packages, strategies, etc)

Basically, there are three strategies to try to optimize code in R. However, they will not always work because each case is a case.

  1. Use vector code

For example, family functions apply have an advantage over loops. Often (though not always), using functions from this family will make your code faster. After all, R is a language that works best with vectors. Family functions apply use this R feature optimally and therefore end up being many times faster than a for (or while etc..).

Besides, in my opinion, make the code cleaner and easier to audit later.

  1. Parallelize the code

Use the power of parallel processing on your computer. Instead of using a core to do the job, distribute it in more colors. The most famous packages for this are the parallel, doMC and foreach.

Unfortunately, I tried in the past and never managed to make them work on Windows. I suspect, even, that it is impossible. However, they are easy to use on macOS and Linux.

  1. Read the book R Inferno. It brings many strategies beyond these two that I mentioned above. The book opened my eyes in the past, showing what I was doing wrong when writing my codes. There are 9 strategies much more detailed than the ones I put here in this summary and I’m sure that many of your doubts will be clarified by him.

3

Complementing @Marcos Nunes' reply, which is excellent, the text that made me understand the difference between loop and vectorization was this one: Vectorization in R: Why?

R is a high-level language, meaning it takes care of your interpretation of the code. For example, when you create code like this:

i<-5.0

You shouldn’t tell the computer:

  1. that 5.0 is a floating point number;
  2. that "i" must hold a numerical data type;
  3. to find a place in memory for number 5;
  4. register "i" as an indicator for that place in memory.
  5. who needs to convert i <- 5.0 for binary, as this is done when you click enter;
  6. if you change the value of "i" to, for example, i <- "b", communicate that "i" no longer holds an integer but a character.

When you put this inside a for, R will repeat this process of interpretation with each loop. And that’s what makes the loop slow.

On the other hand, if you put all the values in a single vector, this interpretation process takes place at once, thus reducing the processing time. Hence because vectors only accept one type of data, that is, you cannot have integer numbers, factors and characters in the same vector, because this would break with the vector logic, which is to perform those six steps once.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.