Excellent questions. Below I will put my two cents on them.
1. Why loops are slow in R?
Loops are slow in R because this is an intrinsic feature of interpreted languages. Any code written in the R language (which is a language interpreted as python or ruby) is read and interpreted for machine language, to be executed there.
C, on the other hand, is a compiled language. All code written in the C language is compiled, transformed into an executable in the native language of the operating system and processor of the machine and only after that will it be executed.
If we loop a language interpreted as R, the translation step from the code written in R to the machine language will occur for each step of the loop. Thus, several extra steps are added in the execution of the program, steps that do not exist in the compiled language. And each intermediate step of these is added to the total time of execution of the program.
I understand that this answer may not directly answer your question. Let me then repeat it as follows:
Why loops in R are slower than vector code?
Although it doesn’t seem like it, the answer to this question is in the above description. Many of the native R codes, such as the sum
from your example, they were written in C, C++ or FORTRAN. Note the output that appears in the prompt when typed sum
:
sum
function (..., na.rm = FALSE) .Primitive("sum")
This function was not written in R. It was definitely written in C, C++ or FORTRAN, which makes it much more optimized. After all, these are compiled languages, much more optimized to perform any operations. Hence the difference in execution time in the codes com_loop
and vetorizado
from the example of your question.
2. What alternatives exist? (packages, strategies, etc)
Basically, there are three strategies to try to optimize code in R. However, they will not always work because each case is a case.
- Use vector code
For example, family functions apply
have an advantage over loops. Often (though not always), using functions from this family will make your code faster. After all, R is a language that works best with vectors. Family functions apply
use this R feature optimally and therefore end up being many times faster than a for
(or while
etc..).
Besides, in my opinion, make the code cleaner and easier to audit later.
- Parallelize the code
Use the power of parallel processing on your computer. Instead of using a core to do the job, distribute it in more colors. The most famous packages for this are the parallel
, doMC
and foreach
.
Unfortunately, I tried in the past and never managed to make them work on Windows. I suspect, even, that it is impossible. However, they are easy to use on macOS and Linux.
- Read the book R Inferno. It brings many strategies beyond these two that I mentioned above. The book opened my eyes in the past, showing what I was doing wrong when writing my codes. There are 9 strategies much more detailed than the ones I put here in this summary and I’m sure that many of your doubts will be clarified by him.