How to parallelize on multiple levels in R?

Asked

Viewed 181 times

5

I’ve been researching how to parallelize for in R and found the package foreach, which, from what I understand and correct me if I’m wrong, replaces the for as follows:

library(foreach)
vetor <- rep(NA, 10)
n <- seq_len(10)
foreach(j = n) %dopar% {
vetor[j] <- j + 1
}

My question how to do in a case where I have for(){for(){}}, for(){for(){for(){}}}... It is possible to subparalelize?

  • 1

    It seems the answer is in the operator %:%. Behold.

  • I just found out I have to study the library(doParallel) to make the library(foreach) work better.

  • @Tomassbarcellos the importance of the operator is that with %do% this function is equal to for and with %dopar% is the version Paralelizada. However first you have to configure the parallelization with the doParallel.

  • 1

    Use %dopar% without first setting up with doParallel does not work. Before you run the code with foreach(...) %dopar%{...}, you will need to rotate doParallel::registerDoParallel(). Then just close the cluster created with doParallel::stopImplicitCluster().

1 answer

7


Generally, it doesn’t pay to parallelize at more than one level. This is until it is possible but will not make your code run faster, unless the first level of parallelism is failing to utilize the entire idle feature of the computer.

Nowadays the easiest way to create parallel code in R is by using the package future in combination with the furrr.

See here a classic example of parallelization:

library(furrr)
#> Loading required package: future
library(purrr)
plan(multisession)

fun <- function(x) {
  Sys.sleep(1)
  x
}

system.time(
  map(1:4, fun)  
)
#>    user  system elapsed 
#>   0.004   0.001   4.020

system.time(
  future_map(1:4, fun)  
)
#>    user  system elapsed 
#>   0.077   0.012   1.297

Created on 2019-02-13 by the reprex package (v0.2.1)

In the example, the parallel version takes a little more than 1s while the sequential version takes 4s, as expected.

Now let’s add a second level of parallelization.

library(furrr)
#> Loading required package: future
library(purrr)
plan(multisession)

fun <- function(x) {
  Sys.sleep(1)
  x
}

system.time(
  future_map(1:4, ~map(1:4, fun))  
)
#>    user  system elapsed 
#>   0.090   0.012   4.391

system.time(
  future_map(1:4, ~future_map(1:4, fun))  
)
#>    user  system elapsed 
#>   0.065   0.005   4.223

Created on 2019-02-13 by the reprex package (v0.2.1)

See that the two forms take very similar times. This happens because first parallelization already uses all the idle CPU resource of the computer, the second level of parallelization can not gain more space.

The first level might not be using all the computer resources, if for example my computer had 8 colors instead of 4, parallelizing only on the first level would leave 4 underutilized colors. In this case it would make sense to do the parallelization on the second level. However, this is rare. In general we parallelize loops in which the number of iterations is > than the number of colors.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.