R parallel loops

Question

I am trying to start to parallelize workload in R. As I am only approaching this new way of writing code, I have done some benchmark to understand how effective this could be.

Here there is the code:

library(doParallel) library(doSNOW) library(foreach) library(tictoc) no_cores <- detectCores() - 1 cl <- makeCluster(no_cores) registerDoParallel(cl) tic() result <- foreach(i=1:10000,.combine=c) %dopar%{ i^2 } toc() stopCluster(cl) tic() result<-c() for (i in 1:10000){ result[i]<-i^2 } toc()

and here the results:

> library(doParallel) > library(doSNOW) > library(foreach) > library(tictoc) > no_cores <- detectCores() - 1 > > cl <- makeCluster(no_cores) > registerDoParallel(cl) > > tic() > result <- foreach(i=1:10000,.combine=c) %dopar%{ + i^2 + } > toc() 3.83 sec elapsed > > stopCluster(cl) > > > tic() > result<-c() > for (i in 1:10000){ + result[i]<-i^2 + } > toc() > 0.02 sec elapsed

It looks like that the serial execution took less time. Is there anything wrong with my approach?

Please check my updated answer, I think it provides further support for your understanding. — Manuel Bickel
– Manuel Bickel, Commented Aug 23, 2018 at 11:59

Manuel Bickel · Accepted Answer · 2018-08-23 21:00:34Z

Parallelization adds some overhead to your code. It only makes sense if the task to be solved is "difficult"/"time-consuming" enough. You have tested with some very simple example, which R can solve quickly without parallelization. You might try to think of more complex examples.

Below I have simulated some more "difficult" tasks by defining the time they might take via Sys.sleep. As expected the parallel code solves the task in much less time on 3 cores. From the three runs of the loop it only requires a bit more (due to the ovehead) than the longest run needs in terms of time.

UPDATE: Another important aspect I did not spot in your code at first sight is how you store your results. For parallel code it usually makes sense to split your results data structure (vector, data.table, etc.) into chunks. In any case you should initialize your results data structure, since you usually know its length and type in advance (such initialization should also be the standard approach to non parallel loops). This can increase speed significantly. I have provided a direct comparison of base and parallel options below for your simple calculation example.

Furthermore, at the end there is a benchmark providing you with an idea of the overhead of parallel processing. I have adapted one of the parallel functions from above example by setting up clusters, etc. outside of the function. Hence, the results at the end show the mere calculation time, whereas in above example the parallel functions include the time for setting up the parallel processing.

library(doParallel) library(foreach) library(microbenchmark) n_cores <- 3 cl <- makeCluster(n_cores) registerDoParallel(cl) microbenchmark( (foreach (i= 1:3) %dopar% {Sys.sleep(i)}) ,(for (i in 1:3) {Sys.sleep(i)}) , times = 1) # Unit: seconds # expr min lq mean median uq max neval # (foreach(i = 1:3) %dopar% {Sys.sleep(i) }) 3.046903 3.046903 3.046903 3.046903 3.046903 3.046903 1 # (for (i in 1:3) {Sys.sleep(i) }) 6.164373 6.164373 6.164373 6.164373 6.164373 6.164373 1 stopCluster(cl) par_sqrt_loop = function(n) { n_cores <- 3 cl = makeCluster(n_cores) registerDoParallel(cl) res = vector(mode = "numeric", length = n) res = foreach (i = 1:n) %dopar% { res[i] = i^2 } stopCluster(cl) unlist(res) } # there might be further options to increase speed # depending on the problem to be solve (e.g. multicombine) # but for the purpose of demonstration below approach seems ok... par_sqrt_loop_w_chunking = function(n) { n_cores <- 3 cl <- makeCluster(n_cores) registerDoParallel(cl) res = vector(mode = "numeric", length = n) chunks_res = list(c1 = 1:1000, c2 = 1001:2000, c3 = 2001:3000) res = foreach (i = 1:n, combine = c) %dopar% { res = res[chunks_res[[i]]] res[i] = i^2 } stopCluster(cl) unlist(res) } base_sqrt_loop = function(n) { res = vector(mode = "numeric", length = n) for (i in 1:n) { res[i] = i^2 } res } base_sqrt_vect = function(n) { res = (1:n)^2 res } # results are the same a = base_sqrt_loop(3) b = base_sqrt_vect(3) c = par_sqrt_loop(3) d = par_sqrt_loop_w_chunking(3) all.equal(a,b) # TRUE all.equal(a,c) # TRUE all.equal(a,d) # TRUE # check difference of timings microbenchmark( base_sqrt_loop(1e5) ,base_sqrt_vect(1e5) ,par_sqrt_loop(1e5) ,par_sqrt_loop_w_chunking(1e5) ,times = 1) # Unit: milliseconds # expr min lq mean median uq max neval # base_sqrt_loop(1e+05) 9.829663 9.829663 9.829663 9.829663 9.829663 9.829663 1 # base_sqrt_vect(1e+05) 5.370965 5.370965 5.370965 5.370965 5.370965 5.370965 1 # par_sqrt_loop(1e+05) 48908.724402 48908.724402 48908.724402 48908.724402 48908.724402 48908.724402 1 # par_sqrt_loop_w_chunking(1e+05) 793.252624 793.252624 793.252624 793.252624 793.252624 793.252624 1 # for the fastest parallel option from above # keep overhead of setting up clusters, etc. out of function n_cores = 3 cl = makeCluster(n_cores) registerDoParallel(cl) par_sqrt_loop_w_chunking_reduced_overhead = function(n) { res = vector(mode = "numeric", length = n) chunks_res = list(c1 = 1:1000, c2 = 1001:2000, c3 = 2001:3000) res = foreach (i = 1:n, combine = c) %dopar% { res = res[chunks_res[[i]]] res[i] = i^2 } unlist(res) } microbenchmark( par_sqrt_loop_w_chunking_reduced_overhead(1e5) ,times = 1) stopCluster(cl) # Unit: milliseconds # expr min lq mean median uq max neval # par_sqrt_loop_w_chunking_reduced_overhead(1e+05) 97.80002 97.80002 97.80002 97.80002 97.80002 97.80002 1

Collectives™ on Stack Overflow

R parallel loops

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related