The below code produces different results on Windows and Ubuntu platforms. I understand it is because of the different methods of handling parallel processing.
Summarizing:
I cannot insert / rbind data on Linux parallely (mclapply, mcmapply) while I can do it on Windows.
Thanks @Hong Ooi for pointing out that
mclapplydoes not works on Windows parallely, yet below question is still valid.
Of course there are no multiple inserts to same data.frame, each insert is performed into separate data.frame.
library(R6) library(parallel) # storage objects generator cl <- R6Class( classname = "cl", public = list( data = data.frame(NULL), initialize = function() invisible(self), insert = function(x) self$data <- rbind(self$data, x) ) ) N <- 4L # number of entities i <- setNames(seq_len(N),paste0("n",seq_len(N))) # random data.frames set.seed(1) ldt <- lapply(i, function(i) data.frame(replicate(sample(3:10,1),sample(letters,1e5,rep=TRUE)))) # entity storage lcl1 <- lapply(i, function(i) cl$new()) lcl2 <- lapply(i, function(i) cl$new()) lcl3 <- lapply(i, function(i) cl$new()) # insert data invisible({ mclapply(names(i), FUN = function(n) lcl1[[n]]$insert(ldt[[n]])) mcmapply(FUN = function(dt, cl) cl$insert(dt), ldt, lcl2, SIMPLIFY=FALSE) lapply(names(i), FUN = function(n) lcl3[[n]]$insert(ldt[[n]])) }) ### Windows sapply(lcl1, function(cl) nrow(cl$data)) # mclapply # n1 n2 n3 n4 # 100000 100000 100000 100000 sapply(lcl2, function(cl) nrow(cl$data)) # mcmapply # n1 n2 n3 n4 # 100000 100000 100000 100000 sapply(lcl3, function(cl) nrow(cl$data)) # lapply # n1 n2 n3 n4 # 100000 100000 100000 100000 ### Unix sapply(lcl1, function(cl) nrow(cl$data)) # mclapply #n1 n2 n3 n4 # 0 0 0 0 sapply(lcl2, function(cl) nrow(cl$data)) # mcmapply #n1 n2 n3 n4 # 0 0 0 0 sapply(lcl3, function(cl) nrow(cl$data)) # lapply # n1 n2 n3 n4 # 100000 100000 100000 100000 And the question:
How can I achieve rbind parallely into separate data.frames on a Linux platform?
P.S. Off-memory storage like SQLite cannot be considered as solution in my case.
data.tablepackage. I know my tip doesn't answer your question directly, but still might help performance. Thedata.tablepackage plays nicer with large data sets (GBs range) than baseR'sdata.frames.data.tablepackage and its different extensions...