Parallelize and speed up R code to read in many files

Question

I've a code that works perfectly for my purpose (it reads some files with a specific pattern, read the matrix within each file and compute something using each filepair...the final output is a matrix that has the same size of the file number) and looks like this:

m<- 100 output<- matrix(0, m, m) lista<- list.files(pattern = "q") listan<- as.matrix(lista) n <- nrow(listan) for (i in 1:n) { AA <- read.table((listan[i,]), header = FALSE) A<- as.matrix(AA) dVarX <- sqrt(mean(A * A)) for (j in i:n) { BB <- read.table ((listan[j,]), header = FALSE) B<- as.matrix(BB) V <- sqrt (dVarX * (sqrt(mean(B * B)))) output[i,j] <- (sqrt(mean(A * B))) / V } }

My problem is that it takes a lot of time (I have about 5000 matrixes, that means 5000x5000 loops). I would like to parallelize, but I need some help! Waiting for your kind suggestions!

Thank you in advance!

Gab

Touching the disk is slow. Think about how many times you're reading in each matrix from the disk. Why not do that only once per matrix? — joran
– joran, Commented Feb 5, 2013 at 17:30
possible duplicate of stackoverflow.com/questions/14316203/parallelize-a-r-code — Arun
– Arun, Commented Feb 5, 2013 at 17:32
To add to the comment by @joran, The Memory usage section of ?read.table explicitly says, "Use scan instead for matrices." — Joshua Ulrich
– Joshua Ulrich, Commented Feb 5, 2013 at 17:38
...and that's just the reading from disk part. You're also duplicating the calculation of sqrt(mean(B*B)) for each matrix. Parallelizing code this inefficient is like trying to speed up your commute to work by running from your house to your car instead of walking. — joran
– joran, Commented Feb 5, 2013 at 17:44
@joran you're right!!..but i'm very new in using R (started before christmas!) and programming...that's the reason why I need a lot of help! Anyway, I worte a command able to read each matrix from the disk only once creating a list...but the list was too big for my RAM (12GB) made by 5000 matrixes. The command was done with llply. — Gabelins
– Gabelins, Commented Feb 6, 2013 at 10:44

Joshua Ulrich · Accepted Answer · 2013-02-08 16:59:04Z

The bottleneck is likely reading from disk. Running code in parallel isn't guaranteed to make things faster. In this case, multiple processes attempting to read from the same disk at the same time is likely to be even slower than a single process.

Since your matrices are being written by another R process, you really should save them in R's binary format. You're reading every matrix once and only once, so the only way to make your program faster is to make reading from disk faster.

Here's an example that shows you how much faster it could be:

# make some random data and write it to disk set.seed(21) for(i in 0:9) { m <- matrix(runif(700*700), 700, 700) f <- paste0("f",i) write(m, f, 700) # text format saveRDS(m, paste0(f,".rds")) # binary format } # initialize two output objects m <- 10 o1 <- o2 <- matrix(NA, m, m) # get list of file names files <- list.files(pattern="^f[[:digit:]]+$") n <- length(files)

First, let's run your your code using scan, which is already a lot faster than your current solution with read.table.

system.time({ for (i in 1:n) { A <- scan(files[i],quiet=TRUE) for (j in i:n) { B <- scan(files[j],quiet=TRUE) o1[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B))) } } }) # user system elapsed # 31.37 0.78 32.58

Now, let's re-run that code using the files saved in R's binary format:

system.time({ for (i in 1:n) { fA <- paste0(files[i],".rds") A <- readRDS(fA) for (j in i:n) { fB <- paste0(files[j],".rds") B <- readRDS(fB) o2[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B))) } } }) # user system elapsed # 2.42 0.39 2.92

So the binary format is ~10x faster! And the output is the same:

all.equal(o1,o2) # [1] TRUE

Thank you so much for the code and all the clear and useful explanations. Really appreciated!!

Collectives™ on Stack Overflow

Parallelize and speed up R code to read in many files

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related