2

I've a code that works perfectly for my purpose (it reads some files with a specific pattern, read the matrix within each file and compute something using each filepair...the final output is a matrix that has the same size of the file number) and looks like this:

m<- 100 output<- matrix(0, m, m) lista<- list.files(pattern = "q") listan<- as.matrix(lista) n <- nrow(listan) for (i in 1:n) { AA <- read.table((listan[i,]), header = FALSE) A<- as.matrix(AA) dVarX <- sqrt(mean(A * A)) for (j in i:n) { BB <- read.table ((listan[j,]), header = FALSE) B<- as.matrix(BB) V <- sqrt (dVarX * (sqrt(mean(B * B)))) output[i,j] <- (sqrt(mean(A * B))) / V } } 

My problem is that it takes a lot of time (I have about 5000 matrixes, that means 5000x5000 loops). I would like to parallelize, but I need some help! Waiting for your kind suggestions!

Thank you in advance!

Gab

10
  • 3
    Touching the disk is slow. Think about how many times you're reading in each matrix from the disk. Why not do that only once per matrix? Commented Feb 5, 2013 at 17:30
  • 2
    possible duplicate of stackoverflow.com/questions/14316203/parallelize-a-r-code Commented Feb 5, 2013 at 17:32
  • 3
    To add to the comment by @joran, The Memory usage section of ?read.table explicitly says, "Use scan instead for matrices." Commented Feb 5, 2013 at 17:38
  • 2
    ...and that's just the reading from disk part. You're also duplicating the calculation of sqrt(mean(B*B)) for each matrix. Parallelizing code this inefficient is like trying to speed up your commute to work by running from your house to your car instead of walking. Commented Feb 5, 2013 at 17:44
  • @joran you're right!!..but i'm very new in using R (started before christmas!) and programming...that's the reason why I need a lot of help! Anyway, I worte a command able to read each matrix from the disk only once creating a list...but the list was too big for my RAM (12GB) made by 5000 matrixes. The command was done with llply. Commented Feb 6, 2013 at 10:44

1 Answer 1

4

The bottleneck is likely reading from disk. Running code in parallel isn't guaranteed to make things faster. In this case, multiple processes attempting to read from the same disk at the same time is likely to be even slower than a single process.

Since your matrices are being written by another R process, you really should save them in R's binary format. You're reading every matrix once and only once, so the only way to make your program faster is to make reading from disk faster.

Here's an example that shows you how much faster it could be:

# make some random data and write it to disk set.seed(21) for(i in 0:9) { m <- matrix(runif(700*700), 700, 700) f <- paste0("f",i) write(m, f, 700) # text format saveRDS(m, paste0(f,".rds")) # binary format } # initialize two output objects m <- 10 o1 <- o2 <- matrix(NA, m, m) # get list of file names files <- list.files(pattern="^f[[:digit:]]+$") n <- length(files) 

First, let's run your your code using scan, which is already a lot faster than your current solution with read.table.

system.time({ for (i in 1:n) { A <- scan(files[i],quiet=TRUE) for (j in i:n) { B <- scan(files[j],quiet=TRUE) o1[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B))) } } }) # user system elapsed # 31.37 0.78 32.58 

Now, let's re-run that code using the files saved in R's binary format:

system.time({ for (i in 1:n) { fA <- paste0(files[i],".rds") A <- readRDS(fA) for (j in i:n) { fB <- paste0(files[j],".rds") B <- readRDS(fB) o2[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B))) } } }) # user system elapsed # 2.42 0.39 2.92 

So the binary format is ~10x faster! And the output is the same:

all.equal(o1,o2) # [1] TRUE 
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much for the code and all the clear and useful explanations. Really appreciated!!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.