7

Foremost, I am looking for a fast(er) way of subsetting/indexing a matrix many, many times over:

for (i in 1:99000) { subset.data <- data[index[, i], ] } 

Background:
I'm implementing a sequential testing procedure involving the bootstrap in R. Wanting to replicate some simulation results, I came upon this bottleneck where lots of indexing needs to be done. For implementation of the block-bootstrap I created an index matrix with which I subset the original data matrix to draw resamples of the data.

# The basic setup B <- 1000 # no. of bootstrap replications n <- 250 # no. of observations m <- 100 # no. of models/data series # Create index matrix with B columns and n rows. # Each column represents a resampling of the data. # (actually block resamples, but doesn't matter here). boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B) # Make matrix with m data series of length n. sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m) subsetMatrix <- function(data, index) { # fn definition for timing subset.data <- data[index, ] return(subset.data) } # check how long it takes. Rprof("subsetMatrix.out") for (i in 1:(m - 1)) { for (b in 1:B) { # B * (m - 1) = 1000 * 99 = 99000 boot.data <- subsetMatrix(sample.data, boot.index[, b]) # do some other stuff } # do some more stuff } Rprof() summaryRprof("subsetMatrix.out") # > summaryRprof("subsetMatrix.out") # $by.self # self.time self.pct total.time total.pct # subsetMatrix 9.96 100 9.96 100 # In the actual application: ######### # > summaryRprof("seq_testing.out") # $by.self # self.time self.pct total.time total.pct # subsetMatrix 6.78 53.98 6.78 53.98 # colMeans 1.98 15.76 2.20 17.52 # makeIndex 1.08 8.60 2.12 16.88 # makeStats 0.66 5.25 9.66 76.91 # runif 0.60 4.78 0.72 5.73 # apply 0.30 2.39 0.42 3.34 # is.data.frame 0.22 1.75 0.22 1.75 # ceiling 0.18 1.43 0.18 1.43 # aperm.default 0.14 1.11 0.14 1.11 # array 0.12 0.96 0.12 0.96 # estimateMCS 0.10 0.80 12.56 100.00 # as.vector 0.10 0.80 0.10 0.80 # matrix 0.08 0.64 0.08 0.64 # lapply 0.06 0.48 0.06 0.48 # / 0.04 0.32 0.04 0.32 # : 0.04 0.32 0.04 0.32 # rowSums 0.04 0.32 0.04 0.32 # - 0.02 0.16 0.02 0.16 # > 0.02 0.16 0.02 0.16 # # $by.total # total.time total.pct self.time self.pct # estimateMCS 12.56 100.00 0.10 0.80 # makeStats 9.66 76.91 0.66 5.25 # subsetMatrix 6.78 53.98 6.78 53.98 # colMeans 2.20 17.52 1.98 15.76 # makeIndex 2.12 16.88 1.08 8.60 # runif 0.72 5.73 0.60 4.78 # doTest 0.68 5.41 0.00 0.00 # apply 0.42 3.34 0.30 2.39 # aperm 0.26 2.07 0.00 0.00 # is.data.frame 0.22 1.75 0.22 1.75 # sweep 0.20 1.59 0.00 0.00 # ceiling 0.18 1.43 0.18 1.43 # aperm.default 0.14 1.11 0.14 1.11 # array 0.12 0.96 0.12 0.96 # as.vector 0.10 0.80 0.10 0.80 # matrix 0.08 0.64 0.08 0.64 # lapply 0.06 0.48 0.06 0.48 # unlist 0.06 0.48 0.00 0.00 # / 0.04 0.32 0.04 0.32 # : 0.04 0.32 0.04 0.32 # rowSums 0.04 0.32 0.04 0.32 # - 0.02 0.16 0.02 0.16 # > 0.02 0.16 0.02 0.16 # mean 0.02 0.16 0.00 0.00 # # $sample.interval # [1] 0.02 # # $sampling.time # [1] 12.56' 

Doing the sequential testing procedure once takes about 10 seconds. Using this in simulations with 2500 replications and several parameter constellations, it would take something like 40 days. Using parallel processing and better CPU power it's possible to do faster, but still not very pleasing :/

  • Is there a better way to resample the data / get rid of the loop?
  • Can apply, Vectorize, replicate etc. come in anywhere?
  • Would it make sense to implement the subsetting in C (e.g. manipulate some pointers)?

Even though every single step is already done incredibly fast by R, it's just not quite fast enough.
I'd be very glad indeed for any kind of response/help/advice!

related Qs:
- Fast matrix subsetting via '[': by rows, by columns or doesn't matter?
- fast function for generating bootstrap samples in matrix forms in R
- random sampling - matrix

from there

mapply(function(row) return(sample.data[row,]), row = boot.index) replicate(B, apply(sample.data, 2, sample, replace = TRUE)) 

didn't really do it for me.

3
  • 2
    [ is extremely fast and unlikely to be the problem. Your first summaryRprof is a bit useless since the only thing you are doing there is use subsetMatrix. Your second summaryRprof might be revealing in showing that other operations like lookupMatrix or colMeans are taking much more time than subsetMatrix but you are not showing us enough of your code or profiles. That your code is generally slow is in my opinion the result of that double for loop. If your algorithm can be vectorized, we can help you. But we need to see your code and a reproducible example. Commented Dec 8, 2013 at 19:54
  • Thanks for your comments. @DWin, running the whole code works for me. Commented Dec 8, 2013 at 22:34
  • @flodel, I uploaded the code at github, but I didn't want to complicate things. Instead of the first Rprof a system.time would have done it. I only defined the whole subsetMatrix (same as lookupMatrix) function to measure the time it takes in the overall application. [ involves making space in memory(?). Would it be faster to simply manipulate pointers in C? Commented Dec 8, 2013 at 22:44

1 Answer 1

3

I rewrote makeStats and makeIndex as they were two of the biggest bottlenecks:

makeStats <- function(data, index) { data.mean <- colMeans(data) m <- nrow(data) n <- ncol(index) tabs <- lapply(1L:n, function(j)tabulate(index[, j], nbins = m)) weights <- matrix(unlist(tabs), m, n) * (1 / nrow(index)) boot.data.mean <- t(data) %*% weights - data.mean return(list(data.mean = data.mean, boot.data.mean = boot.data.mean)) } makeIndex <- function(B, blocks){ n <- ncol(blocks) l <- nrow(blocks) z <- ceiling(n/l) start.points <- sample.int(n, z * B, replace = TRUE) index <- blocks[, start.points] keep <- c(rep(TRUE, n), rep(FALSE, z*l - n)) boot.index <- matrix(as.vector(index)[keep], nrow = n, ncol = B) return(boot.index) } 

This brought down the computation times from 28 to 6 seconds on my machine. I bet there are other parts of the code that can be improved (including my use of lapply/tabulate above.)

Sign up to request clarification or add additional context in comments.

2 Comments

This is awesome. Thank you so much!! I'll take it as an exercise to go through the rest of the code then.
No problem. Here is another one I noticed but was too lazy to try because you used it a lot: sweep is in general slower than using a double transpose: e.g. sweep(x, 2, y, FUN = "-") versus t(t(x) - y). That's only recommended if x is a matrix, not data.frame.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.