Foremost, I am looking for a fast(er) way of subsetting/indexing a matrix many, many times over:
for (i in 1:99000) { subset.data <- data[index[, i], ] } Background:
I'm implementing a sequential testing procedure involving the bootstrap in R. Wanting to replicate some simulation results, I came upon this bottleneck where lots of indexing needs to be done. For implementation of the block-bootstrap I created an index matrix with which I subset the original data matrix to draw resamples of the data.
# The basic setup B <- 1000 # no. of bootstrap replications n <- 250 # no. of observations m <- 100 # no. of models/data series # Create index matrix with B columns and n rows. # Each column represents a resampling of the data. # (actually block resamples, but doesn't matter here). boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B) # Make matrix with m data series of length n. sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m) subsetMatrix <- function(data, index) { # fn definition for timing subset.data <- data[index, ] return(subset.data) } # check how long it takes. Rprof("subsetMatrix.out") for (i in 1:(m - 1)) { for (b in 1:B) { # B * (m - 1) = 1000 * 99 = 99000 boot.data <- subsetMatrix(sample.data, boot.index[, b]) # do some other stuff } # do some more stuff } Rprof() summaryRprof("subsetMatrix.out") # > summaryRprof("subsetMatrix.out") # $by.self # self.time self.pct total.time total.pct # subsetMatrix 9.96 100 9.96 100 # In the actual application: ######### # > summaryRprof("seq_testing.out") # $by.self # self.time self.pct total.time total.pct # subsetMatrix 6.78 53.98 6.78 53.98 # colMeans 1.98 15.76 2.20 17.52 # makeIndex 1.08 8.60 2.12 16.88 # makeStats 0.66 5.25 9.66 76.91 # runif 0.60 4.78 0.72 5.73 # apply 0.30 2.39 0.42 3.34 # is.data.frame 0.22 1.75 0.22 1.75 # ceiling 0.18 1.43 0.18 1.43 # aperm.default 0.14 1.11 0.14 1.11 # array 0.12 0.96 0.12 0.96 # estimateMCS 0.10 0.80 12.56 100.00 # as.vector 0.10 0.80 0.10 0.80 # matrix 0.08 0.64 0.08 0.64 # lapply 0.06 0.48 0.06 0.48 # / 0.04 0.32 0.04 0.32 # : 0.04 0.32 0.04 0.32 # rowSums 0.04 0.32 0.04 0.32 # - 0.02 0.16 0.02 0.16 # > 0.02 0.16 0.02 0.16 # # $by.total # total.time total.pct self.time self.pct # estimateMCS 12.56 100.00 0.10 0.80 # makeStats 9.66 76.91 0.66 5.25 # subsetMatrix 6.78 53.98 6.78 53.98 # colMeans 2.20 17.52 1.98 15.76 # makeIndex 2.12 16.88 1.08 8.60 # runif 0.72 5.73 0.60 4.78 # doTest 0.68 5.41 0.00 0.00 # apply 0.42 3.34 0.30 2.39 # aperm 0.26 2.07 0.00 0.00 # is.data.frame 0.22 1.75 0.22 1.75 # sweep 0.20 1.59 0.00 0.00 # ceiling 0.18 1.43 0.18 1.43 # aperm.default 0.14 1.11 0.14 1.11 # array 0.12 0.96 0.12 0.96 # as.vector 0.10 0.80 0.10 0.80 # matrix 0.08 0.64 0.08 0.64 # lapply 0.06 0.48 0.06 0.48 # unlist 0.06 0.48 0.00 0.00 # / 0.04 0.32 0.04 0.32 # : 0.04 0.32 0.04 0.32 # rowSums 0.04 0.32 0.04 0.32 # - 0.02 0.16 0.02 0.16 # > 0.02 0.16 0.02 0.16 # mean 0.02 0.16 0.00 0.00 # # $sample.interval # [1] 0.02 # # $sampling.time # [1] 12.56' Doing the sequential testing procedure once takes about 10 seconds. Using this in simulations with 2500 replications and several parameter constellations, it would take something like 40 days. Using parallel processing and better CPU power it's possible to do faster, but still not very pleasing :/
- Is there a better way to resample the data / get rid of the loop?
- Can apply, Vectorize, replicate etc. come in anywhere?
- Would it make sense to implement the subsetting in C (e.g. manipulate some pointers)?
Even though every single step is already done incredibly fast by R, it's just not quite fast enough.
I'd be very glad indeed for any kind of response/help/advice!
related Qs:
- Fast matrix subsetting via '[': by rows, by columns or doesn't matter?
- fast function for generating bootstrap samples in matrix forms in R
- random sampling - matrix
from there
mapply(function(row) return(sample.data[row,]), row = boot.index) replicate(B, apply(sample.data, 2, sample, replace = TRUE)) didn't really do it for me.
[is extremely fast and unlikely to be the problem. Your firstsummaryRprofis a bit useless since the only thing you are doing there is usesubsetMatrix. Your secondsummaryRprofmight be revealing in showing that other operations likelookupMatrixorcolMeansare taking much more time thansubsetMatrixbut you are not showing us enough of your code or profiles. That your code is generally slow is in my opinion the result of that doubleforloop. If your algorithm can be vectorized, we can help you. But we need to see your code and a reproducible example.Rprofasystem.timewould have done it. I only defined the wholesubsetMatrix(same aslookupMatrix) function to measure the time it takes in the overall application.[involves making space in memory(?). Would it be faster to simply manipulate pointers in C?