data.table subsetting for bootstrapping

Question

I am relatively new to data.table and was hoping to use the fast sub-setting feature to carry out some bootstrapping procedures.

In my example I have two columns of 1 million random normals, and I want to take a sample of some of the rows and calculate the correlation between the two columns. I was hoping for some of the 100x faster speed improvements that were suggested on the data.table webpage...but perhaps I am miss-using data.table...if so, what way should the function be structured to be able to get this speed improvement.

Please see below for my example:

n <- 1e6 set.seed(1) q <- data.frame(a=rnorm(n),b=rnorm(n)) q.dt <- data.table(q) df.samp <- function(){cor(q[sample(seq(n),n*0.01),])[2,1]} dt.samp <- function(){q.dt[sample(seq(n),n*0.01),cor(a,b)]} require(microbenchmark) microbenchmark(median(sapply(seq(100),function(y){df.samp()})), median(sapply(seq(100),function(y){dt.samp()})), times=100) Unit: milliseconds expr min lq median uq max neval median(sapply(seq(100), function(y) { df.samp() })) 1547.5399 1673.1460 1747.0779 1860.3371 2028.6883 100 median(sapply(seq(100), function(y) { dt.samp() })) 583.4724 647.0869 717.7666 764.4481 989.0562 100

My theory: you ARE seeing the effects of the improved sampling, but it is the additional step of running cor() on all your samples that is the irreducible time bottleneck. — IRTFM
– IRTFM, Commented Aug 28, 2013 at 16:09
Your speed test is a little convoluted. Maybe try samp <- sample.int(n,n/100); microbenchmark(q[samp,],q.dt[samp])? I'm seeing the data.table subsetting as about twice as fast with that. — Frank
– Frank, Commented Aug 28, 2013 at 17:01

Roland · Accepted Answer · 2013-08-28 16:14:04Z

In addition to @DWin's comment:

If you profile your code you will see that the most costly repeated function calls are those to seq (which is necessary at most once) and sample.

Rprof() median(sapply(seq(2000), function(y) { dt.samp() })) Rprof(NULL) summaryRprof() # $by.self # self.time self.pct total.time total.pct # "seq.default" 3.70 35.10 3.70 35.10 # "sample.int" 2.84 26.94 2.84 26.94 # "[.data.table" 1.84 17.46 10.52 99.81 # "sample" 0.34 3.23 6.90 65.46 # "[[.data.frame" 0.16 1.52 0.34 3.23 # "length" 0.14 1.33 0.14 1.33 # "cor" 0.10 0.95 0.26 2.47 #<snip>

Faster subsetting doesn't help with that.

Actually you can eliminate seq() by doing sample.int(n, n * 0.01)

Collectives™ on Stack Overflow

data.table subsetting for bootstrapping

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related