1

I am relatively new to data.table and was hoping to use the fast sub-setting feature to carry out some bootstrapping procedures.

In my example I have two columns of 1 million random normals, and I want to take a sample of some of the rows and calculate the correlation between the two columns. I was hoping for some of the 100x faster speed improvements that were suggested on the data.table webpage...but perhaps I am miss-using data.table...if so, what way should the function be structured to be able to get this speed improvement.

Please see below for my example:

n <- 1e6 set.seed(1) q <- data.frame(a=rnorm(n),b=rnorm(n)) q.dt <- data.table(q) df.samp <- function(){cor(q[sample(seq(n),n*0.01),])[2,1]} dt.samp <- function(){q.dt[sample(seq(n),n*0.01),cor(a,b)]} require(microbenchmark) microbenchmark(median(sapply(seq(100),function(y){df.samp()})), median(sapply(seq(100),function(y){dt.samp()})), times=100) Unit: milliseconds expr min lq median uq max neval median(sapply(seq(100), function(y) { df.samp() })) 1547.5399 1673.1460 1747.0779 1860.3371 2028.6883 100 median(sapply(seq(100), function(y) { dt.samp() })) 583.4724 647.0869 717.7666 764.4481 989.0562 100 
2
  • My theory: you ARE seeing the effects of the improved sampling, but it is the additional step of running cor() on all your samples that is the irreducible time bottleneck. Commented Aug 28, 2013 at 16:09
  • Your speed test is a little convoluted. Maybe try samp <- sample.int(n,n/100); microbenchmark(q[samp,],q.dt[samp])? I'm seeing the data.table subsetting as about twice as fast with that. Commented Aug 28, 2013 at 17:01

1 Answer 1

1

In addition to @DWin's comment:

If you profile your code you will see that the most costly repeated function calls are those to seq (which is necessary at most once) and sample.

Rprof() median(sapply(seq(2000), function(y) { dt.samp() })) Rprof(NULL) summaryRprof() # $by.self # self.time self.pct total.time total.pct # "seq.default" 3.70 35.10 3.70 35.10 # "sample.int" 2.84 26.94 2.84 26.94 # "[.data.table" 1.84 17.46 10.52 99.81 # "sample" 0.34 3.23 6.90 65.46 # "[[.data.frame" 0.16 1.52 0.34 3.23 # "length" 0.14 1.33 0.14 1.33 # "cor" 0.10 0.95 0.26 2.47 #<snip> 

Faster subsetting doesn't help with that.

Sign up to request clarification or add additional context in comments.

1 Comment

Actually you can eliminate seq() by doing sample.int(n, n * 0.01)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.