Bootstrapping multiple columns in data.table in a scalable fashion R

Question

This is a follow up question to this one. In the original question the OP wanted to perform bootstrap on two columns x1 and x2 that are fixed:

set.seed(1000) data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5)) stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]} data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

However, I think this problem can be nicely extended to handle any number of columns by treating them as groups. For instance, lets use the iris dataset. Say I want to calculate bootstrap mean for all four dimensions for each species. I can use melt to flip the data and then use the Species, variable combination to get the mean in one go - I think this approach will scale well.

data(iris) iris = data.table(iris) iris[,mean(Sepal.Length),by=Species] iris[,ID:=.N,] iris_deep = melt(iris ,id.vars = c("ID","Species") ,measure.vars = c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")) #define a mean bootstrap function stat <- function(x, i) {x[i, m=mean(value),]} iris_deep[, list(list(boot(.SD, stat, R = 100))), by = list(Species,variable)]$V1

Here is my attempt at doing this. However the bootstrapping part does not seem to be working. As R throws the following error:

Error in mean(value) : object 'value' not found

Can someone please take a crack at this?

renato vitolo · Accepted Answer · 2016-08-17 20:10:46Z

1

I tried this (with added braces enclosing m=mean(value)) and it appears to work:

stat <- function(x, i) {x[i, (m=mean(value))]}

edited Aug 17, 2016 at 20:10

answered Aug 17, 2016 at 8:24

renato vitolo

1,75411 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

StupidWolf · Accepted Answer · 2020-09-02 22:06:37Z

We can utilize each bootstrap fully, and calculate the mean for each variable within each group, instead of rerunning the bootstrap for each variable.

So if we do something like this, it calculates the mean for each variable:

iris = data.table(iris) iris[sample(nrow(iris),replace=TRUE),lapply(.SD,mean,na.rm=TRUE),by=Species]

Because boot requires a vector / matrix output, we need to modify the output above, and provide names for the vector:

d = function(dat,ind){ k = dat[ind,lapply(.SD,mean,na.rm=TRUE),by=Species] k_vec = unlist(k[,-1]) names(k_vec) = paste(rep(colnames(k)[-1],each=nrow(k)),rep(k$Species,(ncol(k)-1)),sep="_") k_vec } d(iris,sample(nrow(iris),replace=TRUE)) Sepal.Length_versicolor Sepal.Length_virginica Sepal.Length_setosa 5.8784314 6.4851852 4.9688889 Sepal.Width_versicolor Sepal.Width_virginica Sepal.Width_setosa 2.7392157 2.9814815 3.3977778 Petal.Length_versicolor Petal.Length_virginica Petal.Length_setosa 4.1980392 5.5037037 1.4644444 Petal.Width_versicolor Petal.Width_virginica Petal.Width_setosa 1.2960784 2.0944444 0.2333333

And use boot with strata = iris$Species to ensure the Species are sampled evenly:

bo_strata = boot(iris,d,R=1000,strata=iris$Species)

We can compare the distributions of this approach compared to that in the question:

stat <- function(x, i) {x[i, (m=mean(value))]} bo_melt = iris_deep[, list(list(boot(.SD, stat, R = 1000))), by = list(Species,variable)]$V1 par(mfrow=c(4,3)) par(mar=c(3,3,3,3)) for(i in 1:ncol(bo_strata$t)){ plot(density(bo_strata$t[,i]),main=names(bo_strata$t0)[i],col="#43658b") lines(density(bo_melt[[i]]$t),col="#ffa372") legend("topright",fill=c("#43658b","#ffa372"),c("strata","other")) }

Collectives™ on Stack Overflow

Bootstrapping multiple columns in data.table in a scalable fashion R

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related