Is centering needed when bootstrapping the sample mean?

Question

When reading about how to approximate the distribution of the sample mean I came across the nonparametric bootstrap method. Apparently one can approximate the distribution of $\bar{X}_n-\mu$ by the distribution of $\bar{X}_n^*-\bar{X}_n$, where $\bar{X}_n^*$ denotes the sample mean of the bootstrap sample.

My question then is: Do I need the centering? What for?

Couldn't I just approximate $\mathbb{P}\left(\bar{X}_n \leq x\right)$ by $\mathbb{P}\left(\bar{X}_n^* \leq x\right)$?

I don't see why you we need to center anything. All the samples discussed here are of the same size right? — Bitwise
– Bitwise, Commented Oct 12, 2012 at 17:25
Same size, yes. I don't see the reason for the centering either. Would anybody be able to come up with a mathematical explanation why or why not we have to do that? I mean, can we prove that the bootstrap works or does not work if we do not center? — Christin
– Christin, Commented Oct 12, 2012 at 18:20
(Btw, a proof that the bootstrap works for the case where we centered can be found in Bickel, P.J. and D.A. Freedman (1981), Some asymptotic theory for the bootstrap.) — Christin
– Christin, Commented Oct 12, 2012 at 18:24
Maybe we do the entering to be able to use the Central Limit Theorem which gives us that $n^{\frac{1}{2}}(\bar{X}_n-\mu)$ converges to the same distribution as $n^{\frac{1}{2}}(\bar{X}_n^*-\bar{X}_n)$, namely to $\mathcal{N}(0,\sigma^2)$. Maybe there are no asymptotics available for the case without centering that tell us whether it works. — kelu
– kelu, Commented Oct 14, 2012 at 11:11

Peter Ellis · Accepted Answer · 2013-02-10 00:05:11Z

Yes, you can approximate $\mathbb{P}\left(\bar{X}_n \leq x\right)$ by $\mathbb{P}\left(\bar{X}_n^* \leq x\right)$ but it is not optimal. This is a form of the percentile bootstrap. However, the percentile bootstrap does not perform well if you are seeking to make inferences about the population mean unless you have a large sample size. (It does perform well with many other inference problems including when the sample size size is small.) I take this conclusion from Wilcox's Modern Statistics for the Social and Behavioral Sciences, CRC Press, 2012. A theoretical proof is beyond me I'm afraid.

A variant on the centering approach goes the next step and scales your centered bootstrap statistic with the re-sample standard deviation and sample size, calculating the same way as a t statistic. The quantiles from the distribution of these t statistics can be used to construct a confidence interval or perform a hypothesis test. This is the bootstrap-t method and it gives superior results when making inferences about the mean.

Let $s^*$ be the re-sample standard deviation based on a bootstrap re-sample, using n-1 as denominator; and s be the standard deviation of the original sample. Let

$T^*=\frac{\bar{X}_n^*-\bar{X}}{s^*/\sqrt{n}}$

The 97.5th and 2.5th percentiles of of the simulated distribution of $T^*$ can make a confidence interval for $\mu$ by:

$\bar{X}-T^*_{0.975} \frac{s}{\sqrt{n}}, \bar{X}-T^*_{0.025} \frac{s}{\sqrt{n}}$

Consider the simulation results below, showing that with a badly skewed mixed distribution the confidence intervals from this method contain the true value more frequently than either the percentile bootstrap method or a traditional inverstion of a t statistic with no bootstrapping.

compare.boots <- function(samp, reps = 599){ # "samp" is the actual original observed sample # "s" is a re-sample for bootstrap purposes n <- length(samp) boot.t <- numeric(reps) boot.p <- numeric(reps) for(i in 1:reps){ s <- sample(samp, replace=TRUE) boot.t[i] <- (mean(s)-mean(samp)) / (sd(s)/sqrt(n)) boot.p[i] <- mean(s) } conf.t <- mean(samp)-quantile(boot.t, probs=c(0.975,0.025))*sd(samp)/sqrt(n) conf.p <- quantile(boot.p, probs=c(0.025, 0.975)) return(rbind(conf.t, conf.p, "Trad T test"=t.test(samp)$conf.int)) } # Tests below will be for case where sample size is 15 n <- 15 # Create a population that is normally distributed set.seed(123) pop <- rnorm(1000,10,1) my.sample <- sample(pop,n) # All three methods have similar results when normally distributed compare.boots(my.sample)

This gives the following (conf.t is the bootstrap t method; conf.p is the percentile bootstrap method).

 97.5% 2.5% conf.t 9.648824 10.98006 conf.p 9.808311 10.95964 Trad T test 9.681865 11.01644

With a single example from a skewed distribution:

# create a population that is a mixture of two normal and one gamma distribution set.seed(123) pop <- c(rnorm(1000,10,2),rgamma(3000,3,1)*4, rnorm(200,45,7)) my.sample <- sample(pop,n) mean(pop) compare.boots(my.sample)

This gives the following. Note that "conf.t" - the bootstrap t version - gives a wider confidence interval than the other two. Basically, it is better at responding to the unusual distribution of the population.

> mean(pop) [1] 13.02341 > compare.boots(my.sample) 97.5% 2.5% conf.t 10.432285 29.54331 conf.p 9.813542 19.67761 Trad T test 8.312949 20.24093

Finally here is a thousand simulations to see which version gives confidence intervals that are most often correct:

# simulation study set.seed(123) sims <- 1000 results <- matrix(FALSE, sims,3) colnames(results) <- c("Bootstrap T", "Bootstrap percentile", "Trad T test") for(i in 1:sims){ pop <- c(rnorm(1000,10,2),rgamma(3000,3,1)*4, rnorm(200,45,7)) my.sample <- sample(pop,n) mu <- mean(pop) x <- compare.boots(my.sample) for(j in 1:3){ results[i,j] <- x[j,1] < mu & x[j,2] > mu } } apply(results,2,sum)

This gives the results below - the numbers are the times out of 1,000 that the confidence interval contains the true value of a simulated population. Notice that the true success rate of every version is considerably less than 95%.

 Bootstrap T Bootstrap percentile Trad T test 901 854 890

Thank you, that was very informative. This .pdf (from a lesson) describes a caveat to your conclusion: psychology.mcmaster.ca/bennett/boot09/percentileT.pdf This is a summary of what Bennet says: Many datasets consists of numbers that are >=0 (i.e. data that can be counted), in which case the CI should not contain negative values. Using the bootstrap-t method this can occur, making the confidence interval implausible. The requirement that the data be >=0 is in violation of the normal distribution assumption. This is not a problem when constructing a percentile bootstrapped CI — Hannes Ziegler
– Hannes Ziegler, Commented Aug 10, 2016 at 15:18

Stack Exchange Network

Is centering needed when bootstrapping the sample mean?

1 Answer 1

Linked

Hot Network Questions

Is centering needed when bootstrapping the sample mean?

1 Answer 1

Linked

Related

Hot Network Questions