Why n-1 instead of n in pooled sample variance

Question

I am currently self-learning hypothesis testing and am looking at the independent samples t-test whose test statistic involves the pooled sample variance (https://libguides.library.kent.edu/spss/independentttest), $$ S_p^2 = \frac{(n_1 - 1)S^2_1+(n_2-1)S_2^2}{n_1+n_2-2},$$ where $n_1, n_2$ are the sample sizes of the two samples and $S_1^2, S_2^2$ their respective sample variance. This test assumes that $S_1^2 = S_2^2$.

I understand that the pooled sampled variance is computed as a weighted average with weights $w_i = n_i -1$ for $i=1,2$. However I am unsure why $n_i-1$ is used as a weight instead of $n_i$. I understand that the $n-1$ is used instead of $n$ so that the usual sample variance is an unbiased estimator of the variance (Bessel's correction) but I cannot see why it is necessary for the pooled sample variance since the statistic $$ \frac{n_1S^2_1+n_2S^2_2}{n_1+n_2} $$ is also an unbiased estimator.

Can anyone explain this to me? Thanks.

The factors of $n_i-1$ merely undo the divisions that were made in computing the $S_i^2$ in the first place, thereby producing a sum of squared residuals in the numerator. For this test we don't really care about bias: what matters far more is the distribution of the test statistic. The distribution of your last statistic unfortunately depends on the ratio of sample sizes. — whuber
– whuber ♦, Commented May 25, 2022 at 15:41
In computing $S_p^2,$ You nave found $S_i^2, i=1,2,$ each of which requires computing a sample mean $\bar X_i, i = 1,2.$ So, $(n_1+n_2 -2)S_p^2/\sigma^2 \sim\mathsf{Chisq}(\nu=n_1+n_2-2).$ — BruceET
– BruceET, Commented May 25, 2022 at 16:02
Since $\operatorname{Var}S_i^2=...=\frac{\sigma^4}{n_i-1}$, $S_p^2$ can be seen as an inverse variance weighted linear combination of $S_1^2$ and $S_2^2$ and it thus is more efficient than $S_a^2$ as an estimator of $\sigma^2$. — Jarle Tufto
– Jarle Tufto, Commented May 26, 2022 at 15:14

BruceET · Accepted Answer · 2022-05-26 00:06:32Z

For a two-sample t test on samples from populations with the same variance $\sigma^2,$ you have two proposed variance estimates

$$ S_p^2 = \frac{(n_1 - 1)S^2_1+(n_2-1)S_2^2}{n_1+n_2-2},$$

and

$$ S_a^2 = \frac{(n_1S^2_1+n_2)S^2_2}{n_1+n_2}. $$

For $S_p^2,$ you have found $S_i^2; i=1,2,$ each of which requires computing a sample mean $\bar X_i, 1,2.$ So,

$$ \frac{\nu S_p^2}{\sigma^2} \sim \mathsf{Chisq(\nu)}.$$ where $\nu = n_1+n_2 - 2.$

For $S_a^2,$ the distribution theory is not so clear. You say something about $S_a^2$ being unbiased, but that hardly specifies a distribution. Let's use The same degrees of freedom $\nu$ as above for an experiment.

Simulation: Begin by looking at $m = 10\,000$ samples x1 of size $n_1 = 2$ from $\mathsf{Norm}(\mu_1 = 100, \sigma_1 = 15)$ and x2 of size $n_2=3$ from $\mathsf{Norm}(\mu_2 = 110, \sigma_2 = 15).$
We find the sample variances, the pooled variance estimat and the average variance estimate. Then we look at the corresponding chi-squared random variables.

set.seed(2022) n1 = 2; m=10^5 M1 = matrix(rnorm(n1*m, 100, 15), nrow=m) v1 = apply(M1, 1, var) n2 = 3 M2 = matrix(rnorm(n2*m, 110, 15), nrow=m) v2 = apply(M2, 1, var) pool = (v1 + 2*v2)/(n1+n2-2) q.p = (n1+n2-2)*pool/15^2 avg.v = (v1+v2)/(n1+n2) #### q.a = (n1+n2)*avg.v/15^2

Then we compare the results with the density functions of the corresponding chi-squared distribution. For the pooled estimate $S_p^2$ we get a good match, but for $S_a^2$ the fit is not good.

R code for graphs:

par(mfrow=c(1,2)) hist(q.p, prob=T, ylim=c(0,.35), col="skyblue2", main="Pooled") curve(dchisq(x, n1+n2-2), add=T, lwd=2, col="orange") hist(q.a, prob=T, ylim=c(0,.35), col="skyblue2", main="Averaged") curve(dchisq(x, n1+n2-1), add=T, lwd=2, col="orange") par(mfrow=c(1,1))

+1 Very nice, clear, full analysis. It leaves one wondering, though: which statistic, if either, leads to a better (more powerful) test? — whuber
– whuber ♦, Commented May 26, 2022 at 14:03

Stack Exchange Network

Why n-1 instead of n in pooled sample variance

1 Answer 1

Hot Network Questions

Why n-1 instead of n in pooled sample variance

1 Answer 1

Related

Hot Network Questions