Clarification on ANOVA mechanism

Question

This website explains ANOVA and F ratio as follows:

"ANOVA partitions the variability among all the values into one component that is due to variability among group means (due to the treatment) and another component that is due to variability within the groups (also called residual variation)... Each sum-of-squares is associated with a certain number of degrees of freedom... and the mean square (MS) is computed by dividing the sum-of-squares by the appropriate number of degrees of freedom... The F ratio is the ratio of two mean square values..."

My questions are:

How can ANOVA know to partition the variability into 2 distinct components (due to the treatment and due to inherent variation)?

What two mean squares does it refer to (The F ratio is the ratio of two mean square values). Are they mean squares due to treatment and due to inherent variation?

Thanks in advance.

Of possible interest: How to visualize what ANOVA does?, Partitioning sum of squares (and probably many more). — chl
– chl, Commented Oct 18, 2020 at 7:00

BruceET · Accepted Answer · 2020-10-18 17:39:24Z

Consider the following data simulated in R according to the model for a one-factor ANOVA with three levels of the factor and ten replications at each level. Each level has variance $\sigma^2 = 3^2 = 9.$

set.seed(2020) x1 = rnorm(10, 20, 3) x2 = rnorm(10, 21, 3) x3 = rnorm(10, 22, 4) x = c(x1,x2,x3) gp = as.factor(rep(1:3, each=10))

Here is a stripchart in R showing the ten observations in each group.

stripchart(x ~ gp, pch="|", ylim=c(.5,3.5))

The ANOVA table is given below:

anova(lm(x~gp)) Analysis of Variance Table Response: x Df Sum Sq Mean Sq F value Pr(>F) gp 2 140.48 70.240 4.463 0.02115 * Residuals 27 424.93 15.738 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

MSA = $15.7382$ is the average of the variances within each of the three groups. This is one way to estimate $\sigma^2.$ [Never mind that it is not a very good estimate; with only 30 observations altogether, we can't expect a really close estimate.]

mean(c(var(x1),var(x2),var(x3))) [1] 15.7382

If all three groups had the same mean $\mu$ (the assumption of the null hypothesis), then the three group means $(\bar X_1,\bar X_2, \bar X_3)$ would each would have a normal distribution with mean $\mu$ and variance $\sigma^2/10.$ So, if $H_0$ were true, we could also estimate $\sigma^2$ as the $10$ times the variance of the 'sample' of three $\bar X_i$s:

10*var(c(mean(x1),mean(x2),mean(x3))) [1] 70.23971

Thus MS(Group) = $70.2397.$ [Because $H_0$ is false, this estimate is much too large; the three means also express the differences among groups.]

So the way ANOVA "knows" how to get the two variances is because of the two procedures we have just seen.

If $H_0$ is true the two variance estimates tend to be about the same so that the F-ratio would tend to be about $1.$ The larger the F-ratio is above $1,$ the stronger the evidence against $H_0.$ In our case $F = 4.463 > 1.$ Taking numerator and denominator degrees of freedom into account, $4.463$ is judged to be "significantly" larger than $1.$

The variance estimate in the numerator of $F$ involves both $\sigma^2$ and the difference in group population means $\mu_.$ The variance estimate in the denominator involves only $\sigma^2.$

Here is a plot of the density function of the distribution $\mathsf{F}(2, 27).$ The (tiny) area under the density curve to the right of the vertical dotted line is the P-value $0.02115.$

curve(df(x, 2, 27), 0, 10, lwd=2, ylab="PDF", xlab="F", main="Density of F(2,27)") abline(v = 4.463, col="red", lwd=2, lty="dotted") abline(h=0, col="green2"); abline(v=0, col="green2")

Fantastic answer! At first glance, one would assume the three groups x1, x2 and x3 came from the same population (given their similar means and variances), but then ANOVA rejected that assumption! Would you mind clarifying what the two numbers (2, 27) mean? Thanks. — Nemo
– Nemo, Commented Oct 18, 2020 at 11:41
2 is numerator DF for F-statistic (3 gps - 1 = 2); 27 is denominator DF (3(10 reps - 1)=27). // The simple formulas I illustrated for MSA and MS(Gp) are for a balanced design. Slightly messier if different numbers of replications in 3 groups. // Added stripchart to show exact values of the ten replications in each group. — BruceET
– BruceET, Commented Oct 18, 2020 at 17:40
Great answer (+1), just wondering if x3 = rnorm(10, 22, 4) should have a standard deviation of 3 rather than 4 ? — Robert Long
– Robert Long, Commented Oct 18, 2020 at 17:43
Typo. Thanks for spotting that. If I correct it, changes would propagate as follows: MS(Gp) = 51,5, MSA = 14.5, F=3.55, P=0.043. May change it all later, but it may cause extra confusion to make changes immediately. // As it stands: an unintended demo of sensitivity of ANOVA to heteroscedasticity. — BruceET
– BruceET, Commented Oct 18, 2020 at 18:08
No worries, you're welcome. I feel your pain. This happens to me all the time. I use simulation in many of my answers and regularly find a typo that changes small details (and sometimes large details) in what follows :/ LOL to your last sentence :D — Robert Long
– Robert Long, Commented Oct 18, 2020 at 18:31

Stack Exchange Network

Clarification on ANOVA mechanism

1 Answer 1

Linked

Hot Network Questions

Clarification on ANOVA mechanism

1 Answer 1

Linked

Related

Hot Network Questions