Reproducibility of statistical test results
This is a short, simple exercise to assess the reproducibility of decisions based on statistical testing.
Consider a null hypothesis H0 with a set of alternative hypotheses containing H1 and H2. Setup the statistical hypothesis test procedure withat a p-valuesignificance level of 0.05 to have a power of 0.8, if H1 is true. Further assume that the power for H2 is 0.5. To assess reproducibility of test result, the experiment is considered of executing the test procedure two times. Starting with the situation, where H0 is true, the probabilities for the outcomes of the joint experiment are displayed in Table 1. The probability of not being able to reproduce decisions is 0.095.
Table 1. Frequencies, if H0 is true
\begin{array} {|r|r|} \hline Frequency. of. decision &Reject. H0 &Retain. H0 \\ \hline Reject. H0 &0.0025 &0.0475 \\ \hline Retain. H0 &0.0475 &0.9025 \\ \hline \end{array}
The frequencies change as the true state of nature changes. Assuming H1 is true, H0 can be rejected as designed with a power of 0.8. The resulting frequencies for the different outcomes of the joint experiment are displayed in Table 2. The probability of not being able to reproduce decisions is 0.32.
Table 2. Frequencies, if H1 is true
\begin{array} {|r|r|} \hline Frequency. of. decision &Reject. H0 &Retain. H0 \\ \hline Reject. H0 &0.64 &0.16 \\ \hline Retain. H0 &0.16 &0.04 \\ \hline \end{array}
Assuming H2 is true, H0 will be rejected with a probability of 0.5. The resulting frequencies for the different outcomes of the joint experiment are displayed in Table 3. The probability of not being able to reproduce decisions is 0.5.
Table 3. Frequencies, if H2 is true
\begin{array} {|r|r|} \hline Frequency. of. decision &Reject. H0 &Retain. H0 \\ \hline Reject. H0 &0.25 &0.25 \\ \hline Retain. H0 &0.25 &0.25 \\ \hline \end{array}
The test procedure was designed to control type I errors (the rejection of the null hypothesis even though it is true) with a probability of 0.05 and limit type II errors (no rejection of the null hypothesis even though it is wrong and H1 is true) to 0.2. For both cases, with either H0 or H1 assumed to be true, this leads to non-negligible frequencies, 0.095 and 0.32, respectively, of "non-reproducible", "contradictory" decisions, if the same experiment is repeated twice. The situation gets worse with a frequency up to 0.5 for "non-reproducible", "contradictory" decisions, if the true state of nature is between the null- and the alternative hypothesis used to design the experiment.
The situation can also get better - if type 1 errors are controlled more strictly, or if the true state of nature is far away from the null, which results in a power to reject the null that is close to 1.
Thus, if you want more reproducible decisions, increase the p-valuessignificance level and the power of your tests. Not very astonishing ...