How to interpret a very significant p-value between two very close medians using a pairwise Wilcoxon signed-rank test?

Question

I would like to know if there is a difference between the pH values of samples measured using two different methods (group 2 vs 1). Since the groups are paired and the values in each group are not normally distributed, I used a Wilcoxon signed-rank test.

library(dplyr) library(rstatix) # test for normality dat0_normality <- dat0 |> group_by(group) |> shapiro_test(ph)

# medians and CI5-95 dat0_medians <- dat0 |> group_by(group) |> summarise(nb = n(), median = round(quantile(ph, .50, na.rm=TRUE), 4), ci5 = round(quantile(ph, .05, na.rm=TRUE), 4), ci95 = round(quantile(ph, .95, na.rm=TRUE), 4))

# wilcoxon signed-rank test dat0_wilcox <- dat0 |> pairwise_wilcox_test(ph ~ group, paired = TRUE, detailed = TRUE, exact = TRUE) |> adjust_pvalue(method = "bonferroni") |> add_significance()

Here is my interpretation, perhaps inaccurate (I am not a statistician):
The absolute median of differences (AMD) between group 2 vs 1 is +0.008458209 (the estimate), i.e. a relative median of difference (RMD) of +0.116% (i.e. 100*0.008458209/7.2800) with a CI5-95 of +0.08% to +0.15%.
Given the low adjusted p-value of 1.26e-6, these AMD and RMD may be considered as significant.

However, from a clinical and analytical perspective, these AMD and RMD are negligible (the analytical acceptability threshold is +/-0.55%, five times wider than the RMD of 0.116%).
Confused by this discrepancy between the highly significant p-value and the clinically/analytically negligible difference, I have two questions:

Is the Wilcoxon signed-rank test appropriate for the objective?
How can we explain the significance of the test given such close medians?

Boxplot for illustration:

Note: The effect size is large, but does this explain the low p-value?

library(coin) library(libcoin) dat0_effsize <- dat0 |> wilcox_effsize(ph ~ group, paired = TRUE)

Data:

dat0 <- structure(list(group = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), levels = c("1", "2"), class = "factor"), ph = c(7.343, 7.362, 7.323, 7.336, 7.411, 7.418, 7.318, 7.304, 7.216, 7.238, 7.271, 7.277, 7.265, 7.281, 7.282, 7.297, 7.28, 7.294, 7.221, 7.23, 7.255, 7.262, 7.207, 7.211, 7.421, 7.437, 7.265, 7.28, 7.315, 7.329, 7.332, 7.352, 7.322, 7.319, 7.33, 7.318, 7.338, 7.343, 7.395, 7.401, 7.366, 7.374, 7.247, 7.255, 7.361, 7.354, 7.39, 7.391, 7.392, 7.406, 7.289, 7.293, 7.338, 7.344, 7.383, 7.386, 7.38, 7.399, 7.241, 7.265, 7.257, 7.258, 7.014, 6.999, 7.41, 7.317, 7.28, 7.277, 6.955, 6.962, 7.253, 7.256, 7.163, 7.174, 7.284, 7.29, 7.221, 7.247, 7.109, 7.128, 7.444, 7.447, 7.412, 7.425, 7.376, 7.375, 7.189, 7.194, 7.23, 7.244, 7.21, 7.213, 7.14, 7.149, 7.312, 7.322, 7.277, 7.285, 7.352, 7.359, 7.32, 7.327, 7.22, 7.246, 7.219, 7.24, 7.125, 7.134, 7.021, 7.01, 7.33, 7.337, 7.087, 7.094, 6.917, 6.927, 6.932, 6.942, 6.954, 6.987 )), row.names = c(NA, -120L), class = c("tbl_df", "tbl", "data.frame" ))

Thanks for help

jginestet · Accepted Answer · 2025-08-19 16:00:31Z

A first reason why you observed what you did (very small actual differences between the medians, very significant p-value, for a Wilcoxon Signed-Rank test (WSRt)) is because, contrary to what is (sadly) too often read, taught, and found in various sources, the WSRt is not a test of the medians.

In its most general form, the WSRt is a test of the pseudomedian (aka Hodges–Lehmann estimator) of the paired differences. As such it has no distributional assumptions (assumption on the distribution of the parent population, its shape, etc.) The pseudo-median is a measure of "location", but it is not particularly intuitive, has no clear/systematic relationship to the mean or median, and will probably not be understood by your audience. If, and only if, the distribution of the differences is symmetric, then it becomes a test of the median, and of course of the mean as well. Note that the WSRt is not well behaved at all (as a test of the median) if the symmetry assumption is violated; indeed one can actually use the WSRt as a test of symmetry; test the sample against its observed median.
You should have gotten a hint of this; the WSRt told you that the estimate of the median difference was 0.008458, while it really is 0.007500. It gave you the pseudomedian of the paired differences, not the median...

However, the story goes deeper. If we do what we should have done first, and look (i.e. plot) our data, this is what we would have seen.

The paired differences are not symmetric (skweness=4, kurtosis=24), but more importantly, the vast majority of the values are negative (I did Group1-Group2) (and yes, there is 1 straggler, for which I would seriously investigate a transcription error; 714 instead of 741? ...). Hence, the median of the paired differences is certainly not 0.
Indeed, if I run a "real" test of the median (i.e. a Sign test, which truly tests the median, but is considered to have low power), you will also observe a very significant p-value (<1e-06), because, out of 60 differences, 51 were negatives.
What is happening is that method 1 produces systematically lower pH values than method 2.

Now, you also ask why, in spite of the differences being small, and well below the threshold for "analytical importance" (I use importance, instead of significance, to avoid confusion), the test result is statistically significant. This (and its opposite) is in fact quite common; if your test is overpowered (for the analytically important difference), then you are very likely to get "statistically significant" test results which are not "important". And reciprocally, if your test is underpowered, you are likely to observe an "analytically important" difference which is not "statistically significant".
With a sample size of 60 (per group), and a paired test, you actually have quite a lot of power (for a WSRt, you had roughly 100% power for a relative difference of 0.55%, and 50% power for the effect you observed...).

Now, given your sample, why don't we run a paired t-test? With a sample size of 60, the CLT is your best friend, and your marginal distribution does not look that bad (if it is symmetric enough for a WSRt, it is certanly normal enough for the CLT to converge, given a sample size of 60). What do we get? A very small difference (0.00685), but a significant test (p=0.002). Same story...

So you picked an inappropriate test to test medians (WSRt does not test the median), you were overpowered for the "analytically important" difference, and you have a systematic bias between the 2 methods.
Personally, I feel the last one (systematic bias) is the critical issue (w/o it, the median difference would have been closer to 0). I would definitively investigate why that is.
I also noted that you tested only around pH=7. This may make sense for your use of the methods, but I would expand the range a bit to have a better understanding of the behavior of the 2 methods...

A brilliant and very clear explanation, thank you very much. The differences are mostly positive (higher values for group 2), with 51 pairs having a positive difference and 9 pairs having a negative difference. If I exclude the pair with the extreme negative difference (group 1: ph=7.410 and group 2: ph=7.317; difference= -0.093), the estimate is logically slightly higher (0.008541) and the p-value even lower (2.04e-07). Therefore, should we deduce that the WSRt is not sensitive to outliers? — denis
– denis, Commented Aug 19 at 7:28
On the other hand, the paired t-test you suggest gives an even lower p-value by keeping this outlier difference (p=6.67e-09), but much lower by excluding it (p=0.0018). Would the best approach therefore be to use a paired t-test after excluding outliers? Finally, since pH is physiologically regulated within fairly narrow ranges (normal values 7.35 to 7.45, critical if decompensated at <7.25 or >7.55), it is impossible to extend the selection of values more broadly. — denis
– denis, Commented Aug 19 at 7:28
+1 very good answer. Also @denis: I'm not so sure about this: "So you picked an inappropriate test (WSRt does not test the median)" - it is true that WSRt doesn't test the median, however I wonder whether it's really the median that is most important here, or whether what is of real interest is that there is an overall tendency of negativity (or generally non-equality in distributions with systematic differences in one direction) here, for which WSRt is just fine. — Christian Hennig
– Christian Hennig, Commented Aug 19 at 10:36
@denis Note that trying out several tests until one gives you what you want to see will invalidate results. Technically, you've got to pick your test in advance, and not only because another one on the same data gave you a result different from what you expect. Of course there is an argument to use another test if the first one was grossly wrong from a statistical perspective, but I'm not sure whether this is the case here, see earlier comment. — Christian Hennig
– Christian Hennig, Commented Aug 19 at 10:39
@denis Deletion of outliers is rarely a good idea, unless you have a strong indication (from other sources than the observed data values alone) that they're indeed erroneous. (Same issue as with tests picked dependently on the data really. Standard theory of tests assumes you don't manipulate the data based on what you observe, so becomes invalid if you do so.) — Christian Hennig
– Christian Hennig, Commented Aug 19 at 10:40

Stack Exchange Network

How to interpret a very significant p-value between two very close medians using a pairwise Wilcoxon signed-rank test?

1 Answer 1

Hot Network Questions

How to interpret a very significant p-value between two very close medians using a pairwise Wilcoxon signed-rank test?

1 Answer 1

Related

Hot Network Questions