1
$\begingroup$

I have created two machine learning models and want to run significance tests on the results of various metrics (sensitivity, specificity, Cohen's kappa etc.) to see if there's any statistically significant differences in the results.

My dataset has 500 cases, and of these, a random sampling function chooses 400 cases to train each of the models on and the models are tested on the remaining 100 cases. Therefore, a paired comparison is needed, as the models are tested on the same 100 cases. This process is then repeated 1000 times.

  1. I now have 1000 values for sensitivity etc. for each of the two models - which test should I use to compare them to obtain a p-value?

  2. I've read papers which use Wilcoxon Signed-Rank tests - could I do this?

  3. Does the fact that the random train-test split occurs 1000 times necessitate a statistical correction to be carried out?

$\endgroup$
2
  • $\begingroup$ Why use a test, use 95% confidence intervals and check if they overlap. This allows you to also directly see if a significant difference is relevant at all (only if the confidence intervals are far enough apart)... $\endgroup$ Commented May 26 at 17:47
  • $\begingroup$ Seems like a duplicate of this, but my vote would be binding. $\endgroup$ Commented Oct 6 at 1:29

1 Answer 1

0
$\begingroup$

Perhaps in this widely cited paper you might find the answer. A paired t-test doesn't seem to be appropriate in this case, but a 5x2cv paired test is suggested

EDIT: I read further on the subject and it seems that there are two other possible options:

I don't know about the validity of the permutation paired test in comparison to the typical paired t-test. Namely, I'm still not sure if the exchangeability assumption of the permutation test is or not violated in the monte carlo cross validation

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.