Comparing AUROCs of binary classifiers across cross-validation folds: alternatives to DeLong

Question

I have two binary classifiers and would like to check whether there is a statistically significant difference between the area under the ROC curve (AUROC). I have reason to opt for AUROC as my evaluation metric of choice.

For each classifier, I have 15 runs as I do 5-fold cross-validation and use 3 random seeds for initialisation. For evaluation, I have used unseen/independent test data. This means that for both classifiers I have 15 paired AUROC values.

According to this article on Nature, DeLong test is (often) used for significance testing with AUROCs. However, as this depends on the variance and covariance I suspect that I cannot use DeLong test with these 15 AUROC values. In order to use DeLong test, I should concatenate all predictions on the test data across the 15 unique versions of each classifier. Would this be correct?

Would it be a good idea to use paired t-test on these 15 AUROC pairs (assuming the differences between these pairs values are normally distributed)?

Are there any arguments favouring either DeLong test or paired t-test?

Welcome to Cross Validated! Please see this related, even if not quite duplicate, post. — Dave
– Dave, Commented Jul 7 at 16:12

arpad · Accepted Answer · 2025-07-08 07:33:31Z

Edit: So you clarified three important details about your situation:

You are doing grouped k-fold cross-validation, and you are already at the maximal number of groups. 1 group = 1 patient with several samples, and you understandably don't want to train and test on data from the same patient at the same time.
Your data is a time series (per patient/group).
There is a computational limit, so upping the number of folds or seeds by orders of magnitude would be infeasible.

In this case, there are two methods you could try:

The direct simulation approach won't cut it, but 15 data points is still tiny. You can try increasing the number of folds (and therefore the data points) somewhat by performing the standard cross-validation trick on time series data:
1. Split the time series at a given time point near its end, then use the first (more-than-)half as training data and the remaining portion as the hold-out set.
2. Split up both parts into $k'$ further sub-slices, and act as if they were cross-validation folds themselves. Assuming your training/testing time scales linearly with the number of samples, this shouldn't take significantly longer than training and testing took on the original, single "big" fold. However, it will get you more data points to work with.
3. (There are other approaches to cross-validation on time series data, too.)
4. Actually, thinking about this, if your goal is to compare one patient to the rest of the patients, you can still do something similar, but instead of doing the proper cross-validation train-test splitting within the patient's time series, you can simply split up the test patient's data into smaller windows, and just treat them as distinct test sets. This will keep the training and test patients completely separate, while still giving you more data points to work with.
When you get your data thus augmented (and especially if you decide not to perform this augmentation step), I would still strongly advise against running a t-test. The t-test is particularly sensitive to skew in the distribution, and ROC AUC values are (hopefully) expected to be heavily skewed towards 1, if your classifier is any good. Therefore, I would expect a t-test to have a massively inflated type I error rate. You could look into using a non-parametric test instead, such as the Wilcoxon or the Mann-Whitney U-test.

Original answer below:

Since there is only one test, instead of worrying about the exact theoretical distribution (and any other assumptions holding or not holding, eg. whether this kind of pairing is sensible at all), I'd honestly just perform a whole lot more cross-validation runs, perhaps with an increased k (number of folds) and surely an increased number of random re-initializations/seeds. Then you could simply compute an empirical P-value by counting the number of cases where one model's performance was better than that of the other.

Given that test sets are usually meaningfully smaller than training sets, and most models are evaluated much faster than they are fitted, this should be computationally feasible (since training already was, and evaluation shouldn't add much). For example, with a 10-fold CV and 100 different seeds, you could already achieve an approximation to the P-value with a resolution of 0.001.

I agree with your statement, but this would become unfeasible. I am performing leave-one-patient-out cross-validation in a medical context. Therefore, creating additional folds is impossible. I decided to leave this detail out, as this might be irrelevant for others in the future. Furthermore, as one run (1 fold and 1 seed) easily lasts a couple of hours, it is rather impractical to just add more runs. — IsaacNuketon
– IsaacNuketon, Commented Jul 7 at 17:06
@IsaacNuketon Hm, so to be clear, you have 5 patients, and training on a sample of 4 then evaluating on a single sample lasts hours? What kind of data are you working with, exactly? Also, I'd be very skeptical of any metric of model accuracy when it is computed on a training set of 4-5 samples; I don't think you can get any sort of statistically meaningful result from that. In fact, just how can you compute an AUROC from a single sample at all? That sounds totally invalid to me. — arpad
– arpad, Commented Jul 7 at 17:35
I never mentioned that each pat. only has one sample. In fact, each pat. can give me thousands of samples, as it concerns time series data and these are windowed. The exact number of samples per pat. varies, as the total length of the time series is varying across pat. I am comparing across datasets, I have to clarify: I have an unseen test set for all datasets but one. This specific one has 5 pat. As I had reason to choose 5-fold cross-validation, I divided this specific one dataset into 5 folds. Each of these test sets has thousands of samples from either 1 pat. or a small group of pat.. — IsaacNuketon
– IsaacNuketon, Commented Jul 7 at 19:04
@IsaacNuketon Ah, so you are basically doing group k-fold. I'll update my answer soon. — arpad
– arpad, Commented Jul 8 at 5:41

Stack Exchange Network

Comparing AUROCs of binary classifiers across cross-validation folds: alternatives to DeLong

1 Answer 1

Linked

Hot Network Questions

Comparing AUROCs of binary classifiers across cross-validation folds: alternatives to DeLong

1 Answer 1

Linked

Related

Hot Network Questions