I have two binary classifiers and would like to check whether there is a statistically significant difference between the area under the ROC curve (AUROC). I have reason to opt for AUROC as my evaluation metric of choice.
For each classifier, I have 15 runs as I do 5-fold cross-validation and use 3 random seeds for initialisation. For evaluation, I have used unseen/independent test data. This means that for both classifiers I have 15 paired AUROC values.
According to this article on Nature, DeLong test is (often) used for significance testing with AUROCs. However, as this depends on the variance and covariance I suspect that I cannot use DeLong test with these 15 AUROC values. In order to use DeLong test, I should concatenate all predictions on the test data across the 15 unique versions of each classifier. Would this be correct?
Would it be a good idea to use paired t-test on these 15 AUROC pairs (assuming the differences between these pairs values are normally distributed)?
Are there any arguments favouring either DeLong test or paired t-test?