Drawbacks of stratified test set bootstrapping for metric UQ

Question

I am using test-set (percentile) bootstrapping to quantify the uncertainty of various model performance metrics, such as AUROC, AUPR, etc.

To avoid any confusion, the approach is simply:

bootstrap the test set
compute the target metric on each bootstrapped test set, giving me a distribution of metric values
compute percentiles of the metric distribution, giving me a CI.

(Yes I know that more efficient ways to obtain CIs are available for various specific metrics; I am using this in a toolbox as a catch-all UQ approach that works for any/all metrics I can throw at it.)

Now to my question: especially in smaller test sets or with strong class imbalance, it often happens that there are either no positives or no negatives in a bootstrapped sample, prompting various metrics to divide by zero.

One approach to address this issue that I have seen suggested on stats.SE and that the pROC package implements apparently for exactly this reason would be to simply use stratified bootstrapping, i.e., keeping the number of positives (or whatever the relevant number in the denominator of the metric) constant throughout the bootstrapped samples. Are there any important drawbacks to this approach? Why is this not the/a default approach? Or is it?

usεr11852 · Accepted Answer · 2024-12-28 12:03:51Z

Unsurprisingly, the no free lunch theorem is alive and well in this resampling task too. There are inherent drawbacks in stratified bootstrapping, as with any other statistical process.

The main drawback with stratified bootstrapping is that it may under-represent the variability in the data, especially for small samples as here, as we may constraint significantly limit the range of resampling. Depending on how the stratification is implemented this can give inconsistent variance estimate (e.g. see Pons (2007) Bootstrap of means under stratified sampling) To that regards, Hesterberg (2015) in What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum comments on how: "For example, the U.K. Department of Work and Pensions wanted to bootstrap a survey of welfare cheating. They used a stratified sampling procedure that resulted in two subjects in each stratum—so an uncorrected bootstrap standard error would be too small by a factor of $\sqrt{\frac{n_i-1}{n_i}} = \sqrt{\frac{1}{2}}$." ($n_i$ represents the number of subjects in each stratum.). A recent work from Habineza et al. (2024) On bootstrap based variance estimation under fine stratification looks into this in more detail, when we do fine stratification, i.e. when the population is divided into numerous small strata, each containing a relatively small number of sampling units. While not direct analogous to "smaller test sets or with strong class imbalance" this is a similar setup that can amplify the issue of underestimated variability, so we should be aware of it.

The above being said, the real elephant in the room is whether the observed sample proportions accurately represent the true population proportions. Stratified bootstrap cannot remedy situations where the original sample is biased. (And additonally, stratified bootstrap does require a more complex implementation compared to a standard bootstrap.)

All in all, standard bootstrap should typically be the first approach. Even when our metric is incomputable in some cases, the resulting NA data points are valid observations. Of course, if there are sound reasons to stratify our sample (e.g., known subpopulation proportions, class prevalences, etc.) then we should use stratified bootstrap as it strengthens the validity our analysis.

Very interesting, thank you for the comprehensive answer! My first intuition was actually also to treat the resulting NAs as valid observations - but then, the question becomes how to handle these NAs when computing CIs (or whatever measure of uncertainty)? It can't be the optimal approach to return infinite CIs for any case in which any of the bootstrapped samples returned NA for the metric...? Of course, one can think of heuristics - discard up to 5% NAs and compute CIs on the remaining samples or whatever - but that comes with its own biases, of course, and doesn't feel very principled... — Eike P.
– Eike P., Commented Dec 28, 2024 at 22:27
I am happy I could help. That said, if resampling-generated NAs are an absolute showstopper, the work-around is there: let's stratify our sampling process, correct the variance (if needed) and continue our analysis. Side-note on your comment: if we have a sample with missing values we should examine the missingness mechanism (MCAR, MAR, NMAR) given we satisfied we understand it we progress accordingly (e.g. for MCAR). Nothing changes because it is from a bootstrap. If anything, we know that sampling is indeed completely random! :) — usεr11852
– usεr11852, Commented Dec 29, 2024 at 0:39

Stack Exchange Network

Drawbacks of stratified test set bootstrapping for metric UQ

1 Answer 1

Linked

Hot Network Questions

Drawbacks of stratified test set bootstrapping for metric UQ

1 Answer 1

Linked

Related

Hot Network Questions