Will a classifier trained on undersampled data make accurate predictions on new imbalanced data?

Question

I have a dataset with about 200,000 entries. The target variable is binary, and only 4,000 instances belong to the class of interest.

I would like to undersample the majority class so that we have a dataset with 8,000 entries, split 50-50 between the two classes. I would then train it on, say, 80% of the undersampled data and test it on the remaining 20% of the undersampled data.

My question is: would the resulting model perform well also on new, imbalanced data? If the answer is "yes, but..", could you please let me know what I need to do to ensure that this is the case?

Statisticians do not see class imbalance as such a problem, and there is no need to use undersampling to solve a non-problem. It might be helpful if you say why you find the imbalance problematic. You might find the following links enlightening, especially Harrell’s blog. stats.stackexchange.com/questions/357466 fharrell.com/post/class-damage fharrell.com/post/classification stats.stackexchange.com/a/359936/247274 stats.stackexchange.com/questions/464636 twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave
– Dave, Commented Feb 22, 2022 at 23:07

Erwan · Accepted Answer · 2022-02-22 23:40:56Z

All other things being equal, resampling does not really improve performance, it only changes the type of the most common errors. This is why the training data should follow the expected distribution in the population, i.e. if the classifier is intended to be applied on data where the positive class is 2% then keep the 2% proportion in the training data.

Let's assume that your training data is distributed 50-50.

If the features are really good indicators for the target, then the model can achieve close to perfect performance on any distribution since it can distinguish the two classes really well.

In the general case where the features are not that good, the model doesn't always distinguish the two classes well. So there are instances that the model "isn't sure" how to classify. Since there's no majority class, these instances will be predicted as positive or negative in equal proportion, causing around the same amount of false positive (FP) and false negative (FN) errors. However on a test set with 2% positive instances, the number of FN becomes very small but the number of FP becomes very large. In other words, balancing the training set causes better recall but worse precision.

Sometimes if the task requires favouring recall over precision, it might make sense to proceed like this. But I think that this should be done only for this reason, and in my opinion only after having tested the regular distribution first.

Stack Exchange Network

Will a classifier trained on undersampled data make accurate predictions on new imbalanced data?

1 Answer 1

Hot Network Questions

Will a classifier trained on undersampled data make accurate predictions on new imbalanced data?

1 Answer 1

Related

Hot Network Questions