Do infrequent examples screw up classifiers? If so, when is it okay to remove the infrequent examples from the data?

Question

It's hard to think of a more eloquent way of phrasing this question - I'm basically wondering if a classifier trained on data where examples of some of the classes are infrequent/rare would be a bad model? I'm mainly interested in decision trees (C4.5).

I think the answer is no, but that you will get a high error, because you will usually classify members of the infrequent classes as instances of the more frequent classes. This has been my experience so far.

I'm also wondering when it's okay to remove these examples and when it's considered bad practice (i.e. doing it just to lower the error). I'm guessing that it's okay to remove these if there's a good reason to do so, and you explain that reasoning when you report your results.

I'm not really interested in building the best classifier, I'm more interested in understanding relationships between the variables and the structure of the data. But all my variables are categorical and it's non-linear data, so decision trees have so far been the best tool I've found to do this. (SVMs and ensemble methods are more accurate, but you can't really see the internal model structure, which you get with decision trees.)

thanks.

Have you fit the model both with and without the infrequent classes? What happens? — Matt Parker
– Matt Parker, Commented Jun 14, 2011 at 23:17
@Matt It makes a different, not a huge one. Kappa goes from .45 to .48, 61% to 65% correct, mean absolute error from .21 to .26. I cut it off at >=50 examples. There's still a big range, the smallest being 50 instances and the largest 372 (not a huge dataset because I'm starting with a subset of my full set, 808 instances). If I make it >50, I get 69% correct, kappa=0.52 and mean error = 0.29, which is probably not bad for what I'm doing. — paul
– paul, Commented Jun 15, 2011 at 17:18

doug · Accepted Answer · 2011-06-14 23:29:16Z

By 'infrequent example', i assume you mean that the class label occurs infrequently, (i.e., points to which you've assigned a class label occurs with very low frequency in your data). So hiding them from your classifier in essence removes any opportunity your classifier would have had to learn to assign that class label to data points in your test set--but if you don't care about that class, then i think it makes sense to remove those data assigned to that irrelevant class.

But what if you do care about training your classifier to assign data to that class? The paradigmatic example is fraud prediction--the data points are e.g., transactions and the classifier is trained to assign one of two class labels to each transaction--"fraud" or "not fraud". The representation of the two classes in the training and test data is often much less than one percent.

In fact, rather than eliminating these data with the low-frequency class label, it's common to give this small population of data points much higher weights so that a mis-classification penalty is greater for a 'false positive' (e.g., mis-classifying a transaction as 'not fraud').

I suppose you could also have used the term "infrequent example" to refer to an outliner. Absent knowledge or reasonable belief that the value is an artifact, rather than an accurate measurement, it's of course bad form to reject outliners just because they are outliers.

Thanks for the response. I meant it in the way you describe first, not as an outlier. These are data from an experiment I designed, and I think the reason for the infrequent examples is probably a flaw in the experimental design. So while I am very interested in the infrequent classes, it seems like there's not enough data on them because of the design, not because of some real fact about the phenomena I'm studying. I could be wrong and it could be like the fraud detection scenario, but I think I can make an argument for why that's not the case when I present the results. — paul
– paul, Commented Jun 15, 2011 at 16:44

Stack Exchange Network

Do infrequent examples screw up classifiers? If so, when is it okay to remove the infrequent examples from the data?

1 Answer 1

Hot Network Questions

Do infrequent examples screw up classifiers? If so, when is it okay to remove the infrequent examples from the data?

1 Answer 1

Related

Hot Network Questions