Can I use clustering after classification to improve the performance of my classifier?

Question

Say I have a classifier that segments my feature vectors (e.g. representing applicants) into 3 distinct segments A, B, C by assigning each applicant a score between 0 (worst) and 1 (best) with e.g. a logistic regression trained on historical data (ground-truth labels: 1 = great, 0 = bad) and then applying 2 thresholds (A/B, B/C).

Applicants in segments A are approved, while applicants in segments C are rejected. I'm unsure about applicants in segment B, so I reject them. But I am worried of missing out on some good applicants from segment B.

So I'm wondering if the following approach makes sense: I cluster all applicants with e.g. k-means. For every "good" cluster, i.e. with a high proportion of applicants from segment A, I re-assign all applicants from segment B in that cluster to segment A, and approve them.

My question is: Are there any intuitions or examples or better yet theoretical results why this approach can or cannot work, i.e. lead to a better classification accuracy based on the ground-truth labels (1 = great, 0 = bad)?

What I've tried so far:

Experiments show that indeed I can find some good applicants from segment B with clustering (using a large amount of clusters), however never better on average than simply taking the top x% segment B applicants based on assigned score.
In a first research, I couldn't find any papers or questions on this site related to applying clustering after classification. What seems common instead is to apply clustering as a pre-processing step.

Lynn · Accepted Answer · 2022-11-12 04:18:04Z

The problem with using clustering like this is that there is no guarantee the clusters found have any relation to the target variable. If the clusters are based on common features that are not relevant for the classification task, then the target class of the "B"'s in your "good" clusters would probably be random.

If you want a better classification for your "B"'s I'd suggest trying boosting, or if you have enough training samples which get classified as "B"'s, you could try training a second classifier on just these training samples. Then use this classifier to reclassify any cases that the first model classifies as "B".

Edit to explain boosting, as requested by the OP:

Boosting is training a series of classifiers with the aim of improving accuracy over that of the initial classifier. Each time a new classifier is trained, the training samples are weighted so that ones previously classified incorrectly have a higher weight, and those classified correctly have a lower weight, so the classifier puts more emphasis on those samples classified incorrectly so far. For inference, the series of classifiers is used as an ensemble, so the final prediction is a combination of the predictions made by each classifier. One of the most well-known boosting algorithm is Adaboost. A couple of blogs discussing boosting are Quick Introduction to Boosting Algorithms in Machine Learning by Sunil Ray or A Quick Guide to Boosting in ML by Jocelyn D'Souza.

Thanks, could you clarify what you mean by "trying boosting"? — user63726
– user63726, Commented Nov 11, 2022 at 13:53
@user63726 - I've added an explanation of boosting to my answer. — Lynn
– Lynn, Commented Nov 12, 2022 at 4:18
@user63726 Even after boosting, you will be left with some candidates in class B. You can take top scorer in class B and check manually if they are indeed good. After checking some, bad candidates will eventually come frequently. Then you can decide to stop.. That point is your threshold. — amol goel
– amol goel, Commented Nov 16, 2022 at 4:18

Stack Exchange Network

Can I use clustering after classification to improve the performance of my classifier?

1 Answer 1

Hot Network Questions

Can I use clustering after classification to improve the performance of my classifier?

1 Answer 1

Related

Hot Network Questions