k-nearest neighbors where # of objects in each class differs vastly

Question

I am running knn (in R) on a dataset where objects are classified A or B. However, there are many more A's than B's (18 of class A for every 1 of class B).

How should I combat this? If I use a k of 18, for example, and there are 7 B's in the neighbors (way more than the average B's in a group of 18), the test data will still be classified as A when it should probably be B.

I am thinking that a lower k will help me. Is there any rule of thumb for choosing the value of k, as it relates to the frequencies of the classes in the train set?

Guy haimovitz · Accepted Answer · 2016-06-03 03:27:21Z

Ther is no such rule, for your case i would try a very small k probably between 3 and 6.

About the dataset, unless your test data or real world data are found in about the same ratio you have mentioned ( 18:1 ) i would remove some A's for more accurate results, i wont advise you doing it if the ratio is indeed close to the real world data because you will lose the effect of the ratio (lower probability classify for a lower probability data).

Thanks for the advice! I did end up removing some A's to lower the ratio, and used a smaller k. I'm pleased with the results.

Collectives™ on Stack Overflow

k-nearest neighbors where # of objects in each class differs vastly

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related