0

I am running knn (in R) on a dataset where objects are classified A or B. However, there are many more A's than B's (18 of class A for every 1 of class B).

How should I combat this? If I use a k of 18, for example, and there are 7 B's in the neighbors (way more than the average B's in a group of 18), the test data will still be classified as A when it should probably be B.

I am thinking that a lower k will help me. Is there any rule of thumb for choosing the value of k, as it relates to the frequencies of the classes in the train set?

1 Answer 1

1

Ther is no such rule, for your case i would try a very small k probably between 3 and 6.

About the dataset, unless your test data or real world data are found in about the same ratio you have mentioned ( 18:1 ) i would remove some A's for more accurate results, i wont advise you doing it if the ratio is indeed close to the real world data because you will lose the effect of the ratio (lower probability classify for a lower probability data).

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the advice! I did end up removing some A's to lower the ratio, and used a smaller k. I'm pleased with the results.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.