0
$\begingroup$

My friend was reading a textbook and had this question:

Suppose that you observe $(X_1,Y_1),...,(X_{100}Y_{100})$, which you assume to be i.i.d. copies of a random pair $(X,Y)$ taking values in $\mathbb{R}^2 \times \{1,2\}$. Your plot the data and see the following:enter image description here

where black circles represent those $X_i$ with $Y_i=1$ and the red triangles represent those $X_i$ with $Y_i=2$. A practitioner tells you that their misclassification costs are equal, $c_1 = c_2 = 1$, and would like advice on which algorithm to use for prediction. Given the options:

  • Linear discriminant analysis;
  • K-Nearest neighbours with $K=5$
  • K-Nearest neighbours with $K=90$.

What would be the best algorithm for this? I think it should be $5$, as the bigger the $K$, the worse the accuracy gets? What would be your choice and why?

$\endgroup$

1 Answer 1

1
$\begingroup$

You can choose the optimal method using cross-validation. If your sample size is relatively small, use leave-one-out cross-validation... I would not be surprised if $K = 5$ worked well. Linear discriminant analysis (LDA) will not work here because it implies linear decision boundaries. Unless you enlarge the set of predictors with non-linear transformations.

Also, the picture above is a classic case where support vector machines (SVM) with a Gaussian kernel could be of use. R has a friendly implementation of SVM in the "kernlab" package.

$\endgroup$
6
  • $\begingroup$ Hi. But why would I not use $K=90$? $\endgroup$ Commented Jan 21, 2021 at 8:50
  • $\begingroup$ I did not say you shouldn't. What I meant: use cross-validation to decide. $\endgroup$ Commented Jan 21, 2021 at 8:51
  • $\begingroup$ No, I mean my question was, given you only have this graph and you had to choose between $K=90$ or $K=5$, what would you choose? And why? $\endgroup$ Commented Jan 21, 2021 at 8:52
  • 1
    $\begingroup$ Why would you base your decision on the graph only? Is this a homework problem? $\endgroup$ Commented Jan 21, 2021 at 9:01
  • $\begingroup$ I personally wouldn't decide only based on a graph, but the question is from a textbook. It's not homework though! I'd like to hear a good explanation as to why someone would choose one algo over the other, given they only have this graph:) $\endgroup$ Commented Jan 21, 2021 at 9:04

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.