Understanding K-Nearest Neighbor (KNN) Algorithm

K-Nearest Neighbors (K-NN) By Mohamed Gamal

Agenda ▪ What is K-NN? ▪ How does K-NN work? ▪ K-NN algorithm and structure ▪ Advantages of K-NN ▪ Disadvantages of K-NN ▪ Example 1

▪ K-Nearest Neighbors (KNN) is a simple, yet powerful, supervised non-linear machine learning algorithm used for classification and regression tasks. ▪ It's a non-parametric algorithm, meaning it doesn't make any assumptions about the underlying data distribution and doesn't learn a model during the training phase. ▪ Instead, it memorizes the entire training dataset and makes predictions based on the similarity of new data points to the known data points. What is KNN? 2

▪ Step-1: Select the number K of the neighbors. ▪ Step-2: Calculate the distance of K number of neighbors. ▪ Step-3: Take the K nearest neighbors as per the calculated distance. ▪ Step-4: Among these K neighbors, count the number of the data points in each category. ▪ Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. How does KNN work? (Algorithm) 2

1) K=5 2) Calculate the distances. 5) Assign the new data point to the class/category with the majority of votes. Blue: 3 Orange: 2 3) Choose K=5 neighbors with the min. distances. 4) Among the selected K nearest neighbors, count the no. points in each class/category. Blue: 3 Orange: 2

Minkowski Distance (Named after the German mathematician, Hermann Minkowski) Manhattan Distance (Also called taxicab distance or cityblock distance) Euclidean Distance (The shortest distance between any two points) 𝑝 = 1 𝑝 = 2 Ways to calculate the distance in KNN

How to select the value of K? • There’s no particular way to determine the best value for K, so you need to try some values to find the best out of them. • The most preferred value for K is 5. • A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model. • Large values for K are good, but it may find some difficulties.

Advantages Disadvantages Simple and effective algorithm Accuracy depends on the quality of the data (e.g., noise can affect accuracy) Quick calculation time With large data, the prediction stage might be slow High accuracy (with small dataset) Sensitive to the scale of the data and irrelevant features No need to make additional assumptions about the data Requires a large amount of memory — needs to store all of the training data

New data entry The dataset ▪ Assume that: • The value of K is 5. • Euclidean distance is used. Example ▪ Note: you can calculate the distance using any other measure! (e.g., Manhattan, Minkowski … etc.). (Saturation)

The dataset Calculating Distances: ▪ 𝑑1 = 40 − 20 2 + 20 − 35 2 = 25 ▪ 𝑑2 = 50 − 20 2 + 50 − 35 2 = 33.54 ▪ 𝑑3 = 60 − 20 2 + 90 − 35 2 = 68.01 ▪ 𝑑4 = 10 − 20 2 + 25 − 35 2 = 10 ▪ 𝑑5 = 70 − 20 2 + 70 − 35 2 = 61.03 ▪ 𝑑6 = 60 − 20 2 + 10 − 35 2 = 47.17 ▪ 𝑑7 = 25 − 20 2 + 80 − 35 2 = 45 New data entry

The dataset New data entry 𝐾 = 5 ▪ As you can see, based on the 5 selected neighbors with the low distances, the majority of votes are for the Red class, therefore, the new entry is classified as Red. Red

▪ If the data is a jumble of all different classes then KNN will fail because it will try to find k nearest neighbors. ▪ Outliers points. KNN failure cases Jumbled data Outliers

Understanding K-Nearest Neighbor (KNN) Algorithm

More Related Content

What's hot

Similar to Understanding K-Nearest Neighbor (KNN) Algorithm

More from Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt

Recently uploaded

Understanding K-Nearest Neighbor (KNN) Algorithm