Among all the supervised machine learning algorithms, K Nearest Neighbour is the simplest and easiest to understand. It can be applied to both Regression and Classification problems. In this article, we will learn the intuition and mathematics behind the KNN algorithm in detail. You can learn about other machine learning algorithms from here.
KNN is a supervised machine learning algorithm that can be used for both regression and classification problems. There are mainly three types of KNN algorithms – Brute, KD Tree, and Ball Tree.
KNN algorithm is different from other machine learning algorithms by the fact that no work is done on the training data rather all the work is done on the testing data. The model just stores the training data in memory and starts applying calculations when testing data comes.
Let’s consider, that we have binary classification data (having two classes in target output). One class is represented by red colour and the other by blue colour. The below plot represents the training data.
When testing data comes, its K nearest neighbours are found by calculating the distance with all the stored training data points. The majority target class of k nearest neighbours is predicted as the output for testing data in case of a classification problem and an average of target values of k nearest neighbours is predicted as the output in case of regression problems.
This algorithm uses a Brute Force approach as the distance of testing data is calculated with all the training data points. Therefore it is called Brute KNN.
Let’s consider the black point as testing data in the below graph and k is selected as 3, therefore 3 nearest neighbours of the black point are checked and the red is predicted as the output class for the black data point since red is the output class of maximum nearest neighbours.
For the implementation of the KNN algorithm, we need to select the value of K and the distance metric used to calculate the distance of testing data with training data points. Hence K and distance metrics are the two important parameters in KNN.
How to Find the Optimal Value of K in KNN?
Selecting a perfect value of k is very crucial for getting accurate predictions in the case of KNN. Let’s see the effect of different values of k on the performance of KNN.
Case 1: When K=1
k=1 means that it will only look for 1 nearest neighbour of testing data. A very small value of k like 1 can cause very highly variable and unstable decision boundaries i.e. a small change in training data will cause a large change in testing output.
Case 2: When K=N
k=N (where N is the number of training data points) means that it will look for all nearest neighbours of testing data. In this case, it will always predict the majority class.
What should be the value of K?
value of K should be an odd number to avoid equal number of classes problem.
The target is to find the optimal value of k for which both training and testing error is low. An error curve is drawn by running the model for different values of k.
When the value of k is low, the training error is low (since the nearest neighbour of the training data is itself) and the testing error is very high, this condition is called overfitting.
When the value of k is increased, the testing error starts decreasing and reaches an optimal value of k.
When the value of k is further increased, both training and testing data become high, as it will always predict the majority class, this condition is called underfitting.
Distance Measure in KNN
The distance metric is another crucial parameter that can affect the performance of the model. Below are different distance metrics that can be used to calculate the distance between two data points:
- Manhattan Distance
- Euclidian Distance
- Minkowski Distance
- Hamming Distance
1. Manhattan Distance (L1 norm)
It is the absolute distance between two data points. The formula of Manhattan distance is given as:
2. Euclidian Distance (L2 norm)
It is the most commonly used distance metric. It requires scaled data as input to get meaningful results. The formula of Euclidian distance is given as:
For example, the distance between city A and city B by road is Manhattan distance and by air is Euclidian distance.
3. Minkowski Distance (Lp norm)
It is a generalization of Manhattan and Euclidian distance. The formula of Minkowski distance is given as:
where p is the order of norm. For p=1, then Minkowski distance becomes Manhattan distance and for p=2 it Minkowski distance becomes Euclidian distance.
4. Hamming Distance
It is used to calculate the distance between two boolean vectors (having 0 and 1 values) or between categorical values. If both values are the same, the distance is 0 else distance is 1.
In the above boolean vectors, the number of bits that are different in A and B is 3. Hence hamming distance is 3.
Advantages of KNN
- Categorical values can be used while training the model, no need to convert categorical data into numerical.
- KNN algorithm doesn’t make any assumption about the data, it is a non-parametric algorithm.
Drawbacks of KNN
- KNN is very slow as it measures the distance of testing data with all training data points, hence considered a lazy learner.
- A large amount of space is required to store training data.
Thank you for reading this article till the end. In this article, we learned the KNN algorithm along with its implementation, and the important parameters that affect it. In the next article, we will learn about two more versions of the KNN algorithm which are the KD Tree and Ball Tree.
Feel Free to give your feedback and ask any doubts in the comment box below.