K Nearest Neighbor algorithm is one of the simplest supervised machine learning algorithms used for solving both regressions as well as classification problems. In the previous article, we learned about the working of the Brute KNN algorithm which is the basic version of KNN.
The other two types of KNN algorithms are Ball Tree and KD Tree which are based on tree data structure and are computationally less expensive for finding the nearest neighbours of a data point when compared to the Brute method and hence can be used as an advanced type of KNN algorithm.
If you are interested to learn these two types of KNN algorithms in detail, then this article is for you. It is recommended to go through the Brute KNN algorithm first.
K Nearest Neighbor
K Nearest Neighbor is a supervised machine learning algorithm but it can be used for unsupervised tasks as well. In this algorithm, no work is done on the training data hence training data is just stored in the memory.
When the testing data comes, its distance is calculated with all the training data points to find the k nearest neighbours which make this algorithm computationally very expensive.
We need a fast and optimized method to find the nearest neighbours. Using a tree data structure can speed up the search process. Below are three approaches to finding the nearest neighbours for a data point.
- Brute Force Approach
This is the basic approach in which the distance of the data point is calculated with all the other data points and then the k nearest neighbours are selected. Let’s assume N is the number of training data in the dataset and D is the dimension or the number of features. Then the complexity of this algorithm is .
- KD Tree
KD Tree stands for K-Dimensional Tree. It is a tree-based data structure that helps in reducing search complexity. The complexity of this algorithm is which is much less than Brute Force Approach.
- Ball Tree
KD Tree becomes inefficient for datasets having higher dimensions. A ball Tree is another tree-based data structure that is much faster for storing multidimensional data.
K Dimensional Tree (KD Tree) – KNN Algorithm
In Brute KNN Tree, the distance of the new testing data is calculated with all the training data points to find the k nearest neighbours. In KD Tree-based approach, the total space of training data is divided into multiple blocks and the distance is calculated only with the data points in that block instead of calculating with all the training data points.
How is KD Tree Built?
KD tree is built by splitting training data into random dimensions. Let’s build a KD tree for 2 dimensional which has an x and y-axis.
Consider training data points with 2 dimensions (x,y) – (1,9), (2,3), (4,1), (3,7), (5,4), (6,8),(7,2), (8,8), (7,9), (9,6). Below are the steps to divide the space into multiple parts using KD Tree.
- Pick a random dimension.
Let us select x dimension, x data points are 1,2,4,3,5,6,7,8,7,9.
- Find the median
Sort the above data 1,2,3,4,5,6,7,7,8,9. Find the middle value which is 6. The median of these points is 6.
- Split the data into approximately equal halves.
Split the data on the x-axis.
- Repeat the above three steps
In the next iteration, the other dimension y is selected, its median is found and data is again split. The second and third step is repeated to divide the space into multiple parts. In the below graph, you can see that the space is first divided on x=6 and then on y=4, and y=8.
The source of the above image is this.
Prediction using KD Tree
When testing data comes, its distance is calculated only with the training data of the same block to find the k nearest neighbours.
Ball Tree – KNN Algorithm
Similar to KD Tree, in the Ball Tree the total space of training data is divided into multiple balls (circular blocks), and the distance of testing data is calculated only with the training points in that block instead of calculating with all the training data points.
How is Ball Tree Built?
Ball Tree is built by splitting the complete space into multiple smaller circular blocks.
Consider training data with two-dimensional data (x,y) – (1,2), (2,6), (3,4), (5,6), (7,8), (8,3). Plot it on a graph. Below are the steps to divide the space into multiple parts using the Ball Tree.
- Select a random data point (5,6).
- Find a point farthest to that point (1,2).
- Again find the farthest point to the current point (7,8).
- Project all the points on the line joining the farthest points.
- Find the median to divide the space into two halves.
- Find the centroid in each halve. The centroids are denoted by black dots.
- In each halve, find the point farthest from the centroid and draw the circle. (1,2) is selected as the farthest point in the first ball and (8,3) is the farthest point in another ball. The radius of each circle is the distance from the centroid to the farthest point.
In the below image, ball 1 contains all the training data points which are split into two balls ball 2 and ball 3.
All the above 7 steps are repeated again in each individual ball and the data is further divided into smaller balls.
Prediction using Ball Tree
Let (8,5) be a testing data point, it belongs to ball 3 but ball 3 is further divided into ball 6 and ball 7.
Calculate the distance of (8,5) with the centroid of ball 6 and ball 7. The centroid of ball 6 is found to be nearest to the testing data point (8,5). Hence the training data points in ball 6 are used for predicting the output of (8,5).
The images shared above are sourced from the video added below:
By the end of this article, I hope the working of KD Tree and Ball Tree is clear to you. Feel free to share your feedback and ask for any doubts in the comment box below.