KMeans is an unsupervised machine-learning algorithm used for clustering. K-means++ is a modified version of the K-means algorithm in which the centroid initialization process is updated.
In K-means, the initial value of the centroid is randomly selected. K-means++ rather chooses an initialization algorithm to initialize the centroid with values such that the overall model performance improves.
KMeans++ Initialisation Algorithm
- Randomly select any data point as one of the centroids.
- Calculate the distance of all other data points from this centroid.
- Select the farthest data point as the next centroid.
- Repeat the 2 and 3 steps until you get k centroids.
We use the farthest distance to select the next centroid since the probability of a data point being a centroid is directly proportional to distance. Hence higher the distance, the higher the probability.
K-means++ only changes the initialization process. All the other steps for clustering remain the same as in K-means.
Effect of KMeans++ on Model Performance
In the K-means algorithm, the initial values of centroids are randomly selected. This may cause a problem since K-means are highly sensitive to the initialization process.
K-means++ helps improve the performance of K-means by ensuring a smarter initialization technique. Having a separate algorithm for centroid initialization seems an additional task but it overall reduces the time and complexity of clustering.