A Simple Explanation - By Varsha Saini

KMeans is an unsupervised machine-learning algorithm used for clustering. K-means++ is a modified version of the K-means algorithm in which the centroid initialization process is updated.

In K-means, the initial value of the centroid is randomly selected. K-means++ rather chooses an initialization algorithm to initialize the centroid with values such that the overall model performance improves.

KMeans++ Initialisation Algorithm

  1. Randomly select any data point as one of the centroids.
  2. Calculate the distance of all other data points from this centroid.
  3. Select the farthest data point as the next centroid.
  4. Repeat the 2 and 3 steps until you get k centroids.

We use the farthest distance to select the next centroid since the probability of a data point being a centroid is directly proportional to distance. Hence higher the distance, the higher the probability.

K-means++ only changes the initialization process. All the other steps for clustering remain the same as in K-means.

Effect of KMeans++ on Model Performance

In the K-means algorithm, the initial values of centroids are randomly selected. This may cause a problem since K-means are highly sensitive to the initialization process.

K-means++ helps improve the performance of K-means by ensuring a smarter initialization technique. Having a separate algorithm for centroid initialization seems an additional task but it overall reduces the time and complexity of clustering.