Frequently Asked Questions

How Can You Select k For K-means?

The Elbow Curve and Silhouette Analysis are two methods to calculate the optimal value of k.

Elbow Curve

  • The Elbow curve is the graphical method of finding the perfect value of k for the K-means algorithm.
  • It uses WCSS (within-cluster sum of square).
  • Data is divided into different clusters using the K-means clustering algorithm. WCSS is the sum of squares of the distances of each data point in different clusters to their respective centroid.
  • The Elbow curve is created by keeping the number of clusters on the x-axis and WCSS on the y-axis.

    Image by towarddatascience
  • The point where the elbow is created is selected to be the optimal value of k.

Problem with Elbow Curve Method

In most real-world problems, the perfect elbow is never created and hence it is difficult to find the optimal value of k. In such cases, the silhouette analysis method is better.

Image by towarddatascience

Silhouette Analysis

Silhouette Analysis is based on the principle that points in the same cluster should be near as compared to the points in different clusters.

The silhouette coefficient or silhouette score is used to find the optimal value of k. You can learn the calculation of the silhouette score from here.

  • The best value of the silhouette coefficient is +1. It means that within-cluster data points are very compact and the points in one cluster are far from the data points in other clusters.
  • The wost value of the silhouette coefficient is -1.
  • The silhouette coefficient of 0 denotes overlapping clusters.

In the below graph, the silhouette score is highest at k=3. Hence the number of clusters i.e the value of k should be 3.

 

 

Other Popular Questions