K-Means is one of the most popular unsupervised machine learning algorithms used for clustering. The major challenge is selecting the perfect value of k (number of clusters). The Elbow Curve and Silhouette Analysis are two methods to calculate the optimal value of k.
Silhouette Analysis is based on the principle that points in the same cluster should be near as compared to points in different clusters.
Silhouette Coefficient
The silhouette coefficient s[i] is a measure of how similar are within cluster data points as compared to the data points of other clusters.
where
1. a[i] = average distance of point i from all the points in the same cluster.
2. b[i] = average distance of point i from all the points in the neighboring cluster.
3. s[i] = silhouette distance or silhouette score.
Silhouette Analysis
Case 1: If b[i]>a[i], the average distance of point i from other cluster data points is greater than the average distance of point i from all the points in the same cluster.
if b[i]>>>>>a[i] , then and s[i]=1. Hence the maximum value of the silhouette coefficient is 1.
Case 2: If b[i]=a[i], the average distance of point i from other cluster data points is equal to the average distance of point i from all the points in the same cluster.
Case 3: If b[i]<a[i], the average distance of point i from other cluster data points is less than the average distance of point i from all the points in the same cluster.
if b[i]<<<<<a[i] , then and s[i]=-1. Hence the minimum value of the silhouette coefficient is -1.
-1 <= s[i] <= 1 The Value of Silhouette coefficient lies between -1 and +1.
Points to Note in Silhouette Analysis
- The best value of the silhouette coefficient is +1. It means that within-cluster data points are very compact and the points in one cluster are far from the data points in other clusters.
- The wost value of the silhouette coefficient is -1.
- The silhouette coefficient of 0 denotes overlapping clusters.
Optimal Value of k using Silhouette Score
In the below graph, the silhouette score is highest at k=3. Hence the number of clusters i.e the value of k should be 3.