Silhouette Analysis

A Simple Explanation - By Varsha Saini

K-Means is one of the most popular unsupervised machine learning algorithms used for clustering. The major challenge is selecting the perfect value of k (number of clusters). The Elbow Curve and Silhouette Analysis are two methods to calculate the optimal value of k.

Silhouette Analysis is based on the principle that points in the same cluster should be near as compared to points in different clusters.

Silhouette Coefficient

The silhouette coefficient s[i] is a measure of how similar are within cluster data points as compared to the data points of other clusters.

where

1. a[i] = average distance of point i from all the points in the same cluster.

Image Source

2. b[i] = average distance of point i from all the points in the neighboring cluster.

Image Source

3. s[i] = silhouette distance or silhouette score.

Silhouette Analysis

Case 1: If b[i]>a[i], the average distance of point i from other cluster data points is greater than the average distance of point i from all the points in the same cluster.

if b[i]>>>>>a[i] , then and s[i]=1. Hence the maximum value of the silhouette coefficient is 1.

Case 2: If b[i]=a[i], the average distance of point i from other cluster data points is equal to the average distance of point i from all the points in the same cluster.


Case 3:
If b[i]<a[i], the average distance of point i from other cluster data points is less than the average distance of point i from all the points in the same cluster.

if b[i]<<<<<a[i] , then and s[i]=-1. Hence the minimum value of the silhouette coefficient is -1.

-1 <= s[i] <= 1
The Value of Silhouette coefficient lies between -1 and +1.

Points to Note in Silhouette Analysis

  1. The best value of the silhouette coefficient is +1. It means that within-cluster data points are very compact and the points in one cluster are far from the data points in other clusters.
  2. The wost value of the silhouette coefficient is -1.
  3. The silhouette coefficient of 0 denotes overlapping clusters.

Optimal Value of k using Silhouette Score

In the below graph, the silhouette score is highest at k=3. Hence the number of clusters i.e the value of k should be 3.