Clustering Algorithms Data Scientists Need to Know | Clustering in Machine Learning

In all the previous topics of machine learning algorithms, we have covered supervised algorithms which use labeled data for training the model, but there may be cases in which labeled data is not present. The model trained using unlabeled datasets are regarded as unsupervised machine learning algorithms.

Unsupervised algorithms cannot be used for regression or classification problems as they don’t have corresponding target variables. Hence the pattern in the available data is found and the data points are grouped into categories such that the points similar to each other are put in one category.

What is Clustering in Machine Learning?

Clustering is an unsupervised machine learning technique. It is the process of dividing the data points into groups such that data points in the same groups have similar properties to each other and data points in different groups have different properties in some sense. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

Clusters are individual groups created using unsupervised machine learning 

Why Do We Need Clustering?

Many businesses use cluster analysis to identify consumers who are similar to each other so that the same actions can be taken for a cluster. These groups are created based on their properties. Each group differs from the others in some way.

For example, Retail companies often use clustering to identify groups of households that are similar to each other such that the same kind of offers can be given to consumers in the same group.

Streaming services often use clustering analysis to identify viewers who have similar behavior so that the same movie or series can be suggested to the same groups.

Types of Clustering

  1. Flat Clustering
    1. K-Means
    2. K-Mediods
  2. Hierarchical Clustering
    1. Top-Down Clustering
      • Divisive
    2. Bottom-Up Clustering
      • Agglometric Clustering
  3. Density Based Clustering
    1. DBScan Clustering
Related Link: ML Tools You Must Know

Flat Clustering

Flat clustering creates n number of clusters (groups) where n is decided by the user in advance. It is the simplest type of clustering which is created without any explicit structure that would relate clusters to each other.

Types of Flat Clustering

K-Means and K-Mediods are the two most popular types of flat clustering.

1. K-Means

The k-Means algorithm starts with randomly selecting k points as centroids and other data points are assigned to one of the centroids. These centroid points are updated and the process is repeated until these points stop moving.

k (number of clusters) in k-means is selected by the user in advance and the mean of the data point in a cluster is considered to be centroid for that cluster.

2. K-Mediods

The k-Mediods algorithm starts with randomly selecting k points as centroids and other data points are assigned to one of the centroids. These centroid points are updated by keeping every data point as a new centroid and calculating the model performance. This process is repeated until the points stop moving.

k (number of clusters) in k-medoids is selected by the user in advance and the median of the data point in a cluster is considered to be centroid for that cluster.

Advantages of Flat Clustering

  1. Simple to implement.
  2. Flat clustering is computationally faster for a small value of k (no of clusters).
  3. With a large number of variables, K-Means may be computationally faster than hierarchical clustering.
  4. Generalizes to clusters of different shapes and sizes. for example, elliptical clusters.

Disadvantages of Flat Clustering

  1. Need to pre-defined number of clusters.
  2. The initial values of the centroid have a huge impact on the final results.

Hierarchical Clustering

Hierarchical Clustering also known as hierarchical cluster analysis or HCA is a type of clustering in which total data is divided into groups in the form of a tree-like structure.

Types of Hierarchical Clustering

There are two approaches to Hierarchical clustering, Top Down and Bottom Up.

1. Top-Down Approach

In top-down hierarchical clustering,  complete data is considered as one cluster (represented by root) which is split into groups as one moves down the tree until every data point becomes a cluster itself.

2. Bottom-Up Approach

In bottom-up hierarchical clustering,  individual data points are considered as a separate cluster (represented by leaves node) which are grouped together as one move up the tree until all data point becomes a single cluster.

Advantages of Hierarchical Clustering

  1. Hierarchical clustering is capable of dividing complete data into clusters without knowledge of the number of clusters in advance.
  2.  It can draw inferences from a given dataset on its own, without any kind of human intervention.

Disadvantages of Hierarchical Clustering

  1. Highly time-consuming process in case of large datasets.
  2. Sensitive to outliers.

Density Based Clustering

In this method, clusters are created based on the density of the dataset such that highly dense data points are put into one cluster whereas sparse data points are put into separate clusters.

DBSCAN Clustering

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. This algorithm works on the principle that clusters are dense regions that are separated from low dense regions. It can identify clusters by calculating the local density of each data point.

Advantages of Density-Based Clustering

  1. It is robust to outliers.
  2. No need to specify the number of clusters in advance.
  3. Easy to cluster arbitrary shapes.

Disadvantages of Density-Based Clustering

  1. Fails to identify clusters if data points are highly sparse.
  2. DBSCAN algorithm has two parameters eps and minPts, and finding the perfect value of these parameters is difficult.

Two Types of Clustering

  1. Hard Clustering
  2. Soft Clustering

1. Hard Clustering

In hard clustering, each data point either belongs to a cluster completely or not.

2. Soft Clustering

In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point being in those clusters is assigned.

Applications of Clustering

Clustering has applications in various domains like Retail for customer segmentation, Insurance for target marketing, medical for image processing, etc. Below are a few applications of clustering:

  • Movie Recommender systems.
  • In Email Marketing.
  • Sales and Marketing.
  • Identifying fraudulent or criminal activity.

Issues with the Unsupervised Modeling Approach

These are some issues you may encounter when applying unsupervised machine learning techniques:

  • Model complexity increases with an increase in the number of features.
  • Since the training data is not labeled in advance, the results may be less accurate.
  • Model training takes time as it analyses all the possibilities.