In Machine Learning, we come across the problem of an Imbalanced Dataset which can affect the performance of the model as the majority class can have more influence while training. The model will perform better if the data is balanced.
- Balanced Dataset: Dataset in which all the classes in target output are present in equal numbers.
- Imbalanced Dataset: Dataset in which there is a high difference in the number of different classes in target output.
When you have Imbalanced Dataset, there are two ways to make it balanced.
- Oversampling: It is a process of generating more data for minority class.
- Undersampling: It is a process of removing data for majority class.
SMOTE stands for Synthetic Minority Oversampling Technique. It is an oversampling technique that is used to create synthetic data for minority classes.
Working of SMOTE Algorithm
- A minority class is selected.
- K Nearest Neighbour is applied to the minority class with some value of K.
- Lines are drawn from one sample of the minority class to its K nearest neighbor.
- Random values are selected at any point on those lines. These values are considered new data for the minority class.
Limitations of SMOTE
- It only works with continuous data, it is not designed to generate categorical synthetic data.
- The data generated is linearly dependent which can cause a bias in the data generated and produce an overfitted model.
In this type of SMOTE, synthetic data is generated only from the border separating minority class from other classes.
ADASYN (Adaptive Synthetic) SMOTE
In this type of SMOTE, synthetic data is generated in spaces that are dominated by minority classes. It generates data that is harder to learn and doesn’t copy the same minority class.