SMOTE

A Simple Explanation - By Varsha Saini

In Machine Learning, we come across the problem of an Imbalanced Dataset which can affect the performance of the model as the majority class can have more influence while training. The model will perform better if the data is balanced.

Balanced Dataset: Dataset in which all the classes in target output are present in equal numbers.
Imbalanced Dataset: Dataset in which there is a high difference in the number of different classes in target output.

When you have Imbalanced Dataset, there are two ways to make it balanced.

Oversampling: It is a process of generating more data for minority class.
Undersampling: It is a process of removing data for majority class.

SMOTE stands for Synthetic Minority Oversampling Technique. It is an oversampling technique that is used to create synthetic data for minority classes.

Working of SMOTE Algorithm

A minority class is selected.
K Nearest Neighbour is applied to the minority class with some value of K.
Lines are drawn from one sample of the minority class to its K nearest neighbor.
Random values are selected at any point on those lines. These values are considered new data for the minority class.

Limitations of SMOTE

It only works with continuous data, it is not designed to generate categorical synthetic data.
The data generated is linearly dependent which can cause a bias in the data generated and produce an overfitted model.

Borderline SMOTE

In this type of SMOTE, synthetic data is generated only from the border separating minority class from other classes.

ADASYN (Adaptive Synthetic) SMOTE

In this type of SMOTE, synthetic data is generated in spaces that are dominated by minority classes. It generates data that is harder to learn and doesn’t copy the same minority class.

Varsha Saini

SMOTE

A Simple Explanation - By Varsha Saini

Working of SMOTE Algorithm

Limitations of SMOTE

Borderline SMOTE

ADASYN (Adaptive Synthetic) SMOTE

Other Popular Terms

Adjusted R-Squared

Autocorrelation

Bagging Algorithm

Bessel’s Correction

Boosting Algorithm

CatBoost

Citizen Data Scientist

Cohen Kappa

Confusion Matrix

Correlation

Cross Validation

Data Drift

Data Imputation

Differential Privacy

Elastic Net Regression

Evaluation Metrics

Feature Selection

Genetic Programming