Data Imputation

A Simple Explanation - By Varsha Saini

Some values may be missing in the data used for training the machine learning model. Data Imputation is the process of filling the missing data with a substitute value such that most of the information is retained.

Data Imputation is required since libraries used to apply machine learning algorithm is incompatible to work with missing values.

One way of dealing with missing values is to drop the rows having missing values, but it can lead to loss of information. This method is not feasible and can be led to a reduction in the size of training data.

Finding the correct value of missing values is vital as it can affect the performance of the final model.

Some Data Imputation Methods are :

1. Fill in missing values with some summary statistic substitution values like mean, mode, and median.

Missing values can be filled by taking the mean, mode, or median of that feature.

2. Predict the null values using semi-supervised ML methods.

Consider the feature having a missing value as target output and all other features as independent variables. Data, where the target value is available, can be used to train the model and missing values can be predicted using the model trained.

3. Adding “Missing” Category.

In categorical variables, missing values can be replaced by a new category “Missing”.

4. Multiple Imputation

It uses many imputation methods to compute the missing value and hence reduce uncertainty. Several versions of the same data are created, and missing value is calculated using different imputation methods which are then combined to make the best values.