Frequently Asked Questions

How to Treat Outliers?

Treating outliers is an important step in data preprocessing, especially when building machine learning models. Outliers are data points that significantly deviate from the majority of the data, and they can have a disproportionate impact on model performance. Here are several common techniques for handling outliers:

1. Identify Outliers

Before treating outliers, you need to identify them. This can be done through various methods, such as visualizations (box plots, scatter plots), statistical tests (z-scores, IQR), or domain knowledge.

Related Link: What are Outliers and How to Treat them?

2. Remove Outliers

One straightforward approach is to remove outliers from the dataset. This should be done cautiously and only when you are confident that the outliers are due to errors or anomalies in the data.
Be aware that removing outliers can result in a loss of valuable information, so it should be done judiciously.

When can you remove outliers?

Below are situations when you can remove outliers:

When outliers are likely due to data entry errors: If you have reason to believe that certain data points are erroneous, removing them can improve the quality of your dataset.
When outliers are far from the expected range: If you know that your data should fall within a certain range and any data outside that range is likely an outlier, you can remove those values.

3. Transform Data

Transformations like log transformations or square root transformations can help reduce the impact of extreme values. These transformations can make the data more normally distributed and reduce the influence of outliers on certain algorithms.
Box-Cox and Yeo-Johnson are other transformation methods to consider.

When can you apply the transformation to treat outliers?

Below are cases when you can use the transformation technique to treat outliers:

When the data is highly skewed: Log transformation can be effective in reducing the impact of extreme values in positively skewed data.
When the data represents exponential growth: Log transformation can linearize exponential growth patterns.

4. Winsorization

Winsorization replaces extreme values with a specified percentile value. For example, you can replace values above the 95th percentile with the value at the 95th percentile and values below the 5th percentile with the value at the 5th percentile.

When can you apply the winsorization to treat outliers?

Below are cases when you can use the winsorization technique to treat outliers:

When you want to keep the data distribution intact: Winsorization can be useful when you want to address outliers but still maintain the original data distribution to some extent.
When you have some tolerance for outliers: Winsorization allows you to set a threshold for how extreme values should be replaced, making it a flexible approach.

5. Scaling

Standardizing or scaling your data (e.g., using Z-scores or Min-Max scaling) can make it more robust to outliers, as these techniques centre the data around the mean and adjust the scale.

When can you apply the scaling to treat outliers?

Below are cases when you can use the scaling to treat outliers:

When using algorithms that are sensitive to feature scale: Scaling (e.g., Z-score scaling) can help mitigate the impact of outliers when using methods like gradient descent in linear models.
In clustering algorithms: Scaling can be important for distance-based clustering algorithms like K-Means.

6. Use Robust Algorithms

Some machine learning algorithms are inherently robust to outliers. For example, tree-based algorithms and support vector machines are less affected by outliers compared to linear models like linear regression.

When should you use robust algorithms to treat outliers?

Below are cases when you can use robust algorithms to treat outliers:

When you have many outliers in your data: Algorithms like Random Forests, Decision Trees, or Robust Regression (e.g., RANSAC) can handle outliers well and may not require explicit outlier treatment.

7. Feature Engineering

Creating new features that capture the relationships between variables differently can sometimes help models be less sensitive to outliers.

When should you use feature engineering methods to treat outliers?

Below are cases when you can use feature engineering methods algorithms to treat outliers:

When you can create new features that capture outlier information: Feature engineering can sometimes create new variables that are less sensitive to outliers, such as using percentiles.

8. Domain Knowledge

Understanding the domain and the problem context can guide outlier treatment. Sometimes, what might look like an outlier could be a valid data point in a specific context.

When domain knowledge is enough to treat outliers

When you understand the domain and context of the data: Your knowledge of the problem can help you make informed decisions about how to treat outliers based on their relevance to the problem.

9. Winsorization or Clipping

- Similar to Winsorization, you can clip (or cap) extreme values by setting a threshold beyond which any data point is replaced with the threshold value.

Difference Between Winsorization and Clipping?
Winsorization replaces extreme values with values at specific percentiles, retaining some information about the 
original data distribution.
whereas 
Clipping sets a threshold and replaces extreme values with the threshold value, removing outliers without regard 
for the data distribution.

When should you use clipping methods to treat outliers?

Below are cases when you can use clipping methods algorithms to treat outliers:

When you want to cap extreme values without removing them: This is useful when you want to retain the information from extreme values while reducing their impact.

10. Anomaly Detection Algorithms

- Consider using anomaly detection techniques such as Isolation Forest, One-Class SVM, or clustering-based methods to specifically identify and treat outliers.
- When the goal is specifically to identify and isolate outliers: Anomaly detection techniques like Isolation Forest or One-Class SVM are designed for this purpose and can be very effective.

11. Binning or Discretization

Transforming continuous data into discrete bins can help mitigate the impact of outliers, especially in decision tree-based models.

12. Model-Based Detection and Correction

Some machine learning models can detect and handle outliers as part of their learning process. For example, robust regression techniques like RANSAC can automatically down-weight or discard outliers.

The choice of method depends on the nature of your data, the problem you’re trying to solve, and the impact of outliers on your specific application. It’s often a good practice to try multiple approaches and evaluate their effects on model performance to determine the most suitable outlier treatment method for your dataset and task.

Discover top-notch data science interview questions and ace your next opportunity!

Varsha Saini