What are Outliers?
Outliers are data points that are significantly different from other data points in the dataset. It is the odd one out i.e something unusual in comparison to others.
Why Outlier Analysis is Important?
- Statistical measure like the mean is highly affected by outliers.
- Outliers may cause skewness in the data and hence harm the model performance.
- There are a few cases where outliers are very crucial like fraud detection. The points that seem different from other points are likely to be suspicious.
Causes of Outliers?
- It may be a natural occurrence.
- Error in data entry experimental or human error.
How to Detect Outliers?
There are many methods to calculate outliers.
- z-score
- Inter Quartile Range
- Plots
1. z-score
z-score can be used to find outliers using the below formula.
z-score = where μ = mean, σ = standard deviation, x= current value
Calculate Outlier using z-score
If z-score > 3 and z-score < -3, the value is considered to be an outlier.
2. Inter Quartile Range
It is the difference between the 75th percentile (third quartile) and the 25th percentile (first quartile).
Percentile Calculation
- Example: 5,6,7,1,2,8,10
- Sort the range in ascending order 1,2,5,6,7,8,10
- 0 percentile =1 because no value is less than 1
- 10 percentile = 2 because 10% of the values are less than 2
Quartiles
- First Quarter is the 25 percentile value
- The Second Quarter is the 50 percentile value. It is also the median of data.
- The third Quartile is the 75 percentile value.
Outlier Formula Using IQR
- Arrange data in ascending order.
- Calculate the first quartile (q1) and third quartile (q3).
- Calculate Inter Quartile Range (IQR).
IQR = q3 – q1 - Find Lower Bound and Upper Bound.
Lower Bound = q1-1.5*IQR
Upper Bound = q3-1.5*IQR - Value>Upper Bound or Value<Lower Bound are considered to be outliers.
3. Types of Plots to Calculate Outliers
- Scatter Plot
- Box Plot
1. Scatter Plot
In scatterplots, points that are far away from others are considered outliers.
2. Box Plot
Box Plot is one of the most commonly used plots to find outliers. It is based on the concept of the interquartile range we learned above in this article.
- Calculate the first quartile (q1) and third quartile (q3).
- Calculate Inter Quartile Range (IQR).
IQR = q3-q1 - Find Lower Bound and Upper Bound.
Minimum = q1-1.5*IQR
Maximum = q3_1.5*IQR - Value>Manimum or Value<Minimum are considered to be outliers.
How to Handle Outliers?
- Remove the outliers.
- Change the values of outliers like mean imputation.