Descriptive Statistics – Part 2 | Normal Distribution | Central Limit Theorem | Outliers | Box Plot

In this article, we will continue with descriptive statistics. The basic topics are already covered in Descriptive Statistics Part 1. This article will cover important topics like Normal Distribution, Central Limit Theorem, Outliers, and some plots like Box Plots.

Normal Distribution

A Random Variable (X) having mean (\mu) and standard deviation (\sigma) is said to be normally distributed if it has the below properties:

  • Mean = Median = Mode
  • No Skewness
  • It follows a bell curve
  • It is symmetrical on both sides of the mean

Empirical Rule of Normal Distribution

  • If you go one standard deviation to the left and one standard deviation to the right, it covers 68% of the total data.
  • If you go two standard deviations to the left and two standard deviations to the right, it covers 95% of the total data.
  • If you go three standard deviations to the left and three standard deviations to the right, it covers 99.7% of the total data.

Formula of Normal Distribution

Using the mean and standard deviation of a normal distribution, the probability density function can be used to fit the normal curve to your data.

probability density function
Image Source
>> Normal Distribution is also called Gaussian Distribution.
>> It is the most common probability distribution.
>> Many statistical tests are designed for normally distributed properties.

Central Limit Theorem (CLT)

According to Central Limit Theorem, if multiple samples are taken from a population then the distribution of their sample means will follow a normal distribution as the sample size increases.

Standard Normal Distribution

It is a special type of normal distribution where the mean is 0 and the standard deviation is 1.

How to Convert Normal Distribution to Standard Normal Distribution?

Normal Distribution can be converted to Standard Normal Distribution by using a z-score or Standardization.

Z – Score Formula

A z-score tells you where the value lies on a normal distribution curve.

z-score = 

where
μ = mean
σ= standard deviation
x= current value

z-score >0, positive z-score: It represents a value greater than the mean.
z-score<0, negative z-score: It represents a value less than the mean.
z-score == 0,  zero z-score: It represents a value equal to the mean.

Standardization

Standardization is one of the feature scaling methods which can be used to convert normally distributed data into standard normal distribution using a z-score. It is also called z-score normalization.

Standardization can only be applied to normally distributed data. If you want to apply standardization to data that is not normally distributed (skewed data), you need to first convert it into a normal distribution.

How to Convert a Skewed Distribution to Normally Distribution?

Skewed data can be converted to normally distributed data by applying some transformation method. Log Transformation is the most common approach.

Outliers

Values that are very far from the other observations in the data are called Outliers.

How to Calculate Outliers?

There are many methods to calculate outliers.

  1. z-score
  2. Inter Quartile Range
  3. Plots

1. z-score

z-score can be used to find outliers using the below formula.

z-score = where μ = mean, σ = standard deviation, x= current value

Calculate Outlier using z-score

If z-score > 3 and z-score < -3, the value is considered to be an outlier.

2. Inter Quartile Range

It is the difference between the 75th percentile (third quartile) and the 25th percentile (first quartile).

Percentile Calculation

  • Example: 5,6,7,1,2,8,10
  • Sort the range in ascending order  1,2,5,6,7,8,10
  • 0 percentile =1 because no value is less than 1
  • 10 percentile = 2 because 10% of the values are less than 2

Quartiles

  • First Quarter is the 25 percentile value
  • The Second Quarter is the 50 percentile value. It is also the median of data.
  • The third Quartile is the 75 percentile value.

Outlier Formula Using IQR

  1. Arrange data in ascending order.
  2. Calculate the first quartile (q1) and third quartile (q3).
  3. Calculate Inter Quartile Range (IQR).
    IQR = q3 – q1
  4. Find Lower Bound and Upper Bound.
    Lower Bound = q1-1.5*IQR
    Upper Bound = q3+1.5*IQR
  5. Value>Upper Bound or Value<Lower Bound are considered to be outliers.

3. Types of Plots to Calculate Outliers

  1. Scatter Plot
  2. Box Plot

1. Scatter Plot

In scatterplots, points that are far away from others are considered outliers.

2. Box Plot

Box Plot is one of the most commonly used plots to find outliers. It is based on the concept of the interquartile range we learned above in this article.

  1. Calculate the first quartile (q1) and third quartile (q3).
  2. Calculate Inter Quartile Range (IQR).
    IQR = q3-q1
  3. Find Lower Bound and Upper Bound.
    Minimum = q1-1.5*IQR
    Maximum = q3_1.5*IQR
  4. Value>Manimum or Value<Minimum are considered to be outliers.

End Notes

Thank you for reading this article. By the end, we are familiar with some important concepts in statistics like Normal Distribution, Standardization, Outliers, Boxplot, etc.

I hope this article was informative. You can read my other article on Statistics here. Feel free to ask any query or give your feedback in the comment box below.

Happy Learning!