Descriptive Statistics – Part 1 | Fundamentals Of Statistics For Data Scientists and Analysts

Statistics is the root of data analysis. It is a basic topic you should definitely learn if you are preparing for Data Science as it helps you to understand data and draw conclusions from it. I am starting this series where we will learn the statistics and probability required for data science. In this article, we will start with the basic topics of Statistics.

Statistics

Statistics is the area of study that deals with the ways to gather data, review data, manipulate data, and draw insights from it.

It is widely used in various industries to solve their business problem by predicting outcomes from different actions. Conclusions can be drawn from past data and corrective actions can be taken for future improvement.

Statistics has wide applications in the fields of Medical, Supply Chain, logistics, Finance, etc.

Types of Statistics

  1. Descriptive Statistics
  2. Inferential Statistics

Descriptive Statistics

Descriptive Statistics deals with summarizing the features of data like distributions, central tendency, variability, etc. It involves analyzing, exploring, and presentation of findings related to a data set derived from a sample or entire population.

Inferential Statistics

When the data is collected, analyzed, and summarised then Inferential Statistics is used to describe the meaning of the collected data using various analytical tools.

Random Variable

A Random Variable (X) is any possible outcome from a random event. A few Random Events are :

  1. Tossing a Coin: Possible outcome can be Head, Tail.
  2. Rolling a Dice: Possible outcome can be any random value from 1 to 6.
  3. Distributing Cards: Possible outcome can be any Card.

Types of Random Variables

  1. Categorical Variable
  2. Numerical Variable

1. Categorical Variable

Categorical Variables are those which can have finite countable values.

Race, Sex, and Age Group are a few examples of Categorical Variables.

Cardinality: It is the number of unique values in a categorical variable. For example, Sex has values of Female and Male and hence its Cardinality is 2.

There are 2 Types of Categorical Variables:

  1. Ordinal Categorical Variable
  2.  Nominal Categorical Variable

a. Ordinal Categorical Variable

Ordinal Categorical Variables are those whose values have some order in them.

For Example, Temperature Variable has Low, Medium, and High as its Value. It has some order in the values High>Medium>Low.

Education Level, Income Level, and Customer Satisfaction Rating are a few other examples of Ordinal Categorical Variables.

b. Nominal Categorical Variable

Nominal Categorical Variables are those which doesn’t have any order in them. All the values have equal priority.

For Example, Gender Variable has Female and Male as its Value. There is no order in Values. Both Male and Females have equal priority.

Name, Phone Number, and Eye Color are a few other examples of Nominal Categorical variables.

2. Numeric Variable

Numeric Variables are those whose values are numbers having quantifiable properties.

There are 2 Types of Numeric Variables:

  1. Discrete Numeric Variable
  2. Continuous Numeric Variable

a. Discrete Numeric Variable

Discrete Categorical Variables are those whose values are countable.

For Example Number of Children, whose value can range between 0 to n (whole number).

b. Continuous Numeric Variable

Continuous Numeric Variables are those which can have infinite numbers between a range.

For Example Height, Weight, Speed, etc.

Different Measures of Data

  1. Univariate Measure
  2. Bivariate Measure

1. Univariate Measure

Univariate Measure finds insights for one single feature.

  1. Measure of Central Tendency
  2. Measure of Asymmetry
  3. Measure of Variability

a. Measure of Central Tendency

There are three measures of Central Tendency

  1. Mean                                            2. Median                                                     3. Mode

These Measure of Central Tendency represents the central value of data that can be used to describe it. These are used for numerical data.

1. Mean

The average of a given data.

2. Median

The Middle Value in a given data.

3. Mode

The most occurring value in a given data.

Example: 2,3,7,4,5,1,10,1

Mean

  • (2+3+7+4+5+1+10+1)/8 = 4.125

Median

  • Ascending Order = 1,1,2,3,4,5,7,10
  • Middle Values =3 and 4
  • Since there are two middle values, calculate average = (3+4)/2 = 3.5

Mode

  • 1, since one is the most occurring value.

Can there be more than one mode in a given data?

Yes, More than one mode is possible. But it loses its meaning if there are multiple modes.

If data has a lot of outliers, Mean should not be used as Mean is highly 
affected by outliers. In such case, Median is a better option.

b. Measure of Asymmetry

Asymmetry can be measured using the concept of skewness.

Skewness

Skewness refers to the asymmetry in data whose distribution curve deviates from Normal Distribution either in the left or right direction.

Skewness is the measure of the extent to which the curve has deviated from Normal Distribution.

Types of Skewness

  1. Left (Negative) Skewness

    In Left Skewed Data, outliers are present on the left side of the curve.
    Mean<Median

  2. Right (Positive) Skewness

    In Right Skewed Data, outliers are present on the right side of the curve.
    Mean > Median

  3. Zero Skewness

    In Zero Skewed Data, outliers are present on both extreme ends of the curve.
    Mean=Median=Mode

How to Calculate Skewness?

There are two methods to calculate skewness:

  1. Pearson Mode Skewness

    Skewness =

    where
    X = Mean value
    Mo = Mode value
    s = Standard deviation of the sample data

  2. Pearson Median Skewness

    Skewness = 

    where
    Md = Median value

The direction of skewness is given by the sign and the value of the coefficient compares the sample distribution with a normal distribution. The larger the value, the larger the distribution differs from a normal distribution. A value of zero means no skewness at all.

c. Measure of Variability

  1. Variance
  2. Standard Deviation
  3. Coefficient of Determination

a. Variance

Variance is the measure of Variability in data i.e the spread in data from the mean. Higher the Variance, the more the spread in data.

Variance =

where
xi= value of current data
n= no of data points
 i= iterator which moves from 1 to n
= mean

In the above formula, the distance from the mean is squared to get the positive value of output.

b. Standard Deviation

Standard Deviation is calculated by taking the square root of Variance.

Standard Deviation = 

It is the most commonly used measure of variability since its unit of measurement is the same as the unit of the original variable.

c. Coefficient of Determination

It is the Ratio of Standard Deviation by Mean. It can also be said as a relative standard Deviation.

Coefficient of Determination = 

where = mean

Standard Deviation cannot be used to compare two dataset while Coefficient of 
Determination can be used.

2. Bivariate Measure

Measure the Relationship between two Variables.

  1. Covariance
  2. Correlation

a. Covariance

Covariance is a measure of the relationship between two random variables and to what extent they change together.

The formula for Covariance is similar to variance given covariance is applied on two variables whereas variance is on one variable.

Covariance = 

where
  x is variable1 and y is variable2.
  = current value of variable1
= mean of variable1
= current value of variable2
  = mean of variable2

1.  Positive Covariance

If the value of both variable x and variable y is increasing or decreasing, that is both variables are moving in the same direction. Then Covariance is Positive.

2. Negative Covariance

If variable x increases and variable y decreases or vice versa, that is both variables are moving in the opposite direction. Then Covariance is Negative.

Covariance only tells the direction in which two variables are related but 
not the strength of relation.

b. Correlation

Correlation is a measure of the relationship between two variables. It helps to find the direction and the strength in which two variables are related to each other.

Pearson Correlation Coefficient = 

Pearson Correlation coefficient between two variables :

  • Sign represents direction of the relationship
  • Coefficient value represents strength, how strongly two variables are related
  • Coefficient value lies between -1 and 1
  • -1 represents perfectly negative correlated
  • +1 represents perfectly positively correlated
  • 0 value of coefficient represents no correlation
Image Source

Correlation is not Causation

Even when two variables are correlated, it doesn’t mean one is the cause and the other is the effect.

For Example, Sales of umbrellas and Ice creme are highly correlated. Does that mean Buying Umberalla causes buying Ice creme? Not Really. This implies it is not necessary if two variables are highly correlated that one is causing the other.

End Notes

By the end of this article, we have learned the basic and important topics in statistics. The topics learned today will help you in learning other concepts like inferential statistics, hypothesis testing, and machine learning. You can continue with Descriptive Statistics in part 2.

I hope this article was informative. Feel free to ask any query or give your feedback in the comment box below. You can go through this series of articles if you want to learn the statistics and probability required for data science.

Happy Learning!