Statistics is the root of data analysis. It is a basic topic you should definitely learn if you are preparing for Data Science as it helps you to understand data and draw conclusions from it. I am starting this series where we will learn the statistics and probability required for data science. In this article, we will start with the basic topics of Statistics.
Statistics is the area of study that deals with the ways to gather data, review data, manipulate data, and draw insights from it.
It is widely used in various industries to solve their business problem by predicting outcomes from different actions. Conclusions can be drawn from past data and corrective actions can be taken for future improvement.
Statistics has wide applications in the fields of Medical, Supply Chain, logistics, Finance, etc.
Types of Statistics
- Descriptive Statistics
- Inferential Statistics
Descriptive Statistics deals with summarizing the features of data like distributions, central tendency, variability, etc. It involves analyzing, exploring, and presentation of findings related to a data set derived from a sample or entire population.
When the data is collected, analyzed, and summarised then Inferential Statistics is used to describe the meaning of the collected data using various analytical tools.
A Random Variable (X) is any possible outcome from a random event. A few Random Events are :
- Tossing a Coin: Possible outcome can be Head, Tail.
- Rolling a Dice: Possible outcome can be any random value from 1 to 6.
- Distributing Cards: Possible outcome can be any Card.
Types of Random Variables
- Categorical Variable
- Numerical Variable
1. Categorical Variable
Categorical Variables are those which can have finite countable values.
Race, Sex, and Age Group are a few examples of Categorical Variables.
Cardinality: It is the number of unique values in a categorical variable. For example, Sex has values of Female and Male and hence its Cardinality is 2.
There are 2 Types of Categorical Variables:
- Ordinal Categorical Variable
- Nominal Categorical Variable
a. Ordinal Categorical Variable
Ordinal Categorical Variables are those whose values have some order in them.
For Example, Temperature Variable has Low, Medium, and High as its Value. It has some order in the values High>Medium>Low.
Education Level, Income Level, and Customer Satisfaction Rating are a few other examples of Ordinal Categorical Variables.
b. Nominal Categorical Variable
Nominal Categorical Variables are those which doesn’t have any order in them. All the values have equal priority.
For Example, Gender Variable has Female and Male as its Value. There is no order in Values. Both Male and Females have equal priority.
Name, Phone Number, and Eye Color are a few other examples of Nominal Categorical variables.
2. Numeric Variable
Numeric Variables are those whose values are numbers having quantifiable properties.
There are 2 Types of Numeric Variables:
- Discrete Numeric Variable
- Continuous Numeric Variable
a. Discrete Numeric Variable
Discrete Categorical Variables are those whose values are countable.
For Example Number of Children, whose value can range between 0 to n (whole number).
b. Continuous Numeric Variable
Continuous Numeric Variables are those which can have infinite numbers between a range.
For Example Height, Weight, Speed, etc.
Different Measures of Data
- Univariate Measure
- Bivariate Measure
1. Univariate Measure
Univariate Measure finds insights for one single feature.
- Measure of Central Tendency
- Measure of Asymmetry
- Measure of Variability
a. Measure of Central Tendency
There are three measures of Central Tendency
- Mean 2. Median 3. Mode
These Measure of Central Tendency represents the central value of data that can be used to describe it. These are used for numerical data.
The average of a given data.
The Middle Value in a given data.
The most occurring value in a given data.
- (2+3+7+4+5+1+10+1)/8 = 4.125
- Ascending Order = 1,1,2,3,4,5,7,10
- Middle Values =3 and 4
- Since there are two middle values, calculate average = (3+4)/2 = 3.5
- 1, since one is the most occurring value.
Can there be more than one mode in a given data?
Yes, More than one mode is possible. But it loses its meaning if there are multiple modes.
If data has a lot of outliers, Mean should not be used as Mean is highly affected by outliers. In such case, Median is a better option.
b. Measure of Asymmetry
Asymmetry can be measured using the concept of skewness.
Skewness refers to the asymmetry in data whose distribution curve deviates from Normal Distribution either in the left or right direction.
Skewness is the measure of the extent to which the curve has deviated from Normal Distribution.
Types of Skewness
Left (Negative) Skewness
In Left Skewed Data, outliers are present on the left side of the curve.
Right (Positive) Skewness
In Right Skewed Data, outliers are present on the right side of the curve.
Mean > Median
In Zero Skewed Data, outliers are present on both extreme ends of the curve.
How to Calculate Skewness?
There are two methods to calculate skewness:
Pearson Mode Skewness
X = Mean value
Mo = Mode value
s = Standard deviation of the sample data
Pearson Median Skewness
Md = Median value
The direction of skewness is given by the sign and the value of the coefficient compares the sample distribution with a normal distribution. The larger the value, the larger the distribution differs from a normal distribution. A value of zero means no skewness at all.
c. Measure of Variability
- Standard Deviation
- Coefficient of Determination
Variance is the measure of Variability in data i.e the spread in data from the mean. Higher the Variance, the more the spread in data.
xi= value of current data
n= no of data points
i= iterator which moves from 1 to n
In the above formula, the distance from the mean is squared to get the positive value of output.
b. Standard Deviation
Standard Deviation is calculated by taking the square root of Variance.
Standard Deviation =
It is the most commonly used measure of variability since its unit of measurement is the same as the unit of the original variable.
c. Coefficient of Determination
It is the Ratio of Standard Deviation by Mean. It can also be said as a relative standard Deviation.
Coefficient of Determination =
where = mean
Standard Deviation cannot be used to compare two dataset while Coefficient of Determination can be used.
2. Bivariate Measure
Measure the Relationship between two Variables.
Covariance is a measure of the relationship between two random variables and to what extent they change together.
The formula for Covariance is similar to variance given covariance is applied on two variables whereas variance is on one variable.
x is variable1 and y is variable2.
= current value of variable1
= mean of variable1
= current value of variable2
= mean of variable2
1. Positive Covariance
If the value of both variable x and variable y is increasing or decreasing, that is both variables are moving in the same direction. Then Covariance is Positive.
2. Negative Covariance
If variable x increases and variable y decreases or vice versa, that is both variables are moving in the opposite direction. Then Covariance is Negative.
Covariance only tells the direction in which two variables are related but not the strength of relation.
Correlation is a measure of the relationship between two variables. It helps to find the direction and the strength in which two variables are related to each other.
Pearson Correlation Coefficient =
Pearson Correlation coefficient between two variables :
- Sign represents direction of the relationship
- Coefficient value represents strength, how strongly two variables are related
- Coefficient value lies between -1 and 1
- -1 represents perfectly negative correlated
- +1 represents perfectly positively correlated
- 0 value of coefficient represents no correlation
Correlation is not Causation
Even when two variables are correlated, it doesn’t mean one is the cause and the other is the effect.
For Example, Sales of umbrellas and Ice creme are highly correlated. Does that mean Buying Umberalla causes buying Ice creme? Not Really. This implies it is not necessary if two variables are highly correlated that one is causing the other.
By the end of this article, we have learned the basic and important topics in statistics. The topics learned today will help you in learning other concepts like inferential statistics, hypothesis testing, and machine learning. You can continue with Descriptive Statistics in part 2.
I hope this article was informative. Feel free to ask any query or give your feedback in the comment box below. You can go through this series of articles if you want to learn the statistics and probability required for data science.