While learning Statistics for Data Science, Hypothesis Testing is considered to be the most difficult topic. Let’s break it into simple parts and make it easy to understand.
While working with sample data we make a lot of assumptions about it. Therefore it becomes important to test whether the assumption made is correct or not. Hypothesis testing is a tool for making statistical inferences about the population data. Assumption belongs to the population data and the result of hypothesis testing also belongs to the population, not only to the sample.
Hypothesis Testing is an idea that can be tested. It can be applied to cases whose data is available.
Houses in the US are Costly.
The above statement is not a hypothesis, since we don’t have anything to compare with.
When the Price of the House > $374,900, it is considered costly.
The above statement is a hypothesis since we have a value to compare with.
Types of Hypothesis
- Null Hypothesis (H0)
- Alternate Hypothesis (H1)
1. Null Hypothesis
- The Null Hypothesis is the Hypothesis to be tested.
- It is denoted by H0.
- It states that there is no significant difference in a given set of observations.
2. Alternate Hypothesis
- The Alternate Hypothesis is a contradictory statement to the Null Hypothesis.
- It is denoted by H1 or Ha.
- It states that there is a significant difference in a given set of observations.
Using Hypothesis Testing, we either reject or accept Null Hypothesis. It works on the concept of Innocent until proven guilty, according to which the null hypothesis can only be rejected if there are enough evidence from data against it.
Null Hypothesis: Average Marks of Students at Delhi University is 70%.
Alternative Hypothesis: Average Marks of Students at Delhi University is not 70%.
Null Hypothesis: Not more than 60% of registered voters in Delhi voted for the election.
Alternative Hypothesis: More than 60% of registered voters in Delhi voted for the election.
|H1 / Ha
|not equal (≠) or greater than (>) or less than (<)
|greater than or equal to (≥)
|less than (<)
|less than or equal to (≤)
|more than (>)
Null and Alternative Hypotheses are Mutually Exclusive.
We aim to reject the null hypothesis if it is false, but there are chances where the null hypothesis is rejected even when it was true.
- Significance Level is denoted by .
- It is the probability of rejecting a correct Null Hypothesis.
- Common values for are 0.01, 0.05, 0.10 which corresponds to 1%, 5% and 10% respectively.
- values are selected based on the certainty you need 0.01 > 0.05 > 0.10.
The significance level is a measure of the strength of the evidences that must be present in the sample before rejecting the null hypothesis.
If you want to be 95% confident then there is a 5% (100-95) risk of rejecting a null hypothesis that was true. The same applies to 90% where the significance level is 10% and 99% where the significance level is 1%.
How to select Significance Level?
The value of the significance level depends on the problem on which hypothesis testing is applied. A low significance value is used in cases for which high certainty is required and vice versa.
For Example, if you want to check if a machine is working properly or not, then you may go with a 0.01 significance level since you expect a little or no mistake.
For cases like analyzing human behavior, you may go with a 0.10 significance level. since human behavior can be very uncertain.
Types of Hypothesis Testing
- One-Tailed Test
a. Left Tailed Test
b. Right Tailed Test
- Two-Tailed Test
Tails are the extreme ends of the probability distribution of population data.
1. One-Tailed Test
In a one-tailed test, the rejection region is either on the left or right tail.
In Left Tailed Test, the rejection region is on the left tail.
In Right Tailed Test, the rejection region is on the right tail.
2. Two-Tailed Test
In a two-tailed test, the rejection region is present on both tails.
How much part of the tail is considered as the rejection region depends on the value of the significance level .
For a significance level of 0.05,
- In the left-tailed test, 5% area is the rejection region on the left tail.
- In the right-tailed test, 5% area is the rejection region on the left tail.
- In the two-tailed test, 2.5% area is the rejection region on the left tail, and 2.5% area is the rejection region on the right tail.
Errors in Hypothesis Testing
No Test can be 100% accurate. There is a possibility to make mistakes while doing hypothesis testing too. There are two types of errors in hypothesis testing.
- Type1 Error
- Type2 Error
1. Type 1 Error
- When a true null hypothesis is rejected.
- It is also called False Positive.
- The probability of making this error is α, Significance Level.
- Since the value of the significance level is selected by you, the responsibility of making this error is on you.
2. Type2 Error
- When a false null hypothesis is accepted.
- It is also called False Negative.
- The probability of making this error is β.
- β depends on the sample size and variance.
The probability of rejecting a false null hypothesis is 1-β, also called the Power of Test. Our goal is to increase the power of the test which can be done by increasing the sample size.
You can look at the below confusion matrix to understand Type 1 and Type 2 errors better.
Thank you for reading this article till the end. I hope it made your concepts about hypothesis testing clear. If you are further interested in learning how to perform hypothesis testing with an example using different methods like the p-value method and critical value method. You can read this article.
Feel free to ask any query or give your feedback in the comment box below.