What is Confidence Intervals? | Inferential Statistics for Data Science

Hello Guys, In the series of learning Statistics for Data Science we have already covered Descriptive Statistics in Part 1 and Part 2. In this article, we will learn what is statistics, types of statistics, and inferential statistics in detail which include topics like a confidence interval.

Statistics

Statistics is the root of Data Analysis. It is a basic topic you should learn if you want to start with Data Science as it helps you to understand data and draw conclusions from it. Statistics deals with the ways to gather data, review data, manipulate data, and draw insights from it.

Statistical methods have applications in different areas such as medicine, business, economics, social science, and others.

Types of Statistics

  1. Descriptive Statistics
  2. Inferential Statistics

Descriptive Statistics

This field of Statistics deals with the ways to organize, represent and describe a collection of data using tables, graphs, and summarization. Descriptive Statistics is already covered in Part 1 and Part 2.

Inferential Statistics

The whole field of statistics exists because you never have population data but sample data is available.

Inferential Statistics is a statistical tool that is used to find conclusions or inferences about population data using sample data.

There are two major types of Inferential statistics, Hypothesis Testing, and Regression Analysis. You can learn about Linear Regression here.

Descriptive statistics describe data and inferential statistics allows you to
make predictions from the collected details about data from descriptive 
statistics.

Estimation

Estimation is a value that describes some property of population data using sample data.

  1. Point Estimation
  2. Interval Estimation

1. Point Estimation

Point estimation is the process of finding a single value from sample data that can represent population data such as the mean.

2. Interval Estimation

Interval estimation is the process of finding a range from sample data that can represent population data. The point estimate lies exactly in the middle of the Interval Estimate.

Confidence Interval is the most commonly used Interval Estimate.

Estimation is the process of calculation and Estimate is the calculated value.

Confidence Interval

  • CI is the probability that a population parameter will fall between a set of values for a certain proportion of times. It is the range within which you expect your population parameter to be.
  • Confidence Interval is computed at some confidence level. 90%, 95%, and 99% are the most commonly used confidence levels.
  • A confidence interval is the mean of your estimate plus and minus the variation (also called the Margin of Error) in that estimate.
  • [Point Estimate – Reliability Factor*Standard Error, Point Estimate + Reliability Factor*Standard Error] is used to calculate the confidence interval. The first part is used to calculate the lower limit and the second part is used to calculate the upper limit. We will understand the formula in the latter part of this article.

Misconception about Confidence Interval

  • If we have a certain interval for 95% confidence interval, then it is assumed that 95% of the population data lies in this range.
  • It actually means that there is 95% certainty that this range will contain the population mean.

z-test

Before learning the z-test, it is recommended to go through this article.

  • A z-test is one of the hypothesis testing methods.
  • It is used when the sample size is large and population variance is known.
  • The test statistic (z-statistic) is assumed to follow Normal Distribution.
  • A z-statistic is a number representing the result from the z-test.
z-statistics = 

where
x̄ = sample mean
μ = population mean
σ = population standard deviation
n = sample size

t-test

  • A t-test is one of the hypothesis testing methods.
  • It is used when the sample size is small and the population variance is unknown.
  • The test statistic (t-statistic) is assumed to follow Student’s T-Distribution.
  • A t-statistic is a number representing the result from the t-test.
t-statistics = 

where
x̄ = sample mean
μ = population mean
s = population standard deviation
n = sample size

Degree of Freedom = n-1

Student's T-Distribution was one of the major breakthrough in the field of 
statistics as it helps you to make inference through small samples and 
unknown population variance.

CI when Population Variance is Known

  • When Population Variance is Known, a z-test is used.
  • It is assumed that population data follows Normal Distribution.
Confidence Interval Formula =   

where
x̄= point estimate (mean)
z  = critical value (calculated from z table)
σ = population standard deviation
α = significance level
n  = sample size
= standard error

CI when Population Variance is Unknown

  • When Population Variance is Known, a t-test is used.
  • It is assumed that population data follows the Students T Distribution.
Confidence Interval Formula = 

where
x̄ = point estimate (mean)
s = sample standard deviation
α = significance level
n  = sample size
n-1 = degree of freedom
= standard error

1. CI with known population variance is narrower as compared to CI with 
unknown population variance because there is more uncertainty in the later 
case.

2. z-statistics can be calculated from z-table and t-statistics can be 
calculated from t-table.

Margin of Error

Confidence Interval can be calculated by adding and subtracting the margin of error from the point estimate (mean).

Smaller ME returns in narrower confidence intervals and hence confident results. The margin of error can be controlled by changing some parameters in the confidence interval formula.

For Small Confidence Intervals

  • z and t statistics are present on the numerator, hence should be decreased.
  • Standard deviation is present on the numerator, hence should be decreased.
  • The sample size is present on the denominator, hence should be increased.

High Confidence 

  • A low standard deviation means data is more concentrated around the mean, so we have high confidence to get results right.
  • The more sample you have in your data, the more certain you are of the prediction.

End Notes

Thank you for reading this article. By the end of this article, we are familiar with inferential statistics, confidence intervals, and tests like the z-test and t-test. Next, we will learn Hypothesis Testing which is an advanced topic in statistics for Data Science.

I hope this article was informative. Feel free to ask any query or give your feedback in the comment box below.

Happy Learning!