While training a model using any machine learning algorithm, the target is to get prediction results as close to actual values as possible. Today we will learn a popular algorithm that can be used to optimize a machine learning algorithm.
Table of Contents
ToggleWhat is Gradient Descent?
It is not a machine learning algorithm but rather an optimization algorithm used to find the best values of coefficients such that the overall cost of the model is reduced.
For Example, In Linear Regression Cost Function is:
Cost =Â
Gradient Descent algorithm is used to find such values of m and c (coefficients or weights) such that the overall cost is reduced i.e y-predicted is as close to y-actual as possible.
Convergence Theorem
repeat until converge { m[i] : = m[i-1] - (learning rate * slope_m) c[i] : =Â c[i-1] -Â (learning rate * slope_c) }
where
slope_m =   and    slope_c =
   = x[i]*(m*x[i]+c-y[i])            = (m*x[i]+c-y[i])
Case 1: If Slope is Positive
m[i]Â = m[i-1] - [(learning rate) * (+ slope_m)]
The value of the coefficient is reduced hence coefficient shifts towards the left.
Case 2: If Slope is Negative
m[i]Â = m[i-1] - [(learning rate) * (- slope_m)] Â Â Â Â Â = m[i-1] + [(learning rate) * (slope_m)]
The value of the Coefficient is increased hence coefficient shifts towards the right.
Learning Rate
It is a hyperparameter that can control the amount by which coefficients are adjusted i.e. shifted to either direction left or right. Since it is a hyperparameter, we can change its value.
Case 1: If Learning Rate is very High
A high value for the learning rate generally causes an overshoot problem, also called Exploding gradient.
Case 2: If Learning Rate is too Low
Keeping a low value for the learning rate makes gradient descent very slow as it moves slowly towards the global minima. It can cause the Vanishing Gradient problem.
How to Find Optimal Value for Learning Rate?
Initially keep a high value for the learning rate and if you find it is overshooting, decrease its value until you reach an optimal value.
Global Minima
It is the point that is minimum when compared to all other points.
Local Minima
It is the point that is minimum when compared to the nearest points but not when compared to points at some distance.
Types of Gradient Descent
There are 3 types of Gradient Descent :
- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini Batch Gradient Descent
1. Batch Gradient Descent
In the case of Batch Gradient Descent, the Cost Gradient is calculated by considering complete training data for each step.
For one single step, the average is taken from the gradient of all training data and the calculated average is used to update coefficient parameters.
It becomes very slow and computationally expensive for large datasets.
2. Stochastic Gradient Descent
In the case of Stochastic Gradient Descent, weights are updated for each training data value.
Since the gradient is updated for each value, the cost will fluctuate a lot and may not decrease at each step but will decrease in long run.
It is very good in the case of large datasets. It converges faster when the dataset is large as it causes updates to the parameters more frequently.
3. Mini Batch Gradient Descent
In the Case of Mini Batch Gradient Descent, the training dataset is divided into mini-batches and the cost gradient is calculated by considering one batch for each step.
For one single step, the average is taken from the gradient of the current batch and the calculated average is used to update coefficient parameters. The updated value is used in the next iteration.
End Notes
Thank you for reading till the end. By the end of this article, we are familiar with the idea behind gradient descent and its variations.
I hope this article was informative. Feel free to ask any query or give your feedback in the comment box below.
Happy Learning!
1 thought on “Gradient Descent Algorithm – Detailed Explanation”
Pingback: Logistic Regression - Varsha Saini
Comments are closed.