## Varsha Saini # Gradient Descent Algorithm – Detailed Explanation

While training a model using any machine learning algorithm, the target is to get prediction results as close to actual values as possible. Today we will learn a popular algorithm that can be used to optimize a machine learning algorithm.

It is not a machine learning algorithm but rather an optimization algorithm used to find the best values of coefficients such that the overall cost of the model is reduced.

For Example, In Linear Regression Cost Function is:

`Cost = `

Gradient Descent algorithm is used to find such values of m and c (coefficients or weights) such that the overall cost is reduced i.e y-predicted is as close to y-actual as possible.

## Convergence Theorem

```repeat until converge {

m[i] : = m[i-1] - (learning rate * slope_m)

c[i] : =  c[i-1] -  (learning rate * slope_c)

}```

where

slope_m = and       slope_c =  = x[i]*(m*x[i]+c-y[i]) = (m*x[i]+c-y[i]) ### Case 1: If Slope is Positive

`m[i]  = m[i-1] - [(learning rate) * (+ slope_m)]`

The value of the coefficient is reduced hence coefficient shifts towards the left.

### Case 2: If Slope is Negative

```m[i]  = m[i-1] - [(learning rate) * (- slope_m)]

= m[i-1] + [(learning rate) * (slope_m)]```

The value of the Coefficient is increased hence coefficient shifts towards the right. ## Learning Rate

It is a hyperparameter that can control the amount by which coefficients are adjusted i.e. shifted to either direction left or right. Since it is a hyperparameter, we can change its value.

### Case 1: If Learning Rate is very High

A high value for the learning rate generally causes an overshoot problem, also called Exploding gradient.

### Case 2: If Learning Rate is too Low

Keeping a low value for the learning rate makes gradient descent very slow as it moves slowly towards the global minima. It can cause the Vanishing Gradient problem. #### How to Find Optimal Value for Learning Rate?

Initially keep a high value for the learning rate and if you find it is overshooting, decrease its value until you reach an optimal value.

### Global Minima

It is the point that is minimum when compared to all other points.

### Local Minima

It is the point that is minimum when compared to the nearest points but not when compared to points at some distance. There are 3 types of Gradient Descent :

In the case of Batch Gradient Descent, the Cost Gradient is calculated by considering complete training data for each step.

For one single step, the average is taken from the gradient of all training data and the calculated average is used to update coefficient parameters.

It becomes very slow and computationally expensive for large datasets.

In the case of Stochastic Gradient Descent, weights are updated for each training data value.

Since the gradient is updated for each value, the cost will fluctuate a lot and may not decrease at each step but will decrease in long run.

It is very good in the case of large datasets. It converges faster when the dataset is large as it causes updates to the parameters more frequently.

### 3. Mini Batch Gradient Descent

In the Case of Mini Batch Gradient Descent, the training dataset is divided into mini-batches and the cost gradient is calculated by considering one batch for each step.

For one single step, the average is taken from the gradient of the current batch and the calculated average is used to update coefficient parameters. The updated value is used in the next iteration. ## End Notes

Thank you for reading till the end. By the end of this article, we are familiar with the idea behind gradient descent and its variations.

Happy Learning!

### 1 thought on “Gradient Descent Algorithm – Detailed Explanation”

1. Pingback: Logistic Regression - Varsha Saini