The performance of any machine learning model suffers from two major issues, overfitting or underfitting. In this article, we will learn about overfitting, the reasons for overfitting, and how to prevent overfitting.
Overfitting in Machine Learning
Overfitting is the situation where the machine learning model performs very well on training data but poorly on testing data. The bias (error of training data) is low and the variance (error of testing data) is high.
Overfitting occurs because the model learns the training data too well i.e. it even learns the noise in the data. The noisy random fluctuations in the training data are learned as a concept by the model. But the same concepts don’t apply to the new data. Therefore, when this model is tested on the new data, it performs very poorly.
Reasons for Overfitting
Below are a few reasons for overfitting:
- The training data has too many features.
- The training data is noisy i.e there are outliers and errors in the data.
- The model complexity is high.
How to Prevent Overfitting
Below are a few reasons how overfitting can be prevented:
- Train with more data.
- Reduce model complexity.
- Use Regularization methods like Ridge Regression and Lasso Regression.
- Use dropout layer in case of Neural Networks.
- Use techniques like cross-validation.
- Use of pruning methods (pre or post-pruning) in case of Decision Tree.
Bias Variance Tradeoff
Bias and variance are complements of each other. The increase of one will result in the decrease of the other and vice versa.
Bias is the Error in Training Data and Variance is the Error in Testing Data.
- In Overfitting, the model performs very well on the training dataset hence bias is low and it performs badly on the testing dataset hence variance is high.
- In Underfitting, the model neither performs well on training data nor on testing data hence both variance and bias are high.
- For the Balanced model, error on both training and testing should be low hence both bias and variance should be low.