What is Evaluation Metrics in Machine Learning?
Evaluating the statistical or machine learning model build is essential to check its quality. Evaluation metrics are used to measure the performance of the model. Different types of evaluation metrics are used based on the problem we are solving.
Why do we Require Evaluation Metrics?
- After we train a machine learning model, we test it on unseen data to check how well it performs. We require evaluation metrics to check the performance of the model trained.
- It is important to learn different model evaluation metrics to understand which metric is important for a particular project.
The most common machine learning algorithms we use are regression and classification. Let us see what are the different evaluation metrics for each of them.
Evaluation Metrics for Regression
Below is a list of model evaluation metrics for regression:
1. Mean Absolute Error (MAE)
Mean absolute error is calculated by taking the absolute sum of the error (difference between actual and predicted values) and dividing it by the number of data points.
MAE =
MAE is most robust to outliers and it returns the output in the same degree as the output variable. But the equation of MAE is not differentiable which is required when we apply optimizers like Gradient Descent.
2. Mean Squared Error (MSE)
Mean squared error is calculated by taking the squared sum of the error (difference between actual and predicted values) and dividing it by the number of data points.
MSE =
Square is used to remove the negative terms but it is not robust to outliers. Optimizers like Gradient descent can easily be applied as the equation is differentiable.
3. Root Mean Squared Error (RMSE)
Root mean squared error is simply the squared root of mean squared error. It returns the value in the same degree of output but it is still not as robust to outlier as mean absolute error.
RMSE =
4. R squared Error or Coefficient of Determination
R squared error tells the performance of the model and doesn’t consider the loss. In R squared, we have a baseline model to compare with. The current model is compared with the baseline model to find the best model. R-squared statistic tells the proportion of variation in the target variable explained by the linear regression model.
It is also known as goodness of fit, it calculates how much better a regression line is as compared to mean line.
A disadvantage of R squared error is that its value either stays constant or increase even when we add an insignificant feature to the dataset whereas, we expect its value to decrease if a feature which is not important is included in the data.
5. Adjusted R Squared Error
It is a modified version of R squared error which also includes “no of feature” in its formula. The adjusted r square decreases if an insignificant feature is added to the data.
Adjusted R-Squared =
where
- n = number of data points in our dataset.
- k = number of independent variables.
- R = R-squared values.
It is one of the most important model evaluation metrics.
Evaluation Metrics for Classification
Below is a list of model evaluation metrics for classification. All of them can be explained using a confusion matrix.
Confusion Matrix for Binary Classification
The below image represents a confusion matrix for binary classification which has two classes, class 0 and class 1. The left values represent the actual class and the top values represent the predicted class by the machine learning model used.
1. Accuracy
Accuracy is defined as the ratio of correctly predicted values to the total values. It is used in the case of the balanced dataset.
2. Recall or Sensitivity
The recall is defined as the ratio of the no of values that are predicted as a positive class out of the total no of values that actually belongs to the positive class. It is used when a false negative is more important.
3. Precision or Specificity
The precision is defined as the ratio of the no of values that actually belongs to the positive class out of the total no of values that are predicted as the positive class. It is used when a false positive is more important.
4. F1 Score
The F1 Score is the harmonic mean of precision and recall. It is used when both false positives and false negatives are equally important.