As a machine learning engineer, you desire the model built by you is actually capable of solving the business problem which makes Model Evaluation a crucial step. It is required to check the impact of your model on the business problem. Different Model evaluation techniques are available to solve different types of ML problems.
Standard model evaluation metrics in Classification problems are Accuracy, Recall, Precision, and F1 Score. But these matrices evaluate the overall model and sometimes are not enough to check the performance in the context of a particular business.
Table of Contents
ToggleLift & Gain Analysis
- Lift and Gain analysis is an analysis to evaluate the model prediction in a business context and check if the model is actually affecting the business.
- It is used in Classification problems with imbalanced data.
- It helps evaluate model performance in a portion of the population.
- It is required to find how much better predictions are from the machine learning model than without the model.
Let’s understand lift and gain analysis using python on a Telco churn customer from Kaggle.
Step 1: Load the Data
The data can be downloaded from the Telco churn customer from Kaggle. It has 21 columns and 7043 records.
import pandas as pd data=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv') data.shape
The below code represents the column name and their data type.
data.info()
There are many columns having object data type, these need to be converted to numerical data type.
Step 2: Data Preprocessing
The data needs to be processed before applying any machine learning model. The below code performs the following task:
- handles the missing value.
- handles the categorical feature.
- remove non-critical features.
yes_no_col=['Partner','Dependents','PhoneService','OnlineSecurity', 'OnlineBackup','DeviceProtection','TechSupport','StreamingTV', 'StreamingMovies','PaperlessBilling','Churn'] #convert all column with values yes and no to 1 and0 respectively. for i in yes_no_col: data[i]=data[i].apply(lambda x: 1 if x == 'Yes' else 0) #one hot encoding col_one_hot_encode=['PaymentMethod','Contract','MultipleLines','InternetService'] data=pd.get_dummies(data,columns=col_one_hot_encode,drop_first=True) #convert gender column with values male and female to 1 and 0 respectively. data['gender']=data['gender'].apply(lambda x: 1 if x == 'Male' else 0) y=data['Churn'] #drop the column columns_to_drop=['customerID','Churn'] data.drop(columns_to_drop,axis=1,inplace=True) #fill missing value data=data.replace(' ','0') #convert datatype to float data['TotalCharges']=data['TotalCharges'].astype('float')
Step 3: Model Training
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression X_train,X_test,y_train,y_test=train_test_split(data,y,test_size=0.3,stratify=y, random_state=101) model=LogisticRegression() model.fit(X_train,y_train)
Step 4: Scoring
In the below code, we are printing a classification report that includes precision, recall, and f1 score for both 0 and 1 classes.
from sklearn.metrics import classification_report predictions=model.predict(X_test) print(classification_report(y_test,predictions))
In the above classification report, class 1 represents the churned customers and class 0 represents not churn. As we can see, the model’s capability to predict class 1 is very low. Should we still use this model?
We can evaluate the model benefits in context with the business problem using the lift and gain analysis.
Step 5: Start Lift and Gain Analysis
Get the probability of the interested class (let’s consider class 1 i.e churned) and sort it in descending order. The function predict_proba() returns the probability of a data point to be a class 1 whereas function predict() will directly return class 0 or class 1 as the output.
#find probability of data belonging to class 1 X_test['Prob']= model.predict_proba(X_test)[:,1] #sort the data in descending order by keeping ascendng=False X_test = X_test.sort_values(by = 'Prob', ascending = False) X_test['Churn'] = y_test X_test['Prob']
Step 6: Divide the Data into Deciles.
The probability stored in variable “Prob” can be divided into n deciles. Let n=10.
X_test['Decile']=pd.qcut(X_test['Prob'],10,labels=[i for i in range (10,0,-1)])
Step 7: Calculate the Number of Cases and Number of Responses in Each Decile.
The Gain and Lift analysis benefit come from how in the business happens that 80% of the revenue comes from 20% of the customers. This is the main part of the decile analysis used in the Gain and Lift chart calculation. In the below code, data that actually belongs to class 1 but is predicted as it belongs to class 0 is calculated for each decile.
res=pd.crosstab(X_test['Decile'],X_test['Churn'])[1].reset_index().rename(columns={1:'Number of Responses'}) lg=X_test['Decile'].value_counts(sort=False).reset_index().rename(columns ={'Decile','Number of Cases', 'index':'Decile'}) lg=pd.merge(lg,res,on='Decile').sort_values(by='Decile',ascending=False).reset_index(drop=True) lg
We obtain the Number of Cases (the number of data in the decile) and the Number of Responses (the number of actual positive data in each decile).
Step 8: Calculate and Interpret Gain
Gain is the ratio between the cumulative number of the “Number of Responses” (Actual Positive) up to each decile divided by the total number of positive observations in the data.Â
import numpy as np #Calculate the cumulative lg['Cumulative Responses'] = lg['Number of Responses'].cumsum() #Calculate the percentage of positive in each decile compared to the total nuber of responses lg['% of Events'] = np.round(((lg['Number of Responses']/lg['Number of Responses'].sum())*100),2) #Calculate the Gain in each decile lg['Gain'] = lg['% of Events'].cumsum() lg
Gain is the percentage of the actual churned customers at a given decile level. In decile 2, the gain is 50.44, which means 50.44% of actual churned customers are already covered in the top 20% of the data if we use the above logistic regression model.
Step 9: Calculate and Interpret Lift
The lift would measure how much better we can expect to do with the predictive model compared to without the model. It can be calculated by taking the ratio of the gain percentage to the random percentage at a given decile.
#convert Decile column to int lg['Decile']=lg['Decile'].astype('int') #calculate lift lg['lift']=np.round((lg['Gain']/(lg['Decile']*10)),2) lg
In decile 2, the lift is 2.51. It means that if we use the above logistic regression model then by selecting just 20% of the data, we can find the actual churned customers 2.52 times more than if we randomly select any model.
Gain and Lift Chart
Let us see what the gain and lift charts look like. Both of them are compared with a baseline or random model. The greater the area or distance between the model line and the baseline model line, the better the analysis.
1. Gain Chart
The orange line denotes the gain from a random model and the blue line represents the gain from the current ML model build. We can see that gain by a machine learning model is far better than the random model.
2. Lift Chart
The orange line denotes the lift from a random model and the blue line represents the lift from the current machine-learning model. The lift chart measures how much better we can expect to do with the predictive model compared to the random model. We can see that lift by a machine learning model is far better than the random model.
Applications
- It is mostly used in market target analysis.
- It is also used in domains like risk modelling, supply chain analytics etc.
Conclusion
In this article, we have seen the idea behind lift and gain analysis along with its code implementation in python. Next time you can try to evaluate your model using the evaluation method learnt today and check if it is actually solving your business problem.
Feel free to give your feedback or ask for any query in the comment box below.
Happy Learning!