Model Evaluation using Lift and Gain Analysis | Lift and Gain Charts

As a machine learning engineer, you desire the model built by you is actually capable of solving the business problem which makes Model Evaluation a crucial step. It is required to check the impact of your model on the business problem. Different Model evaluation techniques are available to solve different types of ML problems.

Standard model evaluation metrics in Classification problems are Accuracy, Recall, Precision, and F1 Score. But these matrices evaluate the overall model and sometimes are not enough to check the performance in the context of a particular business.

Lift & Gain Analysis

  • Lift and Gain analysis is an analysis to evaluate the model prediction in a business context and check if the model is actually affecting the business.
  • It is used in Classification problems with imbalanced data.
  • It helps evaluate model performance in a portion of the population.
  • It is required to find how much better predictions are from the machine learning model than without the model.

Let’s understand lift and gain analysis using python on a Telco churn customer from Kaggle.

Step 1: Load the Data

The data can be downloaded from the Telco churn customer from Kaggle. It has 21 columns and  7043 records.


import pandas as pd

data=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

data.shape

The below code represents the column name and their data type.

data.info()

There are many columns having object data type, these need to be converted to numerical data type.

Step 2: Data Preprocessing

The data needs to be processed before applying any machine learning model. The below code performs the following task:

  • handles the missing value.
  • handles the categorical feature.
  • remove non-critical features.

yes_no_col=['Partner','Dependents','PhoneService','OnlineSecurity',
'OnlineBackup','DeviceProtection','TechSupport','StreamingTV',
'StreamingMovies','PaperlessBilling','Churn']

#convert all column with values yes and no to 1 and0 respectively.
for i in yes_no_col:
    data[i]=data[i].apply(lambda x: 1 if x == 'Yes' else 0)

#one hot encoding
col_one_hot_encode=['PaymentMethod','Contract','MultipleLines','InternetService']
data=pd.get_dummies(data,columns=col_one_hot_encode,drop_first=True)

#convert gender column with values male and female to 1 and 0 respectively.
data['gender']=data['gender'].apply(lambda x: 1 if x == 'Male' else 0)

y=data['Churn']

#drop the column
columns_to_drop=['customerID','Churn']
data.drop(columns_to_drop,axis=1,inplace=True)

#fill missing value
data=data.replace(' ','0') 

#convert datatype to float
data['TotalCharges']=data['TotalCharges'].astype('float')

Step 3: Model Training

Let us use Logistic Regression to understand Lift and Gain analysis in detail. As we know, Logistic Regression is a Binary Classification Algorithm that returns the probability that data belongs to a particular class. In our dataset, there are two classes, class 0 and class 1. The probability P(0) + P(1) = 1. You can understand Logistic Regression in detail from here.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

X_train,X_test,y_train,y_test=train_test_split(data,y,test_size=0.3,stratify=y,
random_state=101)

model=LogisticRegression()

model.fit(X_train,y_train)

Step 4: Scoring

In the below code, we are printing a classification report that includes precision, recall, and f1 score for both 0 and 1 classes.


from sklearn.metrics import classification_report

predictions=model.predict(X_test)

print(classification_report(y_test,predictions))

In the above classification report, class 1 represents the churned customers and class 0 represents not churn. As we can see, the model’s capability to predict class 1 is very low. Should we still use this model?

We can evaluate the model benefits in context with the business problem using the lift and gain analysis.

Step 5: Start Lift and Gain Analysis

Get the probability of the interested class (let’s consider class 1 i.e churned) and sort it in descending order. The function predict_proba() returns the probability of a data point to be a class 1 whereas function predict() will directly return class 0 or class 1 as the output.

#find probability of data belonging to class 1
X_test['Prob']= model.predict_proba(X_test)[:,1]
#sort the data in descending order by keeping ascendng=False
X_test = X_test.sort_values(by = 'Prob', ascending = False)
X_test['Churn'] = y_test

X_test['Prob']

The above output represents the probability values.

Step 6: Divide the Data into Deciles.

The probability stored in variable “Prob” can be divided into n deciles. Let n=10.


X_test['Decile']=pd.qcut(X_test['Prob'],10,labels=[i for i in range (10,0,-1)])

Step 7: Calculate the Number of Cases and Number of Responses in Each Decile.

The Gain and Lift analysis benefit come from how in the business happens that 80% of the revenue comes from 20% of the customers. This is the main part of the decile analysis used in the Gain and Lift chart calculation. In the below code, data that actually belongs to class 1 but is predicted as it belongs to class 0 is calculated for each decile.


res=pd.crosstab(X_test['Decile'],X_test['Churn'])[1].reset_index().rename(columns={1:'Number of Responses'})

lg=X_test['Decile'].value_counts(sort=False).reset_index().rename(columns ={'Decile','Number of Cases', 'index':'Decile'})

lg=pd.merge(lg,res,on='Decile').sort_values(by='Decile',ascending=False).reset_index(drop=True)

lg

We obtain the Number of Cases (the number of data in the decile) and the Number of Responses (the number of actual positive data in each decile).

Step 8: Calculate and Interpret Gain

Gain is the ratio between the cumulative number of the “Number of Responses” (Actual Positive) up to each decile divided by the total number of positive observations in the data. 

import numpy as np

#Calculate the cumulative
lg['Cumulative Responses'] = lg['Number of Responses'].cumsum()

#Calculate the percentage of positive in each decile compared to the total nuber of responses
lg['% of Events'] = np.round(((lg['Number of Responses']/lg['Number of Responses'].sum())*100),2)

#Calculate the Gain in each decile
lg['Gain'] = lg['% of Events'].cumsum()
lg

Gain is the percentage of the actual churned customers at a given decile level. In decile 2, the gain is 50.44, which means 50.44% of actual churned customers are already covered in the top 20% of the data if we use the above logistic regression model.

Step 9: Calculate and Interpret Lift

The lift would measure how much better we can expect to do with the predictive model compared to without the model. It can be calculated by taking the ratio of the gain percentage to the random percentage at a given decile.

#convert Decile column to int
lg['Decile']=lg['Decile'].astype('int')

#calculate lift
lg['lift']=np.round((lg['Gain']/(lg['Decile']*10)),2)

lg

In decile 2, the lift is 2.51. It means that if we use the above logistic regression model then by selecting just 20% of the data, we can find the actual churned customers 2.52 times more than if we randomly select any model.

Gain and Lift Chart

Let us see what the gain and lift charts look like. Both of them are compared with a baseline or random model. The greater the area or distance between the model line and the baseline model line, the better the analysis.

1. Gain Chart

The orange line denotes the gain from a random model and the blue line represents the gain from the current ML model build. We can see that gain by a machine learning model is far better than the random model.

2. Lift Chart

The orange line denotes the lift from a random model and the blue line represents the lift from the current machine-learning model. The lift chart measures how much better we can expect to do with the predictive model compared to the random model. We can see that lift by a machine learning model is far better than the random model.

 

 

Applications

  1. It is mostly used in market target analysis.
  2. It is also used in domains like risk modelling, supply chain analytics etc.

Conclusion

In this article, we have seen the idea behind lift and gain analysis along with its code implementation in python. Next time you can try to evaluate your model using the evaluation method learnt today and check if it is actually solving your business problem.

Feel free to give your feedback or ask for any query in the comment box below.

Happy Learning!