A Comprehensive Guide to TPOT – AutoML Implementation on Classification and Regression using Python

Varsha Saini
November 21, 2022

We all know the power of machine learning, it is solving so many business problems and is applied to various domains. But not everyone is a data scientist and is aware of the concepts like ML modelling, model selection, hyperparameter tuning etc.

AutoML has made it possible to use the benefits of machine learning with no or minimal knowledge. It automates the task of applying machine learning models to real-world problems. It includes all the steps from getting raw data to using machine learning models on it and making it production ready.

Table of Contents

What is AutoML?

AutoML stands for automated machine learning is a process of automating the task of building a machine learning model. It enables anyone with minimal knowledge to train high-quality machine learning models in their area of interest.

How do AutoML Works?

The goal of AutoML is to find the best machine-learning model with optimized parameters specific to your problem statement which can be classification, regression, NLP, forecasting, computer vision etc.

It creates several pipelines in parallel with different machine-learning models and parameters by iterating over different combinations of models and their hyperparameters, generating scores for each. The model with the highest score is considered to be the best fit for your problem statement.

AutoML Packages

A few of the popular AutoML packages are listed below:

AutoGluon
H2O AutoML
MLBoX
TPOT
TransformogrifAI

In this article, we will explore TPOT AutoML package in detail and see how you can use it in your project.

TPOT

TPOT stands for Tree-based pipeline optimization tool. It is a python automated machine-learning module that automates the process of creating a machine-learning pipeline using the concept of genetic programming in an optimized search space. It is built on the top of scikit library of python.

TPOT Installation

The below command can be used to install tpot library.

 pip install tpot

TPOT Architecture

The below picture explains the tpot architecture in detail. It is taken from http://automl.info/tpot/

TPOT Implementation

Since tpot is built on top of scikit learn library, its code is very similar to it. The process of building tpot pipeline and most of the functions that can be applied on it are the same as scikit library.

Tpot can be applied to supervised machine learning problems. We will learn how to create tpot pipeline for both classification and regression in a step-wise order.

TPOT Parameters

Let us understand some of the important TPOTClassifier parameters:

generations: number of iterations for tpot pipeline
population_size: number of individuals to retain in every generation
offspring_size: number of offspring to generate in each iteration
mutation_rate: values lie between [0,1]
crossover_rate: value lies between [0,1]
mutation_rate+crossover_rate <=1
scoring: model evaluation metrics
cv= cross-validation method
n_job: number of jobs that can be run in parallel
max_time_mins: maximum no of times tpot allowed optimizing the pipeline
max_eval_time_mins: minutes tpot takes to evaluate a single pipeline
verbosity: what information tpot displays while running the pipelines
{0: doesn’t show anything, 1: minimal information, 2: more information along with progress bar, 3: everything}

TPOT Functions

Below are the functions along with their parameters that can be applied on tpot object:

fit(Xtrain,ytrain): Run the tpot optimizer pipeline on the data having X independent and y dependent variables.
predict(Xtest): predicts the output for testing data.
score(Xtest,ytest): compares the predicted output with the actual output and returns the scoring value. The score function can be customized.
export(output filename): exports the optimized pipeline as a python code.

1. TPOT Classifier

Below is the TPOTClassifier function. All the parameters are given their default values which can be modified as per requirement.

TPOTClassifier(generations=100, population_size=100,
                          offspring_size=None, mutation_rate=0.9,
                          crossover_rate=0.1,
                          scoring='accuracy', cv=5,
                          subsample=1.0, n_jobs=1,
                          max_time_mins=None, max_eval_time_mins=5,
                          random_state=None, config_dict=None,
                          template=None,
                          warm_start=False,
                          memory=None,
                          use_dask=False,
                          periodic_checkpoint_folder=None,
                          early_stop=None,
                          verbosity=0,
                          disable_update_check=False,
                          log_file=None

TPOTClassifier Implementation

Let us see the implementation of TPOTClassifier using Python in a five-step process.

Step 1: Load Dataset

We will load data directly from sklearn datasets. The load_digits dataset has numerical digits as classes, therefore it is a multiclass classification problem.

from sklearn.datasets import load_digits
digits = load_digits() 
digits.data.shape

(1797, 64)

Step 2: Split the data into train and test

The data is divided into the train (75%) and the test (25%). The shape of the data is printed after the split.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,train_size=0.75, test_size=0.25)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1347, 64), (450, 64), (1347,), (450,))

Step 3: Fit training data on TPOT

The training data is fitted using the TPOTClassifier module using the fit() function. The model will be fitted on the given parameters which are already explained.

from tpot import TPOTClassifier
pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,random_state=42, verbosity=2)
pipeline_optimizer.fit(X_train, y_train)

Step 4: Predict on testing data

The output is predicated on testing data using the predict() function.

pipeline_optimizer.predict(X_test)

Step 5: Calculate Score

The score is calculated using the score() function. It uses the default score() which can be customized. Since the generation value is 5, TPOT will iterate 5 times and generate the cross-validation score for each iteration. This score() also returns the best pipeline for the given data.

print(pipeline_optimizer.score(X_test, y_test))

Generation 1 – Current best internal CV score: 0.9806939281288723
Generation 2 – Current best internal CV score: 0.9806939281288723
Generation 3 – Current best internal CV score: 0.9806966818119236
Generation 4 – Current best internal CV score: 0.981434668869613
Generation 5 – Current best internal CV score: 0.9829189040341457

Best pipeline: KNeighborsClassifier(GaussianNB(input_matrix), n_neighbors=3, p=2, weights=uniform) 0.9888888888888889

Can We Use a Custom-Made Score Function?

We have already seen how to use the score() function to evaluate the model built. But a custom-created function can also be used to evaluate the model.

In the below code, we are creating a function my_custom_accuracy() and passing it to the make_scorer() function which makes it the function to be used for scoring. The parameter greater_is_better=True means the value which is highest is considered to be the best.

If you want to create a scoring function of your own, use the below code to create a TPOTClassifier object. All other steps will remain the same as we have already learnt in the five-step process above.

 
def my_custom_accuracy(y_true, y_pred):
    return float(sum(y_pred == y_true)) / len(y_true)
my_custom_scorer = make_scorer(my_custom_accuracy, greater_is_better=True)
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,scoring=my_custom_scorer)

TPOT Configurations

TPOT has a few built-in operators and parameter configurations that work well for optimizing the machine learning pipelines. Below is a list of a few built-in TPOT configurations:

Default TPOT
TPOT light
TPOT MDR
TPOT sparse
TPOT NN
TPOT cuML

Let us see how we can use these configurations in the TPOT pipeline.

 
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,config_dict='TPOT light')

What we have seen above were built-in configurations. But we can define our own configuration pipeline as well. Below is a code to create TPOT object using custom-built configurations.

 
tpot_config = {
    'sklearn.naive_bayes.GaussianNB': {
    },
    'sklearn.naive_bayes.BernoulliNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    },
    'sklearn.naive_bayes.MultinomialNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    }
}
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,config_dict=tpot_config)

2. TPOT Regressor

Below is the TPOTRegressor function. All the parameters are given their default values which can be modified as per requirement.

TPOTRegressor(generations=100, population_size=100,
                         offspring_size=None, mutation_rate=0.9,
                         crossover_rate=0.1,
                         scoring='neg_mean_squared_error', cv=5,
                         subsample=1.0, n_jobs=1,
                         max_time_mins=None, max_eval_time_mins=5,
                         random_state=None, config_dict=None,
                         template=None,
                         warm_start=False,
                         memory=None,
                         use_dask=False,
                         periodic_checkpoint_folder=None,
                         early_stop=None,
                         verbosity=0,
                         disable_update_check=False)

TPOTRegressor Implementation

Let us see the implementation of TPOTRegressor using Python in a five-step process. It is exactly the same as what we have seen in the TPOTClassifier pipeline.

Step 1: Load Dataset

We will work on the Boston house price prediction dataset which is loaded directly from a URL.

import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]
X.shape,y.shape

((506, 13), (506,))

Step 2: Split the data into train and test

Split the data into the train (75%) and test (25%) using the train test split function of the sklearn library.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .25)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((379, 13), (127, 13), (379,), (127,))

Step 3: Fit training data on TPOT

Fit the TPOTRegressor() function on training data. The parameters can be modified according to your need. The meaning of each parameter is already explained in this article.

from tpot import TPOTRegressor
reg = TPOTRegressor(verbosity=2, population_size=50, generations=10, random_state=35)
reg.fit(X_train, y_train)

Step 4: Predict on testing data

Predict on testing data.

reg.predict(X_test)

Step 5: Calculate Score

The score() function is used to evaluate the model. It can be modified to a custom-built function using the method we learnt above in this article. As the value of generation is 10, the score function generates 10 results with their best cross-validation score. The best pipeline is returned at the end that can be used to build your machine learning model.

print(reg.score(X_test, y_test))

Generation 1 – Current best internal CV score: -11.179972234350528
Generation 2 – Current best internal CV score: -11.179972234350528
Generation 3 – Current best internal CV score: -11.179972234350528
Generation 4 – Current best internal CV score: -10.86819403913402
Generation 5 – Current best internal CV score: -9.774902464576105
Generation 6 – Current best internal CV score: -9.774902464576105
Generation 7 – Current best internal CV score: -9.774902464576105
Generation 8 – Current best internal CV score: -9.774902464576105
Generation 9 – Current best internal CV score: -9.774902464576105
Generation 10 – Current best internal CV score: -9.774902464576105

Best pipeline: XGBRegressor(ExtraTreesRegressor(input_matrix, bootstrap=False, max_features=0.55, min_samples_leaf=15, min_samples_split=5, n_estimators=100), learning_rate=0.1, max_depth=9, min_child_weight=3, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=1.0, verbosity=0) -10.508523517471932

The code used in this article is from TPOT documentation.

Advantages of AutoML

It helps in finding the best machine-learning model for your problem with the least effort.
It saves a huge amount of time and effort.

Disadvantages of AutoML

It gives multiple good solutions for your dataset which makes it confusing.
It is very time-consuming as the size of the data increases.

Conclusion

After understanding TPOT, you may be thinking it can replace machine learning engineers or data scientists. Well, AutoML packages can be used to support them but can’t replace them completely. It can return the most efficient model which data scientists can use and apply their domain expertise to solve a business problem. You should definitely try using the AutoML process in your current ML project.

I hope the article was useful and made your TPOT concepts clear. Feel free to give your feedback and ask for any queries in the comment section below.

Thank You and Happy Learning!

Varsha Saini

A Comprehensive Guide to TPOT – AutoML Implementation on Classification and Regression using Python

What is AutoML?

How do AutoML Works?

AutoML Packages

TPOT

TPOT Installation

TPOT Architecture

TPOT Implementation

TPOT Parameters

TPOT Functions

1. TPOT Classifier

TPOTClassifier Implementation

Step 1: Load Dataset

Step 2: Split the data into train and test

Step 3: Fit training data on TPOT

Step 4: Predict on testing data

Step 5: Calculate Score

Can We Use a Custom-Made Score Function?

TPOT Configurations

2. TPOT Regressor

TPOTRegressor Implementation

Step 1: Load Dataset

Step 2: Split the data into train and test

Step 3: Fit training data on TPOT

Step 4: Predict on testing data

Step 5: Calculate Score

Advantages of AutoML

Disadvantages of AutoML

Conclusion

Drug Discovery and Data Science: Revolutionizing Pharmaceutical Research

Automatic Ad Generation From Product Description Using Artificial Intelligence

A Guide to Machine Learning Testing for Beginners