We all know the power of machine learning, it is solving so many business problems and is applied to various domains. But not everyone is a data scientist and is aware of the concepts like ML modelling, model selection, hyperparameter tuning etc.
AutoML has made it possible to use the benefits of machine learning with no or minimal knowledge. It automates the task of applying machine learning models to real-world problems. It includes all the steps from getting raw data to using machine learning models on it and making it production ready.
Table of Contents
ToggleWhat is AutoML?
AutoML stands for automated machine learning is a process of automating the task of building a machine learning model. It enables anyone with minimal knowledge to train high-quality machine learning models in their area of interest.
How do AutoML Works?
The goal of AutoML is to find the best machine-learning model with optimized parameters specific to your problem statement which can be classification, regression, NLP, forecasting, computer vision etc.
It creates several pipelines in parallel with different machine-learning models and parameters by iterating over different combinations of models and their hyperparameters, generating scores for each. The model with the highest score is considered to be the best fit for your problem statement.
AutoML Packages
A few of the popular AutoML packages are listed below:
- AutoGluon
- H2O AutoML
- MLBoX
- TPOT
- TransformogrifAI
In this article, we will explore TPOT AutoML package in detail and see how you can use it in your project.
TPOT
TPOT stands for Tree-based pipeline optimization tool. It is a python automated machine-learning module that automates the process of creating a machine-learning pipeline using the concept of genetic programming in an optimized search space. It is built on the top of scikit library of python.
TPOTÂ Installation
The below command can be used to install tpot library.
pip install tpot
TPOT Architecture
The below picture explains the tpot architecture in detail. It is taken from http://automl.info/tpot/
TPOT Implementation
Since tpot is built on top of scikit learn library, its code is very similar to it. The process of building tpot pipeline and most of the functions that can be applied on it are the same as scikit library.
Tpot can be applied to supervised machine learning problems. We will learn how to create tpot pipeline for both classification and regression in a step-wise order.
TPOT Parameters
Let us understand some of the important TPOTClassifier parameters:
- generations: number of iterations for tpot pipeline
- population_size: number of individuals to retain in every generation
- offspring_size: number of offspring to generate in each iteration
- mutation_rate: values lie between [0,1]
- crossover_rate: value lies between [0,1]
mutation_rate+crossover_rate <=1 - scoring: model evaluation metrics
- cv= cross-validation method
- n_job: number of jobs that can be run in parallel
- max_time_mins: maximum no of times tpot allowed optimizing the pipeline
- max_eval_time_mins: minutes tpot takes to evaluate a single pipeline
- verbosity: what information tpot displays while running the pipelines
{0: doesn’t show anything, 1: minimal information, 2: more information along with progress bar, 3: everything}
TPOT Functions
Below are the functions along with their parameters that can be applied on tpot object:
- fit(Xtrain,ytrain): Run the tpot optimizer pipeline on the data having X independent and y dependent variables.
- predict(Xtest): predicts the output for testing data.
- score(Xtest,ytest): compares the predicted output with the actual output and returns the scoring value. The score function can be customized.
- export(output filename): exports the optimized pipeline as a python code.
1. TPOT Classifier
Below is the TPOTClassifier function. All the parameters are given their default values which can be modified as per requirement.
TPOTClassifier(generations=100, population_size=100, offspring_size=None, mutation_rate=0.9, crossover_rate=0.1, scoring='accuracy', cv=5, subsample=1.0, n_jobs=1, max_time_mins=None, max_eval_time_mins=5, random_state=None, config_dict=None, template=None, warm_start=False, memory=None, use_dask=False, periodic_checkpoint_folder=None, early_stop=None, verbosity=0, disable_update_check=False, log_file=None
TPOTClassifier Implementation
Let us see the implementation of TPOTClassifier using Python in a five-step process.
Step 1: Load Dataset
We will load data directly from sklearn datasets. The load_digits dataset has numerical digits as classes, therefore it is a multiclass classification problem.
from sklearn.datasets import load_digits digits = load_digits() digits.data.shape
(1797, 64)
Step 2: Split the data into train and test
The data is divided into the train (75%) and the test (25%). The shape of the data is printed after the split.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,train_size=0.75, test_size=0.25) X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1347, 64), (450, 64), (1347,), (450,))
Step 3: Fit training data on TPOT
The training data is fitted using the TPOTClassifier module using the fit() function. The model will be fitted on the given parameters which are already explained.
from tpot import TPOTClassifier pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,random_state=42, verbosity=2) pipeline_optimizer.fit(X_train, y_train)
Step 4: Predict on testing data
The output is predicated on testing data using the predict() function.
pipeline_optimizer.predict(X_test)
Step 5: Calculate Score
The score is calculated using the score() function. It uses the default score() which can be customized. Since the generation value is 5, TPOT will iterate 5 times and generate the cross-validation score for each iteration. This score() also returns the best pipeline for the given data.
print(pipeline_optimizer.score(X_test, y_test))
Generation 1 – Current best internal CV score: 0.9806939281288723
Generation 2 – Current best internal CV score: 0.9806939281288723
Generation 3 – Current best internal CV score: 0.9806966818119236
Generation 4 – Current best internal CV score: 0.981434668869613
Generation 5 – Current best internal CV score: 0.9829189040341457
Best pipeline: KNeighborsClassifier(GaussianNB(input_matrix), n_neighbors=3, p=2, weights=uniform) 0.9888888888888889
Can We Use a Custom-Made Score Function?
We have already seen how to use the score() function to evaluate the model built. But a custom-created function can also be used to evaluate the model.
In the below code, we are creating a function my_custom_accuracy() and passing it to the make_scorer() function which makes it the function to be used for scoring. The parameter greater_is_better=True means the value which is highest is considered to be the best.
If you want to create a scoring function of your own, use the below code to create a TPOTClassifier object. All other steps will remain the same as we have already learnt in the five-step process above.
def my_custom_accuracy(y_true, y_pred): return float(sum(y_pred == y_true)) / len(y_true) my_custom_scorer = make_scorer(my_custom_accuracy, greater_is_better=True) tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,scoring=my_custom_scorer)
TPOT Configurations
- Default TPOT
- TPOT light
- TPOT MDR
- TPOT sparse
- TPOT NN
- TPOT cuML
Let us see how we can use these configurations in the TPOT pipeline.
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,config_dict='TPOT light')
tpot_config = { Â Â Â Â 'sklearn.naive_bayes.GaussianNB': { Â Â Â Â }, Â Â Â Â 'sklearn.naive_bayes.BernoulliNB': { Â Â Â Â Â Â Â Â 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.], Â Â Â Â Â Â Â Â 'fit_prior': [True, False] Â Â Â Â }, Â Â Â Â 'sklearn.naive_bayes.MultinomialNB': { Â Â Â Â Â Â Â Â 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.], Â Â Â Â Â Â Â Â 'fit_prior': [True, False] Â Â Â } } tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,config_dict=tpot_config)
2. TPOT Regressor
Below is the TPOTRegressor function. All the parameters are given their default values which can be modified as per requirement.
TPOTRegressor(generations=100, population_size=100, offspring_size=None, mutation_rate=0.9, crossover_rate=0.1, scoring='neg_mean_squared_error', cv=5, subsample=1.0, n_jobs=1, max_time_mins=None, max_eval_time_mins=5, random_state=None, config_dict=None, template=None, warm_start=False, memory=None, use_dask=False, periodic_checkpoint_folder=None, early_stop=None, verbosity=0, disable_update_check=False)
TPOTRegressor Implementation
Let us see the implementation of TPOTRegressor using Python in a five-step process. It is exactly the same as what we have seen in the TPOTClassifier pipeline.
Step 1: Load Dataset
We will work on the Boston house price prediction dataset which is loaded directly from a URL.
import pandas as pd import numpy as np data_url = "http://lib.stat.cmu.edu/datasets/boston" raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None) X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]]) y = raw_df.values[1::2, 2] X.shape,y.shape
Step 2: Split the data into train and test
Split the data into the train (75%) and test (25%) using the train test split function of the sklearn library.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .25) X_train.shape, X_test.shape, y_train.shape, y_test.shape
((379, 13), (127, 13), (379,), (127,))
Step 3: Fit training data on TPOT
from tpot import TPOTRegressor reg = TPOTRegressor(verbosity=2, population_size=50, generations=10, random_state=35) reg.fit(X_train, y_train)
Step 4: Predict on testing data
reg.predict(X_test)
Step 5: Calculate Score
The score() function is used to evaluate the model. It can be modified to a custom-built function using the method we learnt above in this article. As the value of generation is 10, the score function generates 10 results with their best cross-validation score. The best pipeline is returned at the end that can be used to build your machine learning model.
print(reg.score(X_test, y_test))
Generation 1 – Current best internal CV score: -11.179972234350528
Generation 2 – Current best internal CV score: -11.179972234350528
Generation 3 – Current best internal CV score: -11.179972234350528
Generation 4 – Current best internal CV score: -10.86819403913402
Generation 5 – Current best internal CV score: -9.774902464576105
Generation 6 – Current best internal CV score: -9.774902464576105
Generation 7 – Current best internal CV score: -9.774902464576105
Generation 8 – Current best internal CV score: -9.774902464576105
Generation 9 – Current best internal CV score: -9.774902464576105
Generation 10 – Current best internal CV score: -9.774902464576105
Best pipeline: XGBRegressor(ExtraTreesRegressor(input_matrix, bootstrap=False, max_features=0.55, min_samples_leaf=15, min_samples_split=5, n_estimators=100), learning_rate=0.1, max_depth=9, min_child_weight=3, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=1.0, verbosity=0) -10.508523517471932
The code used in this article is from TPOT documentation.
Advantages of AutoML
- It helps in finding the best machine-learning model for your problem with the least effort.
- It saves a huge amount of time and effort.
Disadvantages of AutoML
- It gives multiple good solutions for your dataset which makes it confusing.
- It is very time-consuming as the size of the data increases.
Conclusion
After understanding TPOT, you may be thinking it can replace machine learning engineers or data scientists. Well, AutoML packages can be used to support them but can’t replace them completely. It can return the most efficient model which data scientists can use and apply their domain expertise to solve a business problem. You should definitely try using the AutoML process in your current ML project.
I hope the article was useful and made your TPOT concepts clear. Feel free to give your feedback and ask for any queries in the comment section below.
Thank You and Happy Learning!