8

Hyperparameter Optimization

How a Kaggle solution performs is not simply determined by the type of learning algorithm you choose. Aside from the data and the features that you use, it is also strongly determined by the algorithm’s hyperparameters, the parameters of the algorithm that have to be fixed prior to training, and cannot be learned during the training process. Choosing the right variables/data/features is most effective in tabular data competitions; however, hyperparameter optimization is effective in all competitions, of any kind. In fact, given fixed data and an algorithm, hyperparameter optimization is the only sure way to enhance the predictive performance of the algorithm and climb the leaderboard. It also helps in ensembling, because an ensemble of tuned models always performs better than an ensemble of untuned ones.

You may hear that tuning hyperparameters manually is possible if you know and understand the effects of your choices on the algorithm. Many Kaggle Grandmasters and Masters have declared that they often rely on directly tuning their models in competitions. They operate selectively on the most important hyperparameters in a bisection operation style, exploring smaller and smaller intervals of a parameter’s values until they find the value that produces the best result. Then, they move on to another parameter. This works perfectly well if there is a single minimum for each parameter and if the parameters are independent from each other. In this case, the search is mostly driven by experience and knowledge of learning algorithms. In our experience, however, that is not the case with most tasks you will encounter on Kaggle. The sophistication of the problems and the algorithms used requires a systematic approach that only a search algorithm can provide. Hence, we decided to write this chapter.

In this chapter, we will explore how to extend your cross-validation approach to find the best hyperparameters that can generalize to your test set. The idea is to deal with the pressure and scarcity of time and resources that you experience in competitions. For this reason, we will concentrate on Bayesian optimization methods, which are a proven way to optimize for complex models and data problems based on the resources you have available. We won’t limit ourselves to searching for the best values for pre-defined hyperparameters; we will also delve into the problem of neural network architecture.

We will cover the following topics:

  • Basic optimization techniques
  • Key parameters and how to use them
  • Bayesian optimization

Let’s start!

Basic optimization techniques

The core algorithms for hyperparameter optimization, found in the Scikit-learn package, are grid search and random search. Recently, the Scikit-learn contributors have also added the halving algorithm to improve the performances of both grid search and random search strategies.

In this section, we will discuss all these basic techniques. By mastering them, not only will you have effective optimization tools for some specific problems (for instance, SVMs are usually optimized by grid search) but you will also be familiar with the basics of how hyperparameter optimization works.

To start with, it is crucial to figure out what the necessary ingredients are:

  • A model whose hyperparameters have to be optimized
  • A search space containing the boundaries of the values to search between for each hyperparameter
  • A cross-validation scheme
  • An evaluation metric and its score function

All these elements come together in the search method to determine the solution you are looking for. Let’s see how it works.

Grid search

Grid search is a method that searches through the hyperparameters exhaustively, and is not feasible in high-dimensional space. For every parameter, you pick a set of values you want to test. You then test all the possible combinations in this set. That is why it is exhaustive: you try everything. It is a very simple algorithm and it suffers from the curse of dimensionality, but, on the positive side, it’s embarrassingly parallel (see https://www.cs.iusb.edu/~danav/teach/b424/b424_23_embpar.html for a definition of this computer science term). This means you can obtain an optimal tuning very quickly, if you have enough processors to run the search on.

As an example, let’s take a classification problem and support-vector machine classification (SVC). Support-vector machines (SVMs) for both classification and regression problems are probably the machine learning algorithm that you will use grid search for the most. Using the make_classification function from Scikit-learn, we can generate a classification dataset quickly:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=300, n_features=50,
                           n_informative=10,
                           n_redundant=25, n_repeated=15,
                           n_clusters_per_class=5,
                           flip_y=0.05, class_sep=0.5,
                           random_state=0)

For our next step, we define a basic SVC algorithm and set the search space. Since the kernel function of the SVC (the internal function that transforms the input data in an SVM) determines the different hyperparameters to set, we provide a list containing two dictionaries of distinct search spaces for parameters to be used depending on the type of kernel chosen. We also set the evaluation metric (we use accuracy in this case, since the target is perfectly balanced):

from sklearn import svm
svc = svm.SVC()
svc = svm.SVC(probability=True, random_state=1)
from sklearn import model_selection
search_grid = [
               {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
               {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
               'kernel': ['rbf']}
               ]
               
scorer = 'accuracy'

In our example, a linear kernel doesn’t require the tuning of the gamma parameter, though it is very important for a radial basis function kernel. Therefore, we provide two dictionaries: the first containing the parameters for the linear kernel, the second containing parameters for a radial basis function kernel. Each dictionary only contains a reference to the kernel it is relevant to and only the range of parameters that are relevant for that kernel.

It is important to note that the evaluation metric can be different from the cost function optimized by the algorithm. In fact, as discussed in Chapter 5, Competition Tasks and Metrics, you may encounter scenarios in which the evaluation metric for the competition is different, but you cannot modify the cost function of your algorithm. Under these circumstances, tuning the hyperparameters according to your evaluation metric can still help in obtaining a well-performing model. Though built around the algorithm’s cost function, the optimal set of hyperparameters found will be the one returning the best evaluation metric under such constraints. It probably won’t be the theoretically best result that you could obtain for the problem, but it may often not be far from it.

All the ingredients (model, search space, evaluation metric, cross-validation scheme) are combined into the GridSearchCV instance, and then the model is fit to the data:

search_func = model_selection.GridSearchCV(estimator=svc, 
                                           param_grid=search_grid,
                                           scoring=scorer, 
                                           n_jobs=-1,
                                           cv=5)
search_func.fit(X, y)
print (search_func.best_params_)
print (search_func.best_score_)

After a while, depending on the machine you are running the optimization on, you will obtain the best combination based on cross-validated results.

In conclusion, grid search is a very simple optimization algorithm that can leverage the availability of multi-core computers. It can work fine with machine learning algorithms that do not require many tunings (such as SVM and the ridge and lasso regressions) but, in all other cases, its applicability is quite narrow. First, it is limited to optimizing hyperparameters by discrete choice (you need a limited set of values to cycle through). In addition, you cannot expect it to work effectively on algorithms requiring multiple hyperparameters to be tuned. This is because of the exploding complexity of the search space, and because most of the computational inefficiency is due to the fact that the search is trying parameter values blindly, most of which do not work for the problem.

Random search

Random search, which simply samples the search space randomly, is feasible in high-dimensional spaces and is widely used in practice. The downside of random search, however, is that it doesn’t use information from prior experiments to select the next setting (a problem shared by grid search, we should note). In addition, to find the best solution as fast as possible, you cannot do anything except hope to be lucky you catch the right hyperparameters.

Random search works incredibly well and it is simple to understand. Despite the fact it relies on randomness, it isn’t just based on blind luck, though it may initially appear to be. In fact, it works like random sampling in statistics: the main point of the technique is that if you do enough random tests, you have a good possibility of finding the right parameters without wasting energy on testing slightly different combinations of similarly performing combinations.

Many AutoML systems rely on random search when there are too many parameters to set (see Golovin, D. et al. Google Vizier: A Service for Black-Box Optimization, 2017). As a rule of thumb, consider looking at random search when the dimensionality of your hyperparameter optimization problem is sufficiently high (for example, over 16).

Below, we run the previous example using random search:

import scipy.stats as stats
from sklearn.utils.fixes import loguniform
search_dict = {'kernel': ['linear', 'rbf'], 
               'C': loguniform(1, 1000),
               'gamma': loguniform(0.0001, 0.1)
               }
scorer = 'accuracy'
search_func = model_selection.RandomizedSearchCV
              (estimator=svc,param_distributions=search_dict, n_iter=6,
              scoring=scorer, n_jobs=-1, cv=5)
search_func.fit(X, y)
print (search_func.best_params_)
print (search_func.best_score_)

Notice that, now, we don’t care about running the search on separate spaces for the different kernels. Contrary to grid search, where each parameter, even the ineffective ones, is systematically tested, which requires computational time, here the efficiency of the search is not affected by the set of hyperparameters tested. The search doesn’t depend on irrelevant parameters, but is guided by chance; any trial is useful, even if you are testing only one valid parameter among many for the chosen kernel.

Halving search

As we mentioned, both grid search and random search work in an uninformed way: if some tests find out that certain hyperparameters do not impact the result or that certain value intervals are ineffective, the information is not propagated to the following searches.

For this reason, Scikit-learn has recently introduced the HalvingGridSearchCV and HalvingRandomSearchCV estimators, which can be used to search a parameter space using successive halving applied to the grid search and random search tuning strategies.

In halving, a large number of hyperparameter combinations are evaluated in an initial round of tests but using a small amount of computational resources. This is achieved by running the tests on a subsample of a few cases from your training data. A smaller training set needs fewer computations to be tested, so fewer resources (namely time) are used at the cost of more imprecise performance estimations. This initial round allows the selection of a subset of candidate hyperparameter values, which have performed better on the problem, to be used for the second round, when the training set size is increased.

The following rounds proceed in a similar way, allocating larger and larger subsets of the training set to be searched as the range of tested values is restricted (testing now requires more time to execute, but returns a more precise performance estimation), while the number of candidates continues to be halved.

Here is an example applied to the previous problem:

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
search_func = HalvingRandomSearchCV(estimator=svc,
                                    param_distributions=search_dict,
                                    resource='n_samples',
                                    max_resources=100,
                                    aggressive_elimination=True,
                                    scoring=scorer,
                                    n_jobs=-1,
                                    cv=5,
                                    random_state=0)
search_func.fit(X, y)
print (search_func.best_params_)
print (search_func.best_score_)

In this way, halving provides information to the successive optimization steps via the selection of the candidates. In the next sections, we will discuss even smarter ways to achieve a more precise and efficient search through the space of hyperparameters.

Kazuki Onodera

https://www.kaggle.com/onodera

Let’s pause for an interview with another Kaggler. Kazuki Onodera is a Competitions Grandmaster and Discussions Master who has around 7 years of competition experience. He’s also a Senior Deep Learning Data Scientist at NVIDIA and a member of the NVIDIA KGMON (Kaggle Grandmasters of NVIDIA) team.

What’s your favorite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle?

Instacart Market Basket Analysis. This competition proved quite challenging for the Kaggle community because of its use of anonymized data related to customer orders over time in order to predict which previously purchased products will be in a user’s next order. The reason why I like it is that I love feature engineering and I could come up with a bunch of good and interesting features everyone else couldn’t, which allowed me to get second place in the competition.

How do you approach a Kaggle competition? How different is this approach to what you do in your day-to-day work?

I try to imagine how a model works, and delve into false negatives and false positives. Same as in my daily work.

Tell us about a particularly challenging competition you entered, and what insights you used to tackle the task.

Human Protein Atlas - Single Cell Classification. This competition was a kind of instance segmentation competition, but no masks were provided. So, it turned into being a weakly supervised multi-label classification problem. I created a two-stage pipeline for removing label noise.

Has Kaggle helped you in your career? If so, how?

Yes. I’m now working in the NVIDIA KGMON (Kaggle Grandmasters of NVIDIA) team. Kaggle launches many different machine learning competitions, different with regards to data type, tabular, image, natural language, and signal, as well as with regards to sector and domain: industry, finance, astronomy, pathology, sports, retail, and so on. I bet nobody can access and have experience with all these kinds of data except Kagglers.

In your experience, what do inexperienced Kagglers often overlook? What do you know now that you wish you’d known when you first started?

Target analysis. Also, seed averaging is quite overlooked: always simple but powerful.

What mistakes have you made in competitions in the past?

Target analysis. Top teams always analyze the target better than others, so if I couldn’t get a better place in a competition, I go and read about the top solutions, because they always describe to me the knowledge about the data that I missed during the competition.

Are there any particular tools or libraries that you would recommend using for data analysis or machine learning?

Just Python and Jupyter Notebooks.

What’s the most important thing someone should keep in mind or do when they’re entering a competition?

If you can learn from a defeat, you haven’t really lost.

Do you use other competition platforms? How do they compare to Kaggle?

KDD Cup and RecSys. Both meet the minimum requirements for being interesting and challenging.

Key parameters and how to use them

The next problem is using the right set of hyperparameters for each kind of model you use. In particular, in order to be efficient in your optimization, you need to know the values of each hyperparameter that it actually makes sense to test for each distinct algorithm.

In this section, we will examine the most common models used in Kaggle competitions, especially the tabular ones, and discuss the hyperparameters you need to tune in order to obtain the best results. We will distinguish between classical machine learning models and gradient boosting models (which are much more demanding in terms of their space of parameters) for generic tabular data problems.

As for neural networks, we can give you an idea about specific parameters to tune when we present the standard models (for instance, the TabNet neural model has some specific parameters to set so that it works properly). However, most of the optimization on deep neural networks in Kaggle competitions is not performed on standard models, but on custom ones. Consequently, apart from basic learning parameters such as the learning rate and the batch size, optimization in neural networks is based on the specific characteristics of the neural architecture of your model. You have to deal with the problem in an ad hoc way. Near the end of the chapter, we will discuss an example of neural architecture search (NAS) using KerasTuner (https://keras.io/keras_tuner/).

Linear models

The linear models that need to be tuned are usually linear regressions or logistic regressions with regularization:

  • C: The range you should search is np.logspace(-4, 4, 10); smaller values specify stronger regularization.
  • alpha: You should search the range np.logspace(-2, 2, 10); smaller values specify stronger regularization, larger values specify stronger regularization. Also take note that higher values take more time to process when using lasso.
  • l1_ratio: You should pick from the list [.1, .5, .7, .9, .95, .99, 1]; it applies only to elastic net.

In Scikit-learn, depending on the algorithm, you find either the hyperparameter C (logistic regression) or alpha (lasso, ridge, elastic net).

Support-vector machines

SVMs are a family of powerful and advanced supervised learning techniques for classification and regression that can automatically fit linear and non-linear models. Scikit-learn offers an implementation based on LIBSVM, a complete library of SVM classification and regression implementations, and LIBLINEAR, a scalable library for linear classification ideal for large datasets, especially sparse text-based ones. In their optimization, SVMs strive to separate target classes in classification problems using a decision boundary characterized by the largest possible margin between classes.

Though SVMs work fine with default parameters, they are often not optimal, and you need to test various value combinations using cross-validation to find the best ones. Listed according to their importance, you have to set the following parameters:

  • C: The penalty value. Decreasing it makes the margin between classes larger, thus ignoring more noise but also making the model more generalizable. A best value can normally be found in the range np.logspace(-3, 3, 7).
  • kernel: This parameter will determine how non-linearity will be implemented in an SVM and it can be set to 'linear', 'poly', 'rbf', 'sigmoid', or a custom kernel. The most commonly used value is certainly rbf.
  • degree: Works with kernel='poly', signaling the dimensionality of the polynomial expansion. It is ignored by other kernels. Usually, setting its values to between 2 and 5 works the best.
  • gamma: A coefficient for 'rbf', 'poly', and 'sigmoid'. High values tend to fit data in a better way, but can lead to some overfitting. Intuitively, we can imagine gamma as the influence that a single example exercises over the model. Low values make the influence of each example reach further. Since many points have to be considered, the SVM curve will tend to take a shape less influenced by local points and the result will be a smoother decision contour curve. High values of gamma, instead, mean the curve takes into account how points are arranged locally more and, as a result, you get a more irregular and wiggly decision curve. The suggested grid search range for this hyperparameter is np.logspace(-3, 3, 7).
  • nu: For regression and classification with nuSVR and nuSVC, this parameter sets a tolerance for the training points that are near to the margin and are not classified correctly. It helps in ignoring misclassified points just near or on the margin, hence it can render the classification decision curve smoother. It should be in the range [0,1] since it is a proportion relative to your training set. Ultimately, it acts like C, with high proportions enlarging the margin.
  • epsilon: This parameter specifies how much error SVR will accept, by defining an epsilon large range where no penalty is associated with an incorrect prediction of the example during the training of the algorithm. The suggested search range is np.logspace(-4, 2, 7).
  • penalty, loss, and dual: For LinearSVC, these parameters accept the ('l1', 'squared_hinge', False), ('l2', 'hinge', True), ('l2', 'squared_hinge', True), and ('l2', 'squared_hinge', False) combinations. The ('l2', 'hinge', True) combination is analogous to the SVC(kernel='linear') learner.

It may appear that an SVM has many hyperparameters to set, but many settings are specific only to implementations or to kernels, so you only have to select the relevant parameters.

Random forests and extremely randomized trees

Leo Breiman and Adele Cutler originally devised the idea at the core of the random forest algorithm, and the name of the algorithm remains a trademark of theirs today (though the algorithm is open source). Random forests are implemented in Scikit-learn as RandomForestClassifier or RandomForestRegressor.

A random forest works in a similar way to bagging, also devised by Leo Breiman, but operates only using binary split decision trees, which are left to grow to their extremes. Moreover, it samples the cases to be used in each of its models using bootstrapping. As the tree is grown, at each split of a branch, the set of variables considered for the split is drawn randomly, too.

This is the secret at the heart of the algorithm: it ensembles trees that, due to different samples and variables considered at the splits, are very different from each other. As they are different, they are also uncorrelated. This is beneficial because when the results are ensembled, much variance is ruled out, as the extreme values on both sides of a distribution tend to balance out. In other words, bagging algorithms guarantee a certain level of diversity in the predictions, allowing them to develop rules that a single learner (such as a decision tree) might not come across. All this diversity is useful because it helps in building a distribution whose average is a better predictor than any of the individual trees in the ensemble.

Extra Trees (also known as extremely randomized trees), represented in Scikit-learn by the ExtraTreesClassifier/ExtraTreesRegressor classes, are a more randomized kind of random forest that produces a lower variance in the estimates at the cost of greater bias of the estimators. However, when it comes to CPU efficiency, Extra Trees can deliver a considerable speed-up compared to random forests, so they can be ideal when you are working with large datasets in terms of both examples and features. The reason for the resulting higher bias but better speed is the way splits are built in an Extra Tree. Random forests, after drawing a random set of features to be considered for splitting a branch of a tree, carefully search among them for the best values to assign to each branch. By contrast, in Extra Trees, both the set of candidate features for the split and the actual split value are decided completely randomly. So, there’s no need for much computation, though the randomly chosen split may not be the most effective one (hence the bias).

For both algorithms, the key hyperparameters that should be set are as follows:

  • max_features: This is the number of sampled features that are present at every split, which can determine the performance of the algorithm. The lower the number, the speedier, but with higher bias.
  • min_samples_leaf: This allows you to determine the depth of the trees. Large numbers diminish the variance and increase the bias.
  • bootstrap: This is a Boolean that allows bootstrapping.
  • n_estimators: This is the number of trees. Remember that the more trees the better, though there is a threshold beyond which we get diminishing returns depending on the data problem. Also, this comes at a computational cost that you have to take into account based on the resources you have available.

Extra Trees are a good alternative to random forests, especially when the data you have is particularly noisy. Since they trade some variance reduction for more bias given their random choice of splits, they tend to overfit less on important yet noisy features that would otherwise dominate the splits in a random forest.

Gradient tree boosting

Gradient tree boosting or gradient boosting decision trees (GBDT) is an improved version of boosting (boosting works by fitting a sequence of weak learners on reweighted versions of the data). Like AdaBoost, GBDT is based on a gradient descent function. The algorithm has proven to be one of the most proficient ones from the family of models that are based on ensembles, though it is characterized by an increased variance of estimates, more sensitivity to noise in data (both problems can be mitigated by using subsampling), and significant computational costs due to non-parallel operations.

Apart from deep learning, gradient boosting is the most developed machine learning algorithm. Since AdaBoost and the initial gradient boosting implementation, as developed by Jerome Friedman, various other implementations of the algorithms appeared, the most recent ones being XGBoost, LightGBM, and CatBoost.

LightGBM

The high-performance LightGBM algorithm (https://github.com/Microsoft/LightGBM) is capable of being distributed on multiple computers and handling large amounts of data quickly. It was developed by a team at Microsoft as an open-source project on GitHub (there is also an academic paper: https://papers.nips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html).

LightGBM is based on decision trees, like XGBoost, but it follows a different strategy. While XGBoost uses decision trees to split on a variable and explore different tree splits at that variable (the level-wise tree growth strategy), LightGBM concentrates on one split and goes on splitting from there in order to achieve a better fit (the leaf-wise tree growth strategy). This allows LightGBM to quickly reach a good fit of the data, and to generate alternative solutions compared to XGBoost (which is good, if you expect to blend the two solutions together in order to reduce the variance of the estimates). Algorithmically speaking, if we think of the structure of splits operated by a decision tree as a graph, XGBoost pursues a breadth-first search (BFS) and LightGBM a depth-first search (DFS).

Tuning LightGBM may appear daunting; it has more than a hundred parameters to tune that you can explore at this page: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst (also here: https://lightgbm.readthedocs.io/en/latest/Parameters.html).

As a rule of thumb, you should focus on the following hyperparameters, which usually have the most impact on the results:

  • n_estimators: An integer between 10 and 10,000 that sets the number of iterations.
  • learning_rate: A real number between 0.01 and 1.0, usually sampled from a log-uniform distribution. It represents the step size of the gradient descent procedure that computes the weights for the summed ensemble of all the iterations of the algorithm up to this point.
  • max_depth: An integer between 1 and 16, representing the maximum number of splits on features. Setting it to a number below 0 allows the maximum possible number of splits, usually risking overfitting to data.
  • num_leaves: An integer between 2 and 2^max_depth, representing the number of final leaves each tree will have at most.
  • min_data_in_leaf: An integer between 0 and 300 that determines the minimum number of data points in one leaf.
  • min_gain_to_split: A float between 0 and 15; it sets the minimum gain of the algorithm for tree partitioning. By setting this parameter, you can avoid unnecessary tree splits and thus reduce overfitting (it corresponds to the gamma parameter in XGBoost).
  • max_bin: An integer between 32 and 512 that sets the maximum number of bins that feature values will be bucketed into. Having this parameter larger than the default value of 255 implies more risk of producing overfitting results.
  • subsample: A real number between 0.01, and 1.0, representing the portion of the sample to be used in training.
  • subsample_freq: An integer between 0 and 10 specifying the frequency, in terms of iterations, at which the algorithm will subsample the examples.

Note that, if set to zero, the algorithm will ignore any value given to the subsample parameter. In addition, it is set to zero by default, therefore just setting the subsample parameter won’t work.

  • feature_fraction: A real number between 0.1 and 1.0 allowing you to specify the portion of features to be subsampled. Subsampling the features is another way to allow more randomization to play a role in the training, fighting noise and multicollinearity present in the features.
  • subsample_for_bin: An integer between 30 and the number of examples. This sets the number of examples that are sampled for the construction of histogram bins.
  • reg_lambda: A real number between 0 and 100.0 that sets the L2 regularization. Since it is more sensitive to the scale than to the exact number of the parameter, it is usually sampled from a log-uniform distribution.
  • reg_alpha: A real number between 0 and 100.0, usually sampled from a log-uniform distribution, which sets the L1 regularization.
  • scale_pos_weight: A real number between 1e-6 and 500, better sampled from the log-uniform distribution. The parameter weights the positive cases (thus effectively upsampling or downsampling) against the negative cases, which are kept to the value of 1.

Although the number of hyperparameters to tune when using LightGBM may appear daunting, in reality only a few of them matter a lot. Given a fixed number of iterations and learning rate, just a few are the most impactful (feature_fraction, num_leaves, subsample, reg_lambda, reg_alpha, min_data_in_leaf), as explained in this blog article by Kohei Ozaki, a Kaggle Grandmaster: https://medium.com/optuna/lightgbm-tuner-new-optuna-integration-for-hyperparameter-optimization-8b7095e99258. Kohei Ozaki leverages this fact in order to create a fast-tuning procedure for Optuna (you’ll find more on the Optuna optimizer at the end of this chapter).

XGBoost

XGBoost (https://github.com/dmlc/XGBoost) stands for eXtreme Gradient Boosting. It is an open-source project that is not part of Scikit-learn, though it has recently been expanded by a Scikit-learn wrapper interface that makes it easier to incorporate XGBoost into a Scikit-learn-style data pipeline.

The XGBoost algorithm gained momentum and popularity in 2015 data science competitions, such as those on Kaggle and the KDD Cup 2015. As the creators (Tianqui Chen, Tong He, and Carlos Guestrin) report in papers they wrote on the algorithm, out of 29 challenges held on Kaggle during 2015, 17 winning solutions used XGBoost as a standalone solution or as part of an ensemble of multiple different models. Since then, the algorithm has always retained a strong appeal among the community of data scientists, though it struggled to keep pace with the innovation brought about by other GBM implementations such as LightGBM and CatBoost.

Aside from good performance both in terms of accuracy and computational efficiency, XGBoost is also a scalable solution, using at best multi-core processors as well as distributed machines.

XGBoost represents a new generation of GBM algorithms thanks to important tweaks to the initial tree boost GBM algorithm:

  • Sparsity-awareness; it can leverage sparse matrices, saving both memory (no need for dense matrices) and computation time (zero values are handled in a special way).
  • Approximate tree learning (weighted quantile sketch), which produces similar results but in much less time compared to the classical complete explorations of possible branch cuts.
  • Parallel computing on a single machine (using multi-threading during the search for the best split) and, similarly, distributed computations on multiple machines.
  • Out-of-core computations on a single machine, leveraging a data storage solution called column block. This arranges data on a disk by columns, thus saving time by pulling data from the disk in the way the optimization algorithm (which works on column vectors) expects it.

XGBoost can also deal with missing data in an effective way. Other tree ensembles based on standard decision trees require missing data first to be imputed using an off-scale value, such as a negative number, in order to develop an appropriate branching of the tree to deal with missing values.

As for XGBoost’s parameters (https://xgboost.readthedocs.io/en/latest/parameter.html), we have decided to highlight a few key ones you will find across competitions and projects:

  • n_estimators: Usually an integer ranging from 10 to 5,000.
  • learning_rate: A real number ranging from 0.01 to 1.0, better sampled from the log-uniform distribution.
  • min_child_weight: Usually an integer between 1 and 10.
  • max_depth: Usually an integer between 1 and 50.
  • max_delta_step: Usually an integer sampled between 0 and 20, representing the maximum delta step we allow for each leaf output.
  • subsample: A real number from 0.1 to 1.0 indicating the proportion of examples to be subsampled.
  • colsample_bytree: A real number from 0.1 to 1.0 indicating the subsample ratio of columns by tree.
  • colsample_bylevel: A real number from 0.1 to 1.0 indicating the subsample ratio by level in trees.
  • reg_lambda: A real number between 1e-9 and 100.0, preferably sampled from the log-uniform distribution. This parameter controls the L2 regularization.
  • reg_alpha: A real number between 1e-9 and 100.0, preferably sampled from the log-uniform distribution. This parameter controls the L1 regularization.
  • gamma: Specifying the minimum loss reduction for tree partitioning, this parameter requires a real number between 1e-9 and 0.5, preferably sampled from the log-uniform distribution.
  • scale_pos_weight: A real number between 1e-6 and 500.0, preferably sampled from the log-uniform distribution, which represents a weight for the positive class.

Like LightGBM, XGBoost also has many similar hyperparameters to tune, hence all of the considerations previously made for LightGBM are also valid for XGBoost.

CatBoost

In July 2017, Yandex, the Russian search engine, made another interesting GBM algorithm public, CatBoost (https://catboost.ai/), whose name comes from putting together the two words “Category” and “Boosting.” In fact, its strong point is its ability to handle categorical variables, which make up most of the information in most relational databases, by adopting a mixed strategy of one-hot encoding and target encoding. Target encoding is a way to express categorical levels by assigning them an appropriate numeric value for the problem at hand; more on this can be found in Chapter 7, Modeling for Tabular Competitions.

The idea used by CatBoost to encode categorical variables is not new, but it is a kind of feature engineering that has been used before, mostly in data science competitions. Target encoding, also known as likelihood encoding, impact coding, or mean encoding, is simply a way to transform your labels into a number based on their association with the target variable. If you have a regression, you could transform labels based on the mean target value typical of that level; if it is a classification, it is simply the probability of classification of your target given that label (the probability of your target conditional on each category value). It may appear a simple and smart feature engineering trick but it has side effects, mostly in terms of overfitting, because you are taking information from the target into your predictors.

CatBoost has quite a few parameters (see https://catboost.ai/en/docs/references/training-parameters/). We have limited our discussion to the eight most important ones:

  • iterations: Usually an integer between 10 and 1,000, but it can increase based on the problem.
  • depth: An integer between 1 and 8; usually higher values require longer fitting times and do not produce better results.
  • learning_rate: A real value between 0.01 and 1.0, better sampled from the log-uniform distribution.
  • random_strength: A real number log-linearly sampled from the range 1e-9 to 10.0, which specifies the randomness level for scoring splits.
  • bagging_temperature: A real value between 0.0 and 1.0 that sets the Bayesian bootstrap.
  • border_count: An integer between 1 and 255 indicating the splits for numerical features.
  • l2_leaf_reg: An integer between 2 and 30; the value for L2 regularization.
  • scale_pos_weight: A real number between 0.01 and 10.0 representing the weight for the positive class.

Even if CatBoost may appear to be just another GBM implementation, it has quite a few differences (highlighted also by the different parameters being used) that may provide great help in a competition, both as a single-model solution and as a model integrated into a larger ensemble.

HistGradientBoosting

Recently, Scikit-learn has introduced a new version of gradient boosting inspired by LightGBM’s binned data and histograms (see this presentation at EuroPython by Olivier Grisel: https://www.youtube.com/watch?v=urVUlKbQfQ4). Either as a classifier (HistGradientBoostingClassifier) or a regressor (HistGradientBoostingRegressor), it can be used for enriching ensembles with different models and it presents a much shorter and essential range of hyperparameters to be tuned:

  • learning_rate: A real number between 0.01 and 1.0, usually sampled from a log-uniform distribution.
  • max_iter: An integer that can range from 10 to 10,000.
  • max_leaf_nodes: An integer from 2 to 500. It interacts with max_depth; it is advisable to set only one of the two and leave the other set to None.
  • max_depth: An integer between 2 and 12.
  • min_samples_leaf: An integer between 2 and 300.
  • l2_regularization: A float between 0.0 and 100.0.
  • max_bins: An integer between 32 and 512.

Even if Scikit-learn’s HistGradientBoosting is nothing too different from LightGBM or XGBoost, it does provide a different way to implement GBMs in a competition, and models built by HistGradientBoosting may provide a contribution when ensembling multiple predictions, such as in blending and stacking.

Having reached the end of this section, you should be more familiar with the most common machine learning algorithms (only deep learning solutions have not been discussed) and their most important hyperparameters to tune, which will help you in building an outstanding solution in a Kaggle competition. Knowing the basic optimization strategies, usable algorithms, and their key hyperparameters is just a starting point. In the next section, we will begin an in-depth discussion about how to tune them more optimally using Bayesian optimization.

Alberto Danese

https://www.kaggle.com/albedan

Our second interview of the chapter is with Alberto Danese, Head of Data Science at Nexi, an Italian credit card and digital payments company. A Competitions Grandmaster who joined the platform in 2015, he obtained most of his gold medals as a solo competitor.

What’s your favorite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle?

I’ve always worked in the Financial Services industry, dealing mostly with structured data, and I do prefer competitions that belong to this category. I enjoy being able to have a practical grasp of what the data is all about and doing some smart feature engineering in order to squeeze every bit of information out of the data.

Technically speaking, I’ve got good experience with classical ML libraries and especially with Gradient Boosting Decision Trees: the most common libraries (XGBoost, LightGBM, CatBoost) are always my first choice.

How do you approach a Kaggle competition? How different is this approach to what you do in your day-to-day work?

I always spend a lot of time just exploring the data and trying to figure out what the problem that the sponsor is actually trying to solve with machine learning is. Different from what newbies usually think about Kaggle, I don’t spend so much time on all the “tweaking” of the specific ML algorithm – and apparently this approach has paid off!

In my daily job, understanding the data is also extremely important, but there are some additional phases that are completely missing in a Kaggle competition. I’ve got to:

  • Define a business problem to be solved with ML (together with colleagues in the business departments)
  • Find the data, sometimes also from external data providers
  • And when the ML part is done, understand how to put it in production and manage the evolutions

Tell us about a particularly challenging competition you entered, and what insights you used to tackle the task.

I enjoyed the TalkingData AdTracking Fraud Detection Challenge, with which I became a Grandmaster. Besides being on an extremely interesting topic (fighting fraud from click-farms), it really pushed me to do efficient feature engineering, as the volumes were huge (more than 100M labeled rows) and cutting on computation times was key in order to test different approaches. It also forced me to understand how to exploit lag/lead features (and other window functions) in the best way, in order to create a sort of time series in an otherwise classical ML problem.

Has Kaggle helped you in your career? If so, how?

Definitely! Being able to achieve great objective and verifiable results is no doubt something that makes a resume stand out. When I was hired by Cerved (a marketing intelligence service company) in 2016, the hiring manager was perfectly aware of what Kaggle was – and having some real-world projects to talk about during an interview is something extremely valuable. For sure Kaggle had an important role in the evolution of my career.

In your experience, what do inexperienced Kagglers often overlook? What do you know now that you wish you’d known when you first started?

I think that everyone just starts coding, maybe forking a public kernel and just changing a few lines or parameters. This is perfectly fine at the beginning! But you do have to spend a decent amount of time not coding, but studying the data and understanding the problem.

What mistakes have you made in competitions in the past?

Not sure if it counts as a mistake, but I have often preferred to compete solo: on one hand it’s great as it forces you to handle every single aspect of a competition, and you’re able to manage your time as you wish. But I’ve really enjoyed collaborating with teammates on a couple of competitions as well: I probably should consider teaming up more often, as you can learn a lot from collaborating.

Are there any particular tools or libraries that you would recommend using for data analysis or machine learning?

Besides the usual ones, I’ve always been a great fan of data.table (starting from the R version): I think it’s not getting the credit it deserves! It’s really a great package when you want to deal with huge data on a local machine.

What’s the most important thing someone should keep in mind or do when they’re entering a competition?

Understand the problem and the data first: don’t start coding right away!

Bayesian optimization

Leaving behind grid search (feasible only when the space of experiments is limited), the usual choice for the practitioner is to apply random search optimization or try a Bayesian optimization (BO) technique, which requires a more complex setup.

Originally introduced in the paper Practical Bayesian optimization of machine learning algorithms by Snoek, J., Larochelle, H., and Adams, R. P. (http://export.arxiv.org/pdf/1206.2944), the key idea behind Bayesian optimization is that we optimize a proxy function (also called a surrogate function) rather than the true objective function (which grid search and random search both do). We do this if there are no gradients, if testing the true objective function is costly (if it is not, then we simply go for random search), and if the search space is noisy and complex enough.

Bayesian search balances exploration with exploitation. At the start, it explores randomly, thus training the surrogate function as it goes. Based on that surrogate function, the search exploits its initial approximate knowledge of how the predictor works in order to sample more useful examples and minimize the cost function. As the Bayesian part of the name suggests, we are using priors in order to make smarter decisions about sampling during optimization. This way, we reach a minimization more quickly by limiting the number of evaluations we need to make.

Bayesian optimization uses an acquisition function to tell us how promising an observation will be. In fact, to manage the tradeoff between exploration and exploitation, the algorithm defines an acquisition function that provides a single measure of how useful it would be to try any given point.

Usually, Bayesian optimization is powered by Gaussian processes. Gaussian processes perform better when the search space has a smooth and predictable response. An alternative when the search space is more complex is using tree algorithms (for instance, random forests), or a completely different approach called Tree Parzen Estimators or Tree-structured Parzen Estimators (TPEs).

Instead of directly building a model that estimates the success of a set of parameters, thus acting like an oracle, TPEs estimate the parameters of a multivariate distribution that define the best-performing values of the parameters, based on successive approximations provided by the experimentations. In this way, TPEs derive the best set of parameters by sampling them from a probabilistic distribution, and not directly from a machine learning model like Gaussian processes does.

We will discuss each of these approaches, first by examining Scikit-optimize and KerasTuner, both based on Gaussian processes (Scikit-optimize can also use random forests and KerasTuner can use multi-armed bandits), and then Optuna, which is principally based on TPE (though it also offers different strategies: https://optuna.readthedocs.io/en/stable/reference/samplers.html).

Though Bayesian optimization is considered the state of the art for hyperparameter tuning, always keep in mind that for more complex parameter spaces, using Bayesian optimization provides no advantage in terms of time and computation spent over a solution simply found by random search. For instance, in Google Cloud Machine Learning Engine services, the usage of Bayesian optimization is limited to problems involving at most sixteen parameters. For larger numbers of parameters, it resorts to random sampling.

Using Scikit-optimize

Scikit-optimize (skopt) has been developed using the same API as Scikit-learn, as well as making extensive use of NumPy and SciPy functions. In addition, it was created by some of the contributors to the Scikit-learn project, such as Gilles Louppe.

Based on Gaussian process algorithms, the package is well maintained, though sometimes it has to catch up because of improvements on the Scikit-learn, NumPy, or SciPy sides. For instance, at the time of writing, in order to run it properly on Kaggle Notebooks you have to roll back to older versions of these packages, as explained in a GitHub issue (https://github.com/scikit-optimize/scikit-optimize/issues/981).

The package has an intuitive API and it is quite easy to hack it and use its functions in custom optimization strategies. Scikit-optimize is also renowned for its useful graphical representations. In fact, by visualizing the results of an optimization process (using Scikit-optimize’s plot_objective function), you can figure out whether you can re-define the search space for the problem and formulate an explanation of how optimization works for a problem.

In our worked example, we will refer to the work that can be found in the following Kaggle Notebooks:

Our purpose here is to show you how to quickly handle an optimization problem for a competition such as 30 Days of ML, a recent competition that involved many Kagglers in learning new skills and applying them in a competition lasting 30 days. The goal of this competition is to predict the value of an insurance claim, so it is a regression problem. You can find out more about this initiative and download the data necessary for the example we are going to present (materials are always available to the public), by visiting https://www.kaggle.com/thirty-days-of-ml.

If you cannot access the data because you have not taken part in the competition previously, you can use this Kaggle Dataset: https://www.kaggle.com/lucamassaron/30-days-of-ml.

The following code will present how to load the data for this problem and then set up a Bayesian optimization process that will improve the performance of a LightGBM model.

We start by loading the packages:

# Importing core libraries
import numpy as np
import pandas as pd
from time import time
import pprint
import joblib
from functools import partial
# Suppressing warnings because of skopt verbosity
import warnings
warnings.filterwarnings("ignore")
# Classifiers
import lightgbm as lgb
# Model selection
from sklearn.model_selection import KFold
# Metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
# Skopt functions
from skopt import BayesSearchCV
from skopt.callbacks import DeadlineStopper, DeltaYStopper
from skopt.space import Real, Categorical, Integer

As a next step, we load the data. The data doesn’t need much processing, aside from turning some categorical features with alphabetical letters as levels into ordered numeric ones:

# Loading data 
X = pd.read_csv("../input/30-days-of-ml/train.csv")
X_test = pd.read_csv("../input/30-days-of-ml/test.csv")
# Preparing data as a tabular matrix
y = X.target
X = X.set_index('id').drop('target', axis='columns')
X_test = X_test.set_index('id')
# Dealing with categorical data
categoricals = [item for item in X.columns if 'cat' in item]
cat_values = np.unique(X[categoricals].values)
cat_dict = dict(zip(cat_values, range(len(cat_values))))
X[categoricals] = X[categoricals].replace(cat_dict).astype('category')
X_test[categoricals] = X_test[categoricals].replace(cat_dict).astype('category')

After making the data available, we define a reporting function that can be used by Scikit-optimize for various optimization tasks. The function takes the data and the optimizer as inputs. It can also handle callback functions, which are functions that perform actions such as reporting, early stopping based on having reached a certain threshold of time spent searching or performance not improving (for instance, not seeing improvements for a certain number of iterations), or saving the state of the processing after each optimization iteration:

# Reporting util for different optimizers
def report_perf(optimizer, X, y, title="model", callbacks=None):
    """
    A wrapper for measuring time and performance of optimizers
    optimizer = a sklearn or a skopt optimizer
    X = the training set 
    y = our target
    title = a string label for the experiment
    """
    start = time()
    
    if callbacks is not None:
        optimizer.fit(X, y, callback=callbacks)
    else:
        optimizer.fit(X, y)
        
    d=pd.DataFrame(optimizer.cv_results_)
    best_score = optimizer.best_score_
    best_score_std = d.iloc[optimizer.best_index_].std_test_score
    best_params = optimizer.best_params_
    
    print((title + " took %.2f seconds, candidates checked: %d, best CV            score: %.3f" + u" u00B1"+" %.3f") % 
                             (time() - start,
                             len(optimizer.cv_results_['params']),
                             best_score, 
                             best_score_std))
    print('Best parameters:')
    pprint.pprint(best_params)
    print()
    return best_params

We now have to prepare the scoring function (upon which the evaluation is based), the validation strategy (based on cross-validation), the model, and the search space. For the scoring function, which should be a root mean squared error metric, we refer to the practices in Scikit-learn where you always minimize a function (if you have to maximize, you minimize its negative).

The make_scorer wrapper can easily replicate such practices:

# Setting the scoring function
scoring = make_scorer(partial(mean_squared_error, squared=False),
                      greater_is_better=False)
# Setting the validation strategy
kf = KFold(n_splits=5, shuffle=True, random_state=0)
# Setting the basic regressor
reg = lgb.LGBMRegressor(boosting_type='gbdt',
                        metric='rmse',
                        objective='regression',
                        n_jobs=1, 
                        verbose=-1,
                        random_state=0)

Setting the search space requires the use of different functions from Scikit-optimize, such as Real, Integer, or Choice, each one sampling from a different kind of distribution that you define as a parameter (usually the uniform distribution, but the log-uniform is also used when you are more interested in the scale effect of a parameter than its exact value):

# Setting the search space
search_spaces = {
     
     # Boosting learning rate
    'learning_rate': Real(0.01, 1.0, 'log-uniform'),
     
     # Number of boosted trees to fit
    'n_estimators': Integer(30, 5000),
     
     # Maximum tree leaves for base learners
    'num_leaves': Integer(2, 512),
    
     # Maximum tree depth for base learners
    'max_depth': Integer(-1, 256),
     # Minimal number of data in one leaf
    'min_child_samples': Integer(1, 256),
     # Max number of bins buckets
    'max_bin': Integer(100, 1000),
     # Subsample ratio of the training instance 
    'subsample': Real(0.01, 1.0, 'uniform'),
     # Frequency of subsample 
    'subsample_freq': Integer(0, 10),
                
     # Subsample ratio of columns
    'colsample_bytree': Real(0.01, 1.0, 'uniform'), 
    
     # Minimum sum of instance weight
    'min_child_weight': Real(0.01, 10.0, 'uniform'),
   
     # L2 regularization
    'reg_lambda': Real(1e-9, 100.0, 'log-uniform'),
         
     # L1 regularization
    'reg_alpha': Real(1e-9, 100.0, 'log-uniform'),
   }

Once you have defined:

  • Your cross-validation strategy
  • Your evaluation metric
  • Your base model
  • Your hyperparameter search space

All that is left is just to feed them into your optimization function, BayesSearchCV. Based on the CV scheme provided, this function will look for the minimum of your scoring function based on values within the search space. You can set a maximum number of iterations performed, the kind of surrogate function (Gaussian processes (GP) works on most occasions), and the random seed for reproducibility:

# Wrapping everything up into the Bayesian optimizer
opt = BayesSearchCV(estimator=reg,
                    search_spaces=search_spaces,
                    scoring=scoring,
                    cv=kf,
                    n_iter=60,           # max number of trials
                    n_jobs=-1,           # number of jobs
                    iid=False,         
                    # if not iid it optimizes on the cv score
                    return_train_score=False,
                    refit=False,  
                    # Gaussian Processes (GP) 
                    optimizer_kwargs={'base_estimator': 'GP'},
                    # random state for replicability
                    random_state=0)

At this point, you can start the search using the reporting function we defined previously. After a while, the function will return the best parameters for the problem.

# Running the optimizer
overdone_control = DeltaYStopper(delta=0.0001)
# We stop if the gain of the optimization becomes too small
time_limit_control = DeadlineStopper(total_time=60 * 60 * 6)
# We impose a time limit (6 hours)
best_params = report_perf(opt, X, y,'LightGBM_regression', 
                          callbacks=[overdone_control, time_limit_control])

In the example, we set a limit on operations by specifying a maximum time allowed (6 hours) before stopping and reporting the best results. Since the Bayesian optimization approach blends together exploration and exploitation of different combinations of hyperparameters, stopping at any time will always return the best solution found so far (but not necessarily the best one possible). This is because the acquisition function will always give priority of exploration to the most promising parts of the search space, based on the estimated performances returned by the surrogate function and their uncertainty intervals.

Customizing a Bayesian optimization search

The BayesSearchCV function offered by Scikit-optimize is certainly convenient, because it wraps and arranges all the elements of a hyperparameter search by itself, but it also has limitations. For instance, you may find it useful in a competition to:

  • Have more control over each search iteration, for instance mixing random search and Bayesian search
  • Be able to apply early stopping on algorithms
  • Customize your validation strategy more
  • Stop experiments that do not work early (for instance, immediately evaluating the performance of the single cross-validation folds when it is available, instead of waiting to have all folds averaged at the end)
  • Create clusters of hyperparameter sets that perform in a similar way (for instance, in order to create multiple models differing only in the hyperparameters used, to be used for a blending ensemble)

Each of these tasks would not be too complex if you could modify the BayesSearchCV internal procedure. Luckily, Scikit-optimize lets you do just this. In fact, behind BayesSearchCV, as well as behind other wrappers from the package, there are specific minimizing functions that you can use as standalone parts of your own search function:

  • gp_minimize: Bayesian optimization using Gaussian processes
  • forest_minimize: Bayesian optimization using random forests or extremely randomized trees
  • gbrt_minimize: Bayesian optimization using gradient boosting
  • dummy_minimize: Just random search

In the following example, we are going to modify the previous search using our own custom search function. The new custom function will accept early stopping during training and it will prune experiments if one of the fold validation results is not a top-performing one.

You can find the next example working in a Kaggle Notebook at https://www.kaggle.com/lucamassaron/hacking-bayesian-optimization.

As in the previous example, we start by importing the necessary packages.

# Importing core libraries
import numpy as np
import pandas as pd
from time import time
import pprint
import joblib
from functools import partial
# Suppressing warnings because of skopt verbosity
import warnings
warnings.filterwarnings("ignore")
# Classifier/Regressor
from xgboost import XGBRegressor
# Model selection
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
# Metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
# Skopt functions
from skopt import BayesSearchCV
from skopt.callbacks import DeadlineStopper, DeltaYStopper
from skopt.space import Real, Categorical, Integer
from skopt import gp_minimize, forest_minimize
from skopt import gbrt_minimize, dummy_minimize
# Decorator to convert a list of parameters to named arguments
from skopt.utils import use_named_args 
# Data processing
from sklearn.preprocessing import OrdinalEncoder

In the same way as before, we upload the data from the 30 Days of ML competition:

# Loading data 
X_train = pd.read_csv("../input/30-days-of-ml/train.csv")
X_test = pd.read_csv("../input/30-days-of-ml/test.csv")
# Preparing data as a tabular matrix
y_train = X_train.target
X_train = X_train.set_index('id').drop('target', axis='columns')
X_test = X_test.set_index('id')
# Pointing out categorical features
categoricals = [item for item in X_train.columns if 'cat' in item]
# Dealing with categorical data using OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
X_train[categoricals] = ordinal_encoder.fit_transform(X_train[categoricals])
X_test[categoricals] = ordinal_encoder.transform(X_test[categoricals])

Now we set all the necessary elements for a hyperparameter search, that is, the scoring function, the validation strategy, the search space, and the machine learning model to be optimized. The scoring function and the validation strategy will later become the core elements constituting the objective function, the function the Bayesian optimization will strive to minimize.

# Setting the scoring function
scoring = partial(mean_squared_error, squared=False)
# Setting the cv strategy
kf = KFold(n_splits=5, shuffle=True, random_state=0)
# Setting the search space
space = [Real(0.01, 1.0, 'uniform', name='learning_rate'),
         Integer(1, 8, name='max_depth'),
         Real(0.1, 1.0, 'uniform', name='subsample'),
         # Subsample ratio of columns by tree
         Real(0.1, 1.0, 'uniform', name='colsample_bytree'),  
         # L2 regularization
         Real(0, 100., 'uniform', name='reg_lambda'),
         # L1 regularization
         Real(0, 100., 'uniform', name='reg_alpha'),
         # minimum sum of instance weight (hessian)  
         Real(1, 30, 'uniform', name='min_child_weight')
         ]
model = XGBRegressor(n_estimators=10_000, 
                     booster='gbtree', random_state=0)

Notice this time that we have not included the number of estimators (the n_estimators parameter) in the search space. Instead, we set it when instantiating the model and we enter a high value, since we expect to stop the model early based on a validation set.

As a next step, you now need to create the objective function. The objective function should just accept as input the parameters to be optimized and return the resulting score. However, the objective function also needs to accept the elements necessary for the search you have just prepared. Naturally, you could refer to them from inside the function. However, it is a good practice to take them into the function itself, in its internal memory space. This has its advantages; for instance, you will make the elements immutable and they will be carried along with the objective function (by pickling or if you distribute the search task on a multi-processor level). You can obtain this second result by creating a make function that takes in the elements, with the modified objective function being returned by the make function. With this simple structure, your objective function will incorporate all the elements such as the data and the model, and you will only need to pass in the parameters to be tested.

Let’s start coding the function. We will stop along the way to discuss some relevant aspects:

# The objective function to be minimized
def make_objective(model, X, y, space, cv, scoring, validation=0.2):
    # This decorator converts your objective function 
    # with named arguments into one that accepts a list as argument,
    # while doing the conversion automatically.
    @use_named_args(space) 
    def objective(**params):
        model.set_params(**params)
        print("
Testing: ", params)
        validation_scores = list()
        for k, (train_index, test_index) in enumerate(kf.split(X, y)):
            val_index = list()
            train_examples = int(train_examples * (1 - validation))
            train_index, val_index = (train_index[:train_examples], 
                                      train_index[train_examples:])
            
            start_time = time()
            model.fit(X.iloc[train_index,:], y[train_index],
                      early_stopping_rounds=50,
                      eval_set=[(X.iloc[val_index,:], y[val_index])], 
                      verbose=0
                    )
            end_time = time()
            
            rounds = model.best_iteration
            
            test_preds = model.predict(X.iloc[test_index,:])
            test_score = scoring(y[test_index], test_preds)
            print(f"CV Fold {k+1} rmse:{test_score:0.5f}-{rounds} 
                  rounds - it took {end_time-start_time:0.0f} secs")
            validation_scores.append(test_score)

In this first part of the function, you simply create an objective function, doing cross-validation and fitting the data using early stopping. We have used an aggressive early stopping strategy to save time, but you could raise the number of patient rounds if you believe that it might work better for your problem. Notice that the validation examples are sequentially taken out from the examples in the training folds (see how train_index and val_index are defined in the code), leaving the out-of-fold examples (test_index derived from the kf cross-validation splitting) untouched for the final validation. This is important if you do not want to incur adaptive overfitting on the data you use for early stopping.

In the next part, before moving on to the cross-validation loop and proceeding to the remaining cross-validation folds to be trained and tested, you analyze the result obtained by the fold on the out-of-fold set:

            
            if len(history[k]) >= 10:
                threshold = np.percentile(history[k], q=25)
                if test_score > threshold:
                    print(f"Early stopping for under-performing fold: 
                          threshold is {threshold:0.5f}")
                    return np.mean(validation_scores)
                
            history[k].append(test_score)
        return np.mean(validation_scores)
    return objective

Notice that we are keeping a global dictionary, history, containing the results obtained from each fold up to now. We can compare the results across multiple experiments and cross-validations; the cross-validation is reproducible due to the random seed, so the results of the same fold are perfectly comparable. If the result of the present fold is sub-par compared to the previously obtained folds in other iterations (using the bottom quartile as a reference), the idea is to stop and return the average of the folds tested so far. The rationale for this is that if one fold doesn’t present acceptable results, then the whole cross-validation probably won’t either. You can therefore just quit and move on to another set of more promising parameters. It is a kind of early stopping on cross-validation that should speed up your search and allow you to cover more experiments in less time.

Next, using our make_objective function, we put together all the elements (model, data, search space, validation strategy, and scoring function) into a single function, the objective function. As a result, we now have a function that only takes in the parameters to be optimized and returns a score, based on which the minimization engine of the optimization will decide the next experiments:

objective = make_objective(model,
                           X_train, y_train,
                           space=space,
                           cv=kf,
                           scoring=scoring)

Since we want to control each step of the optimization and save it for later use, we also prepare a callback function that will save a list of the experiments executed and their results, at every iteration of the minimization process. Simply by using these two pieces of information, the minimization engine can be halted at any time, and it can thereafter resume the optimization from the checkpoint:

def onstep(res):
    global counter
    x0 = res.x_iters   # List of input points
    y0 = res.func_vals # Evaluation of input points
    print('Last eval: ', x0[-1], 
          ' - Score ', y0[-1])
    print('Current iter: ', counter, 
          ' - Best Score ', res.fun, 
          ' - Best Args: ', res.x)
    # Saving a checkpoint to disk
    joblib.dump((x0, y0), 'checkpoint.pkl') 
    counter += 1

At this point, we are ready to start. Bayesian optimization needs some starting points to work properly. We create a number of experiments with random search (using the dummy_minimize function) and save their results:

counter = 0
history = {i:list() for i in range(5)}
used_time = 0
gp_round = dummy_minimize(func=objective,
                          dimensions=space,
                          n_calls=30,
                          callback=[onstep],
                          random_state=0)

We can then retrieve the saved experiments and print the sequence of sets of hyperparameters that the Bayesian optimization has tested, along with their results. In fact, we can find the set of parameters and their results contained in the x0 and y0 lists:

x0, y0 = joblib.load('checkpoint.pkl')
print(len(x0))

At this point, we can even resume the Bayesian optimization with some changes in the search space, the acquisition function, the number of calls, or the callbacks:

x0, y0 = joblib.load('checkpoint.pkl')
gp_round = gp_minimize(func=objective,
                       x0=x0,    # already examined values for x
                       y0=y0,    # observed values for x0
                       dimensions=space,
                       acq_func='gp_hedge',
                       n_calls=30,
                       n_initial_points=0,
                       callback=[onstep],
                       random_state=0)

Once we are satisfied that we don’t need to continue calling the optimization function, we can print both the best score obtained (based on our inputs and validation scheme) and the set of best hyperparameters:

x0, y0 = joblib.load('checkpoint.pkl')
print(f"Best score: {gp_round.fun:0.5f}")
print("Best hyperparameters:")
for sp, x in zip(gp_round.space, gp_round.x):
    print(f"{sp.name:25} : {x}")

Based on the best result, we can re-train our model for use in the competition.

Now we have the set of parameters and their results (the x0 and y0 lists), we could also explore the different results and cluster together the ones that are similar in output but different in the set of parameters used. This will help us to train a more diverse set of models with similar performances but different optimization strategies. This is the ideal situation for blending, which is the averaging of multiple models in order to lower the variance of the estimates and obtain a better public and private leaderboard score.

Refer to Chapter 9, Ensembling with Blending and Stacking Solutions, for a discussion on blending.

Extending Bayesian optimization to neural architecture search

Moving on to deep learning, neural networks also seem to have quite a few hyperparameters to fix:

  • Batch size
  • Learning rate
  • The kind of optimizer and its internal parameters

All these parameters influence how the network learns and they can make a big impact; just a slight difference in batch size or learning rate can determine whether a network can reduce its error beyond a certain threshold or not.

That being said, these learning parameters are not the only ones that you can optimize when working with deep neural networks (DNNs). How the network is organized in layers and the details of its architecture can make even more of a difference.

In fact, technically speaking, an architecture implies the representational capacity of the deep neural network, which means that, depending on the layers you use, the network will either be able to read and process all the information available in the data, or it will not. While you had a large but limited set of choices with other machine learning algorithms, with DNNs your choices seem unlimited, because the only apparent limit is your knowledge and experience in handling parts of neural networks and putting them together.

Common best practices for great deep learning practitioners when assembling well-performing DNNs depend mainly on:

  • Relying on pre-trained models (so you have to be very knowledgeable about the solutions available, such as those found on Hugging Face (https://huggingface.co/models) or on GitHub)
  • Reading cutting-edge papers
  • Copying top Kaggle Notebooks from the same competition or previous ones
  • Trial and error
  • Ingenuity and luck

In a famous lesson given by Professor Geoffrey Hinton, he states that you can achieve similar and often better results using automated methods such as Bayesian optimization. Bayesian optimization will also avoid you getting stuck because you cannot figure out the best combinations of hyperparameters among the many possible ones.

For a recording of Prof. Geoffrey Hinton’s lesson, see https://www.youtube.com/watch?v=i0cKa0di_lo.

For the slides, see https://www.cs.toronto.edu/~hinton/coursera/lecture16/lec16.pdf.

As we mentioned before, even in most sophisticated AutoML systems, when you have too many hyperparameters, relying on random optimization may produce better results or the same results in the same amount of time as Bayesian optimization. In addition, in this case, you also have to fight against an optimization landscape with sharp turns and surfaces; in DNN optimization, many of your parameters won’t be continuous but Boolean instead, and just one change could unexpectedly transform the performance of your network for the better or for the worse.

Our experience tells us that random optimization may not be suitable for a Kaggle competition because:

  • You have limited time and resources
  • You can leverage your previous optimization results in order to find better solutions

Bayesian optimization in this scenario is ideal: you can set it to work based on the time and computational resources that you have and do it by stages, refining your settings through multiple sessions. Moreover, it is unlikely that you will easily be able to leverage parallelism for tuning DNNs, since they use GPUs, unless you have multiple very powerful machines at hand. By working sequentially, Bayesian optimization just needs one good machine to perform the task. Finally, even if it is hard to find optimal architectures by a search, due to the optimization landscape you leverage information from previous experiments, especially at the beginning, totally avoiding combinations of parameters that won’t work. With random optimization, unless you change the search space along the way, all combinations are always liable to be tested.

There are also drawbacks, however. Bayesian optimization models the hyperparameter space using a surrogate function built from previous trials, which is not an error-free process. It is not a remote possibility that the process ends up concentrating only on a part of the search space while ignoring other parts (which may instead contain the minimum you are looking for). The solution to this is to run a large number of experiments to be safe, or to alternate between random search and Bayesian optimization, challenging the Bayesian model with random trials that can force it to reshape its search model in a more optimal way.

For our example, we use again the data from the 30 Days of ML initiative by Kaggle, a regression task. Our example is based on TensorFlow, but with small modifications it can run on other deep learning frameworks such as PyTorch or MXNet.

As before, you can find the example on Kaggle here: https://www.kaggle.com/lucamassaron/hacking-bayesian-optimization-for-dnns.

Let’s begin:

import tensorflow as tf

After importing the TensorFlow package, we leverage its Dataset function to create an iterable capable of feeding our neural network with batches of data:

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('target')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe),   
                                             labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds
tf.keras.utils.get_custom_objects().update({'leaky-relu': tf.keras.layers.Activation(tf.keras.layers.LeakyReLU(alpha=0.2))})

We have also made leaky ReLU activation a custom object for our model; it can be called by a string, and there is no need to directly use the function.

We proceed to code a function that creates our deep neural network model based on a set of hyperparameters:

def create_model(cat0_dim, cat1_dim, cat2_dim,
                 cat3_dim, cat4_dim, cat5_dim, 
                 cat6_dim, cat7_dim, cat8_dim, cat9_dim,
                 layers, layer_1, layer_2, layer_3, layer_4, layer_5, 
                 activation, dropout, batch_normalization, learning_rate, 
                 **others):
    
    dims = {'cat0': cat0_dim, 'cat1': cat1_dim, 'cat2': cat2_dim, 
            'cat3': cat3_dim, 'cat4': cat4_dim, 'cat5': cat5_dim,
            'cat6': cat6_dim, 'cat7': cat7_dim, 'cat8': cat8_dim, 
            'cat9': cat9_dim}
    
    vocab = {h:X_train['cat4'].unique().astype(int) 
             for h in ['cat0', 'cat1', 'cat2', 'cat3', 
                       'cat4', 'cat5', 'cat6', 'cat7', 
                       'cat8', 'cat9']}
    
    layers = [layer_1, layer_2, layer_3, layer_4, layer_5][:layers]
    
    feature_columns = list()
    for header in ['cont1', 'cont2', 'cont3', 'cont4', 'cont5', 
                   'cont6','cont7', 'cont8', 'cont9', 'cont10',
                   'cont11', 'cont12', 'cont13']:
         
        feature_columns.append(tf.feature_column.numeric_column(header))
    for header in ['cat0', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 
                   'cat6', 'cat7', 'cat8', 'cat9']:
        feature_columns.append(
            tf.feature_column.embedding_column(
            tf.feature_column.categorical_column_with_vocabulary_list(
            header, vocabulary_list=vocab[header]),  
            dimension=dims[header]))
    feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
    network_struct = [feature_layer]
    for nodes in layers:
        network_struct.append(
                 tf.keras.layers.Dense(nodes, activation=activation))
        if batch_normalization is True:
                   network_struct.append(
                   tf.keras.layers.BatchNormalization())
        if dropout > 0:
            network_struct.append(tf.keras.layers.Dropout(dropout))
    model = tf.keras.Sequential(network_struct + 
                                [tf.keras.layers.Dense(1)])
    model.compile(optimizer=tf.keras.optimizers.Adam(
                          learning_rate=learning_rate),
                  loss= tf.keras.losses.MeanSquaredError(),
                  metrics=['mean_squared_error'])
    
    return model

Internally, the code in the create_model function customizes the neural network architecture based on the inputs provided. For instance, as parameters for the function you can provide the dimensions of the embeddings for each categorical variable, or define the structure and number of dense layers present in the network. All these parameters are related to the parameter space you want to be explored by Bayesian optimization, hence every input parameter of the function creating the model should be related to a sampling function defined in the search space. All you have to do is to place the sampling functions in a list, in the same order as expected by the create_model function:

# Setting the search space
    
space = [Integer(1, 2, name='cat0_dim'),
         Integer(1, 2, name='cat1_dim'),
         Integer(1, 2, name='cat2_dim'),
         Integer(1, 3, name='cat3_dim'),
         Integer(1, 3, name='cat4_dim'),
         Integer(1, 3, name='cat5_dim'),
         Integer(1, 4, name='cat6_dim'),
         Integer(1, 4, name='cat7_dim'),
         Integer(1, 6, name='cat8_dim'),
         Integer(1, 8, name='cat9_dim'),
         Integer(1, 5, name='layers'),
         Integer(2, 256, name='layer_1'),
         Integer(2, 256, name='layer_2'),
         Integer(2, 256, name='layer_3'),
         Integer(2, 256, name='layer_4'),
         Integer(2, 256, name='layer_5'),
         Categorical(['relu', 'leaky-relu'], name='activation'),
         Real(0.0, 0.5, 'uniform', name='dropout'),
         Categorical([True, False], name='batch_normalization'),
         Categorical([0.01, 0.005, 0.002, 0.001], name='learning_rate'),
         Integer(256, 1024, name='batch_size')
        ]

As previously illustrated, you now combine all the elements related to the search into an objective function to be created by a function incorporating your basic search elements, such as the data and the cross-validation strategy:

def make_objective(model_fn, X, space, cv, scoring, validation=0.2):
    # This decorator converts your objective function with named arguments
    # into one that accepts a list as argument, while doing the conversion
    # automatically.
    @use_named_args(space) 
    def objective(**params):
        
        print("
Testing: ", params)
        validation_scores = list()
        
        for k, (train_index, test_index) in enumerate(kf.split(X)):
            val_index = list()
            train_examples = len(train_index)
            train_examples = int(train_examples * (1 - validation))
            train_index, val_index = (train_index[:train_examples], 
                                      train_index[train_examples:])
            
            start_time = time()
            
            model = model_fn(**params)
            measure_to_monitor = 'val_mean_squared_error'
            modality='min'
            early_stopping = tf.keras.callbacks.EarlyStopping(
                                 monitor=measure_to_monitor,
                                 mode=modality,
                                 patience=5, 
                                 verbose=0)
            model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
                                   'best.model',
                                   monitor=measure_to_monitor, 
                                   mode=modality, 
                                   save_best_only=True, 
                                   verbose=0)
            run = model.fit(df_to_dataset(
                                X_train.iloc[train_index, :], 
                                batch_size=params['batch_size']),
                            validation_data=df_to_dataset(
                                X_train.iloc[val_index, :], 
                                batch_size=1024),
                            epochs=1_000,
                            callbacks=[model_checkpoint, 
                                       early_stopping],
                            verbose=0)
            
            end_time = time()
            
            rounds = np.argmin(
                     run.history['val_mean_squared_error']) + 1
            
            model = tf.keras.models.load_model('best.model')
            shutil.rmtree('best.model')
            
            test_preds = model.predict(df_to_dataset(
                            X.iloc[test_index, :], shuffle=False, 
                            batch_size=1024)).flatten()
                            test_score = scoring(
                            X.iloc[test_index, :]['target'], 
                            test_preds)
            print(f"CV Fold {k+1} rmse:{test_score:0.5f} - {rounds} 
                  rounds - it took {end_time-start_time:0.0f} secs")
            validation_scores.append(test_score)
            
            if len(history[k]) >= 10:
                threshold = np.percentile(history[k], q=25)
                if test_score > threshold:
                    print(f"Early stopping for under-performing fold: 
                          threshold is {threshold:0.5f}")
                    return np.mean(validation_scores)
                
            history[k].append(test_score)
        return np.mean(validation_scores)
    return objective

The next step is to provide a sequence of random search runs (as a way to start building some feedback from the search space) and gather the results as a starting point. Then, we can feed them into a Bayesian optimization and proceed by using forest_minimize as a surrogate function:

counter = 0
history = {i:list() for i in range(5)}
used_time = 0
gp_round = dummy_minimize(func=objective,
                          dimensions=space,
                          n_calls=10,
                          callback=[onstep],
                          random_state=0)
gc.collect()
x0, y0 = joblib.load('checkpoint.pkl')
gp_round = gp_minimize(func=objective,
                           x0=x0,  # already examined values for x
                           y0=y0,  # observed values for x0
                           dimensions=space,
                           n_calls=30,
                           n_initial_points=0,
                           callback=[onstep],
                           random_state=0)
gc.collect()

Notice that after the first ten rounds of random search, we proceed with our search using a random forest algorithm as a surrogate function. That will ensure better and faster results than using a Gaussian process.

As before, in this process we have to strive to make the optimization feasible within the time and resources we have (for instance, by setting a low number of n_calls). Hence, we can proceed with batches of search iterations by saving the state of the optimization, checking the results obtained, and deciding thereafter to proceed or conclude the optimization process and not invest more time and energy into looking for a better solution.

Creating lighter and faster models with KerasTuner

If the previous section has puzzled you because of its complexity, KerasTuner can offer you a fast solution for setting up an optimization without much hassle. Though it uses Bayesian optimization and Gaussian processes by default, the new idea behind KerasTuner is hyperband optimization. Hyperband optimization uses the bandit approach to figure out the best parameters (see http://web.eecs.umich.edu/~mosharaf/Readings/HyperBand.pdf). This works quite well with neural networks, whose optimization landscape is quite irregular and discontinuous, and thus not always suitable for Gaussian processes.

Keep in mind that you cannot avoid building the function that builds a custom network using input hyperparameters; KerasTuner just makes it much easier to handle.

Let’s start from the beginning. KerasTuner (https://keras.io/keras_tuner/) was announced as a “flexible and efficient hyperparameter tuning for Keras models” by François Chollet, the creator of Keras.

The recipe proposed by Chollet for running KerasTuner is made up of simple steps, starting from your existing Keras model:

  1. Wrap your model in a function with hp as the first parameter.
  2. Define hyperparameters at the beginning of the function.
  3. Replace DNN static values with hyperparameters.
  4. Write the code that models a complex neural network from the given hyperparameters.
  5. If necessary, dynamically define hyperparameters as you build the network.

We’ll now explore how all these steps can work for you in a Kaggle competition by using an example. At the moment, KerasTuner is part of the stack offered by any Kaggle Notebook, hence you don’t need to install it. In addition, the TensorFlow add-ons are part of the Notebook’s pre-installed packages.

If you are not using a Kaggle Notebook and you need to try KerasTuner, you can easily install both using the following commands:

!pip install -U keras-tuner
!pip install -U tensorflow-addons

You can find this example already set up on a Kaggle Notebook here: https://www.kaggle.com/lucamassaron/kerastuner-for-imdb/.

Our first step is to import the necessary packages (creating shortcuts for some commands, such as for pad_sequences) and to upload the data we will be using directly from Keras:

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import tensorflow_addons as tfa
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LeakyReLU
from tensorflow.keras.layers import Activation
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
pad_sequences = keras.preprocessing.sequence.pad_sequences
imdb = keras.datasets.imdb(train_data, train_labels),
(test_data, test_labels) = imdb.load_data(num_words=10000)
train_data, val_data, train_labels, val_labels = train_test_split(train_data, train_labels, test_size=0.30,
                 shuffle=True, random_state=0)

This time, we are using the IMDb dataset, which is available in the Keras package (https://keras.io/api/datasets/imdb/). The dataset has some interesting characteristics:

  • It is a dataset of 25,000 movie reviews from IMDb
  • The reviews are labeled by sentiment (positive/negative)
  • The target classes are balanced (hence accuracy works as a scoring measure)
  • Each review is encoded as a list of word indexes (integers)
  • For convenience, words are indexed by overall frequency

In addition, it has been successfully used in a popular Kaggle competition on word embeddings (https://www.kaggle.com/c/word2vec-nlp-tutorial/overview).

This example involves natural language processing. This type of problem is often solved by using recurrent neural networks (RNNs) based on LSTM or GRU layers. BERT, RoBERTa, and the other transformer-based models often achieve better results – being pre-trained models relying on large language corpora – but this is not necessarily true in all problems, and RNNs can prove a strong baseline to beat or a good addition to an ensemble of neural models. In our example, all words are already numerically indexed. We just add to the existing indices the numeric codes that denote padding (so we can easily normalize all the text to the phrase length), the start of the sentence, an unknown word, and an unused word:

# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()
# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

The next step involves creating a custom layer for attention. Attention is the foundation of transformer models and it is one of the most innovative ideas in neural NLP of recent times.

For all the details of how these kinds of layers work, see the seminal paper on attention: Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems. 2017 (https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).

The idea of attention can be easily conveyed. LSTM and GRU layers output processed sequences, but not all the elements in these output sequences are necessarily important for your predictions. Instead of averaging all the output sequences using a pool layer across the stratified sequences, you can actually take a weighted average of them (and during the training phase learn the correct weights to be used). This weighting process (attention) definitely improves the results you are going to pass on further. Of course, you can make this approach even more sophisticated using multiple attention layers (we call this multi-head attention), but in our example a single layer will suffice because we want to demonstrate that using attention is more effective in this problem than simply averaging or just concatenating all the results together:

from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import Flatten, RepeatVector, dot, multiply, Permute, Lambda
K = keras.backend
def attention(layer):
    # --- Attention is all you need --- #
    _,_,units = layer.shape.as_list()
    attention = Dense(1, activation='tanh')(layer)
    attention = Flatten()(attention)
    attention = Activation('softmax')(attention)
    attention = RepeatVector(units)(attention)
    attention = Permute([2, 1])(attention)
    representation = multiply([layer, attention])
    representation = Lambda(lambda x: K.sum(x, axis=-2), 
                            output_shape=(units,))(representation)
    # ---------------------------------- #
    return representation

As a further variation in our experiments on the architecture of the DNNs for this problem, we also want to test the effectiveness of using different kinds of optimizers such as Rectified Adam (an adaptive learning Adam optimizer; read this post to learn more: https://lessw.medium.com/new-state-of-the-art-ai-optimizer-rectified-adam-radam-5d854730807b) or Stochastic Weighted Averaging (SWA). SWA is a way to average the weights traversed during the optimization based on a modified learning rate schedule: if your model tends to overfit or overshoot, SWA helps in getting near to an optimal solution and it is proven to work especially in NLP problems.

def get_optimizer(option=0, learning_rate=0.001):
    if option==0:
        return tf.keras.optimizers.Adam(learning_rate)
    elif option==1:
        return tf.keras.optimizers.SGD(learning_rate, 
                                       momentum=0.9, nesterov=True)
    elif option==2:
        return tfa.optimizers.RectifiedAdam(learning_rate)
    elif option==3:
        return tfa.optimizers.Lookahead(
                   tf.optimizers.Adam(learning_rate), sync_period=3)
    elif option==4:
        return tfa.optimizers.SWA(tf.optimizers.Adam(learning_rate))
    elif option==5:
        return tfa.optimizers.SWA(
                   tf.keras.optimizers.SGD(learning_rate, 
                                       momentum=0.9, nesterov=True))
    else:
        return tf.keras.optimizers.Adam(learning_rate)

Having defined two key functions, we now face the most important function to code: the one that will provide different neural architectures given the parameters. We don’t encode all the various parameters we want to connect to the different architectural choices; we only provide the hp parameter, which should contain all the possible parameters we want to use, and that will be run by KerasTuner. Aside from hp in the function input, we fix the size of the vocabulary and the length to be padded (adding dummy values if the effective length is shorter or cutting the phrase if the length is longer):

layers = keras.layers
models = keras.models
    
def create_tunable_model(hp, vocab_size=10000, pad_length=256):
    # Instantiate model params
    embedding_size = hp.Int('embedding_size', min_value=8, 
                            max_value=512, step=8)
    spatial_dropout = hp.Float('spatial_dropout', min_value=0, 
                               max_value=0.5, step=0.05)
    conv_layers = hp.Int('conv_layers', min_value=1,
                         max_value=5, step=1)
    rnn_layers = hp.Int('rnn_layers', min_value=1,
                        max_value=5, step=1)
    dense_layers = hp.Int('dense_layers', min_value=1,
                          max_value=3, step=1)
    conv_filters = hp.Int('conv_filters', min_value=32, 
                          max_value=512, step=32)
    conv_kernel = hp.Int('conv_kernel', min_value=1,
                         max_value=8, step=1)
    concat_dropout = hp.Float('concat_dropout', min_value=0, 
                              max_value=0.5, step=0.05)
    dense_dropout = hp.Float('dense_dropout', min_value=0, 
                             max_value=0.5, step=0.05)

In the first part of the function, we simply recover all the settings from the hp parameter. We also make explicit the range of the search space for each of them. Contrary to the solutions we’ve seen so far, this part of the work is done inside the model function, not outside.

The function continues by defining the different layers using the parameters extracted from hp. In some cases, a parameter will switch on or off a part of the network performing certain data processing. For instance, in the code we inserted a branch of the graph (conv_filters and conv_kernel) that processes the sequence of words using convolutional layers, which, in their 1D form, can also prove useful for NLP problems, since they can catch local sequences of words and meanings that LSTMs may find harder to grasp.

Now we can define the actual model:

    inputs = layers.Input(name='inputs',shape=[pad_length])
    layer  = layers.Embedding(vocab_size, embedding_size, 
                              input_length=pad_length)(inputs)
    layer  = layers.SpatialDropout1D(spatial_dropout)(layer)
    for l in range(conv_layers):
        if l==0:
            conv = layers.Conv1D(filters=conv_filters, 
                       kernel_size=conv_kernel, padding='valid',
                       kernel_initializer='he_uniform')(layer)
        else:
            conv = layers.Conv1D(filters=conv_filters,  
                       kernel_size=conv_kernel, padding='valid', 
                       kernel_initializer='he_uniform')(conv) 
    avg_pool_conv = layers.GlobalAveragePooling1D()(conv)
    max_pool_conv = layers.GlobalMaxPooling1D()(conv)
    representations = list()
    for l in range(rnn_layers):
        
        use_bidirectional = hp.Choice(f'use_bidirectional_{l}',
                                      values=[0, 1])
        use_lstm = hp.Choice(f'use_lstm_{l}', values=[0, 1])
        units = hp.Int(f'units_{l}', min_value=8, max_value=512, step=8)
        if use_lstm == 1:
            rnl = layers.LSTM
        else:
            rnl = layers.GRU
        if use_bidirectional==1:
            layer = layers.Bidirectional(rnl(units, 
                              return_sequences=True))(layer)
        else:
            layer = rnl(units, return_sequences=True)(layer)
        representations.append(attention(layer))
    layer = layers.concatenate(representations + [avg_pool_conv, 
                                                  max_pool_conv])
    layer = layers.Dropout(concat_dropout)(layer)
    for l in range(dense_layers):
        dense_units = hp.Int(f'dense_units_{l}', min_value=8, 
                             max_value=512, step=8)
        layer = layers.Dense(dense_units)(layer)
        layer = layers.LeakyReLU()(layer)
        layer = layers.Dropout(dense_dropout)(layer)
    layer = layers.Dense(1, name='out_layer')(layer)
    outputs = layers.Activation('sigmoid')(layer)
    model = models.Model(inputs=inputs, outputs=outputs)

We start by defining the input layer and transform it with a subsequent embedding layer that will encode the sequence values into dense layers. Some dropout regularization is applied to the process using SpatialDropout1D, a function that will randomly drop entire columns of the output matrix (standard dropout drops random single elements in the matrix instead). After these initial phases, we split the network into one pipeline based on convolutions (Conv1D) and another based on recurrent layers (GRU or LSTM). It is after the recurrent layers that we apply the attention layer. Finally, the outputs of these two pipelines are concatenated and, after a few more dense layers, they arrive at the final output node, a sigmoid since we have to represent a probability bounded in the range 0 to 1.

After the model definition, we set the learning parameters and compile the model before returning it:

    hp_learning_rate = hp.Choice('learning_rate', 
                                 values=[0.002, 0.001, 0.0005])
    optimizer_type = hp.Choice('optimizer', values=list(range(6)))
    optimizer = get_optimizer(option=optimizer_type,  
                              learning_rate=hp_learning_rate)
    
    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['acc'])
    
    return model

Note that we have built the model using the functional API of Keras, not the sequential one. We would advise you to avoid the sequential one, in fact; it is easier to set up, but severely restricts your potential architectures.

At this point, most of the work is already done. As a suggestion, having worked out many optimizations using KerasTuner ourselves, we prefer to first build a non-parametric model, using all the possible architectural features that we want to test, with the mutually exclusive parts of the network set to the most complex solutions. After we have set up the generative function and our model seems to be working properly, we can, for instance, represent its graph and have it successfully fit some examples as a test. After that, we start inserting the parametric variables into the architecture and setting up the hp parameter definitions.

In our experience, starting with a parametric function immediately will take more time and debugging effort. The idea behind KerasTuner is to let you think of your DNNs as a set of modular circuits and to help you optimize how the data flows inside them.

Now, we import KerasTuner. First, we set the tuner itself, and then we start the search:

import keras_tuner as kt
tuner = kt.BayesianOptimization(hypermodel=create_tunable_model,
                                objective='val_acc',
                                max_trials=100,
                                num_initial_points=3,
                                directory='storage',
                                project_name='imdb',
                                seed=42)
tuner.search(train_data, train_labels, 
             epochs=30,
             batch_size=64, 
             validation_data=(val_data, val_labels),
             shuffle=True,
             verbose=2,
             callbacks = [EarlyStopping('val_acc',
                                        patience=3,
                                        restore_best_weights=True)]
             )

As a tuner, we opt for the Bayesian optimization one, but you can also try the Hyperband tuner (https://keras.io/api/keras_tuner/tuners/hyperband/) and check if it works better for your problem. We provide our model function to the hypermodel parameter. Then, we set the objective using a string or a function, the maximum number of trials (KerasTuner will stop earlier if there is nothing more to be done), and the initial number of random trials – the more the better – in order to inform the Bayesian process. Early stopping is a standard and well-performing practice in modeling DNNs that you absolutely cannot ignore. Finally, but importantly, we set the directory where we want to save our search and a seed number for reproducibility of the optimization steps.

The search phase is run like a standard fit of a Keras model and – this is quite important – it accepts callbacks. Therefore, you can easily add early stopping to your model. In this case, the given epoch number should therefore be considered the maximum number of epochs. You may also want to optimize the batch size, which we haven’t done in our example. This still requires some extra work, but you can get an idea of how to achieve it by reading this GitHub closed issue: https://github.com/keras-team/keras-tuner/issues/122.

After the optimization is complete, you can extract the best parameters and save the best model without any need to retrain it:

best_hps = tuner.get_best_hyperparameters()[0]
model = tuner.hypermodel.build(best_hps)
print(best_hps.values)
model.summary()
model.save("best_model.h5")

In this example, KerasTuner finds a solution that uses:

  • A larger embedding layer
  • Just plain GRU and LSTM layers (no bi-directional layers)
  • Stacking of multiple one-dimensional convolution layers (Conv1D)
  • More and larger dense layers

Interestingly, the solution is not only more effective, but also lighter and faster than our previous attempts based on intuition and experience with the problem.

Chollet himself suggests using KerasTuner not just to make your DNNs perform better but also to shrink them to a more manageable size, something that may make the difference in Code competitions. This allows you to put together more models that work together within the limited inference time provided by the sponsors of the competition.

If you would like to examine more examples of using KerasTuner, François Chollet also created a series of Notebooks for Kaggle competitions in order to showcase the workings and functionalities of his optimizer:

The TPE approach in Optuna

We complete our overview of Bayesian optimization with another interesting tool and approach to it. As we have discussed, Scikit-optimize uses Gaussian processes (as well as tree algorithms) and it directly models the surrogate function and the acquisition function.

As a reminder of these topics, the surrogate function helps the optimization process to model the potential performance result when you try a set of hyperparameters. The surrogate function is built using the previous experiments and their results; it is just a predictive model applied in order to forecast the behavior of a specific machine learning algorithm on a specific problem. For each parameter input provided to the surrogate function, you get an expected performance output. That’s intuitive and also quite hackable, as we have seen.

The acquisition function instead points out what set of hyperparameters could be tested in order to improve the ability of the surrogate function to predict the performances of the machine learning algorithm. It is also useful for really testing if we can reach a top-performing result based on the surrogate function’s forecasts. These two objectives represent the explore part (where you run experiments) and the exploit part (where you test the performance) of a Bayesian optimization process.

Instead, optimizers based on TPE tackle the problem by estimating the likelihood of success of the values of parameters. In other words, they model the success distribution of the parameters themselves using successive refinements, assigning a higher probability to more successful value combinations.

In this approach, the set of hyperparameters is divided into good and bad ones by these distributions, which take the role of the surrogate and acquisition functions in Bayesian optimization, since the distributions tell you where to sample to get better performances or explore where there is uncertainty.

To explore the technical details of TPE, we suggest reading Bergstra, J. et al. Algorithms for hyper-parameter optimization. Advances in neural information processing systems 24, 2011 (https://proceedings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf).

Therefore, TPE can model the search space and simultaneously suggest what the algorithm can try next, by sampling from the adjusted probability distribution of parameters.

For a long time, Hyperopt was the option for those preferring to use TPE instead of Bayesian optimization based on Gaussian processes. In October 2018, however, Optuna appeared in the open source and it has become the preferred choice for Kagglers due to its versatility (it also works out of the box for neural networks and even for ensembling), speed, and efficiency in finding better solutions compared to previous optimizers.

In this section, we will demonstrate just how easy is to set up a search, which is called a study under Optuna terminology. All you need to do is to write an objective function that takes as input the parameters to be tested by Optuna and then returns an evaluation. Validation and other algorithmic aspects can be handled in a straightforward manner inside the objective function, also using references to variables external to the function itself (both global variables or local ones). Optuna also allows pruning, that is, signaling that a particular experiment is not going well and that Optuna can stop and forget about it. Optuna provides a list of functions that activate this callback (see https://optuna.readthedocs.io/en/stable/reference/integration.html); the algorithm will run everything efficiently for you after that, which will significantly reduce the time needed for optimization.

All of this is in our next example. We return to optimizing for the 30 Days of ML competition. This time, we are trying to figure out what parameters make XGBoost work for this competition.

You can find the Notebook for this example at https://www.kaggle.com/lucamassaron/optuna-bayesian-optimization.

As a first step, we upload the libraries and the data, as before:

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from xgboost import XGBRegressor
import optuna
from optuna.integration import XGBoostPruningCallback
# Loading data 
X_train = pd.read_csv("../input/30-days-of-ml/train.csv").iloc[:100_000, :]
X_test = pd.read_csv("../input/30-days-of-ml/test.csv")
# Preparing data as a tabular matrix
y_train = X_train.target
X_train = X_train.set_index('id').drop('target', axis='columns')
X_test = X_test.set_index('id')
# Pointing out categorical features
categoricals = [item for item in X_train.columns if 'cat' in item]
# Dealing with categorical data using OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
X_train[categoricals] = ordinal_encoder.fit_transform(X_train[categoricals])
X_test[categoricals] = ordinal_encoder.transform(X_test[categoricals])

When using Optuna, you just have to define an objective function containing the model, the cross-validation logic, the evaluation measure, and the search space.

Naturally, for data you can refer to objects outside the function itself, rendering the construction of the function much easier. As in KerasTuner, here you need a special input parameter based on a class from Optuna:

def objective(trial):
    
    params = {
            'learning_rate': trial.suggest_float("learning_rate", 
                                                 0.01, 1.0, log=True),
            'reg_lambda': trial.suggest_loguniform("reg_lambda", 
                                                   1e-9, 100.0),
            'reg_alpha': trial.suggest_loguniform("reg_alpha", 
                                                  1e-9, 100.0),
            'subsample': trial.suggest_float("subsample", 0.1, 1.0),
            'colsample_bytree': trial.suggest_float(
                                      "colsample_bytree", 0.1, 1.0),
            'max_depth': trial.suggest_int("max_depth", 1, 7),
            'min_child_weight': trial.suggest_int("min_child_weight", 
                                                  1, 7),
            'gamma': trial.suggest_float("gamma", 0.1, 1.0, step=0.1)
    }
    model = XGBRegressor(
        random_state=0,
        tree_method="gpu_hist",
        predictor="gpu_predictor",
        n_estimators=10_000,
        **params
    )
    
    model.fit(x, y, early_stopping_rounds=300, 
              eval_set=[(x_val, y_val)], verbose=1000,
              callbacks=[XGBoostPruningCallback(trial, 'validation_0-rmse')])
    preds = model.predict(x_test)
    rmse = mean_squared_error(y_test, preds, squared=False)
    return rmse

In this example, for performance reasons, we won’t cross-validate but use one fixed dataset for training, one for validation (early stopping), and one for testing purposes. We are using GPU in this example, and we are also subsetting the available data in order to fit the execution of 60 trials into a reasonable length of time. If you don’t want to use GPU, just remove the tree_method and predictor parameters from the XGBRegressor instantiation. Also notice how we set a callback in the fit method in order to provide Optuna feedback on how the model is performing, so the optimizer can stop an underperforming experiment early to give space to other attempts.

x, x_val, y, y_val = train_test_split(X_train, y_train, random_state=0,
                                      test_size=0.2)
x, x_test, y, y_test = train_test_split(x, y, random_state=0, test_size=0.25)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)

Another notable aspect is that you can decide to optimize either for minimization or maximization, depending on your problem (Scikit-optimize works only on minimization problems).

print(study.best_value)
print(study.best_params)

To complete the run, you just have to print or export the best test performance and the best parameters found by the optimization.

Ruchi Bhatia

https://www.kaggle.com/ruchi798

As a conclusion to this dense chapter, let’s look at one last interview. This time, we’re speaking to Ruchi Bhatia, a Grandmaster in Datasets and Notebooks. Ruchi is currently a graduate student at Carnegie Mellon University, a Data Scientist at OpenMined, and a Data Science Global Ambassador at Z by HP.

What’s your favorite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle?

My favorite kinds of competitions are NLP and Analytics competitions. Being multilingual has played a significant role in my main focus and interest: Natural Language Processing.

As for Analytics competitions, I enjoy making sense out of complex data and backing my answers to questions with the support of data! Every competition on Kaggle is novel and requires different techniques. I mainly follow a data-driven approach to algorithm selection and have no set favorites.

How do you approach a Kaggle competition? How different is this approach to what you do in your day-to-day work?

When a new competition is announced, my priority is to understand the problem statement in depth. Sometimes problem statements can be out of our comfort zone or domain, so it’s crucial to ensure we grasp them well before moving on to exploratory data analysis. While performing EDA, my goal is to understand data distribution and focus on getting to know the data at hand. During this, we are likely to come across patterns, and we should make an effort to understand those and form a hypothesis for outliers and exceptional cases.

After this, I spend time understanding the competition metrics. The creation of a leak-free cross-validation strategy is my next step. After this, I choose a baseline model and make my first submission. If the correlation between the local validation and the competition leaderboard is not satisfying, I iterate for as long as needed to understand possible discrepancies and account for them.

Then I move on to improve my modeling approach with time. Apart from this, tweaking parameters and trying new experiments help to gain an understanding of what works best with the data at hand (ensuring that I’m preventing overfitting during the whole process). Finally, in the last few weeks of the competition, I perform model ensembling and check the robustness of my solution.

As for my projects outside of Kaggle, most of my time is spent in data gathering, cleaning, and getting relevant value out of the data.

Has Kaggle helped you in your career? If so, how?

Kaggle has tremendously helped me accelerate my career. Not only did it help me find my passion for data science, but it also motivated me to contribute effectively and stay consistent. It’s the perfect place to try hands-on experiments with an enormous amount of data at our fingertips and showcase our work on a global scale. In addition, our work is easily accessible, so we can reach a broader audience as well.

I have used a majority of my Kaggle work on my portfolio to indicate the diversity of work I have done in my journey thus far. Kaggle competitions aim to solve novel and real-world problems, and I feel employers look for our ability and aptitude to solve such problems. I’ve also curated a broad range of datasets that helped me highlight my acumen in working with raw data. These projects helped me secure multiple job opportunities.

In your experience, what do inexperienced Kagglers often overlook? What do you know now that you wish you’d known when you first started?

In my experience, I’ve noticed that many Kagglers get disheartened when their ranking in competitions isn’t what they expected it to be. After weeks and months of hard work, I can see why they might give up early, but winning Kaggle competitions is no easy feat. There are several people of different educational backgrounds and work experience competing, and having the courage to try is all that’s important. We should focus on our individualistic growth and see how far we’ve come in our journey.

Are there any particular tools or libraries that you would recommend using for data analysis or machine learning?

Comprehensive exploratory data analysis combined with relevant visualizations help us spot data trends and context that can improve our methodology. Since I believe in the power of visualizations, my favorite data science libraries would be Seaborn and TensorBoard. Seaborn for EDA and TensorBoard for visualizations needed during the machine learning workflow. I occasionally use Tableau too.

What’s the most important thing someone should keep in mind or do when they’re entering a competition?

When people enter a competition, I believe they should prepare themselves for deep diving into the problem statement and researching. Competitions on Kaggle are particularly challenging and help solve real-life problems in many cases. People should have a positive mindset and not get disheartened. Kaggle competitions provide the perfect opportunity to learn and grow!

Summary

In this chapter, we discussed hyperparameter optimization at length as a way to increase your model’s performance and score higher on the leaderboard. We started by explaining the code functionalities of Scikit-learn, such as grid search and random search, as well as the newer halving algorithms.

Then, we progressed to Bayesian optimization and explored Scikit-optimize, KerasTuner, and finally Optuna. We spent more time discussing the direct modeling of the surrogate function by Gaussian processes and how to hack it, because it can allow you greater intuition and a more ad hoc solution. We recognize that, at the moment, Optuna has become a gold standard among Kagglers, for tabular competitions as well as for deep neural network ones, because of its speedier convergence to optimal parameters in the time allowed in a Kaggle Notebook.

However, if you want to stand out among the competition, you should strive to test solutions from other optimizers as well.

In the next chapter, we will move on to discuss another way to improve your performance in Kaggle competitions: ensembling models. By discovering the workings of averaging, blending, and stacking, we will illustrate how you can boost your results beyond what you can obtain by tuning hyperparameters alone.

Join our book’s Discord space

Join the book’s Discord workspace for a monthly Ask me Anything session with the authors:

https://packt.link/KaggleDiscord

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset