Selecting the best model

When a model is under-performing, it is often not clear how to make it better. Throughout this book, I have declared a rule of thumbs, for example, how to select the number of layers in a neural network. Even worse, the answer is often counter-intuitive! For example, adding another layer to the network might make the results worse, and adding more training data might not change performance at all.

You can see why these issues are some of the most important aspects of machine learning. At the end of the day, the ability to determine what steps will or will not improve our model is what separates the successful machine learning practitioner from all others.

Let's have a look at a specific example. Remember Chapter 5, Using Decision Trees to Make a Medical Diagnosis, where we used decision trees in a regression task? We were fitting two different trees to a sin function—one with depth 2 and one with depth 5. As a reminder, the regression result looked like this:

It should be clear that neither of these fits are particularly good. However, the two decision trees fail in two different ways!

The decision tree with depth 2 (thick line in the preceding screenshot) attempts to fit four straight lines through the data. Because the data is intrinsically more complicated than a few straight lines, this model fails. We could train it as much as we wanted, on as many training samples as we could generate—it would never be able to describe this dataset well. Such a model is said to underfit the data. In other words, the model does not have enough complexity to account for all the features in the data. Hence, the model has a high bias.

The other decision tree (thin line, depth 5) makes a different mistake. This model has enough flexibility to nearly perfectly account for the fine structures in the data. However, at some points, the model seems to follow the particular pattern of the noise; we added to the sin function rather than the sin function itself. You can see that on the right-hand side of the graph, where the blue curve (thin line) would jitter a lot. Such a model is said to overfit the data. In other words, the model is so complex that it ends up accounting for random errors in the data. Hence, the model has a high variance.

Long story short—here's the secret sauce: fundamentally, selecting the right model comes down to finding a sweet spot in the trade-off between bias and variance.

The amount of flexibility a model has (also known as the model complexity) is mostly dictated by its hyperparameters. That is why it is so important to tune them!

Let's return to the kNN algorithm and the Iris dataset. If we repeated the procedure of fitting the model to the Iris data for all possible values of k and calculated both training and test scores, we would expect the result to look something like the following:

The preceding image shows model score as a function of model complexity. If there is one thing I would want you to remember from this chapter, it would be this diagram. Let's unpack it.

The diagram describes the model score (either training or test scores) as a function of model complexity. As mentioned in the preceding diagram, the model complexity of a neural network roughly grows with the number of neurons in the network. In the case of kNN, the opposite logic applies—the larger the value for k, the smoother the decision boundary, and hence, the lower the complexity. In other words, kNN with k=1 would be all of the way to the right in the preceding diagram, where the training score is perfect. No wonder we got 100% accuracy on the training set!

From the preceding diagram, we can gather that there are three regimes in the model complexity landscape:

Very low model complexity (a high-bias model) underfits the training data. In this regime, the model achieves only low scores on both the training and test set, no matter how long we trained it for.
A model with very high complexity (or a high-variance) overfits the training data, which indicates that the model can predict on the training data very well but fails on the unseen data. In this regime, the model has started to learn intricacies or peculiarities that only appear in the training data. Since these peculiarities do not apply to unseen data, the training score gets lower and lower.
For some intermediate value, the test score is maximal. It is this intermediate regime, where the test score is maximal, that we are trying to find. This is the sweet spot in the trade-off between bias and variance!

This means that we can find the best algorithm for the task at hand by mapping out the model complexity landscape. Specifically, we can use the following indicators to know which regime we are currently in:

If both training and test scores are below our expectations, we are probably in the leftmost regime in the preceding diagram, where the model is underfitting the data. In this case, a good idea might be to increase the model complexity and try again.
If the training score is much higher than the test score, we are probably in the rightmost regime in the preceding diagram, where the model is overfitting the data. In this case, a good idea might be to decrease the model complexity and try again.

Although this procedure works in general, there are more sophisticated strategies for model evaluation that proved to be more thorough than a simple train-test split, which we will talk about in the following sections.

Table of Contents for Selecting the best model

Create new playlist

Sign In

Sign Up

Table of Contents for
Selecting the best model