D.2. How fit is fit?

With any machine learning model, one of the major challenges is overcoming the model’s ability to do too well. How can something be “too good”? When working with example data in any model, the given algorithm may do very well at finding patterns in that particular dataset. But given that we already likely know the label of any particular example in the training set (or it wouldn’t be in the training set), that isn’t particularly helpful. The real goal is to use those training examples to build a model that will generalize, and be able to correctly label an example that, while similar to members of the training set, is outside of the training set. Performance on new examples that are outside the training set is what we want to maximize.

A model that perfectly describes (and predicts) your training examples is overfit (see figure D.1). Such a model will have little or no capacity to describe new data. It isn’t a general model that you can trust to do well when you give it an example not in your training set.

Figure D.1. Overfit on training samples

Conversely, if your model gets many of the training predictions wrong and also does poorly on new examples, it’s underfit (see figure D.2). Neither of these kinds of models will be useful for making predictions in the real world. So let’s look at techniques to detect these issues and, more importantly, ways to avoid them.

Figure D.2. Underfit on training samples

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset