Testing the data

In order to build a successful statistical or machine learning model, we need to follow a simple (but hard!) rule: make it as simple as possible (so it generalizes the phenomenon being modeled well) but not too simple (so it loses its main ability to predict). A visual example of how this manifests is as follows (from http://bit.ly/2GpRybB):

The middle chart shows a good fit: the model line follows the true function well. The model line on the left chart oversimplifies the phenomenon and has literally no predictive power (apart from a handful of points)—a perfect example of underfitting. The model line on the right follows the training data almost perfectly but if new data was presented, it would most likely misrepresent it—a concept known as overfitting, that is, it does not generalize well. As you can see from these three charts, the complexity of the model needs to be just right so it models the phenomenon well.

Some machine learning models have a tendency to overtrain. For example, any models that try to find a mapping (a function) between the input data and the independent variable (or a label) have a tendency to overfit; these include parametric regression models, such as linear or generalized regression models, as well as recently (again!) popular neural networks (or deep learning models). On the other hand, some decision tree-based models (such as random forests) are less prone to overfitting even with more complex models.

So, how do we get the model just right? There are four rules of thumb:

Select your features wisely
Do not overtrain, or select a model that is less prone to overfitting
Run multiple model estimations with randomly selected data from your dataset
Tune hyperparameters

In this recipe, we will focus on the first point, the remaining points will be covered in some of the recipes found in this and the two next chapters.

Table of Contents for Testing the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Testing the data