How to use cross-validation for model selection

When several candidate models (that is, algorithms) are available for your use case, the act of choosing one of them is called the model selection problem. Model selection aims to identify the model that will produce the lowest prediction error given new data.

An unbiased estimate of this generalization error requires a test on data that was not part of model training. Hence, we only use part of the available data to train the model and set aside another part of the data to test the model. In order to obtain an unbiased estimate of the prediction error, absolutely no information about the test set may leak into the training set, as shown in the following diagram:

There are several methods that can be used to split the available data, which differ in terms of the amount of data used for training, the variance of the error estimates, the computational intensity, and whether structural aspects of the data are taken into account when splitting the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset