Combining grid search with cross-validation

One potential danger of the grid search we just implemented is that the outcome might be relatively sensitive to how exactly we split the data. After all, we might have accidentally chosen a split that put most of the easy-to-classify data points in the test set, resulting in an overly optimistic score. Although we would be happy at first, as soon as we tried the model on some new held-out data, we would find that the actual performance of the classifier is much lower than expected.

Instead, we can combine grid search with cross-validation. This way, the data is split multiple times into training and validation sets, and cross-validation is performed at every step of the grid search to evaluate every parameter combination.

The entire process is illustrated in the following diagram:

As is evident from the diagram, the test set is created right from the outset and kept separate from the grid search. The remainder of the data is then split into training and validation sets like we did before and passed to the grid search. Within the grid search box, we perform cross-validation on every possible combination of parameter values to find the best model. The selected parameters are then used to build a final model, which is evaluated on the test set.

Scikit-learn provides the GridSearchCV class since grid search with cross-validation is such a commonly used method for hyperparameter tuning. Scikit-learn implements it in the form of an estimator. 

We can specify all of the parameters we want GridSearchCV to search over by using a dictionary. Every entry of the dictionary should be of the form, {name: values}, where name is a string that should be equivalent to the parameter name usually passed to the classifier, and values is a list of values to try.

For example, to search for the best value of the n_neighbors parameter of the KNeighborsClassifier class, we would design the parameter dictionary as follows:

In [12]: param_grid = {'n_neighbors': range(1, 20)}

Here, we are searching for the best k in the range, [1, 19].

We then need to pass the parameter grid as well as the classifier (KNeighborsClassifier) to the GridSearchCV object:

In [13]: from sklearn.neighbors import KNeighborsClassifier
... from sklearn.model_selection import GridSearchCV
... grid_search = GridSearchCV(KNeighborsClassifier(), param_grid,
... cv=5)

Then, we can train the classifier using the fit method. In return, scikit-learn will inform us about all of the parameters used in the grid search:

In [14]: grid_search.fit(X_trainval, y_trainval)
Out[14]: GridSearchCV(cv=5, error_score='raise',
estimator=KNeighborsClassifier(algorithm='auto',
leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5,
p=2, weights='uniform'),
fit_params={}, iid=True, n_jobs=1,
param_grid={'n_neighbors': range(1, 20)},
pre_dispatch='2*n_jobs',
refit=True, return_train_score=True, scoring=None,
verbose=0)

This will allow us to find the best validation score and the corresponding value for k:

In [15]: grid_search.best_score_, grid_search.best_params_
Out[15]: (0.9642857142857143, {'n_neighbors': 3})

We hence get a validation score of 96.4% for k = 3. Since grid search with cross-validation is more robust than our earlier procedure, we would expect the validation scores to be more realistic than the 100% accuracy we found before.

However, from the previous section, we know that this score might still be overly optimistic, so we need to score the classifier on the test set instead:

In [30]: grid_search.score(X_test, y_test)
Out[30]: 0.97368421052631582

And to our surprise, the test score is even better.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset