K folds cross-validation is a much better estimator of our model's performance, even more so than our train-test split. Here's how it works:
We will take a finite number of equal slices of our data (usually 3, 5, or 10). Assume that this number is called k.
Cross-validation is effectively using multiple train-test splits being done on the same dataset. This is done for a few reasons, but mainly because cross-validation is the most honest estimate of our model's out-of-sample (OOS) error.
To explain this visually, let's look at our mammal brain and body weight example for a second. The following code manually creates a five-fold cross-validation, wherein five different training and test sets are made from the same population:
import matplotlib.pyplot as pltfrom sklearn.cross_validation import KFold df = pd.read_table('http://people.sc.fsu.edu/~jburkardt/datasets/regression/x01.txt', sep='s+', skiprows=33, names=['id','brain','body']) df = df[df.brain < 300][df.body < 500] # limit points for visibility nfolds = 5 fig, axes = plt.subplots(1, nfolds, figsize=(14,4)) for i, fold in enumerate(KFold(len(df), n_folds=nfolds, shuffle=True)): training, validation = fold x, y = df.iloc[training]['body'], df.iloc[training]['brain'] axes[i].plot(x, y, 'ro') x, y = df.iloc[validation]['body'], df.iloc[validation]['brain'] axes[i].plot(x, y, 'bo') plt.tight_layout()
Here, each graph shows the exact same population of mammals, but the dots are colored red if they belong to the training set of that fold and blue if they belong to the testing set. By doing this, we are obtaining five different instances of the same machine learning model in order to see if the performance remains consistent across the folds.
If you stare at the dots long enough, you will note that each dot appears in a training set exactly four times (k - 1), while the same dot appears in a test set exactly once and only once.
Some features of k-fold cross-validation include the following:
Basically, whenever we wish to test a model on a set of data, whether we just completed tuning some parameters or feature engineering, a k-fold cross-validation is an excellent way to estimate the performance on our model.
Of course, sklearn
comes with an easy-to-use cross-validation module, called cross_val_score
, which automatically splits up our dataset for us, runs the model on each fold, and gives us a neat and tidy output of results:
# Using a training set and test set is so important # Just as important is cross validation. Remember cross validation # is using several different train test splits and # averaging your results! ## CROSS-VALIDATION # check CV score for K=1 from sklearn.cross_validation import cross_val_score, train_test_split tree = KNeighborsClassifier(n_neighbors=1) scores = cross_val_score(tree, X, y, cv=5, scoring='accuracy') scores.mean() 0.95999999999
This is a much more reasonable accuracy than our previous score of 1. Remember that we are not getting 100% accuracy anymore because we have a distinct training and test set. The data points that KNN has never seen are the test points and it, therefore, cannot match them exactly to themselves.
Let's try cross-validating KNN with K=5
(increasing our model's complexity), as shown:
# check CV score for K=5 knn = KNeighborsClassifier(n_neighbors=5) scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy') scores np.mean(scores) 0.97333333
Even better! So, now we have to find the best K? The best K is the one that maximizes our accuracy. Let's try a few:
# search for an optimal value of K k_range = range(1, 30, 2) # [1, 3, 5, 7, ..., 27, 29] errors = [] for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) # instantiate a KNN with k neighbors scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy') # get our five accuracy scores accuracy = np.mean(scores) # average them together error = 1 - accuracy # get our error, which is 1 minus the accuracy errors.append(error) # keep track of a list of errors
We now have an error value (1 - accuracy) for each value of K (1, 3, 5, 7, 9.., .., 29):
# plot the K values (x-axis) versus the 5-fold CV score (y-axis) plt.figure() plt.plot(k_range, errors) plt.xlabel('K') plt.ylabel('Error')
Compare this graph to the previous graph of model complexity and bias/variance. Toward the left, our graph has a higher bias and is underfitting. As we increased our model's complexity, the error term began to go down, but after a while, our model became overly complex, and the high variance kicked in, making our error term go back up.
It seems that the optimal value of K is between 6 and 10.