K folds cross-validation

K folds cross-validation is a much better estimator of our model's performance, even more so than our train-test split. Here's how it works:

  1. We will take a finite number of equal slices of our data (usually 3, 5, or 10). Assume that this number is called k.
  2. For each "fold" of the cross-validation, we will treat k-1 of the sections as the training set, and the remaining section as our test set.
  3. For the remaining folds, a different arrangement of k-1 sections is considered for our training set and a different section is our training set.
  4. We compute a set metric for each fold of the cross-validation.
  5. We average our scores at the end.

Cross-validation is effectively using multiple train-test splits being done on the same dataset. This is done for a few reasons, but mainly because cross-validation is the most honest estimate of our model's out of the sample error.

To explain this visually, let's look at our mammal brain and body weight example for a second. The following code manually creates a five-fold cross-validation, wherein five different training and test sets are made from the same population:

from sklearn.cross_validation import KFold

df = pd.read_table('http://people.sc.fsu.edu/~jburkardt/datasets/regression/x01.txt', sep='s+', skiprows=33, names=['id','brain','body'])
df = df[df.brain < 300][df.body < 500]
# limit points for visibility

nfolds = 5
fig, axes = plt.subplots(1, nfolds, figsize=(14,4))
for i, fold in enumerate(KFold(len(df), n_folds=nfolds, 
    training, validation = fold
    x, y = df.iloc[training]['body'], df.iloc[training]['brain']
    axes[i].plot(x, y, 'ro')
    x, y = df.iloc[validation]['body'], df.iloc[validation]['brain']
    axes[i].plot(x, y, 'bo')
Five-fold cross-validation: red = training sets, blue = test sets

Here, each graph shows the exact same population of mammals, but the dots are colored red if they belong to the training set of that fold and blue if they belong to the testing set. By doing this, we are obtaining five different instances of the same machine learning model in order to see if performance remains consistent across the folds.

If you stare at the dots long enough, you will note that each dot appears in a training set exactly four times (k – 1), while the same dot appears in a test set exactly once and only once.

Some features of K-fold cross-validation include the following:

  • It is a more accurate estimate of the OOS prediction error than a single train-test split because it is taking several independent train-test splits and averaging the results together.
  • It is a more efficient use of data than single train-test splits because the entire dataset is being used for multiple train-test splits instead of just one.
  • Each record in our dataset is used for both training and testing.
  • This method presents a clear tradeoff between efficiency and computational expense. A 10-fold CV is 10x more expensive computationally than a single train/test split.
  • This method can be used for parameter tuning and model selection.

Basically, whenever we wish to test a model on a set of data, whether we just completed tuning some parameters or feature engineering, a k-fold cross-validation is an excellent way to estimate the performance on our model.

Of course, sklearn comes with an easier-to-use cross-validation module, called cross_val_score, which automatically splits up our dataset for us, runs the model on each fold, and gives us a neat and tidy output of results:

# Using a training set and test set is so important
# Just as important is cross validation. Remember cross validation 
# is using several different train test splits and  
# averaging your results!


# check CV score for K=1
from sklearn.cross_validation import cross_val_score, train_test_split
tree = KNeighborsClassifier(n_neighbors=1)
scores = cross_val_score(tree, X, y, cv=5, scoring='accuracy')

Which is a much more reasonable accuracy than our previous score of 1. Remember that we are not getting 100% accuracy anymore, because we have a distinct training and test set. The data points that KNN has never seen the test points and therefore cannot match them exactly to themselves.

Let's try cross-validating KNN with K=5 (increasing our model's complexity), as shown:

# check CV score for K=5
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')

Even better! So, now we have to find the best K? The best K is the one that maximizes our accuracy. Let's try a few:

# search for an optimal value of K
k_range = range(1, 30, 2) # [1, 3, 5, 7, …, 27, 29]
errors = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
   # instantiate a KNN with k neighbors
   scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
 # get our five accuracy scores
    accuracy = np.mean(scores)
   # average them together
    error = 1 – accuracy
   # get our error, which is 1 minus the accuracy
   # keep track of a list of errors

We now have an error value (1 – accuracy) for each value of K (1, 3, 5, 7, 9.., .., 29):

# plot the K values (x-axis) versus the 5-fold CV score (y-axis)
plt.plot(k_range, errors)
K folds cross-validation

Graph of errors of KNN model against KNN's complexity, represented by the value of K

Compare this graph to the previous graph of model complexity and bias/variance. Toward the left, our graph has a higher bias and is underfitting. As we increased our model's complexity, the error term began to go down, but after a while, our model became overly complex, and the high variance kicked in, making our error term go back up.

It seems that the optimal value of K is between 6 and 10.

