Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

K folds cross-validation

K folds cross-validation is a much better estimator of our model's performance, even more so than our train-test split. Here's how it works:

We will take a finite number of equal slices of our data (usually 3, 5, or 10). Assume that this number is called k.

For each "fold" of the cross-validation, we will treat k-1 of the sections as the training set, and the remaining section as our test set.
For the remaining folds, a different arrangement of k-1 sections is considered for our training set and a different section is our training set.
We compute a set metric for each fold of the cross-validation.
We average our scores at the end.

Cross-validation is effectively using multiple train-test splits being done on the same dataset. This is done for a few reasons, but mainly because cross-validation is the most honest estimate of our model's out-of-sample (OOS) error.

To explain this visually, let's look at our mammal brain and body weight example for a second. The following code manually creates a five-fold cross-validation, wherein five different training and test sets are made from the same population:

import matplotlib.pyplot as pltfrom sklearn.cross_validation import KFold 
 
df = pd.read_table('http://people.sc.fsu.edu/~jburkardt/datasets/regression/x01.txt', sep='s+', skiprows=33, names=['id','brain','body']) 
df = df[df.brain < 300][df.body < 500] 
# limit points for visibility 
 
nfolds = 5 
fig, axes = plt.subplots(1, nfolds, figsize=(14,4)) 
for i, fold in enumerate(KFold(len(df), n_folds=nfolds,  
                              shuffle=True)): 
    training, validation = fold 
    x, y = df.iloc[training]['body'], df.iloc[training]['brain'] 
    axes[i].plot(x, y, 'ro') 
    x, y = df.iloc[validation]['body'], df.iloc[validation]['brain'] 
    axes[i].plot(x, y, 'bo') 
plt.tight_layout()

Five-fold cross-validation: red = training sets, blue = test sets

Here, each graph shows the exact same population of mammals, but the dots are colored red if they belong to the training set of that fold and blue if they belong to the testing set. By doing this, we are obtaining five different instances of the same machine learning model in order to see if the performance remains consistent across the folds.

If you stare at the dots long enough, you will note that each dot appears in a training set exactly four times (k - 1), while the same dot appears in a test set exactly once and only once.

Some features of k-fold cross-validation include the following:

It is a more accurate estimate of the OOS prediction error than a single train-test split because it is taking several independent train-test splits and averaging the results together.
It is a more efficient use of data than single train-test splits because the entire dataset is being used for multiple train-test splits instead of just one.
Each record in our dataset is used for both training and testing.
This method presents a clear tradeoff between efficiency and computational expense. A 10-fold CV is 10x more expensive computationally than a single train/test split.
This method can be used for parameter tuning and model selection.

Basically, whenever we wish to test a model on a set of data, whether we just completed tuning some parameters or feature engineering, a k-fold cross-validation is an excellent way to estimate the performance on our model.

Of course, sklearn comes with an easy-to-use cross-validation module, called cross_val_score, which automatically splits up our dataset for us, runs the model on each fold, and gives us a neat and tidy output of results:

# Using a training set and test set is so important 
# Just as important is cross validation. Remember cross validation  
# is using several different train test splits and   
# averaging your results! 
 
## CROSS-VALIDATION 
 
# check CV score for K=1 
from sklearn.cross_validation import cross_val_score, train_test_split 
tree = KNeighborsClassifier(n_neighbors=1) 
scores = cross_val_score(tree, X, y, cv=5, scoring='accuracy') 
scores.mean() 
0.95999999999

This is a much more reasonable accuracy than our previous score of 1. Remember that we are not getting 100% accuracy anymore because we have a distinct training and test set. The data points that KNN has never seen are the test points and it, therefore, cannot match them exactly to themselves.

Let's try cross-validating KNN with K=5 (increasing our model's complexity), as shown:

# check CV score for K=5 
knn = KNeighborsClassifier(n_neighbors=5) 
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy') 
scores 
np.mean(scores) 
0.97333333

Even better! So, now we have to find the best K? The best K is the one that maximizes our accuracy. Let's try a few:

# search for an optimal value of K 
k_range = range(1, 30, 2) # [1, 3, 5, 7, ..., 27, 29] 
errors = [] 
for k in k_range: 
    knn = KNeighborsClassifier(n_neighbors=k) 
   # instantiate a KNN with k neighbors 
   scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy') 
 # get our five accuracy scores 
    accuracy = np.mean(scores) 
   # average them together 
    error = 1 - accuracy 
   # get our error, which is 1 minus the accuracy 
    errors.append(error) 
   # keep track of a list of errors

We now have an error value (1 - accuracy) for each value of K (1, 3, 5, 7, 9.., .., 29):

# plot the K values (x-axis) versus the 5-fold CV score (y-axis) 
plt.figure() 
plt.plot(k_range, errors) 
plt.xlabel('K') 
plt.ylabel('Error')

Graph of errors of the KNN model against KNN's complexity, represented by the value of K

Compare this graph to the previous graph of model complexity and bias/variance. Toward the left, our graph has a higher bias and is underfitting. As we increased our model's complexity, the error term began to go down, but after a while, our model became overly complex, and the high variance kicked in, making our error term go back up.

It seems that the optimal value of K is between 6 and 10.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for K folds cross-validation

Create new playlist

Sign In

Sign Up

K folds cross-validation

Table of Contents for
K folds cross-validation