With the possible exception of data munging, evaluating is probably what machine learning scientists spend most of their time doing. Staring at lists of numbers and graphs, watching hopefully as their models run, and trying earnestly to make sense of their output. Evaluation is a cyclical process; we run models, evaluate the results, and plug in new parameters, each time hoping that this will result in a performance gain. Our work becomes more enjoyable and productive as we increase the efficiency of each evaluation cycle, and there are some tools and techniques that can help us achieve this. This chapter will introduce some of these through the following topics:
Measuring a model's performance is an important machine learning task, and there are many varied parameters and heuristics for doing this. The importance of defining a scoring strategy should not be underestimated, and in Sklearn, there are basically three approaches:
We have seen examples of the estimator score()
method, for example, clf.score()
. In the case of a linear classifier, the score()
method returns the mean accuracy. It is a quick and easy way to gauge an individual estimator's performance. However, this method is usually insufficient in itself for a number of reasons.
If we remember, accuracy is the sum of the true positive and true negative cases divided by the number of samples. Using this as a measure would indicate that if we performed a test on a number of patients to see if they had a particular disease, simply predicting that every patient was disease free would likely give us a high accuracy. Obviously, this is not what we want.
A better way to measure performance is using by precision, (P) and Recall, (R). If you remember from the table in Chapter 4, Models – Learning from Information, precision, or specificity, is the proportion of predicted positive instances that are correct, that is, TP/(TP+FP). Recall, or sensitivity, is TP/(TP+FN). The F-measure is defined as 2*R*P/(R+P). These measures ignore the true negative rate, and so they are not making an evaluation on how well a model handles negative cases.
Rather than use the score method of the estimator, it often makes sense to use specific scoring parameters such as those provided by the cross_val_score
object. This has a cv
parameter that controls how the data is split. It is usually set as an int, and it determines how many random consecutive splits are made on the data. Each of these has a different split point. This parameter can also be set to an iterable of train and test splits, or an object that can be used as a cross validation generator.
Also important in cross_val_score
is the scoring parameter. This is usually set by a string indicating a scoring strategy. For classification, the default is accuracy, and some common values are f1
, precision
, recall
, as well as the micro-averaged, macro-averaged, and weighted versions of these. For regression estimators, the scoring
values are mean_absolute_error
, mean_squared error
, median_absolute_error
, and r2
.
The following code estimates the performance of three models on a dataset using 10 consecutive splits. Here, we print out the mean of each score, using several measures, for each of the four models. In a real-world situation, we will probably need to preprocess our data in one or more ways, and it is important to apply these data transformations to our test set as well as the training set. To make this easier, we can use the sklearn.pipeline
module. This sequentially applies a list of transforms and a final estimator, and it allows us to assemble several steps that can be cross-validated together. Here, we also use the StandardScaler()
class to scale the data. Scaling is applied to the logistic regression model and the decision tree by using two pipelines:
from sklearn import cross_validation from sklearn.tree import DecisionTreeClassifier from sklearn import svm from sklearn.linear_model import LogisticRegression from sklearn.datasets import samples_generator from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler from sklearn.cross_validation import cross_val_score from sklearn.pipeline import Pipeline X, y = samples_generator.make_classification(n_samples=1000,n_informative=5, n_redundant=0,random_state=42) le=LabelEncoder() y=le.fit_transform(y) Xtrain, Xtest, ytrain, ytest = cross_validation.train_test_split(X, y, test_size=0.5, random_state=1) clf1=DecisionTreeClassifier(max_depth=2,criterion='gini').fit(Xtrain,ytrain) clf2= svm.SVC(kernel='linear', probability=True, random_state=0).fit(Xtrain,ytrain) clf3=LogisticRegression(penalty='l2', C=0.001).fit(Xtrain,ytrain) pipe1=Pipeline([['sc',StandardScaler()],['mod',clf1]]) mod_labels=['Decision Tree','SVM','Logistic Regression' ] print('10 fold cross validation: ') for mod,label in zip([pipe1,clf2,clf3], mod_labels): #print(label) auc_scores= cross_val_score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring ='roc_auc') p_scores= cross_val_score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring ='precision_macro') r_scores= cross_val_score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring ='recall_macro') f_scores= cross_val_score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring ='f1_macro') print(label) print("auc scores %2f +/- %2f " % (auc_scores.mean(), auc_scores.std())) print("precision %2f +/- %2f " % (p_scores.mean(), p_scores.std())) print("recall %2f +/- %2f ]" % (r_scores.mean(), r_scores.std())) print("f scores %2f +/- %2f " % (f_scores.mean(), f_scores.std()))
On execution, you will see the following output:
There are several variations on these techniques, most commonly using what is known as k-fold cross validation. This uses what is sometimes referred to as the leave one out strategy. First, the model is trained using k—1 of the folds as training data. The remaining data is then used to compute the performance measure. This is repeated for each of the folds. The performance is calculated as an average of all the folds.
Sklearn implements this using the cross_validation.KFold
object. The important parameters are a required int
, indicating the total number of elements, and an n_folds
parameter, defaulting to 3
, to indicate the number of folds. It also takes optional shuffle
and random_state
parameters indicating whether to shuffle the data before splitting, and what method to use to generate the random state. The default random_state
parameter is to use the NumPy random number generator.
In the following snippet, we use the LassoCV
object. This is a linear model trained with L1 regularization. The optimization function for regularized linear regression, if you remember, includes a constant (alpha) that multiplies the L1 regularization term. The LassoCV
object automatically sets this alpha value, and to see how effective this is, we can compare the selected alpha and the score on each of the k-folds:
import numpy as np from sklearn import cross_validation, datasets, linear_model X,y=datasets.make_blobs(n_samples=80,centers=2, random_state=0, cluster_std=2) alphas = np.logspace(-4, -.5, 30) lasso_cv = linear_model.LassoCV(alphas=alphas) k_fold = cross_validation.KFold(len(X), 5) alphas = np.logspace(-4, -.5, 30) for k, (train, test) in enumerate(k_fold): lasso_cv.fit(X[train], y[train]) print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}". format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
The output of the preceding commands is as follows:
Sometimes, it is necessary to preserve the percentages of the classes in each fold. This is done using stratified cross validation. It can be helpful when classes are unbalanced, that is, when there is a larger number of some classes and very few of others. Using the stratified cv
object may help correct defects in models that might cause bias because a class is not represented in a fold in large enough numbers to make an accurate prediction. However, this may also cause an unwanted increase in variance.
In the following example, we use stratified cross validation to test how significant the classification score is. This is done by repeating the classification procedure after randomizing the labels. The p value is the percentage of runs by which the score is greater than the classification score obtained initially. This code snippet uses the cross_validation.permutation_test_score
method that takes the estimator, data, and labels as parameters. Here, we print out the initial test score, the p value, and the score on each permutation:
import numpy as np from sklearn import linear_model from sklearn.cross_validation import StratifiedKFold, permutation_test_score from sklearn import datasets X,y=datasets.make_classification(n_samples=100, n_features=5) n_classes = np.unique(y).size cls=linear_model.LogisticRegression() cv = StratifiedKFold(y, 2) score, permutation_scores, pvalue = permutation_test_score(cls, X, y, scoring="f1", cv=cv, n_permutations=10, n_jobs=1) print("Classification score %s (pvalue : %s)" % (score, pvalue)) print("Permutation scores %s" % (permutation_scores))