Model tuning and cross-validation

Having learned what metrics to use to measure a classification model, we'll now study how to measure it properly. We simply can avoid adopting the classification results from one fixed testing set, which we did in experiments previously. Instead, we usually apply the k-fold cross-validation technique to assess how a model will generally perform in practice.

In the k-fold cross-validation setting, the original data is first randomly divided into the k equal-sized subsets, in which class proportion is often preserved. Each of these k subsets is then successively retained as the testing set for evaluating the model. During each trial, the rest k -1 subsets (excluding the one-fold holdout) form the training set for driving the model. Finally, the average performance across all k trials is calculated to generate an overall result:

Statistically, the averaged performance of k-fold cross-validation is an accurate estimate of how a model performs in general. Given different sets of parameters pertaining to a machine learning model and/or data preprocessing algorithms, or even two or more different models, the goal of model tuning and/or model selection is to pick a set of parameters of a classifier so that the best averaged performance is achieved. With these concepts in mind, we can now start to tweak our Naïve Bayes classifier, incorporating cross-validation and AUC of ROC measurement.

In k-fold cross-validation, k is usually set 3, 5, or 10. If the training size is small, a large k (5 or 10) is recommended to ensure enough training samples in each fold. If the training size is large, a small value (such 3 or 4) works fine since a higher k will lead to even higher computational cost of training on large dataset.

We herein use the split() method from the StratifiedKFold class of scikit-learn to divide the data into chunks with preserved class fractions:

>>> from sklearn.model_selection import StratifiedKFold
>>> k = 10
>>> k_fold = StratifiedKFold(n_splits=k, random_state=42)
>>> cleaned_emails_np = np.array(cleaned_emails)
>>> labels_np = np.array(labels)

After initializing a 10-fold generator, we choose to explore the following values for the following parameters:

max_features: This represents the n most frequent terms used as feature space
alpha: This represents the smoothing factor, the initial count for a term
fit_prior: This represents whether or not to use prior tailored to the training data

We start with the following options:

>>> max_features_option = [2000, 8000, None]
>>> smoothing_factor_option = [0.5, 1.0, 2.0, 4.0]
>>> fit_prior_option = [True, False]
>>> auc_record = {}

Then, for each fold generated by the split() method of the k_fold object, repeat the process of term count feature extraction, classifier training, and prediction with one of the aforementioned combinations of parameters, and record the resulting AUCs:

>>> for train_indices, test_indices in k_fold.split(emails_cleaned,
                                                            labels):
...     X_train, X_test = cleaned_emails_np[train_indices],
                                cleaned_emails_np[test_indices]
...     Y_train, Y_test = labels_np[train_indices],
                                  labels_np[test_indices]
...     for max_features in max_features_option:
...         if max_features not in auc_record:
...             auc_record[max_features] = {}
...         cv = CountVectorizer(stop_words="english",
                    max_features=max_features, max_df=0.5, min_df=2)
...         term_docs_train = cv.fit_transform(X_train)
...         term_docs_test = cv.transform(X_test)
...         for alpha in smoothing_factor_option:
...             if alpha not in auc_record[max_features]:
...                 auc_record[max_features][alpha] = {}
...             for fit_prior in fit_prior_option:
...                 clf = MultinomialNB(alpha=alpha, fit_prior=fit_prior)
...                 clf.fit(term_docs_train, Y_train)
...                 prediction_prob = clf.predict_proba(term_docs_test)
...                 pos_prob = prediction_prob[:, 1]
...                 auc = roc_auc_score(Y_test, pos_prob)
...                 auc_record[max_features][alpha][fit_prior] = 
                                auc+ auc_record[max_features][alpha].get(
                                fit_prior, 0.0)

Finally, we present the results as follows:

>>> print('max features  smoothing fit prior auc')
>>> for max_features, max_feature_record in auc_record.items():
...     for smoothing, smoothing_record in max_feature_record.items():
...         for fit_prior, auc in smoothing_record.items():
...             print(' {0}      {1}   {2}
                   {3:.5f}'.format(
                   max_features, smoothing, fit_prior, auc/k))
max features  smoothing   fit prior   auc
        2000      0.5      False     0.97421
        2000      1.0      True      0.97237
        2000      1.0      False     0.97238
        2000      2.0 .    True      0.97043
        2000      2.0      False     0.97057
        2000      4.0      True      0.96853
        2000      4.0      False     0.96843
        8000      0.5      True      0.98533
        8000      0.5      False     0.98530
        8000      1.0      True      0.98428
        8000      1.0      False     0.98430
        8000      2.0      True      0.98338
        8000      2.0      False     0.98337
        8000      4.0      True      0.98291
        8000      4.0      False     0.98296
        None      0.5      True      0.98890
        None      0.5      False     0.98884
        None      1.0      True      0.98899
        None      1.0      False     0.98904
        None      2.0      True      0.98906
        None      2.0      False     0.98915
        None      4.0      True      0.98965
        None      4.0      False     0.98969

The (None, 4.0, False) set enables the best AUC, at 0.98969. In fact, not limiting the maximal number of features outperforms doing so, as 4.0, the highest smoothing factor, always beats other values. Hence, we conduct a second tweak, with the following options for greater values of smoothing factor:

>>> max_features_option = [None]
>>> smoothing_factor_option = [4.0, 10, 16, 20, 32]
>>> fit_prior_option = [True, False]

Repeat the cross-validation process and we get the following results:

max features  smoothing   fit prior   auc
      None       4.0        True    0.98965
      None       4.0        False   0.98969
      None       10         True    0.99208
      None       10         False   0.99211
      None       16         True    0.99329
      None       16         False   0.99329
      None       20         True    0.99362
      None       20         False   0.99362
      None       32         True    0.99307
      None       32         False   0.99307

The (None, 20, False) set enables the best AUC, at 0.99362!

Table of Contents for Model tuning and cross-validation

Create new playlist

Sign In

Sign Up

Table of Contents for
Model tuning and cross-validation