Model tuning and cross-validation

Having learned what metrics to use to measure a classification model, we'll now study how to measure it properly. We simply can avoid adopting the classification results from one fixed testing set, which we did in experiments previously. Instead, we usually apply the k-fold cross-validation technique to assess how a model will generally perform in practice.

In the k-fold cross-validation setting, the original data is first randomly divided into the k equal-sized subsets, in which class proportion is often preserved. Each of these k subsets is then successively retained as the testing set for evaluating the model. During each trial, the rest k -1 subsets (excluding the one-fold holdout) form the training set for driving the model. Finally, the average performance across all k trials is calculated to generate an overall result:

Statistically, the averaged performance of k-fold cross-validation is an accurate estimate of how a model performs in general. Given different sets of parameters pertaining to a machine learning model and/or data preprocessing algorithms, or even two or more different models, the goal of model tuning and/or model selection is to pick a set of parameters of a classifier so that the best averaged performance is achieved. With these concepts in mind, we can now start to tweak our Naïve Bayes classifier, incorporating cross-validation and AUC of ROC measurement.

In k-fold cross-validation, k is usually set 3, 5, or 10. If the training size is small, a large k (5 or 10) is recommended to ensure enough training samples in each fold. If the training size is large, a small value (such 3 or 4) works fine since a higher k will lead to even higher computational cost of training on large dataset.

We herein use the split() method from the StratifiedKFold class of scikit-learn to divide the data into chunks with preserved class fractions:

>>> from sklearn.model_selection import StratifiedKFold
>>> k = 10
>>> k_fold = StratifiedKFold(n_splits=k, random_state=42)
>>> cleaned_emails_np = np.array(cleaned_emails)
>>> labels_np = np.array(labels)

After initializing a 10-fold generator, we choose to explore the following values for the following parameters:

  • max_features: This represents the n most frequent terms used as feature space
  • alpha: This represents the smoothing factor, the initial count for a term
  • fit_prior: This represents whether or not to use prior tailored to the training data

We start with the following options:

>>> max_features_option = [2000, 8000, None]
>>> smoothing_factor_option = [0.5, 1.0, 2.0, 4.0]
>>> fit_prior_option = [True, False]
>>> auc_record = {}

Then, for each fold generated by the split() method of the k_fold object, repeat the process of term count feature extraction, classifier training, and prediction with one of the aforementioned combinations of parameters, and record the resulting AUCs:

>>> for train_indices, test_indices in k_fold.split(emails_cleaned,
labels):
... X_train, X_test = cleaned_emails_np[train_indices],
cleaned_emails_np[test_indices]
... Y_train, Y_test = labels_np[train_indices],
labels_np[test_indices]
... for max_features in max_features_option:
... if max_features not in auc_record:
... auc_record[max_features] = {}
... cv = CountVectorizer(stop_words="english",
max_features=max_features, max_df=0.5, min_df=2)
... term_docs_train = cv.fit_transform(X_train)
... term_docs_test = cv.transform(X_test)
... for alpha in smoothing_factor_option:
... if alpha not in auc_record[max_features]:
... auc_record[max_features][alpha] = {}
... for fit_prior in fit_prior_option:
... clf = MultinomialNB(alpha=alpha, fit_prior=fit_prior)
... clf.fit(term_docs_train, Y_train)
... prediction_prob = clf.predict_proba(term_docs_test)
... pos_prob = prediction_prob[:, 1]
... auc = roc_auc_score(Y_test, pos_prob)
... auc_record[max_features][alpha][fit_prior] =
auc
+ auc_record[max_features][alpha].get(
fit_prior, 0.0)

Finally, we present the results as follows:

>>> print('max features  smoothing fit prior auc')
>>> for max_features, max_feature_record in auc_record.items():
... for smoothing, smoothing_record in max_feature_record.items():
... for fit_prior, auc in smoothing_record.items():
... print(' {0} {1} {2}
{3:.5f}'.format(
max_features, smoothing,
fit_prior, auc/k))
max features smoothing fit prior auc
2000 0.5 False 0.97421
2000 1.0 True 0.97237
2000 1.0 False 0.97238
2000 2.0 . True 0.97043
2000 2.0 False 0.97057
2000 4.0 True 0.96853
2000 4.0 False 0.96843
8000 0.5 True 0.98533
8000 0.5 False 0.98530
8000 1.0 True 0.98428
8000 1.0 False 0.98430
8000 2.0 True 0.98338
8000 2.0 False 0.98337
8000 4.0 True 0.98291
8000 4.0 False 0.98296
None 0.5 True 0.98890
None 0.5 False 0.98884
None 1.0 True 0.98899
None 1.0 False 0.98904
None 2.0 True 0.98906
None 2.0 False 0.98915
None 4.0 True 0.98965
None 4.0 False 0.98969

The (None, 4.0, False) set enables the best AUC, at 0.98969. In fact, not limiting the maximal number of features outperforms doing so, as 4.0, the highest smoothing factor, always beats other values. Hence, we conduct a second tweak, with the following options for greater values of smoothing factor:

>>> max_features_option = [None]
>>> smoothing_factor_option = [4.0, 10, 16, 20, 32]
>>> fit_prior_option = [True, False]

Repeat the cross-validation process and we get the following results:

max features  smoothing   fit prior   auc
None 4.0 True 0.98965
None 4.0 False 0.98969
None 10 True 0.99208
None 10 False 0.99211
None 16 True 0.99329
None 16 False 0.99329
None 20 True 0.99362
None 20 False 0.99362
None 32 True 0.99307
None 32 False 0.99307

The (None, 20, False) set enables the best AUC, at 0.99362!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset