One thing to do is to use n-gram counts instead of plain word counts. So far, we have relied on what is known as a bag of words: we simply threw every word of an email into a bag and counted the number of its occurrences. However, in real emails, the order in which words appear can carry a great deal of information!
This is exactly what n-gram counts are trying to convey. You can think of an n-gram as a phrase that is n words long. For example, the phrase Statistics has its moments contains the following 1-grams: Statistics, has, its, and moments. It also has the following 2-grams: Statistics has, has its, and its moments. It also has two 3-grams (Statistics has its and has its moments) and only a single 4-gram.
We can tell CountVectorizer to include any order of n-grams into the feature matrix by specifying a range for n:
In [20]: counts = feature_extraction.text.CountVectorizer(
... ngram_range=(1, 2)
... )
... X = counts.fit_transform(data['text'].values)
We then repeat the entire procedure of splitting the data and training the classifier:
In [21]: X_train, X_test, y_train, y_test = ms.train_test_split(
... X, y, test_size=0.2, random_state=42
... )
In [22]: model_naive = naive_bayes.MultinomialNB()
... model_naive.fit(X_train, y_train)
Out[22]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
You might have noticed that the training is taking much longer this time. To our delight, we find that the performance has significantly increased:
In [23]: model_naive.score(X_test, y_test)
Out[23]: 0.97062211981566815
However, n-gram counts are not perfect. They have the disadvantage of unfairly weighting longer documents (because there are more possible combinations of forming n-grams). To avoid this problem, we can use relative frequencies instead of a simple number of occurrences. We have already encountered one way to do so, and it had a horribly complicated name. Do you remember what it was called?