Using n-grams to improve the result

One thing to do is to use n-gram counts instead of plain word counts. So far, we have relied on what is known as a bag of words: we simply threw every word of an email into a bag and counted the number of its occurrences. However, in real emails, the order in which words appear can carry a great deal of information!

This is exactly what n-gram counts are trying to convey. You can think of an n-gram as a phrase that is n words long. For example, the phrase Statistics has its moments contains the following 1-grams: Statistics, has, its, and moments. It also has the following 2-grams: Statistics has, has its, and its moments. It also has two 3-grams (Statistics has its and has its moments) and only a single 4-gram.

We can tell CountVectorizer to include any order of n-grams into the feature matrix by specifying a range for n:

In [20]: counts = feature_extraction.text.CountVectorizer(
... ngram_range=(1, 2)
... )
... X = counts.fit_transform(data['text'].values)

We then repeat the entire procedure of splitting the data and training the classifier:

In [21]: X_train, X_test, y_train, y_test = ms.train_test_split(
... X, y, test_size=0.2, random_state=42
... )
In [22]: model_naive = naive_bayes.MultinomialNB()
... model_naive.fit(X_train, y_train)
Out[22]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

You might have noticed that the training is taking much longer this time. To our delight, we find that the performance has significantly increased:

In [23]: model_naive.score(X_test, y_test)
Out[23]: 0.97062211981566815

However, n-gram counts are not perfect. They have the disadvantage of unfairly weighting longer documents (because there are more possible combinations of forming n-grams). To avoid this problem, we can use relative frequencies instead of a simple number of occurrences. We have already encountered one way to do so, and it had a horribly complicated name. Do you remember what it was called?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset