Implementing Naïve Bayes from scratch

After a hand-calculating spam email detection example, as promised, we are going to code it through a genuine dataset, taken from the Enron email dataset http://www.aueb.gr/users/ion/data/enron-spam/. The specific dataset we are using can be directly downloaded via http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz. You can either unzip it using software, or run the following command line on your terminal:

tar -xvz enron1.tar.gz

The uncompressed folder includes a folder of ham, or non-spam, email text files, and a folder of spam email text files, as well as a summary description of the database:

enron1/
  ham/
    0001.1999-12-10.farmer.ham.txt
       0002.1999-12-13.farmer.ham.txt
       ……
       ……
    5172.2002-01-11.farmer.ham.txt
  spam/
    0006.2003-12-18.GP.spam.txt
       0008.2003-12-18.GP.spam.txt
       ……
       ……
       5171.2005-09-06.GP.spam.txt
  Summary.txt

Given a dataset for a classification problem, it is always good to keep in mind the number of samples per class and the proportion of samples from each class before applying any machine learning techniques. As written in the Summary.txt file, there are 3,672 ham (legitimate) emails and 1,500 spam emails so the spam: the legitimate-to-spam ratio is approximately 1:3 here. If such information was not given, you can also get the numbers by running the following commands:

ls -1 enron1/ham/*.txt | wc -l
3672
ls -1 enron1/spam/*.txt | wc -l
1500

Class imbalance is critical to classification performance. Imagine if most samples are from one class—the classifier tends to only learn from the dominant class and neglect the minorities. Hence, paying extra attention to class imbalance is always recommended. If it does occur, we need to either downsample the majority class, or upsample the minor class, in order to mitigate the disproportion.

Let's have a look at a legitimate and a spam email by running the following scripts from the same path where the unzipped folder is located:

>>> file_path = 'enron1/ham/0007.1999-12-14.farmer.ham.txt'
>>> with open(file_path, 'r') as infile:
...     ham_sample = infile.read()
>>> print(ham_sample)
Subject: mcmullen gas for 11 / 99
jackie ,
since the inlet to 3 river plant is shut in on 10 / 19 / 99 ( the 
last day of flow ) :
at what meter is the mcmullen gas being diverted to ?
at what meter is hpl buying the residue gas ? ( this is the gas 
from teco ,vastar , vintage , tejones , and swift )
i still see active deals at meter 3405 in path manager for teco , 
vastar ,vintage , tejones , and swift
i also see gas scheduled in pops at meter 3404 and 3405 .
please advice . we need to resolve this as soon as possible so 
settlement can send out payments .
thanks

Similarly, the spam sample is as follows:

>>> file_path = 'enron1/spam/0058.2003-12-21.GP.spam.txt'
>>> with open(file_path, 'r') as infile:
...    spam_sample = infile.read()
>>> print(spam_sample)
Subject: stacey automated system generating 8 k per week parallelogram
people are
getting rich using this system ! now it ' s your
turn !
we ' ve
cracked the code and will show you . . . .
this is the
only system that does everything for you , so you can make
money
. . . . . . . .
because your
success is . . . completely automated !
let me show
you how !
click
here
to opt out click here % random _ text

Next, we read all of the email text files and keep the ham/spam class information in the labels variable, where 1 represents spam emails, and 0 is for ham.

First, import the necessary modules, glob and os, in order to find all the .txt email files, and initialize the variables, keeping the text data and labels:

>>> import glob
>>> import os
>>> emails, labels = [], []

Then, to load the spam email files, run the following commands:

>>> file_path = 'enron1/spam/'
>>> for filename in glob.glob(os.path.join(file_path, '*.txt')):
...     with open(filename, 'r', encoding="ISO-8859-1") as infile:
...         emails.append(infile.read())
...         labels.append(1)

Load the legitimate email files by running the following commands:

>>> file_path = 'enron1/ham/'
>>> for filename in glob.glob(os.path.join(file_path, '*.txt')):
...     with open(filename, 'r', encoding="ISO-8859-1") as infile:
...         emails.append(infile.read())
...         labels.append(0)
>>> len(emails)
5172
>>> len(labels)
5172

The next step is to preprocess and clean the raw text data. To briefly recap, this includes the following:

Number and punctuation removal
Human name removal (optional)
Stop-word removal
Lemmatization

We herein reuse the code we developed in the previous two chapters:

>>> from nltk.corpus import names
>>> from nltk.stem import WordNetLemmatizer
>>> def is_letter_only(word):
...     return word.isalpha()
>>> all_names = set(names.words())
>>> lemmatizer = WordNetLemmatizer()

Put together a function performing text cleaning as follows:

>>> def clean_text(docs):
...     docs_cleaned = []
...     for doc in docs:
...         doc = doc.lower()
...         doc_cleaned = ' '.join(lemmatizer.lemmatize(word)
                    for word in doc.split() if is_letter_only(word)
                    and word not in all_names)
...         docs_cleaned.append(doc_cleaned)
...     return docs_cleaned
>>> emails_cleaned = clean_text(emails)

This leads to stop-word removal and term feature extraction, as follows:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer(stop_words="english", max_features=1000,
                                             max_df=0.5, min_df=2)
>>> docs_cv = cv.fit_transform(emails_cleaned)

The max_features parameter is set to 1000, so it only considers the 1,000 most frequent terms, excluding those that are too common (50% max_df) and too rare (2 min_df). We can definitely tweak this parameter later on in order to achieve higher classification accuracy.

In case you forget what the resulting term vectors look like, let's take a peek:

>>> print(docs_cv[0])
 (0, 932) 1
 (0, 968) 1
 (0, 715) 1
 (0, 151) 1
 (0, 585) 1
 (0, 864) 1
 (0, 506) 1
 (0, 691) 1
 (0, 897) 1
 (0, 476) 1
 (0, 72)  1
 (0, 86)  2
 (0, 997) 1
 (0, 103) 1
 (0, 361) 2
 (0, 229) 1
 (0, 363) 2
 (0, 482) 2
 (0, 265) 2

The sparse vector is in the form of the following:

  (row index, term index) term_frequency

We can also see what the corresponding terms are, as follows:

>>> terms = cv.get_feature_names()
>>> print(terms[932])
unsubscribe
>>> print(terms[968])
website
>>> print(terms[715])
read

With the docs_cv feature matrix just generated, we can now develop and train our Naïve Bayes model, from scratch as always.

Starting with the prior, we first group the data by label and record the index of samples:

>>> def get_label_index(labels):
...     from collections import defaultdict
...     label_index = defaultdict(list)
...     for index, label in enumerate(labels):
...         label_index[label].append(index)
...     return label_index
>>> label_index = get_label_index(labels)

The resulting label_index looks like {0: [3000, 3001, 3002, 3003, …… 6670, 6671], 1: [0, 1, 2, 3, …., 2998, 2999]}, where training sample indices are grouped by class. With this, we calculate prior:

>>> def get_prior(label_index):
...     """
...     Compute prior based on training samples
...     @param label_index: grouped sample indices by class
...     @return: dictionary, with class label as key, corresponding
                                                  prior as the value
...     """
...     prior = {label: len(index) for label, index in
                                       label_index.items()}
...     total_count = sum(prior.values())
...     for label in prior:
...         prior[label] /= float(total_count)
...     return prior
>>> prior = get_prior(label_index)
>>> print('Prior:', prior)
Prior: {1: 0.2900232018561485, 0: 0.7099767981438515}

With prior calculated, we continue with likelihood:

>>> import numpy as np
>>> def get_likelihood(term_matrix, label_index, smoothing=0):
...     """
...     Compute likelihood based on training samples
...     @param term_matrix: sparse matrix of the term frequency features
...     @param label_index: grouped sample indices by class
...     @param smoothing: integer, additive Laplace smoothing parameter
...     @return: dictionary, with class as key, corresponding conditional      
                 probability P(feature|class) vector as value
...     """
...     likelihood = {}
...     for label, index in label_index.items():
...         likelihood[label] = term_matrix[index, :].sum(axis=0) +
                                                           smoothing
...         likelihood[label] = np.asarray(likelihood[label])[0]
...         total_count = likelihood[label].sum()
...         likelihood[label] = likelihood[label] /
                                            float(total_count)
...     return likelihood

We set the smoothing value to 1 here, which can also be 0 for no smoothing, and any other positive value, as long as high classification performance is achieved:

>>> smoothing = 1
>>> likelihood = get_likelihood(docs_cv, label_index, smoothing)
>>> len(likelihood[0])
1000

The likelihood[0] parameter is the conditional probability P(feature | legitimate) vector of length 1,000 (1,000 features) for the legitimate class. Probabilities P(feature | legitimate) for the first five features are as follows:

>>> likelihood[0][:5]
[0.00024653 0.00090705 0.00080007 0.00032096 0.00073495]

And you can probably guess that likelihood[1] would be for the spam class. Similarly, the first five conditional probabilities P(feature | spam) are:

>>> likelihood[1][:5]
[0.00063304 0.00078026 0.00101581 0.00022083 0.00326826]

If you ever find any of these confusing, feel free to check the following toy example to refresh (14 comes from 2 + 1 + 1 + 1 + 1 + 1 + 1 + smoothing 1 * 5, 16 comes from 1 + 1 + 2 + 1 + 3 + 2 + 1 + smoothing 1 * 5):

With prior and likelihood ready, we can now compute the posterior for the testing/new samples. There is a trick we use: instead of calculating the multiplication of hundreds of thousands of small value conditional probabilities of P(feature | class) (for example, 0.00024653, as we just saw), which may cause overflow error, we instead calculate the summation of their natural logarithms, then convert it back to its natural exponential value:

>>> def get_posterior(term_matrix, prior, likelihood):
...     """
...     Compute posterior of testing samples, based on prior and likelihood
...     @param term_matrix: sparse matrix of the term frequency features
...     @param prior: dictionary, with class label as key,
                                  corresponding prior as the value
...     @param likelihood: dictionary, with class label as key,
                     corresponding conditional probability vector as value
...     @return: dictionary, with class label as key, corresponding
                                                  posterior as value
...     """
...     num_docs = term_matrix.shape[0]
...     posteriors = []
...     for i in range(num_docs):
...         # posterior is proportional to prior * likelihood
...         # = exp(log(prior * likelihood))
...         # = exp(log(prior) + log(likelihood))
...         posterior = {key: np.log(prior_label) for key,
                                      prior_label in prior.items()}
...         for label, likelihood_label in likelihood.items():
...             term_document_vector = term_matrix.getrow(i)
...             counts = term_document_vector.data
...             indices = term_document_vector.indices
...             for count, index in zip(counts, indices):
...                 posterior[label] +=
                         np.log(likelihood_label[index]) * count
...         # exp(-1000):exp(-999) will cause zero division error,
...         # however it equates to exp(0):exp(1)
...         min_log_posterior = min(posterior.values())
...         for label in posterior:
...             try:
...                 posterior[label] = np.exp(
                               posterior[label] - min_log_posterior)
...             except:
...                 posterior[label] = float('inf')
...         # normalize so that all sums up to 1
...         sum_posterior = sum(posterior.values())
...         for label in posterior:
...             if posterior[label] == float('inf'):
...                 posterior[label] = 1.0
...             else:
...                 posterior[label] /= sum_posterior
...         posteriors.append(posterior.copy())
...     return posteriors

The prediction function is finished. Let's take one ham and one spam sample from another Enron email dataset to quickly verify our algorithm:

>>> emails_test = [
...     '''Subject: flat screens
...     hello ,
...     please call or contact regarding the other flat screens
...     requested .
...     trisha tlapek - eb 3132 b
...     michael sergeev - eb 3132 a
...     also the sun blocker that was taken away from eb 3131 a .
...     trisha should two monitors also michael .
...     thanks
...     kevin moore''',
...     '''Subject: let ' s stop the mlm insanity !
...     still believe you can earn $ 100 , 000 fast in mlm ? get real !
...     get emm , a brand new system that replaces mlm with something that works !
...     start earning 1 , 000 ' s now ! up to $ 10 , 000 per week doing simple 
...     online tasks .
...     free info - breakfree @ luxmail . com - type " send emm info " in the 
...     subject box .
...     this message is sent in compliance of the proposed bill section 301 . per 
...     section 301 , paragraph ( a ) ( 2 ) ( c ) of s . 1618 . further transmission 
...     to you by the sender of this e - mail may be stopped at no cost to you by        
...     sending a reply to : " email address " with the word remove in the subject 
...     line .''',
... ]

Go through the same cleaning and preprocessing steps as in training stage:

>>> emails_cleaned_test = clean_text(emails_test)
>>> term_docs_test = cv.transform(emails_cleaned_test)
>>> posterior = get_posterior(term_docs_test, prior, likelihood)
>>> print(posterior)
[{1: 5.958269329017321e-08, 0: 0.9999999404173067},
{1: 0.9999999999999948, 0: 5.2138625988879895e-15}]

For the first email, 99.5% legitimate; the second email nearly 100% spam. Both are predicted correctly.

Further, to comprehensively evaluate our classifier's performance, we can randomly split the original dataset into two sets, the training and testing sets, which simulate learning data and prediction data respectively. Generally, the proportion of the original dataset to include in the testing split can be 25%, 33.3%, or 40%. We use the train_test_split function from scikit-learn to do the random splitting and to preserve the percentage of samples for each class:

>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, Y_train, Y_test =
              train_test_split(emails_cleaned, labels, test_size=0.33,
              random_state=42)

It is a good practice to assign a fixed random_state (for example, 42) during experiments and exploration in order to guarantee that the same training and testing sets are generated every time the program runs. This allows us to make sure that the classifier functions and performs well on a fixed dataset before we incorporate randomness and proceed further.

Check the training size and testing size as follows:

>>> len(X_train), len(Y_train)
(3465, 3465)
>>> len(X_test), len(Y_test)
(1707, 1707)

Retrain the term frequency CountVectorizer based on the training set and recompute prior and likelihood accordingly:

>>> term_docs_train = cv.fit_transform(X_train)  
>>> label_index = get_label_index(Y_train)
>>> prior = get_prior(label_index)
>>> likelihood = get_likelihood(term_docs_train, label_index, smoothing)

We then convert the testing documents into term matrix as follows:

>>> term_docs_test = cv.transform(X_test)

It is noted that we can't train CountVectorizer using both the training and testing sets. Otherwise, it will cause data leakage, as the testing set is supposed to be unknown beforehand to all feature extractors. Hence, the term pool and the term counter should be built solely on the training set.

Now, predict the posterior of the testing/new dataset as follows:

>>> posterior = get_posterior(term_docs_test, prior, likelihood)

Finally, we evaluate the model's performance with classification accuracy, which is the proportion of correct prediction:

>>> correct = 0.0
>>> for pred, actual in zip(posterior, Y_test):
...     if actual == 1:
...         if pred[1] >= 0.5:
...             correct += 1
...     elif pred[0] > 0.5:
...         correct += 1
>>> print('The accuracy on {0} testing samples is:
         {1:.1f}%'.format(len(Y_test), correct/len(Y_test)*100))
The accuracy on 1707 testing samples is: 93.0%

The Naïve Bayes classifier we just developed line by line correctly classifies 93% emails!

Table of Contents for Implementing Na&#xEF;ve Bayes from scratch

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing Naïve Bayes from scratch