After a hand-calculating spam email detection example, as promised, we are going to code it through a genuine dataset, taken from the Enron email dataset http://www.aueb.gr/users/ion/data/enron-spam/. The specific dataset we are using can be directly downloaded via http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz. You can either unzip it using software, or run the following command line on your terminal:
tar -xvz enron1.tar.gz
The uncompressed folder includes a folder of ham, or non-spam, email text files, and a folder of spam email text files, as well as a summary description of the database:
enron1/
ham/
0001.1999-12-10.farmer.ham.txt
0002.1999-12-13.farmer.ham.txt
……
……
5172.2002-01-11.farmer.ham.txt
spam/
0006.2003-12-18.GP.spam.txt
0008.2003-12-18.GP.spam.txt
……
……
5171.2005-09-06.GP.spam.txt
Summary.txt
Given a dataset for a classification problem, it is always good to keep in mind the number of samples per class and the proportion of samples from each class before applying any machine learning techniques. As written in the Summary.txt file, there are 3,672 ham (legitimate) emails and 1,500 spam emails so the spam: the legitimate-to-spam ratio is approximately 1:3 here. If such information was not given, you can also get the numbers by running the following commands:
ls -1 enron1/ham/*.txt | wc -l
3672
ls -1 enron1/spam/*.txt | wc -l
1500
Let's have a look at a legitimate and a spam email by running the following scripts from the same path where the unzipped folder is located:
>>> file_path = 'enron1/ham/0007.1999-12-14.farmer.ham.txt'
>>> with open(file_path, 'r') as infile:
... ham_sample = infile.read()
>>> print(ham_sample)
Subject: mcmullen gas for 11 / 99
jackie ,
since the inlet to 3 river plant is shut in on 10 / 19 / 99 ( the
last day of flow ) :
at what meter is the mcmullen gas being diverted to ?
at what meter is hpl buying the residue gas ? ( this is the gas
from teco ,vastar , vintage , tejones , and swift )
i still see active deals at meter 3405 in path manager for teco ,
vastar ,vintage , tejones , and swift
i also see gas scheduled in pops at meter 3404 and 3405 .
please advice . we need to resolve this as soon as possible so
settlement can send out payments .
thanks
Similarly, the spam sample is as follows:
>>> file_path = 'enron1/spam/0058.2003-12-21.GP.spam.txt'
>>> with open(file_path, 'r') as infile:
... spam_sample = infile.read()
>>> print(spam_sample)
Subject: stacey automated system generating 8 k per week parallelogram
people are
getting rich using this system ! now it ' s your
turn !
we ' ve
cracked the code and will show you . . . .
this is the
only system that does everything for you , so you can make
money
. . . . . . . .
because your
success is . . . completely automated !
let me show
you how !
click
here
to opt out click here % random _ text
Next, we read all of the email text files and keep the ham/spam class information in the labels variable, where 1 represents spam emails, and 0 is for ham.
First, import the necessary modules, glob and os, in order to find all the .txt email files, and initialize the variables, keeping the text data and labels:
>>> import glob
>>> import os
>>> emails, labels = [], []
Then, to load the spam email files, run the following commands:
>>> file_path = 'enron1/spam/'
>>> for filename in glob.glob(os.path.join(file_path, '*.txt')):
... with open(filename, 'r', encoding="ISO-8859-1") as infile:
... emails.append(infile.read())
... labels.append(1)
Load the legitimate email files by running the following commands:
>>> file_path = 'enron1/ham/'
>>> for filename in glob.glob(os.path.join(file_path, '*.txt')):
... with open(filename, 'r', encoding="ISO-8859-1") as infile:
... emails.append(infile.read())
... labels.append(0)
>>> len(emails)
5172
>>> len(labels)
5172
The next step is to preprocess and clean the raw text data. To briefly recap, this includes the following:
- Number and punctuation removal
- Human name removal (optional)
- Stop-word removal
- Lemmatization
We herein reuse the code we developed in the previous two chapters:
>>> from nltk.corpus import names
>>> from nltk.stem import WordNetLemmatizer
>>> def is_letter_only(word):
... return word.isalpha()
>>> all_names = set(names.words())
>>> lemmatizer = WordNetLemmatizer()
Put together a function performing text cleaning as follows:
>>> def clean_text(docs):
... docs_cleaned = []
... for doc in docs:
... doc = doc.lower()
... doc_cleaned = ' '.join(lemmatizer.lemmatize(word)
for word in doc.split() if is_letter_only(word)
and word not in all_names)
... docs_cleaned.append(doc_cleaned)
... return docs_cleaned
>>> emails_cleaned = clean_text(emails)
This leads to stop-word removal and term feature extraction, as follows:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer(stop_words="english", max_features=1000,
max_df=0.5, min_df=2)
>>> docs_cv = cv.fit_transform(emails_cleaned)
The max_features parameter is set to 1000, so it only considers the 1,000 most frequent terms, excluding those that are too common (50% max_df) and too rare (2 min_df). We can definitely tweak this parameter later on in order to achieve higher classification accuracy.
In case you forget what the resulting term vectors look like, let's take a peek:
>>> print(docs_cv[0])
(0, 932) 1
(0, 968) 1
(0, 715) 1
(0, 151) 1
(0, 585) 1
(0, 864) 1
(0, 506) 1
(0, 691) 1
(0, 897) 1
(0, 476) 1
(0, 72) 1
(0, 86) 2
(0, 997) 1
(0, 103) 1
(0, 361) 2
(0, 229) 1
(0, 363) 2
(0, 482) 2
(0, 265) 2
The sparse vector is in the form of the following:
(row index, term index) term_frequency
We can also see what the corresponding terms are, as follows:
>>> terms = cv.get_feature_names()
>>> print(terms[932])
unsubscribe
>>> print(terms[968])
website
>>> print(terms[715])
read
With the docs_cv feature matrix just generated, we can now develop and train our Naïve Bayes model, from scratch as always.
Starting with the prior, we first group the data by label and record the index of samples:
>>> def get_label_index(labels):
... from collections import defaultdict
... label_index = defaultdict(list)
... for index, label in enumerate(labels):
... label_index[label].append(index)
... return label_index
>>> label_index = get_label_index(labels)
The resulting label_index looks like {0: [3000, 3001, 3002, 3003, …… 6670, 6671], 1: [0, 1, 2, 3, …., 2998, 2999]}, where training sample indices are grouped by class. With this, we calculate prior:
>>> def get_prior(label_index):
... """
... Compute prior based on training samples
... @param label_index: grouped sample indices by class
... @return: dictionary, with class label as key, corresponding
prior as the value
... """
... prior = {label: len(index) for label, index in
label_index.items()}
... total_count = sum(prior.values())
... for label in prior:
... prior[label] /= float(total_count)
... return prior
>>> prior = get_prior(label_index)
>>> print('Prior:', prior)
Prior: {1: 0.2900232018561485, 0: 0.7099767981438515}
With prior calculated, we continue with likelihood:
>>> import numpy as np
>>> def get_likelihood(term_matrix, label_index, smoothing=0):
... """
... Compute likelihood based on training samples
... @param term_matrix: sparse matrix of the term frequency features
... @param label_index: grouped sample indices by class
... @param smoothing: integer, additive Laplace smoothing parameter
... @return: dictionary, with class as key, corresponding conditional
probability P(feature|class) vector as value
... """
... likelihood = {}
... for label, index in label_index.items():
... likelihood[label] = term_matrix[index, :].sum(axis=0) +
smoothing
... likelihood[label] = np.asarray(likelihood[label])[0]
... total_count = likelihood[label].sum()
... likelihood[label] = likelihood[label] /
float(total_count)
... return likelihood
We set the smoothing value to 1 here, which can also be 0 for no smoothing, and any other positive value, as long as high classification performance is achieved:
>>> smoothing = 1
>>> likelihood = get_likelihood(docs_cv, label_index, smoothing)
>>> len(likelihood[0])
1000
The likelihood[0] parameter is the conditional probability P(feature | legitimate) vector of length 1,000 (1,000 features) for the legitimate class. Probabilities P(feature | legitimate) for the first five features are as follows:
>>> likelihood[0][:5]
[0.00024653 0.00090705 0.00080007 0.00032096 0.00073495]
And you can probably guess that likelihood[1] would be for the spam class. Similarly, the first five conditional probabilities P(feature | spam) are:
>>> likelihood[1][:5]
[0.00063304 0.00078026 0.00101581 0.00022083 0.00326826]
If you ever find any of these confusing, feel free to check the following toy example to refresh (14 comes from 2 + 1 + 1 + 1 + 1 + 1 + 1 + smoothing 1 * 5, 16 comes from 1 + 1 + 2 + 1 + 3 + 2 + 1 + smoothing 1 * 5):
With prior and likelihood ready, we can now compute the posterior for the testing/new samples. There is a trick we use: instead of calculating the multiplication of hundreds of thousands of small value conditional probabilities of P(feature | class) (for example, 0.00024653, as we just saw), which may cause overflow error, we instead calculate the summation of their natural logarithms, then convert it back to its natural exponential value:
>>> def get_posterior(term_matrix, prior, likelihood):
... """
... Compute posterior of testing samples, based on prior and likelihood
... @param term_matrix: sparse matrix of the term frequency features
... @param prior: dictionary, with class label as key,
corresponding prior as the value
... @param likelihood: dictionary, with class label as key,
corresponding conditional probability vector as value
... @return: dictionary, with class label as key, corresponding
posterior as value
... """
... num_docs = term_matrix.shape[0]
... posteriors = []
... for i in range(num_docs):
... # posterior is proportional to prior * likelihood
... # = exp(log(prior * likelihood))
... # = exp(log(prior) + log(likelihood))
... posterior = {key: np.log(prior_label) for key,
prior_label in prior.items()}
... for label, likelihood_label in likelihood.items():
... term_document_vector = term_matrix.getrow(i)
... counts = term_document_vector.data
... indices = term_document_vector.indices
... for count, index in zip(counts, indices):
... posterior[label] +=
np.log(likelihood_label[index]) * count
... # exp(-1000):exp(-999) will cause zero division error,
... # however it equates to exp(0):exp(1)
... min_log_posterior = min(posterior.values())
... for label in posterior:
... try:
... posterior[label] = np.exp(
posterior[label] - min_log_posterior)
... except:
... posterior[label] = float('inf')
... # normalize so that all sums up to 1
... sum_posterior = sum(posterior.values())
... for label in posterior:
... if posterior[label] == float('inf'):
... posterior[label] = 1.0
... else:
... posterior[label] /= sum_posterior
... posteriors.append(posterior.copy())
... return posteriors
The prediction function is finished. Let's take one ham and one spam sample from another Enron email dataset to quickly verify our algorithm:
>>> emails_test = [
... '''Subject: flat screens
... hello ,
... please call or contact regarding the other flat screens
... requested .
... trisha tlapek - eb 3132 b
... michael sergeev - eb 3132 a
... also the sun blocker that was taken away from eb 3131 a .
... trisha should two monitors also michael .
... thanks
... kevin moore''',
... '''Subject: let ' s stop the mlm insanity !
... still believe you can earn $ 100 , 000 fast in mlm ? get real !
... get emm , a brand new system that replaces mlm with something that works !
... start earning 1 , 000 ' s now ! up to $ 10 , 000 per week doing simple
... online tasks .
... free info - breakfree @ luxmail . com - type " send emm info " in the
... subject box .
... this message is sent in compliance of the proposed bill section 301 . per
... section 301 , paragraph ( a ) ( 2 ) ( c ) of s . 1618 . further transmission
... to you by the sender of this e - mail may be stopped at no cost to you by
... sending a reply to : " email address " with the word remove in the subject
... line .''',
... ]
Go through the same cleaning and preprocessing steps as in training stage:
>>> emails_cleaned_test = clean_text(emails_test)
>>> term_docs_test = cv.transform(emails_cleaned_test)
>>> posterior = get_posterior(term_docs_test, prior, likelihood)
>>> print(posterior)
[{1: 5.958269329017321e-08, 0: 0.9999999404173067},
{1: 0.9999999999999948, 0: 5.2138625988879895e-15}]
For the first email, 99.5% legitimate; the second email nearly 100% spam. Both are predicted correctly.
Further, to comprehensively evaluate our classifier's performance, we can randomly split the original dataset into two sets, the training and testing sets, which simulate learning data and prediction data respectively. Generally, the proportion of the original dataset to include in the testing split can be 25%, 33.3%, or 40%. We use the train_test_split function from scikit-learn to do the random splitting and to preserve the percentage of samples for each class:
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, Y_train, Y_test =
train_test_split(emails_cleaned, labels, test_size=0.33,
random_state=42)
Check the training size and testing size as follows:
>>> len(X_train), len(Y_train)
(3465, 3465)
>>> len(X_test), len(Y_test)
(1707, 1707)
Retrain the term frequency CountVectorizer based on the training set and recompute prior and likelihood accordingly:
>>> term_docs_train = cv.fit_transform(X_train)
>>> label_index = get_label_index(Y_train)
>>> prior = get_prior(label_index)
>>> likelihood = get_likelihood(term_docs_train, label_index, smoothing)
We then convert the testing documents into term matrix as follows:
>>> term_docs_test = cv.transform(X_test)
Now, predict the posterior of the testing/new dataset as follows:
>>> posterior = get_posterior(term_docs_test, prior, likelihood)
Finally, we evaluate the model's performance with classification accuracy, which is the proportion of correct prediction:
>>> correct = 0.0
>>> for pred, actual in zip(posterior, Y_test):
... if actual == 1:
... if pred[1] >= 0.5:
... correct += 1
... elif pred[0] > 0.5:
... correct += 1
>>> print('The accuracy on {0} testing samples is:
{1:.1f}%'.format(len(Y_test), correct/len(Y_test)*100))
The accuracy on 1707 testing samples is: 93.0%
The Naïve Bayes classifier we just developed line by line correctly classifies 93% emails!