Chapter 11. Predictions Don't Grow on Trees – or Do They?

In this chapter, we will be looking at three types of machine learning algorithms. The first two being examples of supervised learning while the final algorithm being an example of unsupervised learning.

Our goal in this chapter is to see and apply concepts learned from previous chapters in order to construct and use modern learning algorithms in order to glean insights and make predictions on real data sets. While we explore the following algorithms, we should always remember that we are constantly keeping our metrics in mind.

Let's get to it!

Naïve Bayes classification

Let's get right into it! Let's begin with Naïve Bayes classification. This machine learning model relies heavily on results from previous chapters, specifically with Bayes theorem:

Let's look a little closer at the specific features of this formula:

  • P(H) is the probability of the hypothesis before we observe the data, called the prior probability, or just prior
  • P(H|D) is what we want to compute, the probability of the hypothesis after we observe the data, called the posterior
  • P(D|H) is the probability of the data under the given hypothesis, called the likelihood
  • P(D) is the probability of the data under any hypothesis, called the normalizing constant

Naïve Bayes classification is a classification model, and therefore a supervised model. Given this, what kind of data do we need?

  • Labeled data
  • Unlabeled data

(Insert jeopardy music here)

If you answered labeled data then you're well on your way to becoming a data scientist!

Suppose we have a data set with n features, (x1, x2, …, xn) and a class label C. For example let's take some data involving spam text classification. Our data would consist of rows of individual text samples and columns of both our features and our class labels. Our features would be words and phrases that are contained within the text samples and our class labels are simply spam or not spam. In this scenario, I will replace the class not spam with the easier to say word, ham:

import pandas as pd
import sklearn
df = pd.read_table('',
                   sep='	', header=None, names=['label', 'msg'])

Here is a sample of text data in a row column format:

Naïve Bayes classification

Let's do some preliminary statistics to see what we are dealing with. Let's see the difference in the number of ham and spam messages at our disposal:


This gives us a bar chart, as follows:

Naïve Bayes classification

So we have WAY more ham messages than we do spam. Because this is a classification problem, it will be very useful to know our null accuracy rate which is the percentage chance of predicting a single row correctly if we keep guessing the most common class, ham:

df.label.value_counts() / df.shape[0]

ham     0.865937
spam    0.134063

So if we blindly guessed ham we would be correct about 87% of the time, but we can do better than that. If we have a set of classes, C, and a features xi, then we can use Bayes theorem to predict the probability that a single row belongs to class C using the following formula:

Naïve Bayes classification

Let's look at this formula in a little more detail:

  • P(class C | {xi}): The posterior probability is the probability that the row belongs to class C given the features {xi}.
  • P({xi} | class C): This is the likelihood that we would observe these features given that the row was in class C.
  • P(class C): This is the prior probability. It is the probability that the data point belongs to class C before we see any data.
  • P({xi}): This is our normalization constant.

For example, imagine we have an e-mail with three words: send cash now. We'll use Naïve Bayes to classify the e-mail as either being spam or ham:

Naïve Bayes classification
Naïve Bayes classification

We are concerned with the difference of these two numbers. We can use the following criteria to classify any single text sample:

  • If P(spam | send cash now) is larger than P(ham | send cash now), then we will classify the text as spam
  • If P(ham | send cash now) is larger than P(spam | send cash now), then we will label the text as ham

Because both equations have P (send money now) on the denominator, we can ignore them.

So now we are concerned with the following:

Naïve Bayes classification

Let's figure out the numbers in this equation:

  • P(spam) = 0.134063
  • P(ham) = 0.865937
  • P(send cash now | spam)
  • P(send cash now | ham)

The final two likelihoods might seem like they would not be so difficult to calculate. All we have to do is count the numbers of spam messages that include the phrase send money now and divide that by the total number of spam messages:

df.msg = df.msg.apply(lambda x:x.lower())
# make all strings lower case so we can search easier

df[df.msg.str.contains('send cash now')] .shape
(0, 2)

Oh no! There are none! There are literally 0 texts with the exact phrase send cash now. The hidden problem here is that this phrase is very specific and we can't assume that we will have enough data in the world to have seen this exact phrase many times before. Instead we can make a naïve assumption in our Bayes theorem. If we assume that the features (words) are conditionally independent (meaning that no word affects the existence of another word) then we can rewrite the formula:

Naïve Bayes classification
spams = df[df.label == 'spam']
for word in ['send', 'cash', 'now']:
    print word, spams[spams.msg.str.contains(word)].shape[0] / float(spams.shape[0])revealing 
  • P(send|spam) = 0.096
  • P(cash |spam) = 0.091
  • P(now|spam) = 0.280

Meaning we can calculate the following:

Naïve Bayes classification

Repeating the same procedure for ham gives us the following:

  • P(send|ham) = 0.03
  • P(cash|ham) = 0.003
  • P(now|ham) = 0.109
Naïve Bayes classification

The fact that these numbers are both very low is not as important as the fact that the spam probability is much larger than the ham calculation. If we calculate .00032 / .0000084 = 38.1 we see that the send cash now probability for spam is 38 times higher than for spam.

Doing this means that we can classify send cash now as spam! Simple, right?

Let's use Python to implement a Naïve Bayes classifier without having to do all of these calculations ourselves.

First, let's revisit the count vectorizer in scikit-learn that turns text into numerical data for us. Let's assume that we will train on three documents (sentences):

# simple count vectorizer example
from sklearn.feature_extraction.text import CountVectorizer
# start with a simple example
train_simple = ['call you tonight',
                'Call me a cab',
                'please call me... PLEASE 44!']

# learn the 'vocabulary' of the training data
vect = CountVectorizer()
train_simple_dtm = vect.fit_transform(train_simple)
pd.DataFrame(train_simple_dtm.toarray(), columns=vect.get_feature_names())
Naïve Bayes classification

Note that each row represents one of the three documents (sentences), each column represents one of the words present in the documents and each cell contains the number of times each word appears in each document.

We can then use the count vectorizer to transform new incoming test documents to conform with our training set (the three sentences):

# transform testing data into a document-term matrix (using existing vocabulary, notice don't is missing)
test_simple = ["please don't call me"]
test_simple_dtm = vect.transform(test_simple)
pd.DataFrame(test_simple_dtm.toarray(), columns=vect.get_feature_names())
Naïve Bayes classification

Note how in our test sentence we had a new word, namely don't. When we vectorized it, because we hadn't seen that word previously in our training data, the vectorizer simply ignored it. This is important, and incentivizes data scientists to obtain as much data as possible for their training sets.

Now let's do this for our actual data:

# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.msg, df.label, random_state=1)

# instantiate the vectorizer
vect = CountVectorizer()

# learn vocabulary and create document-term matrix in a single step
train_dtm = vect.fit_transform(X_train)

<4179x7456 sparse matrix of type '<type 'numpy.int64'>'

With 55209 stored elements in compressed sparse row format.

Note that the format is in a sparse matrix, meaning the matrix is so large and full of zeroes, there exists a special format to deal with objects such as this. Take a look at the number of columns.

7,456 words!!

This means that in our training set, there are 7,456 unique words to look at. We can now transform our test data to conform to our vocabulary:

# transform testing data into a document-term matrix
test_dtm = vect.transform(X_test)

<1393x7456 sparse matrix of type '<type 'numpy.int64'>'

With 17604 stored elements in compressed sparse row format.

Note that we have the same exact number of columns because it is conforming to our test set to be exactly the same vocabulary as before. No more, no less.

Now let's build a Naïve Bayes model (similar to the linear regression process):


# train a Naive Bayes model using train_dtm
from sklearn.naive_bayes import MultinomialNB
# import our model

nb = MultinomialNB()
# instantiate our model, y_train)
# fit it to our training set

Now the variable nb holds our fitted model. The training phase of the model involves computing the likelihood function, which is the conditional probability of each feature given each class:

# make predictions on test data using test_dtm
preds = nb.predict(test_dtm)


array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], 

The prediction phase of the model involves computing the posterior probability of each class given the observed features, and choosing the class with the highest probability.

We will use sklearn's built-in accuracy and confusion matrix to look at how well our Naïve Bayes models are performing:

# compare predictions to true labels
from sklearn import metrics
print metrics.accuracy_score(y_test, preds)
print metrics.confusion_matrix(y_test, preds)

accuracy == 0.988513998564
confusion matrix == 
[[1203    5]
 [  11  174]]

First off, our accuracy is great! Compared to our null accuracy which was 87%, 99% is a fantastic improvement.

Now to our confusion matrix. From before, we know that each row represents actual values while columns represent predicted values so the top left value, 1,203, represents our true negatives. But what is negative and positive? We gave the model the strings spam and ham as our classes, not positive and negative.

We can use the following:

array(['ham', 'spam'])

We can then line up the indices so that the 1,203 refers to true ham predictions and 174 refers to true spam predictions.

There were also five false spam classifications, meaning that five messages were predicted as spam, but were actually ham, as well as 11 false ham classifications.

In summary, Naïve Bayes classification uses Bayes theorem in order to fit posterior probabilities of classes so that data points are correctly labeled as belonging to the proper class.

