In this chapter, we will be looking at three types of machine learning algorithms. The first two being examples of supervised learning while the final algorithm being an example of unsupervised learning.
Our goal in this chapter is to see and apply concepts learned from previous chapters in order to construct and use modern learning algorithms in order to glean insights and make predictions on real data sets. While we explore the following algorithms, we should always remember that we are constantly keeping our metrics in mind.
Let's get to it!
Let's get right into it! Let's begin with Naïve Bayes classification. This machine learning model relies heavily on results from previous chapters, specifically with Bayes theorem:
Let's look a little closer at the specific features of this formula:
Naïve Bayes classification is a classification model, and therefore a supervised model. Given this, what kind of data do we need?
(Insert jeopardy music here)
If you answered labeled data then you're well on your way to becoming a data scientist!
Suppose we have a data set with n features, (x1
, x2
, …, xn
) and a class label C. For example let's take some data involving spam text classification. Our data would consist of rows of individual text samples and columns of both our features and our class labels. Our features would be words and phrases that are contained within the text samples and our class labels are simply spam
or not spam
. In this scenario, I will replace the class not spam
with the easier to say word, ham
:
import pandas as pd import sklearn df = pd.read_table('https://raw.githubusercontent.com/sinanuozdemir/sfdat22/master/data/sms.tsv', sep=' ', header=None, names=['label', 'msg']) df
Here is a sample of text data in a row column format:
Let's do some preliminary statistics to see what we are dealing with. Let's see the difference in the number of ham and spam messages at our disposal:
df.label.value_counts().plot(kind="bar")
This gives us a bar chart, as follows:
So we have WAY more ham messages than we do spam. Because this is a classification problem, it will be very useful to know our null accuracy rate which is the percentage chance of predicting a single row correctly if we keep guessing the most common class, ham:
df.label.value_counts() / df.shape[0] ham 0.865937 spam 0.134063
So if we blindly guessed ham
we would be correct about 87% of the time, but we can do better than that. If we have a set of classes, C, and a features xi, then we can use Bayes theorem to predict the probability that a single row belongs to class C using the following formula:
Let's look at this formula in a little more detail:
For example, imagine we have an e-mail with three words: send cash now.
We'll use Naïve Bayes to classify the e-mail as either being spam or ham:
We are concerned with the difference of these two numbers. We can use the following criteria to classify any single text sample:
P(spam | send cash now)
is larger than P(ham | send cash now)
, then we will classify the text as spamP(ham | send cash now)
is larger than P(spam | send cash now)
, then we will label the text as hamBecause both equations have P (send money now) on the denominator, we can ignore them.
So now we are concerned with the following:
Let's figure out the numbers in this equation:
The final two likelihoods might seem like they would not be so difficult to calculate. All we have to do is count the numbers of spam messages that include the phrase send money now
and divide that by the total number of spam messages:
df.msg = df.msg.apply(lambda x:x.lower()) # make all strings lower case so we can search easier df[df.msg.str.contains('send cash now')] .shape (0, 2)
Oh no! There are none! There are literally 0
texts with the exact phrase send cash now
. The hidden problem here is that this phrase is very specific and we can't assume that we will have enough data in the world to have seen this exact phrase many times before. Instead we can make a naïve assumption in our Bayes theorem. If we assume that the features (words) are conditionally independent (meaning that no word affects the existence of another word) then we can rewrite the formula:
spams = df[df.label == 'spam'] for word in ['send', 'cash', 'now']: print word, spams[spams.msg.str.contains(word)].shape[0] / float(spams.shape[0])revealing
Meaning we can calculate the following:
Repeating the same procedure for ham gives us the following:
The fact that these numbers are both very low is not as important as the fact that the spam probability is much larger than the ham calculation. If we calculate .00032 / .0000084 = 38.1 we see that the send cash now
probability for spam is 38 times higher than for spam.
Doing this means that we can classify send cash now
as spam! Simple, right?
Let's use Python to implement a Naïve Bayes classifier without having to do all of these calculations ourselves.
First, let's revisit the count vectorizer in scikit-learn that turns text into numerical data for us. Let's assume that we will train on three documents (sentences):
# simple count vectorizer example from sklearn.feature_extraction.text import CountVectorizer # start with a simple example train_simple = ['call you tonight', 'Call me a cab', 'please call me... PLEASE 44!'] # learn the 'vocabulary' of the training data vect = CountVectorizer() train_simple_dtm = vect.fit_transform(train_simple) pd.DataFrame(train_simple_dtm.toarray(), columns=vect.get_feature_names())
Note that each row represents one of the three documents (sentences), each column represents one of the words present in the documents and each cell contains the number of times each word appears in each document.
We can then use the count vectorizer to transform new incoming test documents to conform with our training set (the three sentences):
# transform testing data into a document-term matrix (using existing vocabulary, notice don't is missing) test_simple = ["please don't call me"] test_simple_dtm = vect.transform(test_simple) test_simple_dtm.toarray() pd.DataFrame(test_simple_dtm.toarray(), columns=vect.get_feature_names())
Note how in our test sentence we had a new word, namely don't. When we vectorized it, because we hadn't seen that word previously in our training data, the vectorizer simply ignored it. This is important, and incentivizes data scientists to obtain as much data as possible for their training sets.
Now let's do this for our actual data:
# split into training and testing sets from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(df.msg, df.label, random_state=1) # instantiate the vectorizer vect = CountVectorizer() # learn vocabulary and create document-term matrix in a single step train_dtm = vect.fit_transform(X_train) train_dtm <4179x7456 sparse matrix of type '<type 'numpy.int64'>'
With 55209 stored elements in compressed sparse row format.
Note that the format is in a sparse matrix, meaning the matrix is so large and full of zeroes, there exists a special format to deal with objects such as this. Take a look at the number of columns.
7,456 words!!
This means that in our training set, there are 7,456 unique words to look at. We can now transform our test data to conform to our vocabulary:
# transform testing data into a document-term matrix test_dtm = vect.transform(X_test) test_dtm <1393x7456 sparse matrix of type '<type 'numpy.int64'>'
With 17604 stored elements in compressed sparse row format.
Note that we have the same exact number of columns because it is conforming to our test set to be exactly the same vocabulary as before. No more, no less.
Now let's build a Naïve Bayes model (similar to the linear regression process):
## MODEL BUILDING WITH NAIVE BAYES # train a Naive Bayes model using train_dtm from sklearn.naive_bayes import MultinomialNB # import our model nb = MultinomialNB() # instantiate our model nb.fit(train_dtm, y_train) # fit it to our training set
Now the variable nb
holds our fitted model. The training phase of the model involves computing the likelihood function, which is the conditional probability of each feature given each class:
# make predictions on test data using test_dtm preds = nb.predict(test_dtm) preds array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], dtype='|S4')
The prediction phase of the model involves computing the posterior probability of each class given the observed features, and choosing the class with the highest probability.
We will use sklearn's built-in accuracy and confusion matrix to look at how well our Naïve Bayes models are performing:
# compare predictions to true labels from sklearn import metrics print metrics.accuracy_score(y_test, preds) print metrics.confusion_matrix(y_test, preds) accuracy == 0.988513998564 confusion matrix == [[1203 5] [ 11 174]]
First off, our accuracy is great! Compared to our null accuracy which was 87%, 99% is a fantastic improvement.
Now to our confusion matrix. From before, we know that each row represents actual values while columns represent predicted values so the top left value, 1,203, represents our true negatives. But what is negative and positive? We gave the model the strings spam
and ham
as our classes, not positive and negative.
We can use the following:
nb.classes_ array(['ham', 'spam'])
We can then line up the indices so that the 1,203 refers to true ham
predictions and 174 refers to true spam
predictions.
There were also five false spam classifications, meaning that five messages were predicted as spam
, but were actually ham
, as well as 11 false ham classifications.
In summary, Naïve Bayes classification uses Bayes theorem in order to fit posterior probabilities of classes so that data points are correctly labeled as belonging to the proper class.