Chapter 8. Building a Recommendation Engine

In the last chapter, we looked at ways of classifying data into categories. We took a difficult problem of language detection and used a Naive Bayesian approach to solve it. We also took a lot of unorganized text data, cleaned it, and converted it into a numerical form. The accomplishing of our goal didn't require a lot of code either, all thanks to Haskell's expressive syntax for recursion and list processing. Haskell's beauty comes from its emphasis on type correctness and functional paradigms.

In this chapter, we will cover the following:

  • Analyzing the frequency of words in tweets
  • Removing stop words from tweets
  • Creating multivariate datasets
  • Understanding eigenvalues and eigenvectors
  • Performing simple linear algebra in Haskell
  • Creating a covariance matrix in Haskell
  • Eigen-solving in Haskell
  • Creating a recommendation engine based on PCA

The problems that we have encountered so far looked at data that fit nicely into one or a maximum of two features. In Chapter 4, Plotting, we looked at the share prices of various companies over time. In Chapter 5, Hypothesis Testing, we compared the runs per game for matches that were played at home stadiums to the runs per game for matches that were played at away stadiums in baseball data. In Chapter 6, Correlation and Regression Analysis, we compared the runs per game to the win percentage in baseball data. The data analyst's job is relatively simple when working with bivariate data. In each example, the primary task is to compare the independent (input) variable to the dependent (output) variable. Life would be so much easier if we could reduce all our problems to a single input variable and an output variable. We alluded to this in the chapter on the estimation of the win percentage of baseball teams—there is more to winning a game than the scoring of runs. Some variables help us to explain the output (such as the runs scored per game and the on-base percentage). However, we won't expect some variables to do the same (such as the temperature at the starting time of the game). You should not immediately rule out these obscure variables, but you can fall into the trap of trying to look for correlations in everything. At some point, you need to make a decision about which variables to include and ignore.

In Chapter 6, Correlation and Regression Analysis, we discussed the statistical measure of covariance. Covariance measures how two variables relate to each other. A positive covariance score indicates that the two variables are related. A negative score indicates that the two variables are inversely related. A covariance score that is close to 0 (the score can be either positive or negative) indicates that the two variables are probably not related. We also discussed the Pearson's r2 correlation coefficient, which is a normalized squared version of the covariance score. The Pearson's r2 score allows us to quickly judge the strength at which two variables correlate. A Pearson's r2 score of 1 indicates that the two variables correlate and a score of 0 indicates that the two variables do not correlate. When working with lots of variables, our goal should be to improve the results by selecting a subset of input variables with the largest variance scores with the output variable.

In this chapter, we will look at recommendation engines. Websites such as Amazon will recommend products for you to purchase based on your prior purchases. Netflix famously offered 1 million dollars to a team that could design a system to improve their already excellent movie recommendation engine. Facebook will recommend people as your friends and Twitter will recommend people who might have interests that are similar to yours. There are plenty of variables that go into a recommendation engine, but what should come out is a single list of items with a score associated with each item. A select number of top-ranked items are passed along to you as your top-ranked recommendations. Some of these input variables are directly relevant to our recommendations and the others aren't.

Our goal in this chapter is to build our own recommendation engine based on the Twitter data that was obtained using the methods described in Chapter 7, Naive Bayes Classification of Twitter Data. After writing Chapter 7, Naive Bayes Classification of Twitter Data, I have been collecting more tweets. The database used in this chapter consists of just over 50,000 tweets, with a little over half of them representing tweets in the English language. We will mine the database consisting of just the English tweets and develop a recommendation engine to find like-minded users to any other user.

This problem is rather difficult to solve and there are lots of ways to approach this. In this chapter, we will attempt to solve it by discovering pairs of users who share the usage rate of commonly used words. The general thinking in this approach is that if two people tend to use the same words in their speech, they will share even more things in common. (I picked this approach because it allows us to use the same dataset collected in the last chapter.) To accomplish this, we need to discover the most frequently used words that were tagged by Twitter with the en language code.

Before we proceed further, we need to discuss the necessary software for this chapter. You need to install LAPACK on your system. LAPACK is a collection of algorithms that are used to solve common matrix-related tasks such as the eigenvalue decomposition problem. On Debian-based systems, you should be able to download it using apt-get, as follows:

$ sudo apt-get install liblapack-dev

You will also need to install the Haskell matrix wrapper modules for LAPACK called hmatrix using cabal, as follows:

$ cabal install hmatrix

We will discuss the Numeric.LinearAlgebra libraries later on in the chapter. Here are the necessary libraries used in this chapter for the LearningDataAnalysis08 Haskell module:

module LearningDataAnalysis08 where
import Numeric.LinearAlgebra.Data
import Numeric.LinearAlgebra.Algorithms
import Numeric.LinearAlgebra.HMatrix
import Data.List as L

Analyzing the frequency of words in tweets

We will begin by pulling English tweets from our database. From the Haskell command line, we will query the database in the following way:

> :l LearningDataAnalysis04 LearningDataAnalysis06 LearningDataAnalysis07 LearningDataAnalysis08

> :m LearningDataAnalysis04 LearningDataAnalysis06 LearningDataAnalysis07 LearningDataAnalysis08

> import Data.HashMap.Strict as HM
> import Data.List as L

> tweetsEnglish <- queryDatabase "tweets.sql" "SELECT message, user FROM tweets WHERE language='en'"
> let tweets = zip (readStringColumn tweetsEnglish 0) (readStringColumn tweetsEnglish 1)

Using the frequency function presented in the last chapter, we will compute the set of unique tweets, as follows:

> let freqTable = frequency tweets
> -- Number of unique tweets
> HM.size freqTable
27348
> let uniqueTweets = keys freqTable

After writing the last chapter, I've collected 27,000 unique English tweets. Just as we did in the last chapter, we must clean our tweets:

> let cleanedTweets = L.map ((message, user) -> (removePunctuation $ clean message, user)) uniqueTweets

We can now build a frequency HashMap of our cleaned tweets in the following way:

> let wordsFrequency = frequency $ concatMap (words . fst) cleanedTweets

A note on the importance of removing stop words

Now, we must take our word frequency HashMap, convert it into a list, and sort that list from largest to smallest. We will then take the first 25 items from the list:

> let topWords = take 25 $ sortBy ((_, c1) (_, c2) -> compare c2 c1) $ HM.toList wordsFrequency
> topWords
[("a",30028),("rt",12594),("to",9815),("the",7754),("i",7506),("you",7425),("and",5403),("for",5337),("of",5312),("in",5187),("is",4832),("on",2906),("it",2723),("me",2665),("my",2648),("with",2617),("be",2564),("have",2507),("that",2281),("this",2203),("if",1959),("your",1891),("just",1832),("at",1807),("like",1732)]

Looking through this list, you will find that it's not interesting at all. The top two words are a (which was the term used to gather these tweets) and rt (which is Twitter speak for retweet). The rest of the list consists of common words that are known as function words. Function words are the words that are ambiguous in meaning until used in a sentence to provide some context. We will filter these words out in order to work on more interesting data. In the field of natural language processing, these words are called stop words. With a search using your favorite search engine of choice, you can find websites with lists stop words. I made my own list and added it to my LearningDataAnalysis08 module. A few words have been added to this list, such as rt (for retweet), mt (for modified tweet), and u (which is short for "you"):

stopWords :: [String]
stopWords = ["a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount",  "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as",  "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "dont", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fifty", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "got", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "i", "ie", "if", "im", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "just", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "need", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "want", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "youre", "yours", "yourself", "yourselves", "rt", "mt", "u"]

Once your stop list has been created, you can filter out these words in the following way:

> let notStopWords = L.filter ((word, _) -> notElem word stopWords) (HM.toList wordsFrequency)
> let topWords = take 25 . L.map fst $ sortBy ((_, c1) (_, c2) -> compare c2 c1) notStopWords
> topWords
["like", "amp", "day", "good", "new", "love", "time", "follow", "great", "today", "make", "lot", "people", "video", "know", "life", "happy", "look", "think", "girl", "win", "photo", "way", "little", "really"]

You might argue that there are some words in this list that represent stop words. I hope that you agree that this list contains words that are more interesting than our previous set of stop words. We will use this set of 25 words in order to represent the 25 features of each person represented in our dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset