Chapter 3. Recommending relevant content

This chapter covers

  • Understanding recommendation engines based on users, items, and content
  • Finding recommendations about friends, articles, and news stories
  • Creating recommendations for sites similar to Netflix

In today’s world, we’re overwhelmed with choices, with a plethora of options available for nearly every aspect of our lives. We need to make decisions on a daily basis, from automobiles to home theatre systems, from finding Mr. or Ms. “Perfect” to selecting attorneys or accountants, from books and newspapers to wikis and blogs. In addition, we’re constantly being bombarded by information—and occasionally misinformation! Under these conditions, the ability to recommend a choice is valuable, and even more so if that choice is aligned with the person receiving the recommendation.

In the business of influencing your choice, no one is interested in good results more than advertising companies. The raison d’être of these entities is to convince you that you really need product X or service Y. If you have no interest in products like X or services like Y, they’ll be wasting their time, and you’ll be annoyed. The broadcasting approach of traditional advertising methods (such as billboards, TV ads, and radio ads) suffers from that problem. The goal of broadcasting is to alter your preferences by incessantly repeating the same message. An alternative, more pleasant, and more effective approach would be targeted to your preferences. It would entice you to select a product based on its relevance to your personal wants and desires. That’s where the online world and the intelligent advertising businesses on the internet distinguish themselves.

In this chapter, we’ll tell you everything you need to know about building a recommendation engine. You’ll learn about collaborative filtering and content-based recommendation engines. Throughout this chapter, we’ll use the example of recommending movies in an online movie store, and we’ll generalize it so that the proposed solutions are applicable to a variety of circumstances.

The online movie store is a simple example, but it’s concrete and detailed, making it easy to understand all the basic concepts involved in the process of writing a recommendation engine. We’ll look in detail at the definition of similarity between users and introduce the mechanisms by which this similarity metric can be used to perform recommendation. We’ll also further delve into a more complex technique known as singular value decomposition (SVD) that clusters the user and movie groups together via implicit relationships within the data. By the end of this chapter, you’ll be able to perform recommendation using a sizable dataset and understand the different mechanisms by which this is achieved.

Once we cover all the basic concepts in the online movie store, we’ll make things a lot more interesting by presenting more-complicated cases. We’ll cover recommendation engines that are crucial in the online movie industry, for online bookstores, and for internet shopping carts.

3.1. Setting the scene: an online movie store

Let’s say that you have an online movie service that sells or makes money from movie downloads and streaming. Registered users log in to your application and can watch trailers of available movies. If a user likes a particular movie, she can add it to her shopping cart and purchase it later when she’s ready to check out from your store. Naturally, when users complete their purchase, or when they land on the pages of your hypothetical application, you want to suggest more movies to them. There are millions of movies and TV shows available, drawn from many different genres. In addition, many people are sensitive to the kind of shows and movies that they don’t like, so when you display content for a user, you want to target the genre the user likes and avoid those they don’t. If that sounds difficult, fear not. Recommendation engines are here to help you deliver the right content to your users.

A recommendation engine examines the selections that a user has made in the past and identifies the degree to which he would like a certain item that he hasn’t seen yet. It can be used to determine what types of shows your user prefers and the extent to which he does so, by comparing the similarity of his preferences with the characteristics of genres. For a more creative twist, you could help people establish a social network on that site based on the similarity of their movie and TV show tastes. It quickly becomes apparent that the crucial functional element of recommendation engines is the ability to define how similar to each other two (or more) users or two (or more) items are. That similarity can later be used to provide recommendations.

3.2. Distance and similarity

Let’s take some data and start exploring these concepts in detail. In the following sections, we’ll work with items, users, and ratings. In the context of recommendation engines, similarity is a measure that allows you to compare the proximity of two items in much the same way that the physical distance between two cities tells you how close they are to each other geographically. Distance and similarity differ only by the frame of reference or the coordinate space used. Consider that two cities may be geographically similar in that their distance in physical space is low, but culturally they may be very different, in that their distance as a measure of the interests and habits of the general population is very high. In general, the same distance calculation may be performed over different spaces to gain different measures of similarity.

For two cities, you’d use their longitude and latitude coordinates to calculate their geographical proximity. Think of the ratings as the coordinates in the space of items or users. Let’s look at these concepts in action. We’ll select three users and associate with them a list of movies (items) and their hypothetical rankings. As is typically the case, ratings will range between 1 and 5 (inclusive). The assignments for the first two users (Frank and Constantine) involve ratings that are either 4 or 5—these people really like all the movies we selected. But the third user’s ratings (Catherine’s) are between 1 and 3. So clearly, we expect the first two users to be similar to each other and be dissimilar to the third user. When we load the example data in the script, we have available the users, movies, and ratings shown in table 3.1.

Table 3.1. Ratings for the users show that Frank and Constantine agree more than Frank and Catherine. This data was modified, with permission, from the MovieLens database.[a]

a

GroupLens Research, MovieLens dataset, http://grouplens.org/datasets/movielens/.

User

Movie

Rating

0: Frank 0: Toy Story 5
1: Jumanji 4
2: Grumpier Old Men 5
3: Waiting to Exhale 4
4: Father of the Bride Part II 5
5: Heat 4
6: Sabrina 5
1: Constantine 0: Toy Story 5
2: Grumpier Old Men 5
4: Father of the Bride Part II 4
5: Heat 5
7: Tom and Huck 5
8: Sudden Death 4
9: Goldeneye 5
2: Catherine 0: Toy Story 1
2: Grumpier Old Men 2
3: Waiting to Exhale 2
6: Sabrina 3
9: Goldeneye 2
10: American President, The 1

Let’s begin by looking at the similarity between these users based on their ratings of different movies. The following listing provides the code. We use a cut-down and modified version of the MovieLens dataset, which contains the ratings of the 11 unique movies in table 3.1. These data files can be found in the resources associated with this book.

Listing 3.1. Calculating the similarity of users based on their ratings of movies

We’ve provided two definitions of similarity, which are invoked by providing a different value in the third argument of the similarity method introduced in listing 3.1.

We’ll describe the detailed implementation of that code shortly, but first look at figure 3.1, which shows the results that we get if we compare among the three users considering the ratings only. As you can clearly see, Frank’s preferences in movies are more similar to Constantine’s than they are to Catherine’s.

Figure 3.1. The similarity between two users can be measured by evaluating the extent of overlap (if any) between the two lines in plots like this. Thus, Frank and Constantine (top) are more similar than Frank and Catherine (bottom).

The similarity between two users doesn’t depend on the order in which we pass the arguments in the similarity method. The similarity of Frank with himself is equal to 1.0, which we take to be the maximum value of similarity between any two entities. These properties stem from the fact that many similarity measures are based on distances, like the geometric distance between two points on a plane that you learned about in high school. In general, mathematical distances have the following four important properties:

  • All distances are greater than or equal to zero. In most cases, we constrain the similarities to be nonnegative like distances. In fact, we constrain the similarities within the interval [0,1].
  • The distance between any two points, say A and B, is zero if and only if A is the same point as B. In the example, and based on our implementation of similarity, this property is reflected in the fact that when two users have exactly the same ratings, the similarity between them will be equal to 1.0. You’ll see that this is true if you run listing 3.1, where we used the same user twice to show that the similarity is 1.0. Of course, you can create a fourth user and prove that the similarity will be equal to 1.0, provided the users have watched the same movies.
  • The third property of distances is symmetry—the distance between A and B is exactly the same as the distance between B and A. This means if Catherine’s movie taste is similar to Constantine’s movie taste, the reverse will also be true by exactly the same amount. Quite often we want the measure of similarity to preserve the symmetric property of distances, with respect to its arguments.
  • The fourth property of mathematical distances is the triangle inequality, because it relates the distances between three points. In mathematical terms, if d(A,B) denotes the distance between points A and B, then the triangle inequality states that d(A,B) <= d(A,C) + d(C,B) for any third point C. Looking at the output of listing 3.1, Frank is similar to Constantine by 0.391 and Constantine is similar to Catherine by 0.002, whereas Frank is similar to Catherine by 0.006, which is less than the sum of the first two similarities. Nevertheless, that property doesn’t hold, in general, for all similarities.

Relaxing the fourth fundamental property of distances when you pass on to similarities is fine; there’s no imperative to carry over the properties of distances to similarities. You should always be cautious to ensure that the mathematics involved is in agreement with what you consider reasonable, though. There’s a century-old counterexample to the triangle inequality when it comes to similarities that’s attributed to William James:[1] “A flame is similar to the moon because they are both luminous, and the moon is similar to a ball because they are both round, but in contradiction to the triangle inequality, a flame is not similar to a ball.” For an interesting account of similarities in relation to cognition, we recommend Classification and Cognition by W. K. Estes.[2]

1

William James, Principles of Psychology (Holt, 1890).

2

W. K. Estes, Classification and Cognition (Oxford University Press, 1994).

In figure 3.1, we show a visual representation of the similarity by plotting mutual ratings for our three example users. The closer the lines of the ratings, the more similar the users are; the farther apart the lines, the less the similarity. At the top of the figure, you can see that the lines for Frank and Constantine are close, depicting the similarity between them. At the bottom of the figure, where we show the ratings of Frank versus those of Catherine, the lines diverge and are far apart, which is in accordance with the low similarity value that we got during our calculation.

The plots of the ratings in figure 3.1 clearly display the somewhat reciprocal nature of distance and similarity. The greater the distance between the two curves, the smaller the similarity between the two users; the smaller the distance between the two curves, the greater the similarity between the two users. As you’ll see in the next section, the evaluation of similarity often involves the evaluation of some kind of distance, although that’s not always necessary. The concept of distance is more familiar. The concept of distance and the concept of similarity are special cases of the general concept of a metric.

3.2.1. A closer look at distance and similarity

Now, let’s examine the code that helped us find the similarity between the users and look closely at how to calculate similarity. The code in the next listing shows the details of the similarity method, which accepts three arguments: the IDs of the two users we wish to compare and the kind of similarity we want to use.

Listing 3.2. Calculating the similarity between two users

We include two similarity formulas in the code to show that the notion of similarity is fairly flexible and extensible. Let’s examine the basic steps in the calculation of these similarity formulas. First, we take the differences between all the ratings of movies that the users have in common, square them, and add them together. The square root of that value is called the Euclidean distance; and as it stands, it’s not sufficient to provide a measure of similarity:

This equation states the definition of the Euclidean distance between user a and user b, da,b. It’s expressed as the square root of the sum of squares between the ratings of the users for the same film. ratinga,i expresses the rating that user a has given to film i.

As we mentioned earlier, the concepts of distance and similarity are somewhat reciprocal, in the sense that the smaller the value of the Euclidean distance, the more similar the two users are. As a first naïve attempt, we could create a similarity score by adding 1 to the Euclidean score and inverting it. This would have the effect that large distances have small scores and vice versa. It would also ensure that a distance of zero returns a similarity score of 1.0.

At first sight, it appears that inverting the distance (after adding the constant value 1) might work. But this seemingly innocuous modification suffers from shortcomings. If two users have watched only one movie, one of them rated the movie 1, and the other gave a rating of 4, the sum of their differences squared is 9.0. In that case, the naïve similarity, based on the Euclidean distance, would result in a similarity value of 0.25. The same similarity value can be obtained in other cases. If the two users had watched three movies, and among these three movies their ratings differed by (for each movie), their similarity would also be 0.25, according to the naïve similarity metric. Intuitively, we expect these two users to be more similar than those who watched a single movie and whose opinions differed by three units (out of five).

The naïve similarity “squeezes” the similarity values for small distances (because we add 1) while leaving large distances (values of the distance much larger than 1.0) unaffected. What if we add another value? The general form of the naïve similarity is y = beta / (beta + x), where beta is our free parameter and x is the Euclidean distance. Figure 3.2 shows what the naïve similarity would look like for various values of the parameter beta between 1.0 and 2.0.

Figure 3.2. Naïve similarity curves as functions of the Euclidean distance

Keeping in mind the shortcomings of the naïve similarity, let’s look at the first (default) similarity definition between two users as shown in listing 3.2, where sim_type=0. If the users have some movies in common, we divide the sum of their squared differences by the number of common movies, take the positive square root, and pass on that value to a special function. This function is called the hyperbolic tangent function. We subtract the value of the hyperbolic tangent from 1.0, so that the final value of similarity ranges between 0.0 and 1.0, with 0.0 implying dissimilarity and 1.0 implying the highest similarity. Voilà! We’ve arrived at our first definition of similarity of users based on their ratings.

The second similarity definition that we present in listing 3.2, in the case where sim_type=1, improves on the first similarity by taking into account the ratio of the common items versus the number of all possible common items. That’s a heuristic that intuitively makes sense. If I’ve watched 30 movies and you’ve watched 20, we could have up to 20 common movies. Let’s say that we have only 5 movies in common, and we agree fairly well on these movies, which is nice; but why don’t we have more movies in common? Shouldn’t that somehow be reflected in our similarity? This is exactly the aspect of the problem that we’re trying to capture in the second similarity formula. In other words, the extent to which we watch the same movies should somehow affect the degree of our similarity as movie enthusiasts.

3.2.2. Which is the best similarity formula?

It may be clear by now that you can use many formulas to establish the similarity between two users—or between two items, for that matter. In addition to the two similarities introduced earlier, we could’ve used a metric formula known as the Jaccard similarity between users, which is defined by the ratio of the intersection over the union of their item sets—or, in the case of item similarity, the ratio of the intersection over the union of the user sets. In other words, the Jaccard similarity between two sets, A and B, is defined by intersection(A,B) / union(A,B).

Of course, you may naturally wonder, “Which similarity formula is more appropriate?” The answer, as always, is, “It depends.” In this case, it depends on your data. In one of the few large-scale comparisons of similarity metrics,[3] the simple Euclidean distance–based similarity showed the best empirical results among seven similarity metrics, despite the fact that other formulas were more elaborate and intuitively expected to perform better. Their measurements were based on 1,279,266 clicks on related community recommendations from September 22, 2004, through October 21, 2004 on the social networking website Orkut.

3

Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten, “Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network,” Proceedings of KDD 2005.

We don’t advise that you randomly choose your similarity metric, but if you’re in a hurry, consider the Euclidean or the Jaccard similarity first. It should give you decent results. You should try to understand the nature of your data and what it means for two users or two items to be similar. If you don’t understand the reasons why a particular similarity metric (formula) is good or bad, you’re setting yourself up for trouble. To stress this point, think of the common misconception “The shortest path between two points is a straight line that joins them.” That statement is true only for what we call flat geometries, such as a football field. To convince yourself, compare the distance of going over a tall but narrow hill versus going around the hill’s base. The straight line won’t be the shortest path for a wide range of hill sizes.

In summary, one of the cornerstones of recommendations is the ability to measure the similarity between any two users and the similarity between any two items. We’ve provided a number of similarity measures that you can use off the shelf, and the movie store exemplified the typical structure of the data you’d deal with in order to create recommendations. We’ll now move on to examine the types of recommendation engines and how they work.

3.3. How do recommendation engines work?

Now that you’re armed with a good understanding of what similarity between two users or two items means, we can proceed with our description of recommendation engines. Generally speaking, there are two categories of recommendation engines. The first broad category is based on the analysis of the content—associated with the items or the users or both. The main characteristic of this content-based approach is the accumulation and analysis of information related to both users and items. That information may be provided either by the software system or through external sources. The system can collect information about the users explicitly through their response to solicited questionnaires or implicitly through the mining of each user’s profile or news-reading habits, emails, blogs, and so on. We won’t cover this category of recommendation engines in detail because it requires a large degree of manual design, and intervention is rather application specific.

The second goes under the label collaborative filtering (CF). The first incarnation of CF appeared in an experimental mail system (circa 1992) developed at the Xerox Palo Alto Research Center (PARC).[4] CF relies on the breadcrumbs that a user leaves behind through interaction with a software system. Typically, these breadcrumbs are the user’s ratings, such as the movie ratings that we described in the previous section. Collaborative filtering isn’t limited to one-dimensional or discrete variables; its main characteristic is that it depends on past behavior rather than the content of each item in the collection of interest. CF requires neither domain knowledge nor preliminary gathering and analysis work to produce recommendations. It can be typically classified further into one of three types: item-based, user-based, and model-based.

4

David Goldberg et al., “Using Collaborative Filtering to Weave an Information Tapestry,” Communications of the ACM 35, no. 12 (December 1992), http://mng.bz/eZpM.

In item-based CF, we’re most interested in the similarity of items. For example, in the case of our movie recommendation example, we’d typically build a matrix of movies whereby matrix entries express some degree of similarity. Movies can then be recommended that are most similar to those that users have already seen or liked. Conversely, with user-based CF, the similarity of users is considered important. In this scenario, users may be recommended content if they’re similar to those who have seen or liked a particular item. In model-based CF, neither item-item similarity nor user-user similarity is calculated explicitly, but a model is introduced to capture the interaction between users and items.

In the following sections, we’ll provide an example of user-based CF, leaving the implementation of item-based CF as an exercise for you. We’ll then move on to an explicit implementation of model-based CF that uses singular value decomposition (SVD) in section 3.5.

Item-based CF vs. user-based CF

Depending on the data, there can be a distinct performance profile in the use of each of these methods of CF. If you consider the problem of movie recommendations, you’ll note that there are probably more users than there are movies; thus the user-user similarity matrix is much larger and therefore more expensive to calculate. If you take two users who have rated numerous films, you aren’t guaranteed much overlap in the films they’ve seen, because users probably watch only a small subset of available films. Conversely, the item-item matrix is cheaper to calculate, because it’s smaller. If you take two films with many ratings, it may (arguably, but again, depending on the data) be more likely that there is a reasonable degree of overlap in the users who have reviewed them, because there are likely more reviews for a film from individuals than a given individual has watched films. This makes the user-user similarity matrix harder to calculate and the item-item matrix (relatively) easier.

Despite the user-user similarity matrix being harder to calculate in a fully deployed movie recommendation system, we’ll be providing an implementation of user-based CF. In reality, the approach is identical, and the concept of user similarity with respect to the films users have rated is easier to reason about than the concept of item similarity with respect to the users who have rated them.

3.4. User-based collaborative filtering

An ancient Greek proverb (with similar variants in nearly every culture of the world) states, “Show me your friends, and I’ll tell you who you are.” Collaborative filtering based on neighborhoods of similar users is more or less an algorithmic incarnation of that proverb. In order to evaluate the rating of a particular user for a given item, you look for the ratings of similar users (neighbors or friends, if you prefer) on the same item. Then, you multiply the rating of each friend by a weight and add them—it’s that simple! The following listing shows the code to provide a recommendation based on user similarity. We use the same data as we did to demonstrate different measures of similarity—the data given in table 3.1.

Listing 3.3. User-based CF recommendation

Listing 3.3 provides the high-level code to perform user-based CF. We read in the data using the load method in recsys.datamodel.data. The recommend method is then used to provide recommendations of movies that the user should consider. More accurately, the method provides a prediction of the ratings that the user would give for particular movies. These are ordered, and the top 10 are returned. The recommend method predicts ratings for only those items that haven’t already been rated by the user. With the dataset used, there are fewer than 10 movies unseen for each user; hence, this limit is never hit. You might also come across the “None” rating. This occurs only where the user hasn’t rated the movie and no other users in the dataset have rated it, either. Thus, this represents not a zero rating but a lack of rating.

To provide an example, if you run the code, you’ll see that Constantine is recommended Jumanji, Waiting to Exhale, and Sabrina, mainly because Frank has rated these films. Constantine is similar to Frank (as you saw previously), and Frank has rated them highly. It seems that our recommendation engine is working well. You’ll note that the recommendations for each user are only for films that they haven’t seen or rated themselves, and that recommendations appear to be made based on the tastes of similar users.

How did the recommend class arrive at these conclusions? How can it find the similar users (friends) for any given user? How can it recommend movies from the list of movies that a user has never watched? Let’s go through the basic steps to understand what happens. Recommendation engines that are based on collaborative filtering proceed in two steps. First they calculate the similarity between either users or items. Then they use a weighted average to calculate the rating that a user would give to a yet-unseen item. Let’s delve further into the recommend method, shown in the following listing, to see this in action.

Listing 3.4. recommend method
def recommend(user_id, top_n):
    #[(item,value),(item1, value1)...]
    recommendations = []
    for i in itemdict.keys():
        if (int(i) not in items_reviewed(int(user_id),userdict)):
            recommendations.append((i,predict_rating(user_id, i)))
    recommendations.sort(key=lambda t: t[1], reverse=True)
    return recommendations[:top_n]

This listing demonstrates that the recommend method is relatively simple and relies heavily on the predict_rating method. recommend takes a user_id and the top number of recommendations that should be returned. It iterates through every item in the item dictionary, and if the user has not yet rated the item, predict_rating is called and the result is appended to the recommendations list. The recommendations list is then sorted and passed back to the caller. The next listing shows the inner workings of the predict_rating method.

Listing 3.5. Predicting the ratings of a user
def predict_rating(user_id, item_id):
    estimated_rating = None;
    similarity_sum = 0;
    weighted_rating_sum = 0;

    if (int(item_id) in items_reviewed(user_id,userdict)):
        return get_score_item_reviewed(user_id,item_id,userdict)
    else:
        for u in userdict.keys():
            if (int(item_id) in items_reviewed(u,userdict)):
                item_rating = get_score_item_reviewed(u,item_id,userdict)
                user_similarity =
similarity_matrix.get_user_similarity(user_id,u)
                weighted_rating = user_similarity * item_rating
                weighted_rating_sum += weighted_rating
                similarity_sum += user_similarity

        if (similarity_sum > 0.0):
            estimated_rating = weighted_rating_sum / similarity_sum

    return estimated_rating

In this listing, you see that predict_rating takes two parameters: the ID of the user of interest and the ID of the item of interest. It returns a prediction of the rating that this user may give to this item.

Stepping through the code line by line, you see that if the user has already rated the item, we don’t attempt to predict the rating; we merely return the actual rating given to the item by that user. If the user hasn’t rated the item, then we proceed as per user-based CF. That is, for each user in the system, we check whether that user has reviewed the item for which we wish to make a prediction. If they have, we get the score that they input, along with a calculation of how similar this user is to the one for whom we wish to make a prediction. We then calculate a weighted rating, which is the similarity of the two users multiplied by the rating given by this new user. In order to calculate a value that considers all the users in the system, we calculate the sum of such weighted_ratings along with the sum of the similarities. The latter is used to normalize the values back to the range [0,5]. Provided that there are some users in the dataset who are in some way similar to the user we’re attempting to make the recommendation for, and that they have rated the item we’re interested in, we’re guaranteed to be able to make an estimation regarding the rating with this code.

Although the code works perfectly well for our purposes, note that already there are some serious performance issues that we might want to look at. Considering listings 3.4 and 3.5 together, you can see that we’re essentially looping though every item and every user in the system. We might wish to think about other data structures that could avoid this; for example, we might consider storing users in a dictionary keyed by items rated (which would result in only the users who have rated an item being iterated through, rather than all users as in listing 3.5). This is just one of many performance improvements that might be considered for a production system with user-based CF.

So far, we’ve presented a broad outline of our user-based CF system, but we’re missing some specific details from the implementation in listing 3.5: the calculation of the similarity matrix. The next listing outlines the class responsible for this calculation.

Listing 3.6. SimilarityMatrix class

The SimilarityMatrix class encapsulates the storage, calculation, and access to the matrix that defines how similar each user is with respect to another. When it’s first instantiated, it calls the build method, which calculates the similarities of the users stored in the userdict dictionary object. There are two important things to note about this. First, we use the upper triangular form of the matrix. As we mentioned in section 3.2, it may be sensible for our similarities to have the property of symmetry. That is, if Catherine is similar to Frank, then Frank is similar to Catherine. We do indeed take this property for our implementation of user similarity, and because of this we can use the upper triangular form of the similarity matrix. Put simply, there’s no point in storing the similarity between a and b if we already have the similarity between b and a. This halves the size of the similarity matrix along with the amount of computation required to calculate it.

The second thing you may note is the use of another helper class called RatingCountMatrix, the definition of which is presented in the following listing.

Listing 3.7. RatingCountMatrix class
class RatingCountMatrix:
    user_id_a = None
    user_id_b = None
    matrix = None

    def __init__(self, user_id_a, user_id_b):
        num_rating_values = max([x[0] for x in data])
        self.user_id_a = user_id_a
        self.user_id_b = user_id_b
        self.matrix = np.empty((num_rating_values,num_rating_values,))
        self.matrix[:] = 0
        self.calculate_matrix(user_id_a,user_id_b)

    def get_shape(self):
        a = self.matrix.shape
        return a

    def get_matrix(self):
        return self.matrix

    def calculate_matrix(self,user_id_a, user_id_b):
        for item in items_reviewed(user_id_a, userdict):
            if int(item) in items_reviewed(user_id_b, userdict):
                i = get_score_item_reviewed(user_id_a,item, userdict)-1    
                j = get_score_item_reviewed(user_id_b,item, userdict)-1
                self.matrix[i][j] +=1

    def get_total_count(self):
        return self.matrix.sum()

    def get_agreement_count(self):
       return np.trace(self.matrix) #sum across the diagonal

The RatingCountMatrix class is used to calculate the degree of similarity between the users, based on the items that both have rated and on their level of agreement in those ratings. As you can see, it’s not enough for two users to have both rated a movie; they also must broadly agree on its quality. RatingCountMatrix encapsulates this through the use of a matrix that captures the covariance of their ratings. For users who always agree on the ratings for the movies they’ve seen, we’d expect the diagonal of the matrix to be populated with higher numbers: that is, Mij where i=j. For those who never agree or who have opposing views, we’d expect those entries farthest away from the diagonal to be populated: that is, Mij where i=min(rating), j=max(rating) or Mij where i=max(rating) and j=min(rating).

The calculation of similarity falls out of the matrix intuitively, as you saw in listing 3.6. We can use the total number of ratings that the users have agreed on, divided by the total number of items that both users have provided a rating for. Note again that our computational performance characteristics are likely far below optimal, because we recalculate the RatingCountMatrix every time we wish to determine the similarity score of two users. This is acceptable for our illustrative example but would be far from appropriate for a reasonably large system.

Thus far, we’ve focused on a user-based CF system. You should easily be able to see how we might flip this system on its head (so that users are treated as items and items as users). Rather than provide comprehensive coverage of this here, we’ll leave the exercise of completing an item-based CF system to you and move on to another form of collaborative filtering known as model-based CF.

Specifically, the model we’ll use is singular value decomposition (SVD). Model-based approaches are powerful and in this case help us understand a deeper relationship between users and movies. For SVD, this model consists of a third, unobservable space that we’ll call a latent space. This captures that there is some unobservable reason why users make certain choices (affinity to certain genres of movies, for example). Rather than ignoring this as with user- and item-based CF, we attempt to capture it.

3.5. Model-based recommendation using singular value decomposition

In the previous example, we essentially clustered users in a way that allows for the most similar users to contribute to the predicted ratings for a user. Both in this case and in item-based CF, we don’t presume anything about the distributions between users and item similarities. Herein lies the principal difference between user-based and item-based CF when compared with model-based CF.

In model-based CF, an additional assumption or model is added to the approach. In this case, using SVD, we assume that items and users are related through some latent space, not captured by either users or items. If this sounds somewhat exotic, don’t worry. All it means is that users map to items, and items map to users through some other, third space. You can think of this as a genre space if you like, although it’s technically not quite correct. If a user’s preference is determined by the algorithm as equal parts drama, romance, and action, and a film is similarly (and automatically) determined as such, then they may be given a recommendation for that film -provided they haven’t already seen it. In reality, this latent space of genres isn’t labeled as such and is determined automatically by the algorithm. This should give you a flavor of how SVD works as a recommender—but if not, never fear; we’ll cover the details in this section.

3.5.1. Singular value decomposition

SVD is a method of factorizing a matrix. Factorizing just means splitting something up, such that if you multiply these elements together you get the original item that was factorized. For numbers, this is relatively simple (10 can factorize into 2 × 5, for example), but for matrix multiplication this can be a little harder to do in your head.

Wolfram MathWorld[5] tells us that a given matrix A, which has m rows and n columns, with m>n, can be written in the form

5

“Singular Value Decomposition,” Wolfram MathWorld, 2015, http://mng.bz/TxyY.

A = UDVT

where U is m × m, D is m × n, and V is n × n. D only has entries along the diagonal, and both U and V have orthogonal columns (that is, UTU=I=VTV). This last point is important because it means the directions that make up the latent space are all at orthogonal; that is, they can’t be expressed as linear combinations of the others. We’ll return to this in a moment.

You might think this is all very well, but what does it have to do with recommender systems? Actually, quite a lot. Let’s imagine that matrix A is m × n, where m is the number of users and n is the number of films. An entry in the matrix indicates a rating that user m has given to film n. Now, what if we could split that into U, an m × r matrix that provides information regarding users’ affinity to a new smaller space of size r? Similarly, we do the same with VT, such that we create an r × n matrix, which provides the affinity to movies to each dimension in size r. Finally, we try to understand how important each dimension of r is in predicting the scores users give to items. These are known as the singular values of the factored matrix and are placed on the diagonal of D in non-increasing order, with the columns and rows of the user (U) and item (VT) matrix, respectively, being organized appropriately.

Note that in this example we use a latent space of size r. In general, you may choose to keep the top number of singular values—say, k—to capture the most important dimensions in the latent space while maintaining as compact a form as possible. You may notice a similarity to the concept of dimensionality reduction through eigenvalue decomposition from chapter 2. This isn’t a coincidence! If you’re interested, I refer you to Gerbrands[6] for more information regarding the relationship between the two concepts.

6

Jan J. Gerbrands, “On the Relationships between SVD, KLT and PCA,” Conference on Pattern Recognition (1980).

To return to the orthogonality of the latent space in U and V, this conceptually makes a lot of sense. Let’s take an example with matrix U. If we pick any two columns, then the dot product of the two column vectors will equal zero. Latent features are selected such that the features are as far away as possible from each other with regard to viewership, thus being maximally descriptive. Another way to think about this is that it makes sense to select features that are far away from each other in user space. If sci-fi fans never watch romantic comedies and vice versa, then such a feature would be useful in providing a recommendation. Simultaneously, we know that for any two columns in V, or equivalently any two rows in VT, they’re also orthogonal. That is, latent features are selected such that the features are as far away as possible from each other with regard to film membership. To return to our imperfect, genre parallel, genres are created such that they’re far away from each other in film space. For example, sci-fi films are never comedies. It’s important to stress that these features aren’t actually genres, as we’ve used in the illustration, but rather some other property that encapsulates the relationship between films and viewers.

Working through an example, if we already have A factorized, it’s trivial to see how an individual rating can be calculated. The user’s position in this r-dimensional latent space is looked up (a bit like their percentage interest in r different orthogonal genres) before multiplying each r with its relative importance globally. This leaves us with a vector of magnitude r. Finally, the coordinate in r dimensions is also looked up for the item for which we want to predict (a bit like its percentage affinity to r different orthogonal genres) before taking the dot product of the two. This leaves us with a number, which is the predicted rating for that user and that movie. Note that even though we’re using the concept of genres here, this is for illustration only. Although it might be possible to overlay semantic meaning onto these dimensions after training, they have no such explicit meaning in reality.

Training a recommender system is equivalent to performing matrix factorization. Recommendations can be looked up directly from the factored matrix form, mirroring the example provided in the last paragraph. If this is a little surprising, consider that the factorization is trying to make the connections between the users, movies, and latent space that it thinks should be there. Essentially, it’s trying to rationalize the factorization and minimize its error in order to model the underlying relationship. If you think about it carefully, you’ll realize that recommendations spring from the error in the factorization. We’ll cover this in more detail in the next section, where you’ll see a recommendation system based on SVD.

3.5.2. Recommendation using SVD: choosing movies for a given user

In this section, we’re going to use SVD to provide recommendations of movies to users. As before, we’ll use the MovieLens dataset, but rather than select only a few movies and users to demonstrate the concept, we’ll use all the data available to us. Recall that the MovieLens dataset contains numerous user-movie recommendations. We’ll use the power of SVD to build a latent-space matrix—that is, to understand whether there is some pattern between users given the movies that they rate highly.

We’ll also introduce a new package here that isn’t found in scikit-learn. The python-recsys package[7] is lightweight yet powerful enough for us to easily demonstrate the use of SVD for recommendation, so we’ll use this package throughout this section; see the following listing. Make sure you have it installed before running the code examples.

7

Listing 3.8. SVD for recommendation

This listing provides a high-level overview of how we use the recsys package to perform SVD on the freely available Movielens ml-1m dataset, which you should download. This dataset contains 1 million user ratings, and we’ll use all of them to build a recommender. We start by creating an instance of the SVD object and reading in both the user ratings for a movie and further information regarding that movie (genres, year of release, and so on). Although the model doesn’t use the additional data, we could use it to reason about how sensible the recommendations returned are. In the final line, we pass only the data file to the SVD object, specifying the format of the file.

As in previous examples, we’ve yet to actually do any learning. Data has been passed to an object that will perform this learning, but the actual computation has been deferred. The next listing shows the code to train the model, or rather factorize the matrix that was created through the load method of the Data class.

Listing 3.9. Factorizing a matrix using SVD for recommendation
k = 100
svd.compute(k=k, min_values=10,
            pre_normalize=None, mean_center=True, post_normalize=True)

films = svd.recommend(10,only_unknowns=True, is_row=False)

We call svd.compute with five parameters. These parameters control the execution of matrix factorization. For completeness, we reproduce the documentation here[8] and discuss the impact of these options on factorization:

8

Oscar Celma, Python Recsys v1.0 documentation, http://mng.bz/UBiX.

  • min_values—Remove rows or columns that have fewer than min_values non-zeros.
  • pre_normalize—Normalize the input matrix. Possible values are these:

    • tfidf (default)—Treat the matrix as terms-by-documents. It’s important, then, how the data is loaded. Use the format param in svd.load_data() to determine the order of the fields of the input file.
    • rows—Rescale the rows of the input matrix so that they all have unit Euclidean magnitude.
    • cols—Rescale the columns of the input matrix so that they all have unit Euclidean magnitude.
    • all—Rescale the rows and columns of the input matrix, by dividing both the rows and the columns by the square root of their Euclidean norm.
  • mean_center—Center the input matrix (aka mean substraction).
  • post_normalize—Normalize every row of UD to be a unit vector. Thus, row similarity (using cosine distance) returns [-1.0 .. 1.0].
  • savefile—Output file to store SVD transformation (U, D, VT vectors).

You see here that the majority of the options control data pre- or post-processing. In the first parameter, we have the option to remove movies or users with fewer than 100 ratings. This should reduce the matrix size significantly and concentrate on only those movies or users with enough data to provide recommendations for. The second and third parameters provide normalization and centering of the data. Normalization is important because it provides guarantees regarding the variance of the data. Centering modifies the point of reference for this variance. Both parameters can help with the stability of the algorithm performing SVD, which in this case is the Single-Vector Lanczos Method,[9] called via the SVDLIBC library.[10]

9

C. Lanczos, “An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators,” Journal of Research of the National Bureau of Standards 45, no. 4 (October 1950), http://mng.bz/cB2d.

10

Doug Rohde, SVDLIBC, http://tedlab.mit.edu/~dr/SVDLIBC/.

Let’s now continue with the example and perform recommendation with the model. In the last line of listing 3.9, we call svd.recommend(), passing in three parameters. The first parameter is the user we’re trying to determine recommendations for; in this case, we’ve randomly chosen user 10. We also pass in two additional parameters: only_unknowns=True and is_row=False. The first of these tells the method to return ratings only for items that haven’t been rated by user 10. The latter is a peculiarity of the library we’re using, because is_row=False tells the recommendation to occur over columns: that is, users, not items (which are along the rows). Because the model used here is that of matrix factorization, we could just as easily have provided an item ID (movie) and asserted is_row=True in order to determine a list of appropriate users for that movie. In contrast to the user-based and item-based methods discussed earlier in this chapter, this method can provide similar functionality to each without a huge amount of change to the code.

Enough of the details—what about the results of the recommendation? Let’s take a look. The next listing shows the films that have been recommended.

Listing 3.10. Output from model-based recommendation
[items_full[str(x[0])].get_data() for x in films]

[{'Genres': 'Action|Adventure
', 'Title': 'Sanjuro (1962)'},
 {'Genres': 'Crime|Drama
', 'Title': 'Pulp Fiction (1994)'},
 {'Genres': 'Crime|Thriller
', 'Title': 'Usual Suspects, The (1995)'},
 {'Genres': 'Comedy|Crime
', 'Title': 'Some Like It Hot (1959)'},
 {'Genres': 'Drama
', 'Title': 'World of Apu, The (Apur Sansar) (1959)'},
 {'Genres': 'Documentary
', 'Title': 'For All Mankind (1989)'},
 {'Genres': 'Animation|Comedy
', 'Title': 'Creature Comforts (1990)'},
 {'Genres': 'Comedy|Drama|Romance
', 'Title': 'City Lights (1931)'},
 {'Genres': 'Drama
', 'Title': 'Pather Panchali (1955)'},
 {'Genres': 'Drama
',
  'Title': '400 Blows, The (Les Quatre cents coups) (1959)'}]

Here you can see that user 10 has been recommended some action/adventure, crime/drama, and crime/thriller. They seem to like a bit of comedy too, as well as some animation. Looks reasonable, but we don’t know anything about the user. The following listing digs in to the rating information for user 10 to see what they’ve watched and rated before.

Listing 3.11. Films rated by user 10
get_name_item_reviewed(10,user_full,items_full)

[(2622, "Midsummer Night's Dream, A (1999)", 'Comedy|Fantasy
', 5.0),
 (3358, 'Defending Your Life (1991)', 'Comedy|Romance
', 5.0),
 (1682, 'Truman Show, The (1998)', 'Drama
', 5.0),
 (2125, 'Ever After: A Cinderella Story (1998)', 'Drama|Romance
', 5.0),
 (1253, 'Day the Earth Stood Still, The (1951)', 'Drama|Sci-Fi
', 5.0),
 (720,
  'Wallace and Gromit: The Best of Aardman Animation (1996)',
  'Animation
',
  5.0),
 (3500, 'Mr. Saturday Night (1992)', 'Comedy|Drama
', 5.0),
 (1257, 'Better Off Dead... (1985)', 'Comedy
', 5.0),
 (3501, "Murphy's Romance (1985)", 'Comedy|Romance
', 5.0),
 (1831, 'Lost in Space (1998)', 'Action|Sci-Fi|Thriller
', 5.0),
 (3363, 'American Graffiti (1973)', 'Comedy|Drama
', 5.0),
 (587, 'Ghost (1990)', 'Comedy|Romance|Thriller
', 5.0),
 (150, 'Apollo 13 (1995)', 'Drama
', 5.0),
 (1, 'Toy Story (1995)', "Animation|Children's|Comedy
", 5.0),
 (2, 'Jumanji (1995)', "Adventure|Children's|Fantasy
", 5.0), ...

This is only a very small portion of 401 movies rated by this user, but we certainly find some animation/comedy coming through (both Wallace and Gromit and Creature Comforts are made by the same production company, Aardman Animations). We also see comedy and drama feature in both the rated movies and the recommended movies. Note, however, that the user gave 154 ratings of 5, so this small list isn’t completely indicative of this user’s interests. Let’s dig in a little further and look at the distribution of genres for user 10. Figure 3.3 provides the genres for which the user has rated higher than 3, and figure 3.4 provides the genres for which the user has rated lower than 3. We don’t report on films that have been rated with the average score of 3.

Figure 3.3. The number of films that have been rated by user 10 as 4 or higher. These are broken out by genre, so user 10 has rated more than 40 comedy films with a 4 or a 5. Only the top 19 genres are displayed here.

Figure 3.4. The number of films that have been ranked as lower than 3 by user 10, broken out by genre. In this case, the user has ranked three comedy films and two action/sci-fi films below 3 and has watched several other categories of films and ranked them with a score of 1.

This should now start to make more sense. Figure 3.3 shows that this user is clearly interested in comedy and drama films, because they account for the overwhelming majority of positive ratings provided by the user. If you return to our recommendation in listing 3.10, you’ll see that this does correlate well. Seven of 10 of these recommendations have either comedy or drama in their genre title. Romance, action/adventure/sci-fi, and animation also feature highly in the user’s positive ratings, because they appear in their recommendations. Perhaps surprising is the inclusion of the thriller and crime categories, because these don’t feature highly or at all in the existing ratings. These can be explained through the collaborative filtering process at work.

Within the existing dataset, there will be co-occurrence of the user’s highly rated films with others films of different genres. This algorithm has found these patterns and is thus returning new and unseen categories for the user to try. This isn’t surprising, really, because a recommender algorithm should certainly try to provide recommendations outside the obvious. A little thought might reveal that this correlation isn’t unsurprising—viewers of drama may well enjoy crime/thriller films from time to time.

Figure 3.4 provides a histogram of those genres containing films that received low ratings from user 10. At first glance, we see that comedy and action/sci-fi both received two or more low rankings, but we must be careful how we interpret this. Take a look at the scales on both figure 3.3 and figure 3.4. This user clearly has a bias to rate films highly, as well as an interest in comedy movies (and has rated more than 40 positively). There are techniques to tackle rating biases in model-based CF, but rather than discuss them here, we refer you to the literature.[11]

11

Arkadiusz Paterek, “Improving Regularized Singular Value Decomposition for Collaborative Filtering,” Proceedings of the ACM SIGKDD (2007), http://mng.bz/TQnD.

3.5.3. Recommendation using SVD: choosing users for a given movie

We’ve already provided an example of SVD for recommending movies to users, but because of the flexibility of model-based CF it’s equally easy to perform recommendation of users for a given movie: that is, which users we should recommend it to. You could easily imagine such an approach would be useful in the advertising world, should we need to decide which user to promote an item to.

Let’s imagine that we are LucasFilm and that we are just promoting the new Star Wars film: Star Wars, Episode 1: The Phantom Menace (1999). Using SVD and the existing dataset, how might we find the right users to promote this to? The next listing provides the answer.

Listing 3.12. Investigating user recommendations in model-based CF
> items_full[str(2628)].get_data()
{'Genres': 'Action|Adventure|Fantasy|Sci-Fi
',
 'Title': 'Star Wars: Episode I - The Phantom Menace (1999)'}
> users_for_star_wars = svd.recommend(2628,only_unknowns=True)
> users_for_star_wars
[(446, 4.7741731815492816),
 (3324, 4.7601341930085157),
 (2339, 4.7352608789782398),
 (1131, 4.6541316195384743),
 (5069, 4.6479235651508217),
 (4755, 4.6444117760840502),
 (4634, 4.6308837065012067),
 (4649, 4.619985795550809),
 (1856, 4.5846499038453166),
 (4273, 4.5803152198983419)]

We first check to see whether we have the correct film for recommendation. In this case, item ID 2628 does indeed correspond to the Star Wars film we’re interested in. By calling svd.recommend() with this ID and without the option is_row=False, we can perform recommendations across the rows: that is to say, the users. What is returned are the IDs of the top 10 users and the scores that we predict they would give this movie.

Again, let’s dig a little further in the next listing to see if these recommendations make intuitive sense.

Listing 3.13. Analyzing the ratings of users recommended Star Wars: TPM
> movies_reviewed_by_sw_rec  =[get_name_item_reviewed(x[0],
user_full,items_full) for x in users_for_star_wars]
> movies_flatten = [movie for movie_list in movies_reviewed_by_sw_rec for
movie in movie_list]
> movie_aggregate = movies_by_category(movies_flatten, 3)
> movies_sort = sorted(movie_aggregate,key=lambda x: x[1], reverse=True)
> movies_sort

[['Drama
', 64],
 ['Comedy
', 38],
 ['Drama|War
', 22],
 ['Drama|Romance
', 16],
 ['Action|Sci-Fi
', 16],
 ['Action|Adventure|Sci-Fi
', 14],
 ['Sci-Fi
', 14],
 ['Action|Sci-Fi|Thriller
', 13],
 ['Thriller
', 12],
 ['Comedy|Romance
', 12],
 ['Comedy|Drama
', 11],
 ['Action|Drama|War
', 11],
 ['Drama|Romance|War
', 10],
 ['Drama|Thriller
', 9],
 ['Horror|Sci-Fi
', 9],
 ['Film-Noir|Thriller
', 8],
 ['Western
', 8],
 ['Comedy|Sci-Fi
', 8],
 ['Drama|Sci-Fi
', 7]...

We first get the movies rated by those in the recommended user set, before flattening this list. We need to flatten the list from a list of lists (where each entry contains the list of rated movies for that user) to a list containing all the ratings of the users together. The method movies_by_category builds a list that contains the genres of the films with a count giving the number of times the movie was rated in the set of users under consideration. The parameter to this method allows the movies to be filtered based on the value of their rating, so in this case we count only movies that have received a rating of 3 or higher. Finally, we sort the output list so that the genres with the largest number of films appear first.

Essentially, what we’ve done is to build a list of genres ranked most positively by the set of users to whom we’re going to recommend The Phantom Menace. Figure 3.5 presents this output in a graphical form.

Figure 3.5. Number of ratings >3 aggregated by genre over all users in the set for whom we are to recommend Star Wars: The Phantom Menace

There are several things to note. First, there is a high prevalence of sci-fi within the categories of films rated highly by this aggregate user set. No fewer than 4 out of 10 of the top-rated categories include sci-fi somewhere in the genre title. Intuitively, this is as expected. The Star Wars series is sci-fi, and so it would certainly make sense if others who are interested in sci-fi were recommended this new film. What might be slightly surprising is that the most popular two genres among these users are the drama and comedy categories. At first glance this may seem erroneous, but remember that there are many films in these categories. Also remember that films in these genres may receive high ratings from all users in the dataset, providing us with another example of bias.

In the previous sections, we introduced you to the world of recommender systems and provided specific details of user-based and model-based recommendation engines. We used the MovieLens dataset, a non-commercial dataset consisting of 1 million ratings from 6,000 users over 4,000 movies, to choose both movies for a user and users for a movie.

Although user-based CF has the benefit of being simple to understand and explainable—because it’s easy to present to a user why they were recommended a given movie (because they’re similar to other users who like it)—user-based CF can suffer from performance issues where sparsity increases (that is, there are many movies and users but not much overlap between users in terms of their ratings). Conversely, with model-based CF, it can be difficult to explain a certain recommendation to a user, but such methods can perform better in the presence of sparsity due to their introduction of a latent space through which the users and movies are connected. As always, it’s up to the machine-learning practitioner to choose the solution that best matches their requirements.

3.6. The Netflix Prize

It’s difficult to talk about recommender systems and not mention Netflix in the same breath. The Netflix Prize, which concluded in 2009, was a roaring success and saw more than 41,000 entries from 186 countries[12] compete to obtain the best root-mean-square error (RMSE) score in recommending relevant movies to users given the users’ past ratings. Teams were submitting new entries for evaluation against a publically known quiz test set (as opposed to a privately held evaluation set held by Netflix) up until the last minutes before closing—exciting stuff! But the story behind the development of these algorithms is just as compelling.

12

In the end, the $1 million prize was awarded to team BellKor’s Pragmatic Chaos. This was a composite team from three of the top contenders in the competition: Team Pragmatic Theory, Team BellKor, and Team BigChaos. The algorithm provided a 10.06% improvement over Netflix’s homegrown Cinematch algorithm, thus securing the prize for the first team to breach the 10% mark. Rather than provide the specifics of the developed algorithm here, we’ll talk more generally about the approach and its impact. For those interested in the details, we strongly recommend the following papers: “The BellKor Solution to the Netflix Grand Prize,”[13] “The BellKor 2008 Solution to the Netflix Prize,”[14] and “The Pragmatic Theory Solution to the Netflix Grand Prize.”[15] Collectively these papers provide the full documentation for the prizewinning approach.

13

Yehuda Koren, “The BellKor Solution to the Netflix Grand Prize,” August 2009, http://mng.bz/TwOD.

14

Robert M. Bell, Yehuda Koren, and Chris Volinsky, “The BellKor 2008 Solution to the Netflix Prize,” http://mng.bz/mW25.

15

Martin Piotte and Martin Chabbert, “The Pragmatic Theory Solution to the Netflix Grand Prize,” August 2009, http://mng.bz/1et7.

The result achieved by the team blended hundreds of individual predictors to achieve results with performance in excess of any individual predictor alone. By blending many algorithms, each of which captured specifics of the interactions between user and item, the team was able to achieve exceptional results. What’s interesting about this competition is that the most impressive results came after these three teams pooled their predictors.

You may be surprised to learn that many of the concepts required to understand the winning algorithm have already been introduced in this or previous chapters. We’ll look at one of the predictors (from BellKor) and briefly discuss how this predictor was combined with others to reach a recommendation.

Team BellKor introduced a baseline predictor. That is, they modeled the item and user biases that are present in the dataset. To this, they added temporal dynamics. This helped their algorithm understand changes in overall popularity of a film (for example, if films have been re-released, they may garner more attention) as well as the propensities of users to change their rating biases (such as becoming more or less critical). They also added an SVD-like approach, as discussed previously. This had the effect of modeling global patterns, such as those discussed, but such an approach can often miss local patterns in the data. To counter this, they added a neighborhood method, similar to item-based and user-based CF. This helped capture local patterns between users, items, and the surrounding neighborhood within the data. In the end, this was only one of 454 candidate predictors used to select a set whose results were combined using gradient-boosted decision trees (GDBT).[16], [17]

16

Jerome H. Friedman, “Stochastic Gradient Boosting,” March 26, 1999, https://statweb.stanford.edu/~jhf/ftp/stobst.pdf.

17

Jerry Ye, Jyh-Herng Chow, et al., “Stochastic Gradient Boosted Distributed Decision Trees,” Proceedings of the 18th ACM CIKM (November 2009), http://mng.bz/WiMO.

Although this was a truly impressive achievement, Netflix didn’t actually implement the final algorithm and instead chose to continue with a simpler, albeit less-accurate algorithm (yielding only an 8.43% improvement versus the winning 10.06%). This algorithm won the first progress prize and was implemented even before the competition concluded. According to Netflix, the company’s reasoning was that the additional effort required to scale up the winning algorithm didn’t justify the increase in accuracy.

Intelligent-algorithm developers have an important lesson to learn here. We must focus not just on the accuracy of a proposed solution but also its cost. There’s another reason Netflix didn’t implement the winning algorithm: by the time the competition closed, the business had begun to shift focus. Recommendations were crucial to Netflix back in its DVD rental days, because they allowed for good suggestions to be served to users, such that Netflix could carefully select films for a limited number of mail-order rental slots. With the focus shifting to online streaming, this became less relevant because the delivery mechansim was amost instantaneous. A user can watch whatever they want, wherever and wherever they want. Herein lies our second important lesson: a reasonably intelligent algorithm delivered quickly can often be worth more than an extremely accurate one delivered at a later date.

3.7. Evaluating your recommender

Commercial recommendation systems operate under demanding conditions. The number of users is typically on the order of millions and the number of items on the order of hundreds of thousands. An additional requirement is the capability to provide recommendations in real time (typically, sub-second response times) without sacrificing the quality of the recommendations. This is typically performed with some layer of caching, and the time taken to use SVD for matrix factorization is usually not real-time and depends on many other factors: for example, the amount of memory you have and, more important, the implementation of SVD. It also might depend on how long you’re willing to wait for the answer!

As you’ve seen, by accumulating ratings from each user, it’s possible to enhance the accuracy of predictions over time. But in real life, it’s imperative that we give excellent recommendations to new users for whom, by definition, we don’t have a lot of ratings.

Another stringent requirement for state-of-the-art recommendation systems is the ability to update predictions based on incoming ratings. In large commercial sites, thousands of ratings and purchases may take place in a few hours and perhaps tens of thousands (or more) in the course of a single day. The ability to update the recommendation system with that additional information is important and must happen online—without downtime.

Let’s say that you wrote a recommender and you’re satisfied with its speed and the amount of data it can handle. Is this a good recommender? It’s not useful to have a fast and scalable recommender that produces bad recommendations. So, let’s talk about evaluating the accuracy of a recommendation system. If you search the related literature, you’ll find dozens of quantitative metrics and several qualitative methods for evaluating the results of recommendation systems. The plethora of metrics and methods reflects the challenges of conducting a meaningful, fair, and accurate evaluation for recommendations. The review article by Herlocker, Konstan, Terveen, and Riedl contains a wealth of information if you’re interested in this topic.[18]

18

Jonathan L. Herlocker, et al., “Evaluating Collaborative Filtering Recommender Systems,” ACM Transactions on Information Systems 22, no. 1 (January 2004).

We’ll demonstrate a measure known as the root-mean-square error (RMSE) of the predicted ratings over the MovieLens dataset. The RMSE is a simple but robust technique of evaluating the accuracy of recommendations. This metric has two main features: it always increases (you don’t get kudos for predicting a rating accurately); and by taking the square of the differences, the large differences (>1) are amplified, and it doesn’t matter if you undershoot or overshoot the rating.

We can argue that the RMSE is probably too naïve. Let’s consider two cases. In the first case, we recommend to a user a movie with four stars, and he really doesn’t like it (he’d rate it two stars); in the second case, we recommend a movie with three stars, but the user loves it (he’d rate it five stars). In both cases, the contribution to the RMSE is the same, but it’s likely that the user’s dissatisfaction would probably be larger in the first case than in the second case. You can find the code that calculates the RMSE for our model-based CF solution in the following listing. As we’ve mentioned, other metrics are available, and we’d encourage you to try those also.

Listing 3.14. Calculating the root-mean-square error for a recommender
from recsys.evaluation.prediction import RMSE

err = RMSE()
for rating, item_id, user_id in data.get():
    try:
        prediction = svd.predict(item_id, user_id)
        err.add(rating, prediction)
    except KeyError, k:
        continue

print 'RMSE is ' + str(err.compute())

err = RMSE()
print d
for rating, item_id, user_id in data.get():
    try:
        prediction = svd.predict(item_id, user_id)
        rmse.add(rating, pred_rating)
    except KeyError, k:
        continue

print 'RMSE is ' + err.compute()

3.8. Summary

  • You’ve learned about the concepts of distance and similarity between users and items, and you saw that one size doesn’t fit all—you need to be careful in your selection of a similarity metric. Similarity formulas must produce results that are consistent with a few basic rules, but otherwise you’re free to choose the ones that produce the best results for your purposes.
  • There are two broad categories of techniques for creating recommendations: collaborative filtering and the content-based approach. Focusing on the former, we provided three different approaches that you can use to perform recommendation: user-based, item-based, and model-based.
  • We performed item and user recommendation, and in each instance we looked at the distribution of ratings over movie genres. Movie genres are an imperfect way to reason about model-based CF but can capture some of the properties of user preference and movie subject.
  • In the final pages of this chapter, we looked at the Netflix Prize. Although this competition concluded in 2009, it’s still synonymous with the concept of recommendation. We provided a broad overview of the winning approach, but we leave it to you to uncover all the intricacies of this algorithm.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset