In our final chapter, we'll tackle one of the most ubiquitous problems prevalent in the e-commerce world, namely that of making effective product recommendations to customers. Recommendation systems, also referred to as recommender systems, often rely on the notion of similarity between objects, in an approach known as collaborative filtering. Its basic premise is that customers can be considered similar to each other if they share most of the products that they have purchased; equally, items can be considered similar to each other if they share a large number of customers who purchased them.
There are a number of different ways to quantify this notion of similarity, and we will present some of the commonly used alternatives. Whether we want to recommend movies, books, hotels, or restaurants, building a recommender system often involves dealing with very large data sets. Consequently, we'll introduce a few ideas and options for working with Big Data using R to wrap up our exploration of building predictive models in R.
A recommendation system usually involves having a set of users U = {u1, u2, …, um} that have varying preferences on a set of items I = {i1, i2, …, in}. The number of users |U| = m is different from the number of items |I| = n in the general case. In addition, users can often express their preference by rating items on some scale. As an example, we can think of the users as being restaurant patrons in a city, and the items being the restaurants that they visit. Under this setup, the preferences of the users could be expressed as ratings on a five star scale. Of course, our generalization does not require that the items be physical items or that the users be actual people—this is simply an abstraction for the recommender system problem that is commonly used.
As an illustration, think of a dating website in which users rate other users; here, the items that are being rated are the profiles of the actual users themselves. Let's return to our example of a restaurant recommender system and build some example data. A natural data structure that is popularly used for recommendation systems is the rating matrix. This is an m × n matrix where the rows represent the users and the columns represent the items. Each entry, ei,j, of the matrix represents the rating made by the user i for item j. What follows is a simple example:
> oliver <- c(1,1,2,5,7,8,9,7) > thibault <- c(5,9,4,1,1,7,5,9) > maria <- c(1,4,2,5,8,6,2,8) > pedro <- c(2,6,7,2,6,1,8,9) > ines <- c(1,3,2,4,8,9,7,7) > gertrude <- c(1,6,5,7,3,2,5,5) > ratingMatrix <- rbind(oliver, thibault, maria, pedro, ines, gertrude) > colnames(ratingMatrix) <- c("Berny's", "La Traviata", "El Pollo Loco", "Joey's Pizza", "The Old West", "Jake and Jill", "Full Moon", "Acropolis") > ratingMatrix Berny's La Traviata El Pollo Loco Joey's Pizza oliver 1 1 2 5 thibault 5 9 4 1 maria 1 4 2 5 pedro 2 6 7 2 ines 1 3 2 4 gertrude 1 6 5 7 The Old West Jake and Jill Full Moon Acropolis oliver 7 8 9 7 thibault 1 7 5 9 maria 8 6 2 8 pedro 6 1 8 9 ines 8 9 7 7 gertrude 3 2 5 5
Here, we have used a 10-point scale as a rating system, where 10 is the highest rating and 1 is the lowest. An alternative rating scale is a binary rating scale where 1 indicates a positive rating and 0 indicates a negative rating. This second approach would yield a binary rating matrix. How might we be able to use this rating matrix in order to inform a simple recommender system for other users?
Concretely, suppose that a new user, Silvan, has rated a few restaurants and we would like to make a recommendation for a suitable restaurant to which he has not been. Alternatively, we might want to propose a list of top three restaurants or even predict whether Silvan will like a specific restaurant he is currently considering.
One way to think about this problem is to find users that have similar views as Silvan on the restaurants that he has already rated. Then, we could use their ratings on restaurants that Silvan has not yet rated in order to predict Silvan's rating for those restaurants. This seems promising, but we should first think about how we might quantify this notion of similarity between two users based on their item ratings.
Even with a very large database of users, chances are that for a real-world recommender system, it will be rare—if not massively unlikely—to find two people that would rate all the items in our item set with the exact same score. That being said, we can still say that some users are more similar than others based on how they rate different items. For example, in our restaurant rating matrix, we can see that Ines and Oliver rated the first four restaurants poorly and the last four restaurants highly and so their tastes can be considered far more similar compared to a pair like Thibault and Pedro, who sometimes agree and sometimes have completely opposite views on a particular restaurant.
By representing a user as their particular row in the rating matrix, we can think of a user as being a vector in an n dimensional space, n being the number of items. Thus we can use different distance measures appropriate for vectors in order to measure the similarity of two different users. Note that the notion of distance is inversely proportional to the notion of similarity and we can thus use measures of distance as measures of similarity by interpreting a large distance between two vectors as analogous to a low similarity score.
The most familiar distance metric for two vectors a and b is the Euclidean distance:
We can use R's built-in dist()
function to compute all the pair-wise distances in our rating matrix as follows:
> dist(ratingMatrix, method = 'euclidean') oliver thibault maria pedro ines thibault 12.529964 maria 8.000000 11.000000 pedro 10.723805 9.899495 10.246951 ines 3.316625 11.224972 6.082763 10.583005 gertrude 10.488088 10.344080 8.717798 8.062258 10.440307
The result is a lower triangular matrix because the Euclidean distance is a symmetric function. Thus, the entry for (maria
, pedro
) is exactly the same as for (pedro
, maria
) and so we only need to display one of these. Here, we can explicitly see that Ines and Oliver are the two most similar users as the distance between them is smallest. Note that we can also talk about the distances between items in terms of the similarity of the ratings they received from different users. All we have to do to compute this is to transpose the rating matrix:
> dist(t(ratingMatrix), method = 'euclidean') Berny's La Traviata El Pollo Loco Joey's Pizza La Traviata 8.366600 El Pollo Loco 6.708204 5.744563 Joey's Pizza 9.643651 9.949874 7.745967 The Old West 13.038405 12.247449 10.535654 7.810250 Jake and Jill 12.000000 11.575837 12.449900 9.848858 Full Moon 12.369317 10.246951 8.717798 9.486833 Acropolis 14.212670 8.831761 10.723805 11.789826 The Old West Jake and Jill Full Moon La Traviata El Pollo Loco Joey's Pizza The Old West Jake and Jill 8.246211 Full Moon 8.062258 9.110434 Acropolis 8.831761 9.273618 7.549834
As we can see, the two most dissimilar restaurants (that is to say, those with the largest difference between them) are the Acropolis and Berny's. Looking back at the rating matrix, we should easily convince ourselves why this is the case. The former restaurant has received largely positive reviews across our user base whereas the reviews have been poor for the latter.
A commonly used alternative to the Euclidean distance (or L2 norm, as it is also known) is the cosine distance. This metric measures the cosine of the smallest angle between two vectors. If the vectors are parallel to each other, meaning that their angle is 0, then the cosine distance is 0 as well. If the two vectors are at a right angle to each other, then they have the largest distance according to this metric. The cosine distance is given by:
Here, the numerator is the dot product between the two vectors and the denominator is the product of the magnitudes (typically computed via the L2 norm) of the two vectors. The cosine distance isn't available as a method in the dist()
function of R's base distribution, but we can install the proxy
package, which enhances this function with a number of new distance metrics in order to compute the cosine distances for our rating matrix:
> library("proxy") > dist(ratingMatrix, method = 'cosine') oliver thibault maria pedro ines thibault 0.28387670 maria 0.12450495 0.23879093 pedro 0.20947046 0.17687385 0.20854178 ines 0.02010805 0.22821528 0.06911870 0.20437426 gertrude 0.22600742 0.21481973 0.19156876 0.12227138 0.22459114
Suppose instead that our users rated restaurants on a binary scale. We can convert our rating matrix into a binary rating matrix by considering all ratings above 5 to be positive and assigning them a new score of 1. The remaining ratings are all converted to a score of 0. For two binary vectors, the Jaccard similarity is given by the cardinality of the logical intersection divided by the cardinality of the logical union. The Jaccard distance is then computed as 1 minus this:
In a nutshell, what this is computing is one minus the ratio of the number of positions in which the two vectors both have a positive rating over the total number of positions in which either of the two vectors have a positive rating. Two binary vectors that agree in all their positive positions will be identical and thus have a distance of 0. Using the proxy
package, we can show the Jaccard
distance for our restaurant patrons as follows:
> binaryRatingMatrix <- ratingMatrix > 5 > dist(binaryRatingMatrix, method = 'jaccard') oliver thibault maria pedro ines thibault 0.6000000 maria 0.2500000 0.5000000 pedro 0.5000000 0.6666667 0.6666667 ines 0.0000000 0.6000000 0.2500000 0.5000000 gertrude 1.0000000 0.7500000 1.0000000 0.8333333 1.0000000