An overview of a recommendation engine

We will now focus on situations where users have provided rankings or ratings on previously viewed or purchased items. There are two primary categories of designing recommendation systems: collaborative filtering and content-based (Ansari, Essegaier, and Kohli, 2000). The former category is what we will concentrate on as this is the focus of the recommenderlab R package that we will be using.

For content-based approaches, the concept is to link user preferences with item attributes. These attributes may be things such as the genre, cast, and storyline for a movie or TV show recommendation. As such, recommendations are based entirely on what the user provides as ratings; there is no linkage to what anyone else recommends. This has the advantage over content-based approaches: when a new item is added, it can be recommended to a user if it matches their profile instead of relying on other users to rate it first (the so-called first rater problem). However, content-based methods can suffer when limited content is available, either because of the domain or when a new user enters the system. This can result in non-unique recommendations, that is, poor recommendations. (Lops, Gemmis, and Semeraro, 2011)

In collaborative filtering, the recommendations are based on the many ratings provided by some or all of the individuals in the database. Essentially, it tries to capture the wisdom of the crowd.

For collaborative filtering, we will focus on the following four methods:

  • User-based Collaborative Filtering (UBCF)
  • Item-based Collaborative Filtering (IBCF)
  • Singular Value Decomposition (SVD)
  • Principal Components Analysis (PCA)

We will look at these methods briefly before moving on to the business case. It is also important to understand that recommenderlab was not designed to be used as a real-world implementation tool, but rather as a laboratory tool in order to research algorithms provided in the package as well as algorithms that you wish to experiment with on your own.

User-based collaborative filtering

In UBCF, the algorithm finds missing ratings for a user can be predicted by first finding a neighborhood of similar users and then aggregate the ratings of these users to form a prediction. (Hahsler, 2011). The neighborhood is determined by selecting either the KNN that is the most similar to the user we are making predictions for or by some similarity measure with a minimum threshold. The two similarity measures available in recommenderlab are Pearson Correlation Coefficient and Cosine Similarity. I will skip the formulas for these measures as they are readily available in the package documentation.

Once the neighborhood method is decided on, the algorithm identifies the neighbors by calculating the similarity measure between the individual of interest and their neighbors on only those items that were rated by both. Through some scoring scheme, say, a simple average, the ratings are aggregated in order to make a predicted score for the individual and item of interest.

Let's look at a simple example. In the following matrix, there are six individuals with ratings on four movies, with the exception of my rating for Mad Max. Using k=1, the nearest neighbor is Homer, with Bart a close second; even though Flanders hated the Avengers as much as I did. So, using Homer's rating for Mad Max, which is 4, the predicted rating for me would also be a 4:

User-based collaborative filtering

There are a number of ways to weigh the data and/or control the bias. For instance, Flanders is quite likely to have lower ratings than the other users, so normalizing the data where the new rating score is equal to the user rating for an item minus the average for that user for all the items.

The weakness of UBCF is that to calculate the similarity measure for all the possible users, the entire database must be kept in memory, which can be quite computationally expensive and time-consuming.

Item-based collaborative filtering

As you might have guessed, IBCF uses the similarity between the items and not users to make a recommendation. The assumption behind this approach is that users will prefer items that are similar to other items they like. (Hahsler, 2011). The model is built by calculating a pairwise similarity matrix of all the items. The popular similarity measures are Pearson correlation and cosine similarity. To reduce the size of the similarity matrix, one can specify to retain only the k-most similar items. However, limiting the size of the neighborhood may significantly reduce the accuracy, leading to poorer performance versus UCBF.

Continuing with our simplified example, if we examine the following matrix, with k=1 the item most similar to Mad Max is American Sniper and we can thus take that rating as the prediction for Mad Max is as follows:

Item-based collaborative filtering

Singular value decomposition and principal components analysis

It is quite common to have a dataset where the number of users and items could number in the millions. Even if the rating matrix is not that large, it may be beneficial to reduce the dimensionality by creating a smaller (lower rank) matrix that captures most of the information in the higher dimension matrix. This may potentially allow you to capture important latent factors and their corresponding weights in the data. Such factors could lead to important insights such as the movie genre or book topics in the rating matrix. Even if you are unable to discern the meaningful factors, the techniques may filter out the noise in the data.

One issue with large datasets is that you will likely end up with a sparse matrix with many ratings missing. One weakness of these methods is that they will not work on a matrix with missing values, which must be imputed. As with any data imputation task, there are a number of techniques that one can try and experiment with, such as using the mean, median, or code as zeroes. The default for recommenderlab is to use the median. You should also be aware of the R package, bcv, Cross-Validation for the SVD. This package has the impute.svd() function, which will impute the missing values in a matrix.

So what is SVD? It is simply a method for matrix factorization. Say that you have a matrix called A. This matrix will factor into three matrices: U, D, and VT. U is an orthogonal matrix, D is a non-negative, diagonal matrix, and VT is a transpose of an orthogonal matrix. Furthermore, U is the eigenvector of AAT and V is the eigenvector of ATA. Now, let's look at our rating matrix and walk through an example using R.

The first thing that we will do is recreate the rating matrix, as shown in the following code:

> ratings = c(3,5,5,5,1,1,5,2,5,1,1,5,3,5,1,5,4,2,4,3,4,2,1,4)

> ratingMat = matrix(ratings, nrow=6)

> rownames(ratingMat) = c("Homer","Marge","Bart","Lisa","Flanders","Me")

> colnames(ratingMat) = c("Avengers","American Sniper","Les Miserable","Mad Max")

> ratingMat
         Avengers American Sniper Les Miserable Mad Max
Homer        3               5             3       4
Marge        5               2             5       3
Bart         5               5             1       4
Lisa         5               1             5       2
Flanders     1               1             4       1
Me           1               5             2       4

Now, we will use the svd() function in base R to create the three matrices, which R calls $d, $u, and $v. You can think of the $u values as an individual's loadings on that factor and $v as a movie's loadings on that dimension, for example, Mad Max loads on dimension one at -0.116:

> svd=svd(ratingMat)

> svd
$d
[1] 16.1204848  6.1300650  3.3664409  0.4683445

$u
           [,1]       [,2]       [,3]        [,4]
[1,] -0.4630576  0.2731330  0.2010738 -0.27437700
[2,] -0.4678975 -0.3986762 -0.0789907  0.53908884
[3,] -0.4697552  0.3760415 -0.6172940 -0.31895450
[4,] -0.4075589 -0.5547074 -0.1547602 -0.04159102
[5,] -0.2142482 -0.3017006  0.5619506 -0.57340176
[6,] -0.3660235  0.4757362  0.4822227  0.44927622

$v
           [,1]       [,2]        [,3]       [,4]
[1,] -0.5394070 -0.3088509 -0.77465479 -0.1164526
[2,] -0.4994752  0.6477571  0.17205756 -0.5489367
[3,] -0.4854227 -0.6242687  0.60283871 -0.1060138
[4,] -0.4732118  0.3087241  0.08301592  0.8208949

It is easy to explore how much variation is explained by reducing the dimensionality. Let's sum the diagonal numbers of $d, then look at how much of the variation we can explain with just two factors, as follows:

> sum(svd$d)
[1] 26.08534

> var = sum(svd$d[1:2])

> var
[1] 22.25055

> var/sum(svd$d)
[1] 0.8529908

With two of the four factors, we are able to capture just over 85 percent of the total variation in the full matrix. You can see the scores that the reduced dimensions would produce. To do this, we will create a function. (Many thanks to the www.stackoverflow.com respondents who helped me put this function together.) This function will allow us to specify the number of factors that are to be included for a prediction. It calculates a rating value by multiplying the $u matrix times the $v matrix times the $d matrix:

> f1 = function(x) {
score = 0
for(i in 1:n )
score = score + svd$u[,i] %*% t(svd$v[,i]) * svd$d[i]
return(score)}

By specifying n=4 and calling the function, we can recreate the original rating matrix:

> n=4

> f1(svd)
     [,1] [,2] [,3] [,4]
[1,]    3    5    3    4
[2,]    5    2    5    3
[3,]    5    5    1    4
[4,]    5    1    5    2
[5,]    1    1    4    1
[6,]    1    5    2    4

Alternatively, we can specify n=2 and examine the resulting matrix:

> n=2

> f1(svd)
            [,1]      [,2]     [,3]     [,4]
[1,] 3.509402 4.8129937 2.578313 4.049294
[2,] 4.823408 2.1843483 5.187072 2.814816
[3,] 3.372807 5.2755495 2.236913 4.295140
[4,] 4.594143 1.0789477 5.312009 2.059241
[5,] 2.434198 0.5270894 2.831096 1.063404
[6,] 2.282058 4.8361913 1.043674 3.692505

So, with SVD, you can reduce the dimensionality and possibly identify the meaningful latent factors.

If you went through the prior chapter, you can see the similarities with PCA. In fact, the two are closely related and often used interchangeably as they both utilize the matrix factorization. However, what is the difference? In short, PCA is based on the covariance matrix, which is symmetric. This means that you start with the data, compute the covariance matrix on the centered data, diagonalize it, and create the components. Now, one can derive the principal components using SVD. Let's apply a portion of the PCA code from the prior chapter to our data in order to see how the difference manifests itself:

> library(psych)

> pca = principal(ratingMat, nfactors=2, rotate="none")

> pca
Principal Components Analysis
Call: principal(r = ratingMat, nfactors = 2, rotate =
"none")
Standardized loadings (pattern matrix) based upon correlation matrix
                  PC1   PC2   h2    u2
Avengers        -0.09  0.98 0.98 0.022
American Sniper  0.99 -0.01 0.99 0.015
Les Miserable   -0.90  0.18 0.85 0.150
Mad Max          0.92  0.29 0.93 0.071

                            PC1  PC2
SS loadings           2.65 1.09
Proportion Var        0.66 0.27
Cumulative Var        0.66 0.94
Proportion Explained  0.71 0.29
Cumulative Proportion 0.71 1.00

You can see that PCA is easier to interpret. Notice how American Sniper and Mad Max have high loadings on the first component and only Avengers has a high loading on the second component. Additionally, these two components account for 94 percent of the total variance in the data.

Having applied a simplistic rating matrix to the techniques of collaborative filtering, let's move on to a more complex example using real-world data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset