Building a recommendation engine

When we left the recommendation engine problem at the beginning of the chapter, we created the wordsMatrix variable. We should now have enough background with regards to the benefits of the eigenvalue decomposition process, which will help us to finish the creation of our recommendation engine. We can use the principalComponentAnalysis function and produce our own dataset based on wordsMatrix, as follows:

> let pcaMatrix = principalComponentAnalysis wordsMatrix 5

The pcaMatrix function is a compressed form of the wordsMatrix variable, which focuses on the 5 vectors with the highest variance.

Finding the nearest neighbors

Each row in the pcaMatrix function represents one user. If we were to plot this dataset (which would be a challenge in five dimensions), we would see points scattered through the space. These points would be gathered near other points with a high similarity. What we are going to do next is use pcaMatrix and seek out the 5 nearest neighbors to a given user at a specified index.

To find the point nearest to another point, you need a distance measure. A commonly used distance measure is known as the Euclidean distance. The equation for this can be denoted as follows:

Finding the nearest neighbors

This approach returns the true distance between two points, say a and b, but there is a way to optimize this function. When comparing two distances, we are trying to figure out the distance that is the shortest, but we don't require the distances themselves. We can still get the information desired by dropping the square root function. Comparing Euclidean distances with or without a square root will return the same result (and the square root operation can be a costly calculation).

Because of this, we will use the Euclidean squared distance:

Finding the nearest neighbors

In order to perform this operation in code, we are going to isolate a single desired row in our matrix, repeat it n times (where n is the number of rows in the matrix), and then use matrix operations to return the Euclidean squared distance in the following way:

euclideanDistanceSqrdFromRow :: Matrix Double) -> Integer -> Vector Double
euclideanDistanceSqrdFromRow records row =
    sumOfSquaresVector
  where
    d = cols records
    copyMatrix = repmat
      (subMatrix (fromIntegral row, 0) (1, d) records) (rows records) 1
      diffMatrix = records - copyMatrix
      diffSqrdMatrix = diffMatrix * diffMatrix
      sumOfSquaresVector = sum $ toColumns diffSqrdMatrix

The result of this operation will be a vector (not a matrix), that is, n elements long representing distances from a selected index to every index. The distance from the selected index to itself should be 0.

Next, we need to come up with an ordering for the values returned by this function. Here's what I came up with. This algorithm assumes that the list of distances has been nubsorted. This means that the distances have been sorted and all the duplicates have been removed. For each value in the nubSorted list, we return a list of all the index positions in the unsorted list that match this position. This is accomplished by using the elemIndices function, as follows:

order :: Eq a => [a] -> [a] -> [Integer]
order unsorted nubSorted =
  concatMap (x -> L.map toInteger $ L.elemIndices x unsorted) nubSorted

Finally, we put it all together into a function that will recommend users to us based on our usage of popular words. This function will take a matrix returned by the principalComponentAnalysis function and search the knn nearest index positions in index position's row by using the Euclidean similarity, as follows:

findClosestFeaturesToRow :: Matrix Double -> Integer -> Integer
  -> [Integer]

findClosestFeaturesToRow records row knn =
    take (fromIntegral knn) orderOfFeatures
  where
    sumOfSquares = toList $ euclideanDistanceSqrdFromRow records row
    orderOfFeatures = order sumOfSquares . nub $ sort sumOfSquares

Testing our recommendation engine

This aspect of the testing process will be different for everyone since each of us should have downloaded our databases individually. I found a user in my database at the 22951 index position who tweeted a lot. We are going to see whether we can recommend users to user #22951.

First, here's the feature vector for user #22951 representing the 25 most commonly used words in our database:

> genericIndex wordsMatrix 22951
[0.0,0.0,0.0,24.0,0.0,24.0,0.0,24.0,0.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]

It seems that this user's word patterns focused on just 4 popular words. Let's use our recommendation to provide the 5 users who are the most similar to #22951 using the following statement. Note that we purposely typed 6 here instead of 5. The closest match to #22951 is #22951, which shouldn't be surprising to us at all. (We will hope that a recommendation engine will be smart enough to notice someone identical to ourselves first.) Ignore the first result.

> findClosestFeaturesToRow pcaMatrix 22951 6
[22951,8020,13579,19982,11178,22227]

To see whether our engine works, the feature vector of each of these five users will hopefully be similar to #22951's feature vector. These users won't have the exact same word usage pattern as #22951, but it will be similar enough to provide a recommendation. This can be seen using the following statements:

> genericIndex wordsMatrix 8020
[0.0,0.0,0.0,0.0,0.0,14.0,0.0,14.0,0.0,0.0,0.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]

> genericIndex wordsMatrix 13579
[0.0,0.0,0.0,0.0,0.0,16.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]

> genericIndex wordsMatrix 19982
[0.0,0.0,13.0,0.0,0.0,13.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]

> genericIndex wordsMatrix 11178
[10.0,0.0,0.0,0.0,0.0,10.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]

> genericIndex wordsMatrix 22227
[0.0,0.0,0.0,0.0,0.0,8.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]

Our PCA recommendation engine was able to find some users that shared a common word usage pattern with the user at row 22951 in our database using only the eigenvectors associated with the top 5 eigenvalues. This was a small database of only 25,000 users with about 1 tweet each and we were still able to come up with a short list of similar users.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset