Case Study: Training a Recommender System in PySpark

To close this chapter, let us look at an example of how we might generate a large-scale recommendation system using dimensionality reduction. The dataset we will work with comes from a set of user transactions from an online store (Chen, Daqing, Sai Laing Sain, and Kun Guo. Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining. Journal of Database Marketing & Customer Strategy Management 19.3 (2012): 197-208). In this model, we will input a matrix in which the rows are users and the columns represent items in the catalog of an e-commerce site. Items purchased by a user are indicated by a 1. Our goal is to factorize this matrix into 1 x k user factors (row components) and k x 1 item factors (column components) using k components. Then, presented with a new user and their purchase history, we can predict what items they are like to buy in the future, and thus what we might recommend to them on a homepage. The steps to do so are as follows:

  1. Consider a user's prior purchase history as a vector p. We imagine this vector is the product of an unknown user factor component u with the item factors i we obtained through matrix factorization: each element of the vector p is then the dot product of this unknown user factor with the item factor for a given item. Solve for the unknown user factor u in the equation:
    Case Study: Training a Recommender System in PySpark

    Given the item factors i and the purchase history p, using matrix.Use the resulting user factor u, take the dot product with each item factor to obtain and sort by the result to determine a list of the top ranked items.

Now that we have described what is happening under the hood in this example, we can begin to parse this data using the following commands. First, we create a parsing function to read the 2nd and 7th columns of the data containing the item ID and user ID, respectively:

>>> def parse_data(line):
…     try: 
…         line_array = line.split(',')
…      return (line_array[6],line_array[1]) # user-term pairs
…      except:
…         return None

Next, we read in the file and convert the user and item IDs, which are both string, into a numerical index by incrementing a counter as we add unique items to a dictionary:

>>> f = open('Online Retail.csv',encoding="Windows-1252")
>>> purchases = []
>>> users = {}
>>> items = {}
>>>user_index = 0
>>>item_index = 0
>>>for index, line in enumerate(f):
…    if index > 0: # skip header
…         purchase = parse_data(line)
…         if purchase is not None:
 …            if users.get(purchase[0],None) is not None:
 …                purchase_user = users.get(purchase[0])
 …            else:
 …                users[purchase[0]] = user_index
 …                user_index += 1
 …                purchase_user = users.get(purchase[0])
 …            if items.get(purchase[1],None) is not None:
 …               purchase_item = items.get(purchase[1])
…             else:
 …                items[purchase[1]] = item_index
 …                item_index += 1
 …                purchase_item = items.get(purchase[1])
  …           purchases.append((purchase_user,purchase_item))>>>f.close()

Next, we convert the resulting array of purchase into an rdd and convert the resulting entries into Rating objects -- a (user, item, rating) tuple. Here, we will just indicate that the purchase occurred by giving a rating of 1.0 to all observed purchases, but we could just as well have a system where the ratings indicate user preference (such as movie ratings) and follow a numerical scale.

>>> purchasesRdd = sc.parallelize(purchases,5).map(lambda x: Rating(x[0],x[1],1.0))

Now we can fit the matrix factorization model using the following commands:

>>> from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

>>> k = 10
>>> iterations = 10
>>> mfModel = ALS.train(purchasesRdd, k, iterations)

The algorithm for matrix factorization used in PySpark is Alternating Least Squares (ALS), which has parameters for the number of row (column) components chosen (k) and a regularization parameter λ which we did not specify here, but functions similarly to its role in the regression algorithms we studied in Chapter 4, Connecting the Dots with Models – Regression Methods, by constraining the values in the row (column) vectors from becoming too large and potentially causing overfitting.

We could try several values of k and λ, and measure the mean squared error between the observed and predicted matrix (from multiplying the row factors by the column factors) to determine the optimal values.

Once we have obtained a good fit, we can use the predict and predictAll methods of the model object to obtain predictions for new users, and the persist it on disk using the save method.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset