Splitting into training and testing

At a high level, splitting the dataset into training and testing data in order to obtain a principled estimate of the system's performance is performed in the same way that we saw in previous chapters: we take a certain fraction of our data points (we will use 10 percent) and reserve them for testing; the rest will be used for training.

However, because the data is structured differently in this context, the code is different. In some of the models we explore, setting aside 10 percent of the users would not work as we transfer the data.

The first step is to load the data from the disk, for which we use the following function:

def load(): 
    import numpy as np 
    from scipy import sparse 
 
    data = np.loadtxt('data/ml-100k/u.data') 
    ij = data[:, :2] 
    ij = 1  # original data is in 1-based system 
    values = data[:, 2] 
    reviews = sparse.csc_matrix((values, ij.T)).astype(float) 
    return reviews.toarray()

Note that zero entries in this matrix represent missing ratings:

reviews = load() 
U,M = np.where(reviews)

We now use the standard random module to choose the indices to test:

import random 
test_idxs = np.array(random.sample(range(len(U)), len(U)//10))

Now we build the train matrix, which is like reviews, but with the testing entries set to zero:

train = reviews.copy() 
train[U[test_idxs], M[test_idxs]] = 0

Finally, the test matrix contains just the testing values:

test = np.zeros_like(reviews) 
test[U[test_idxs], M[test_idxs]] = reviews[U[test_idxs], M[test_idxs]]

From now on, we will work on taking the training data, and try to predict all the missing entries in the dataset. That is, we will write code that assigns each user–movie pair a recommendation.

Table of Contents for Splitting into training and testing

Create new playlist

Sign In

Sign Up

Table of Contents for
Splitting into training and testing