4.1 Introduction
Netflix, Hulu, and Amazon Prime all attempt to help you pick movies. In this chapter, we will create a database of movies, with fictional ratings. We will then create a set of viewers. We will then try to predict if a viewer would choose to watch a particular movie. We will use deep learning with MATLAB’s pattern recognition network, patternnet. You will see that we can achieve accuracies of up to 100% over our small set of movies. Guessing what a customer would like to buy is something that all manufacturers and retailers want to do as it lets them focus their efforts on products that are of the greatest interest to their customers. As we show in this chapter, deep learning can be a valuable tool.
4.2 Generating a Movie Database
4.2.1 Problem
We first need to generate a database of movies.
4.2.2 Solution
Write a MATLAB function, CreateMovieDatabase.m, to create a database of movies. The movies will have fields for genre, reviewer ratings (like IMDB), and viewer ratings.
4.2.3 How It Works
The movie database will have five characteristics.
Characteristic | Value |
Name | String |
Genre | Type of movie (Animated, Comedy, Dance, Drama, Fantasy, Romance, SciFi, War ) in a string |
Rating | Average viewer rating string, one to four stars, such as from IMBD |
Quality | Number of stars |
Length | Minute duration |
MPAA Rating | MPAA Rating (PG, PG-13, R) in a string |
MPAA stands for Motion Picture Association of America. It is an organization that rates movies. Other systems are possible, but this will be sufficient to test out our deep learning system. Three of the data points that will be used are strings and two are numbers. One number, length, is a continuum, while ratings have discrete values. The second number, quality, is based on the “stars” in the rating. Some movie databases, like IMDB, have fractional values because they average over all their users. We created our own MPAA ratings and genres based on our opinions. The real MPAA ratings may be different.
Length can be any duration. We’ll use randn to generate the lengths around a mean of 1.8 hours and a standard deviation of 0.15 hours. Length is a floating-point number. Stars are one to five and must be integers.
We created an Excel file with the names of 100 real movies, which is included with the book’s software. We assigned random genres and MPAA ratings (PG, R, and so forth) to them. We then saved the Excel file as tab-delimited text and search for tabs in each line. (There are other ways to import data from Excel and text files in MATLAB; this is just one example.) We then assign the data to the fields. The function will check to see if the maximum length or rating is zero, which it is for all the movies in this case, and then create random values. You can create a spreadsheet with rating values as an extension of this recipe. We use str2double since it is faster than str2num when you know that the value is a single number. fgetl reads in one line and ignores the end of line characters.
4.3 Generating a Viewer Database
4.3.1 Problem
We next need to generate a database of movie viewers for training and testing.
4.3.2 Solution
Write a MATLAB function, CreateMovieViewers.m, to create a series of watchers. We will use a probability model to select which of the movies each viewer has watched based on the movie’s genre, length, and ratings.
4.3.3 How It Works
Each watcher will have seen a fraction of the 100 movies in our database. This will be a random integer between 20 and 60. Each movie watcher will have a probability for each characteristic: the probability that they would watch a movie rated one or five stars, the probability that they would watch a movie in a given genre, etc. (Some viewers enjoy watching so-called “turkeys”!) We will combine the probabilities to determine the movies the viewer has watched. For mPAA, genre, and rating, the probabilities will be discrete. For the length, it will be a continuous distribution. You could argue that a watcher would always want the highest-rated movie, but remember this rating is based on an aggregate of other people’s opinions and so may not directly map onto the particular viewer. The only output of this function is a list of movie numbers for each user. The list is in a cell array.
4.4 Training and Testing
4.4.1 Problem
We want to test a deep learning algorithm to select new movies for the viewer, based on what the algorithm thinks a viewer would choose to watch.
4.4.2 Solution
Create a viewer database and train a pattern recognition neural net on the viewer’s movie selections. This is done in the script MovieNN.m. We will train a neural net for each viewer in the database.
4.4.3 How It Works
The next block displays the characteristics of the movies each viewer has watched. This is shown graphically in Figure 4.1. For the moment, there are only four viewers.
We use bar charts throughout. Notice how we make the x labels strings for the genre and so on. We also rotate them 90 degrees for clarity. The length is the number of movies longer than the number on the x-axis.
This data is based on our viewer model from a recipe in Section 4.3 which is based on joint probabilities. We will train the neural net on a subset of the movies. This is a classification problem. We just want to know if a given movie would be picked or not picked by the viewer.
The training window is shown in Figure 4.3. When we view the net, MATLAB opens the display in Figure 4.2. Each net has four inputs, for the movie’s rating, length, genre, and MPAA classification. The net’s single output is the classification of whether the viewer has watched the movie or not. The training window provides access to additional plots of the training and performance data. We train with 70% of the data.