© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
M. Paluszek et al.Practical MATLAB Deep Learninghttps://doi.org/10.1007/978-1-4842-7912-0_4

4. Classifying Movies

Michael Paluszek1  , Stephanie Thomas2   and Eric Ham2  
(1)
Plainsboro, NJ, USA
(2)
Princeton, NJ, USA
 

Abstract

Netflix, Hulu, and Amazon Prime all attempt to help you pick movies. In this chapter, we will create a database of movies, with fictional ratings. We will then create a set of viewers. We will then try to predict if a viewer would choose to watch a particular movie. We will use deep learning with MATLAB’s pattern recognition network, patternnet. You will see that we can achieve accuracies of up to 100% over our small set of movies. Guessing what a customer would like to buy is something that all manufacturers and retailers want to do as it lets them focus their efforts on products that are of the greatest interest to their customers. As we show in this chapter, deep learning can be a valuable tool.

4.1 Introduction

Netflix, Hulu, and Amazon Prime all attempt to help you pick movies. In this chapter, we will create a database of movies, with fictional ratings. We will then create a set of viewers. We will then try to predict if a viewer would choose to watch a particular movie. We will use deep learning with MATLAB’s pattern recognition network, patternnet. You will see that we can achieve accuracies of up to 100% over our small set of movies. Guessing what a customer would like to buy is something that all manufacturers and retailers want to do as it lets them focus their efforts on products that are of the greatest interest to their customers. As we show in this chapter, deep learning can be a valuable tool.

4.2 Generating a Movie Database

4.2.1 Problem

We first need to generate a database of movies.

4.2.2 Solution

Write a MATLAB function, CreateMovieDatabase.m, to create a database of movies. The movies will have fields for genre, reviewer ratings (like IMDB), and viewer ratings.

4.2.3 How It Works

We first need to come up with a method for characterizing movies. Table 4.1 gives our system.
Table 4.1

The movie database will have five characteristics.

Characteristic

Value

Name

String

Genre

Type of movie (Animated, Comedy, Dance, Drama, Fantasy, Romance, SciFi, War ) in a string

Rating

Average viewer rating string, one to four stars, such as from IMBD

Quality

Number of stars

Length

Minute duration

MPAA Rating

MPAA Rating (PG, PG-13, R) in a string

MPAA stands for Motion Picture Association of America. It is an organization that rates movies. Other systems are possible, but this will be sufficient to test out our deep learning system. Three of the data points that will be used are strings and two are numbers. One number, length, is a continuum, while ratings have discrete values. The second number, quality, is based on the “stars” in the rating. Some movie databases, like IMDB, have fractional values because they average over all their users. We created our own MPAA ratings and genres based on our opinions. The real MPAA ratings may be different.

Length can be any duration. We’ll use randn to generate the lengths around a mean of 1.8 hours and a standard deviation of 0.15 hours. Length is a floating-point number. Stars are one to five and must be integers.

We created an Excel file with the names of 100 real movies, which is included with the book’s software. We assigned random genres and MPAA ratings (PG, R, and so forth) to them. We then saved the Excel file as tab-delimited text and search for tabs in each line. (There are other ways to import data from Excel and text files in MATLAB; this is just one example.) We then assign the data to the fields. The function will check to see if the maximum length or rating is zero, which it is for all the movies in this case, and then create random values. You can create a spreadsheet with rating values as an extension of this recipe. We use str2double since it is faster than str2num when you know that the value is a single number. fgetl reads in one line and ignores the end of line characters.

You’ll notice that we check for NaN in the length and rating fields since
Running the function demo, shown in the following, creates a database of movies in a data structure. The length and ratings are added here.
The output is the following:
Here are the first few movies:

4.3 Generating a Viewer Database

4.3.1 Problem

We next need to generate a database of movie viewers for training and testing.

4.3.2 Solution

Write a MATLAB function, CreateMovieViewers.m, to create a series of watchers. We will use a probability model to select which of the movies each viewer has watched based on the movie’s genre, length, and ratings.

4.3.3 How It Works

Each watcher will have seen a fraction of the 100 movies in our database. This will be a random integer between 20 and 60. Each movie watcher will have a probability for each characteristic: the probability that they would watch a movie rated one or five stars, the probability that they would watch a movie in a given genre, etc. (Some viewers enjoy watching so-called “turkeys”!) We will combine the probabilities to determine the movies the viewer has watched. For mPAA, genre, and rating, the probabilities will be discrete. For the length, it will be a continuous distribution. You could argue that a watcher would always want the highest-rated movie, but remember this rating is based on an aggregate of other people’s opinions and so may not directly map onto the particular viewer. The only output of this function is a list of movie numbers for each user. The list is in a cell array.

We start by creating cell arrays of the categories. We then loop through the viewers and compute probabilities for each movie characteristic. We then loop through the movies and compute the combined probabilities. This results in a list of movies watched by each viewer.
This code computes the Gaussian or normal probability. The inputs include a standard deviation sigma and a mean mu.
The built-in function demo follows the Gaussian function. It is run automatically if the function is called with no inputs.
The output from the demo is as follows. This shows how many of the movies in the database each viewer has watched; the most is 57 movies and the fewest is 26.

4.4 Training and Testing

4.4.1 Problem

We want to test a deep learning algorithm to select new movies for the viewer, based on what the algorithm thinks a viewer would choose to watch.

4.4.2 Solution

Create a viewer database and train a pattern recognition neural net on the viewer’s movie selections. This is done in the script MovieNN.m. We will train a neural net for each viewer in the database.

4.4.3 How It Works

First, the movie data is loaded and displayed.
The viewer database is then created from the movie database:

The next block displays the characteristics of the movies each viewer has watched. This is shown graphically in Figure 4.1. For the moment, there are only four viewers.

Figure 4.1

The watched movie data for four viewers.

We use bar charts throughout. Notice how we make the x labels strings for the genre and so on. We also rotate them 90 degrees for clarity. The length is the number of movies longer than the number on the x-axis.

This data is based on our viewer model from a recipe in Section 4.3 which is based on joint probabilities. We will train the neural net on a subset of the movies. This is a classification problem. We just want to know if a given movie would be picked or not picked by the viewer.

We use patternnet to predict the movies watched. This is shown in the next code block. The input to patternnet is the sizes of the hidden layers, in this case, a single layer of size 40. We convert everything into integers. Note that you need to round the results since patternnet does not return integers, despite the label being an integer. patternnet has methods train and view.

The training window is shown in Figure 4.3. When we view the net, MATLAB opens the display in Figure 4.2. Each net has four inputs, for the movie’s rating, length, genre, and MPAA classification. The net’s single output is the classification of whether the viewer has watched the movie or not. The training window provides access to additional plots of the training and performance data. We train with 70% of the data.

The output of the script for patternnet(40) is shown in the following. The accuracy is the percentage of the movies in the test set (30% of the data available) that the net correctly predicted the viewer has watched. The accuracy is usually between 65% and 94% for this size hidden layer.
Figure 4.2

The patternnet network with four inputs and one output.

patternnet(40) returns good results, but we also tried greater and smaller numbers of layers and multiple hidden layers. For example, with a layer size of just 5, the accuracy ranges from 50 to 70%. With a size of 50, we reached over 90% of all viewers! Granted, this is a small number of movies. The results will vary with each run due to the random nature of the variables in the test. The predictions are probably as good as Netflix! It is important to note that the neural network did not know anything about the viewer model. Nonetheless, it does a good job of predicting movies that the viewer might like.
Figure 4.3

Patternnet training window.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset