Rating predictions and recommendations

If you have used any online shopping system in the last 10 years, you have probably seen recommendations. Some are like Amazon's, customers who bought X also bought Y, feature. These will be discussed in the Basket analysis section. Other recommendations are based on predicting the rating of a product, such as a movie.

The problem of learning recommendations based on past product ratings was made famous by the Netflix prize, a million-dollar machine-learning public challenge by Netflix. Netflix is a movie-streaming company. One of the distinguishing features of the service is that it gives users the option to rate the films they have seen. Netflix then uses these ratings to recommend other films to its customers. In this machine-learning problem, you not only have the information about which films the user saw, but also about how the user rated them.

In 2006, Netflix made a large number of customer ratings of films in its database available for a public challenge. The goal was to improve on their in-house algorithm for rating prediction. Whoever could beat it by 10 percent or more would win 1 million dollars. In 2009, an international team named BellKor's Pragmatic Chaos was able to beat this mark and take the prize. They did so just 20 minutes before another team, the ensemble, passed the 10 percent mark as well—an exciting photo finish for a competition that lasted several years.

Machine learning in the real world:
Much has been written about the Netflix Prize, and you may learn a lot by reading up on it. The techniques that won were a mixture of advanced machine learning and a lot of work put into preprocessing the data. For example, some users like to rate everything very highly, while others are always more negative; if you do not account for this in preprocessing, your model will suffer. Other normalizations were also necessary for a good result, bearing in mind factors such as the film's age and how many ratings it received. Good algorithms are a good thing, but you always need to get your hands dirty and tune your methods to the properties of the data you have in front of you. Preprocessing and normalizing the data is often the most time-consuming part of the machine-learning process. However, this is also the place where one can have the biggest impact on the final performance of the system.

The first thing to note about the Netflix Prize is how hard it was. Roughly speaking, the internal system that Netflix used was about 10 percent better than having no recommendations at all (that is, assigning each movie just the average value for all users). The goal was to obtain just another 10 percent improvement on this. In total, the winning system was roughly just 20 percent better than no personalization. Yet it took a tremendous amount of time and effort to achieve this goal, and even though 20 percent does not seem like much, the result is a system that is useful in practice.

Unfortunately, for legal reasons, this dataset is no longer available. Although the dataset was anonymous, there were concerns that it might be possible to discover who the clients were and reveal private details of movie rentals. However, we can use an academic dataset with similar characteristics. This data comes from GroupLens, a research laboratory at the University of Minnesota.

How can we solve a Netflix-style ratings prediction question? We will look at two different kinds of approach: neighborhood approaches and regression approaches. We will also see how to combine these methods to obtain a single prediction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset