The dataset

The Large Movie Review Database, originally published in the paper, Learning Word Vectors for Sentiment Analysis, by Andrew L. Maas et al, can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/.

The downloaded archive contains two folders labeled train and test. For train, there are 12,500 positive reviews and 12,500 negative reviews that we will train a classifier on. The test dataset contains the same amount of positive and negative reviews for a grand total of 50,000 positive and negative reviews amongst the two files.

Let's look at an example of one review to see what the data looks like:

"Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross-section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina, and Natella, our three "protagonists", for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!"

It appears that the only thing we have to work with is the raw text from the movie review and review sentiment; we know nothing about the date posted, who posted the review, and other data that may/may not be helpful to us aside from the text.

Table of Contents for The dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
The dataset