Sentiment classification of movie reviews

Sentiment analysis is the capability to decipher the opinions contained in a written or spoken text. The main purpose of this technique is to identify the sentiment (or polarity) of a lexical expression, which may have a neutral, positive, or negative connotation.

The problem we want to resolve is the IMDB movie review sentiment classification problem. Each movie review is a variable sequence of words, and the sentiment (positive or negative) of each movie review must be classified.

This problem is very complex, because the sequences can vary in length; they can also be part of a large vocabulary of input symbols.

The solution requires the model to learn long-term dependencies between symbols in the input sequence.

The IMDB dataset contains 25,000 highly polarized movie reviews (good or bad) for training and the same amount again for testing. The data was collected by Stanford researchers, and was used in a 2011 paper, where a split of 50/50 of the data was used for training and testing. In this paper, an accuracy of 88.89% was achieved.

Once we define our problem, we are ready to develop an LSTM model to classify the sentiment of movie reviews. We can quickly develop an LSTM for the IMDB problem and achieve good accuracy.

Let's start off by importing the classes and functions required for this model, and initializing the random number generator to a constant value, to ensure we can easily reproduce the results:

import numpy 
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
numpy.random.seed(7)

We load the IMDB dataset. We are constraining the dataset to the top 5,000 words. We also split the dataset into training (50%) and testing (50%) sets.

Keras provides access to the IMDb dataset (http://www.imdb.com/interfaces) built-in. alternatively, you also can download the IMDB dataset from Kaggle website at https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset.

The imdb.load_data() function allows you to load the dataset in a format that is ready for use in neural network and deep learning models. The words have been replaced by integers, which indicate the ordered frequency of each word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

Here's the code:

top_words = 5000 
(X_train, y_train), (X_test, y_test) =
imdb.load_data(nb_words=top_words)

Next, we need to truncate and pad the input sequences so that they are all the same length for modeling. The model will learn the zero values that carry no information because, although the sequences are not the same length in terms of content, same length vectors are required to perform the computation in Keras. The sequence length in each review varies, so we constrained each review to 500 words, truncating long reviews and padding the shorter reviews with zero values:

Let's see:

max_review_length = 500 
X_train = sequence.pad_sequences
(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences
(X_test, maxlen=max_review_length)

We can now define, compile, and fit our LSTM model.

To resolve the sentiment classification problem, we'll use the word embedding technique, which consists of representing words in a continuous vector space, that is, an area in which the words that are semantically similar are mapped in neighboring points. Word embedding is based on the distributional hypothesis, that is, the words that appear in a given context must share the same semantic meaning. Each movie review will then be mapped into a real vector domain, where the similarity between words, in terms of meaning, translates to closeness in the vector space. Keras provides a convenient way to convert positive integer representations of words into word embedding by an embedding layer.

Here, we define the length of the embedding vector and the model:

embedding_vector_length = 32 
model = Sequential()

The first layer is the embedded layer, which uses 32 length vectors to represent each word:

model.add(Embedding(top_words,  
embedding_vector_length,
input_length=max_review_length))

The next layer is the LSTM layer, with 100 memory units. Finally, because this is a classification problem, we use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes (good and bad) in the problem:

model.add(LSTM(100)) 
model.add(Dense(1, activation='sigmoid'))

Because it is a binary classification problem, the binary_crossentropy function is used as a loss function, while the optimizer function used here is the adam optimization algorithm (we also encountered it in a previous TensorFlow implementation):

model.compile(loss='binary_crossentropy', 
optimizer='adam',
metrics=['accuracy'])
print(model.summary())

We fit only three epochs, because the problem quickly overfits. A batch size of 64 reviews is used to space out weight updates:

model.fit(X_train, y_train,  
validation_data=(X_test, y_test),
nb_epoch=3,
batch_size=64)

Then, we estimate the model's performance on unseen reviews:

scores = model.evaluate(X_test, y_test, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

Running this example produces the following output:

Epoch 1/3 
16750/16750 [==============================] - 107s - loss: 0.5570 - acc: 0.7149
Epoch 2/3
16750/16750 [==============================] - 107s - loss: 0.3530 - acc: 0.8577
Epoch 3/3
16750/16750 [==============================] - 107s - loss: 0.2559 - acc: 0.9019
Accuracy: 86.79%

You can see that this simple LSTM, with little tuning, achieves near state of the art results on the IMDB problem. Importantly, this is a template that you can use to apply LSTM networks to your own sequence classification problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset