An LSTM predictive model for sentiment analysis

Sentiment analysis is one of the most widely used tasks in NLP. An LSTM network can be used to classify short texts into desired categories, a classification problem. For example, a set of tweets can be categorized as either positive or negative. In this section, we will see such an example.

Network design

The implemented LSTM network will have three layers: an embedding layer, an RNN layer, and a softmax layer. A high-level view of this can be seen in the following diagram. Here, I summarize the functionalities of all of the layers:

  • Embedding layer: We will see an example in Chapter 8, Advanced TensorFlow Programming that shows that text datasets cannot be fed to Deep Neural Networks (DNNs) directly, so an additional layer called an embedding layer is required. For this layer, we transform each input, which is a tensor of k words, into a tensor of k N-dimensional vectors. This is called word embedding, where N is the embedding size. Every word will be associated with a vector of weights that needs to be learned during the training process. You can gain more insight into word embedding at vector representations of words.
  • RNN layer: Once we have constructed the embedding layer, there will be a new layer called the RNN layer, which is made out of LSTM cells with a dropout wrapper. LSTM weights need to be learned during the training process, as described in the previous sections. The RNN layer is unrolled dynamically (as shown in figure 4), taking k word embeddings as input and outputting k M-dimensional vectors, where M is the hidden size of the LSTM cells.
  • Softmax or sigmoid layer: The RNN layer's output is averaged across k time steps, obtaining a single tensor of size M. Finally, a softmax layer, for example, is used to compute classification probabilities.
    Network design

    Figure 20: The high-level view of the LSTM network for sentiment analysis

We will see later how cross-entropy can be used as the loss function, and RMSProp is the optimizer that minimizes it.

LSTM model training

The UMICH SI650 – sentiment classification dataset (with duplication removed) contains data about product and movie reviews donated by the University of Michigan can be downloaded from https://inclass.kaggle.com/c/si650winter11/data. Unwanted or special characters have been cleaned, before getting, the tokens (see the data.csv file).

The following script also removes stop words (see data_preparation.py). Some samples are given that are labeled as either negative or positive (1 is positive and 0 is negative):

Sentiment

SentimentText

1

The Da Vinci Code book is just awesome.

1

I liked the Da Vinci Code a lot.

0

OMG, I HATE BROKEBACK MOUNTAIN.

0

I hate Harry Potter.

Table 1: A sample of the sentiment dataset

Now, let's see a step-by-step example of training the LSTM network for this task. At first, we import the necessary modules and packages (execute the train.py file):

from data_preparation import Preprocessing
from lstm_network import LSTM_RNN_Network
import tensorflow as tf
import pickle
import datetime
import time
import os
import matplotlib.pyplot as plt

In the preceding import declarations, data_preparation and lstm_network are two helper Python scripts that are used for dataset preparation and network design. We will see more details later shortly. Now let's define parameters for the LSTM:

data_dir = 'data/' # Data directory containing 'data.csv'
stopwords_file = 'data/stopwords.txt' # Path to stopwords file
n_samples= None # Set n_samples=None to use the whole dataset

# Directory where TensorFlow summaries will be stored'
summaries_dir= 'logs/'
batch_size = 100 #Batch size
train_steps = 1000 #Number of training steps
hidden_size= 75 # Hidden size of LSTM layer
embedding_size = 75 # Size of embeddings layer
learning_rate = 0.01
test_size = 0.2
dropout_keep_prob = 0.5 # Dropout keep-probability
sequence_len = None # Maximum sequence length
validate_every = 100 # Step frequency to validate

I believe the preceding parameters are self-explanatory. The next task is to prepare summaries to be used by the TensorBoard:

summaries_dir = '{0}/{1}'.format(summaries_dir, datetime.datetime.now().strftime('%d_%b_%Y-%H_%M_%S'))
train_writer = tf.summary.FileWriter(summaries_dir + '/train')
validation_writer = tf.summary.FileWriter(summaries_dir + '/validation')

Now let's prepare the model directory:

model_name = str(int(time.time()))
model_dir = '{0}/{1}'.format(checkpoints_root, model_name)
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

Next, let's prepare the data and build a TensorFlow graph (see the data_preparation.py file):

data_lstm = Preprocessing(data_dir=data_dir,
                 stopwords_file=stopwords_file,
                 sequence_len=sequence_len,
                 test_size=test_size,
                 val_samples=batch_size,
                 n_samples=n_samples,
                 random_state=100)

In the preceding code segment, Preprocessing is a class continuing (see data_preparation.py for detail) several function and constructor that help us pre-process the training and testing set in order to train the LSTM network. Here, I have provided the code for each function and its functionality.

The constructor of this class initializes the data pre-processor. This class provides an interface to load, pre-process, and split the data into training, validation, and testing sets. It takes the following parameters:

  • data_dir: A data directory containing the dataset file, data.csv, with columns called SentimentText and Sentiment.
  • stopwords_file: Optional. If provided, it discards each stop word from the original data.
  • sequence_len: Optional. If m is the maximum sequence length in the dataset, it's required that sequence_len >= m. If sequence_len is None, then it'll be automatically assigned to m.
  • n_samples: Optional. It's the number of samples to load from the dataset (which is useful for large datasets). If n_samples is None, then the whole dataset will be loaded (be careful; if the dataset is large it may take a while to pre-process every sample).
  • test_size: Optional. 0<test_size<1. It represents the proportion of the dataset to include in the testing set (the default is 0.2).
  • val_samples: Optional but can be used to represent the absolute number of validations samples (the default is 100).
  • random_state: This is an optional parameter for the random seed used for splitting data into training, testing, and validation sets (the default is 0).
  • ensure_preprocessed: Optional. If ensure_preprocessed=True, it ensures that the dataset is already pre-processed (the default is False).

The code for the constructor is as follows:

def __init__(self, data_dir, stopwords_file=None, sequence_len=None, n_samples=None, test_size=0.2, val_samples=100, random_state=0, ensure_preprocessed=False):
        self._stopwords_file = stopwords_file
        self._n_samples = n_samples
        self.sequence_len = sequence_len
        self._input_file = os.path.join(data_dir, 'data.csv')
        self._preprocessed_file=os.path.join(data_dir,"preprocessed_"+str(n_samples)+ ".npz")
        self._vocab_file = os.path.join(data_dir,"vocab_" + str(n_samples) + ".pkl")
        self._tensors = None
        self._sentiments = None
        self._lengths = None
        self._vocab = None
        self.vocab_size = None

        # Prepare data
        if os.path.exists(self._preprocessed_file)and os.path.exists(self._vocab_file):
            print('Loading preprocessed files ...')
            self.__load_preprocessed()
        else:
            if ensure_preprocessed:
                raise ValueError('Unable to findpreprocessed files.')
            print('Reading data ...')
            self.__preprocess()
        # Split data in train, validation and test sets
        indices = np.arange(len(self._sentiments))
        x_tv, self._x_test, y_tv, self._y_test,tv_indices, test_indices = train_test_split(
            self._tensors,
            self._sentiments,
            indices,
            test_size=test_size,
            random_state=random_state,
            stratify=self._sentiments[:, 0])
            self._x_train,self._x_val,self._y_train,self._y_val,train_indices,val_indices= train_test_split(x_tv, y_tv, tv_indices, test_size=val_samples,random_state = random_state,
               stratify=y_tv[:, 0])
        self._val_indices = val_indices
        self._test_indices = test_indices
        self._train_lengths = self._lengths[train_indices]
        self._val_lengths = self._lengths[val_indices]
        self._test_lengths = self._lengths[test_indices]
        self._current_index = 0
        self._epoch_completed = 0 

Now let's see the signature of the preceding method. We start with the _preprocess() method, which loads data from data_dir/data.csv, pre-processes each sample loaded, and stores intermediate files to avoid pre-processing later. The workflow is as follows:

  1. Load the data
  2. Clean the sample text
  3. Prepare the vocabulary dictionary
  4. Remove the most uncommon words (they are probably grammar mistakes), encode the samples into tensors, and pad each tensor with zeros according to sequence_len
  5. Save intermediate files
  6. Store sample lengths for future use

Now let's take a look at the following code block, which represents the preceding workflow:

def __preprocess(self):
    data = pd.read_csv(self._input_file, nrows=self._n_samples)
    self._sentiments = np.squeeze(data.as_matrix(columns=['Sentiment']))
    self._sentiments = np.eye(2)[self._sentiments]
    samples = data.as_matrix(columns=['SentimentText'])[:, 0]
    samples = self.__clean_samples(samples)
    vocab = dict()
    vocab[''] = (0, len(samples))  # add empty word
    for sample in samples:
        sample_words = sample.split()
        for word in list(set(sample_words)):  # distinct words
            value = vocab.get(word)
            if value is None:
                vocab[word] = (-1, 1)
            else:
                encoding, count = value
                vocab[word] = (-1, count + 1)
      sample_lengths = []
      tensors = []
      word_count = 1
      for sample in samples:
          sample_words = sample.split()
          encoded_sample = []
          for word in list(set(sample_words)):  # distinct words 
              value = vocab.get(word)
              if value is not None:
                  encoding, count = value
                  if count / len(samples) > 0.0001:
                      if encoding == -1:
                          encoding = word_count
                          vocab[word] = (encoding, count)
                          word_count += 1
                      encoded_sample += [encoding]
                  else:
                      del vocab[word]
          tensors += [encoded_sample]
          sample_lengths += [len(encoded_sample)]
      self.vocab_size = len(vocab)
      self._vocab = vocab
      self._lengths = np.array(sample_lengths)
      self.sequence_len, self._tensors = self.__apply_to_zeros(tensors, self.sequence_len)
      with open(self._vocab_file, 'wb') as f:
          pickle.dump(self._vocab, f)
      np.savez(self._preprocessed_file, tensors=self._tensors, lengths=self._lengths, sentiments=self._sentiments)

Next, we invoke the preceding method and load the intermediate files, avoiding data pre-processing:

def __load_preprocessed(self):
    with open(self._vocab_file, 'rb') as f:
        self._vocab = pickle.load(f)
    self.vocab_size = len(self._vocab)
    load_dict = np.load(self._preprocessed_file)
    self._lengths = load_dict['lengths']
    self._tensors = load_dict['tensors']
    self._sentiments = load_dict['sentiments']
    self.sequence_len = len(self._tensors[0])

Once we have the pre-processed dataset, the next task is to clean the samples. The workflow is as follows:

  1. Prepare regex patterns.
  2. Clean each sample.
  3. Restore HTML characters.
  4. Remove @users and URLs.
  5. Transform to lowercase.
  6. Remove punctuation symbols.
  7. Replace CC(C+) (a character occurring more than twice in a row) with C.
  8. Remove stop words.

Now let's write the above steps programatically. For this, we have the following function:

def __clean_samples(self, samples):
    print('Cleaning samples ...')
    ret = []
    reg_punct = '[' + re.escape(''.join(string.punctuation)) + ']'
    if self._stopwords_file is not None:
        stopwords = self.__read_stopwords()
        sw_pattern = re.compile(r'(' + '|'.join(stopwords) + r')')
    for sample in samples:
        text = html.unescape(sample)
        words = text.split()
        words = [word for word in words if not word.startswith('@') and not word.startswith('http://')]
        text = ' '.join(words)
        text = text.lower()
        text = re.sub(reg_punct, ' ', text)
        text = re.sub(r'([a-z])1{2,}', r'1', text)
        if stopwords is not None:
            text = sw_pattern.sub('', text)
        ret += [text]
    return ret

The __apply_to_zeros() method returns the padding_length used and a NumPy array of padded tensors. First, it finds the maximum length, m, and ensures that m>=sequence_len. Then it pads the list with zeros according to sequence_len:

def __apply_to_zeros(self, lst, sequence_len=None):
    inner_max_len = max(map(len, lst))
    if sequence_len is not None:
        if inner_max_len > sequence_len:
            raise Exception('Error: Provided sequence length is not sufficient')
        else:
            inner_max_len = sequence_len
result = np.zeros([len(lst), inner_max_len], np.int32)
for i, row in enumerate(lst):
    for j, val in enumerate(row):
        result[i][j] = val
return inner_max_len, result

The next task is to remove all the stop words (which are provided in the data/StopWords.txt file). This method returns the stop words list:

def __read_stopwords(self):
    if self._stopwords_file is None:
        return None
    with open(self._stopwords_file, mode='r') as f:
        stopwords = f.read().splitlines()
    return stopwords

The next_batch() method takes batch_size>0 as the number of samples that'll be included, returns batch size samples (text_tensor, text_target, text_length) after completing the epoch, and randomly shuffles the training samples:

def next_batch(self, batch_size):
    start = self._current_index
    self._current_index += batch_size
    if self._current_index > len(self._y_train):
        self._epoch_completed += 1
        ind = np.arange(len(self._y_train))
        np.random.shuffle(ind)
        self._x_train = self._x_train[ind]
        self._y_train = self._y_train[ind]
        self._train_lengths = self._train_lengths[ind]
        start = 0
        self._current_index = batch_size
    end = self._current_index
    return self._x_train[start:end], self._y_train[start:end], self._train_lengths[start:end]

The next method, called get_val_data(), is then used to get the validation set to be used during the training period. It takes the original text and returns the validation data. By default, it returns the original_text (original_samples, text_tensor, text_target, text_length), or otherwise returns text_tensor, text_target, text_length:

def get_val_data(self, original_text=False):
    if original_text:
        data = pd.read_csv(self._input_file, nrows=self._n_samples)
        samples = data.as_matrix(columns=['SentimentText'])[:, 0]
        return samples[self._val_indices], self._x_val, self._y_val, self._val_lengths
        return self._x_val, self._y_val, self._val_lengths

Finally, we have an additional method called get_test_data(), which is used to prepare the testing set that will be used during the model evaluation period:

    def get_test_data(self, original_text=False):
        if original_text:
            data = pd.read_csv(self._input_file, nrows=self._n_samples)
            samples = data.as_matrix(columns=['SentimentText'])[:, 0]
            return samples[self._test_indices], self._x_test, self._y_test, self._test_lengths
        return self._x_test, self._y_test, self._test_lengths

Now we prepare the data so that the LSTM network can feed it:

lstm_model = LSTM_RNN_Network(hidden_size=[hidden_size],
                              vocab_size=data_lstm.vocab_size,
                              embedding_size=embedding_size,
                              max_length=data_lstm.sequence_len,
                              learning_rate=learning_rate)

In the preceding code segment, LSTM_RNN_Network is a class containing several functions and constructors that help us create the LSTM network. The upcoming constructor builds a TensorFlow LSTM model. It takes the following parameters:

  • hidden_size: An array holding the number of units in an LSTM cell of rnn layers
  • vocab_size: The vocabulary size in the sample
  • embedding_size: Words will be encoded using a vector of this size
  • max_length: The maximum length of an input tensor
  • n_classes: The number of classification classes
  • learning_rate: The learning rate of the RMSProp algorithm
  • random_state: The random state for dropout

The code for the constructor is as follows:

def __init__(self, hidden_size, vocab_size, embedding_size, max_length, n_classes=2, learning_rate=0.01, random_state=None):
    # Build TensorFlow graph
    self.input = self.__input(max_length)
    self.seq_len = self.__seq_len()
    self.target = self.__target(n_classes)
    self.dropout_keep_prob = self.__dropout_keep_prob()
    self.word_embeddings = self.__word_embeddings(self.input, vocab_size, embedding_size, random_state)
    self.scores = self.__scores(self.word_embeddings, self.seq_len, hidden_size, n_classes, self.dropout_keep_prob,
                                random_state)
        self.predict = self.__predict(self.scores)
        self.losses = self.__losses(self.scores, self.target)
        self.loss = self.__loss(self.losses)
        self.train_step = self.__train_step(learning_rate, self.loss)
        self.accuracy = self.__accuracy(self.predict, self.target)
        self.merged = tf.summary.merge_all()

The next function is called _input(), and it takes a parameter called param max_length, which is the maximum length of an input tensor. It then returns an input placeholder with the shape [batch_size, max_length] for the TensorFlow computation:

    def __input(self, max_length):
        return tf.placeholder(tf.int32, [None, max_length], name='input')

Next, the _seq_len() function returns a sequence length placeholder with the shape [batch_size]. It holds each tensor's real length in a given batch, allowing a dynamic sequence length:

def __seq_len(self):
    return tf.placeholder(tf.int32, [None], name='lengths')

The next function is called _target(). It takes a parameter called param n_classes, which contains the number of classification classes. Finally, it returns the target placeholder with the shape [batch_size, n_classes]:

def __target(self, n_classes):
    return tf.placeholder(tf.float32, [None, n_classes], name='target')

_dropout_keep_prob() returns a placeholder holding the dropout keep probability to reduce the overfitting:

def __dropout_keep_prob(self):
    return tf.placeholder(tf.float32, name='dropout_keep_prob')

The _cell() method is used to build a LSTM cell with a dropout wrapper. It takes the following parameters:

  • hidden_size: It is the number of units in the LSTM cell
  • dropout_keep_prob: This indicates the tensor holding the dropout keep probability
  • seed: It is an optional value that ensures the reproducibility of the computation for the random state for the dropout wrapper.

Finally, it returns an LSTM cell with a dropout wrapper:

def __cell(self, hidden_size, dropout_keep_prob, seed=None):
    lstm_cell = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True)
    dropout_cell = tf.nn.rnn_cell.DropoutWrapper(lstm_cell, input_keep_prob=dropout_keep_prob, output_keep_prob = dropout_keep_prob, seed=seed)
        return dropout_cell

Once we have created the LSTM cells, we can create the embedding of the input tokens. For this, __word_embeddings() does the trick. It builds an embedding layer with the shape [vocab_size, embedding_size], with input parameters such as x, which is the input with the shape [batch_size, max_length]. The vocab_size is the vocabulary size, that is, the number of possible words that may appear in a sample. The embedding_size is the words that will be represented using a vector of this size and seed is optional, but it ensures the random state for the embedding initialization.

Finally, it returns the embedding lookup tensor with the shape [batch_size, max_length, embedding_size]:

def __word_embeddings(self, x, vocab_size, embedding_size, seed=None):
    with tf.name_scope('word_embeddings'):
        embeddings = tf.get_variable("embeddings",shape=[vocab_size, embedding_size], dtype=tf.float32, initializer=None, regularizer=None, trainable=True, collections=None)
        embedded_words = tf.nn.embedding_lookup(embeddings, x)
    return embedded_words

The __rnn_layer () method creates the LSTM layer. It takes several input parameters, which are described here:

  • hidden_size: This is the number of units in the LSTM cell
  • x: This is the input with shape
  • seq_len: This is the sequence length tensor with shape
  • dropout_keep_prob: This is the tensor holding the dropout keep probability
  • variable_scope: This is the name of the variable scope (the default layer is rnn_layer)
  • random_state: This is the random state for the dropout wrapper

Finally, it returns the outputs with the shape [batch_size, max_seq_len, hidden_size]:

def __rnn_layer(self, hidden_size, x, seq_len, dropout_keep_prob, variable_scope=None, random_state=None):
    with tf.variable_scope(variable_scope, default_name='rnn_layer'):
        lstm_cell = self.__cell(hidden_size, dropout_keep_prob, random_state)
        outputs, _ = tf.nn.dynamic_rnn(lstm_cell, x, dtype=tf.float32, sequence_length=seq_len)
    return outputs

The _score() method is used to compute the network output. It takes several input parameters, as follows:

  • embedded_words: This is the embedding lookup tensor with the shape [batch_size, max_length, embedding_size]
  • seq_len: This is the sequence length tensor with the shape [batch_size]
  • hidden_size: This is an array holding the number of units in the LSTM cell in each RNN layer
  • n_classes: This is the number of classification classes
  • dropout_keep_prob: This is the tensor holding the dropout keep probability
  • random_state: This is an optional parameter, but it can be used to ensure the random state for the dropout wrapper

Finally, the _score() method returns the linear activation of each class with the shape [batch_size, n_classes]:

def __scores(self, embedded_words, seq_len, hidden_size, n_classes, dropout_keep_prob, random_state=None):
    outputs = embedded_words
    for h in hidden_size:
        outputs = self.__rnn_layer(h, outputs, seq_len, dropout_keep_prob)
    outputs = tf.reduce_mean(outputs, axis=[1])
    with tf.name_scope('final_layer/weights'):
        w = tf.get_variable("w", shape=[hidden_size[-1], n_classes], dtype=tf.float32, initializer=None, regularizer=None, trainable=True, collections=None)
        self.variable_summaries(w, 'final_layer/weights')
    with tf.name_scope('final_layer/biases'):
        b = tf.get_variable("b", shape=[n_classes], dtype=tf.float32, initializer=None, regularizer=None,trainable=True, collections=None)
        self.variable_summaries(b, 'final_layer/biases')
        with tf.name_scope('final_layer/wx_plus_b'):
            scores = tf.nn.xw_plus_b(outputs, w, b, name='scores')
            tf.summary.histogram('final_layer/wx_plus_b', scores)
        return scores

The _predict() method takes scores as the linear activation of each class with the shape [batch_size, n_classes] and returns softmax (to normalize the score in a scale of [0, 1]) activations with the shape [batch_size, n_classes]:

def __predict(self, scores):
    with tf.name_scope('final_layer/softmax'):
        softmax = tf.nn.softmax(scores, name='predictions')
        tf.summary.histogram('final_layer/softmax', softmax)
    return softmax

The _losses() method returns the cross-entropy losses (since softmax is used as the activation function) with the shape [batch_size]. It also takes two parameters, such as scores, as the linear activation of each class with the shape [batch_size, n_classes] and the target tensor with the shape [batch_size, n_classes]:

def __losses(self, scores, target):
        with tf.name_scope('cross_entropy'):
            cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=scores, labels=target, name='cross_entropy')
        return cross_entropy

The _loss() function computes and returns the mean cross-entropy loss. It takes only one parameter, called losses, which indicates the cross-entropy losses with the shape [batch_size] and is computed by the previous function:

def __loss(self, losses):
    with tf.name_scope('loss'):
        loss = tf.reduce_mean(losses, name='loss')
        tf.summary.scalar('loss', loss)
    return loss

Now, _train_step() computes and returns the RMSProp training step operation. It takes two parameters, learning_rate, which is the learning rate for the RMSProp optimizer; and the mean cross-entropy loss computed by the previous function:

def __train_step(self, learning_rate, loss):
    return tf.train.RMSPropOptimizer(learning_rate).minimize(loss)

When it is time for performance evaluation, the _accuracy() function computes the accuracy of the classification. It takes three parameters, predict, which the softmax activation is having the shape [batch_size, n_classes]; and the target tensor with the shape [batch_size, n_classes] and the mean accuracy obtained in the current batch:

def __accuracy(self, predict, target):
    with tf.name_scope('accuracy'):
        correct_pred = tf.equal(tf.argmax(predict, 1), tf.argmax(target, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
        tf.summary.scalar('accuracy', accuracy)
    return accuracy

The next function is called initialize_all_variable() and, as you may be able to guess, it initializes all variables:

def initialize_all_variables(self):
    return tf.global_variables_initializer()

Finally, we have a static method called variable_summaries(), which attaches a lot of summaries to a tensor for the TensorBoard visualization. It takes the following parameters:

var: is the variable to summarize
mean: mean of the summary name.

The signature is given below:

    @staticmethod
    def variable_summaries(var, name):
        with tf.name_scope('summaries'):
            mean = tf.reduce_mean(var)
            tf.summary.scalar('mean/' + name, mean)
            with tf.name_scope('stddev'):
                stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
            tf.summary.scalar('stddev/' + name, stddev)
            tf.summary.scalar('max/' + name, tf.reduce_max(var))
            tf.summary.scalar('min/' + name, tf.reduce_min(var))
            tf.summary.histogram(name, var)

Now we need to create a TensorFlow session before we can train the model:

sess = tf.Session()

Let's initialize all the variables:

init_op = tf.global_variables_initializer()
sess.run(init_op)

Then we save the TensorFlow model for future use:

saver = tf.train.Saver()

Now let's prepare the training set:

x_val, y_val, val_seq_len = data_lstm.get_val_data()

Now we should write the logs of the TensorFlow graph computation:

train_writer.add_graph(lstm_model.input.graph)

Additionally, we can create some empty lists to hold the training loss, validation loss, and the steps so that we can see them graphically:

train_loss_list = []
val_loss_list = []
step_list = []
sub_step_list = []
step = 0

Now we start the training. In each step, we record the training error. The validation errors are recorded in each sub-step:

for i in range(train_steps):
    x_train, y_train, train_seq_len = data_lstm.next_batch(batch_size)
    train_loss, _, summary = sess.run([lstm_model.loss, lstm_model.train_step, lstm_model.merged],
                                      feed_dict={lstm_model.input: x_train,
                                                 lstm_model.target: y_train,
                                                 lstm_model.seq_len: train_seq_len,
                                                 lstm_model.dropout_keep_prob:dropout_keep_prob})
    train_writer.add_summary(summary, i)  # Write train summary for step i (TensorBoard visualization)
    train_loss_list.append(train_loss)
    step_list.append(i)
        print('{0}/{1} train loss: {2:.4f}'.format(i + 1, FLAGS.train_steps, train_loss))
    if (i + 1) %validate_every == 0:
        val_loss, accuracy, summary = sess.run([lstm_model.loss, lstm_model.accuracy, lstm_model.merged],
                                               feed_dict={lstm_model.input: x_val,
                                                          lstm_model.target: y_val,
                                                          lstm_model.seq_len: val_seq_len,
                                                          lstm_model.dropout_keep_prob: 1})
        validation_writer.add_summary(summary, i)  
        print('   validation loss: {0:.4f} (accuracy {1:.4f})'.format(val_loss, accuracy))
        step = step + 1
        val_loss_list.append(val_loss)
        sub_step_list.append(step)

The following is the output of the preceding code:

>>>

1/1000 train loss: 0.6883
2/1000 train loss: 0.6879
3/1000 train loss: 0.6943

99/1000 train loss: 0.4870
100/1000 train loss: 0.5307
validation loss: 0.4018 (accuracy 0.9200)

199/1000 train loss: 0.1103
200/1000 train loss: 0.1032
validation loss: 0.0607 (accuracy 0.9800)

299/1000 train loss: 0.0292
300/1000 train loss: 0.0266
validation loss: 0.0417 (accuracy 0.9800)

998/1000 train loss: 0.0021
999/1000 train loss: 0.0007
1000/1000 train loss: 0.0004
validation loss: 0.0939 (accuracy 0.9700)

The preceding code prints the training and validation error. When the training is over, the model will be saved to the checkpoint directory that has a unique id:

checkpoint_file = '{}/model.ckpt'.format(model_dir)
save_path = saver.save(sess, checkpoint_file)
print('Model saved in: {0}'.format(model_dir))

The following is the output of the preceding code:

>>>
Model saved in checkpoints/1517781236

The checkpoint directory will produce at least three files:

  • config.pkl contains parameters used to train the model.
  • model.ckpt contains the weights of the model.
  • model.ckpt.meta contains the TensorFlow graph definition.

Let's see how the training went, that is, what were the training and the validation losses like:

# Plot loss over time
plt.plot(step_list, train_loss_list, 'r--', label='LSTM training loss per iteration', linewidth=4)
plt.title('LSTM training loss per iteration')
plt.xlabel('Iteration')
plt.ylabel('Training loss')
plt.legend(loc='upper right')
plt.show()

# Plot accuracy over time
plt.plot(sub_step_list, val_loss_list, 'r--', label='LSTM validation loss per validating interval', linewidth=4)
plt.title('LSTM validation loss per validation interval')
plt.xlabel('Validation interval')
plt.ylabel('Validation loss')
plt.legend(loc='upper left')
plt.show()

The following is the output of the preceding code:

>>>
LSTM model training

Figure 21: a) LSTM training loss per iteration on test set, b) LSTM validation loss per validation interval

If we examine the preceding graphs, it is clear that the training went pretty well in both the training phase and the validation phase with only 1,000 steps. However, readers should increase the training step, tune the hyperparameters, and see how it goes.

Visualizing through TensorBoard

Now let's observe the TensorFlow computational graph on TensorBoard. Simply execute the following command and access TensorBoard at localhost:6006/:

tensorboard --logdir /home/logs/

The graph tab shows the execution graph, including the gradients used, loss_op, the accuracy, the final layer, the optimizer used (in our case it's RMSPro), the LSTM layer (that is, RNN layer), the embedding layer, and save_op:

Visualizing through TensorBoard

Figure 22: The execution graph on TensorBoard

The execution graph shows that the computations we have done for this LSTM-based classifier for sentiment analysis are quite transparent. We can also observe the validation, training losses, accuracies, and the operations in the layers:

Visualizing through TensorBoard

Figure 23: Validation, training losses, accuracies and the operations in the layers on TensorBoard

LSTM model evaluation

We have trained and saved our LSTM model. We can easily restore the trained model and do some evaluation. We need to prepare the testing set and use the previously trained TensorFlow model to make predictions on it. Let's do this straight away. First, we load the required models:

import tensorflow as tf
from data_preparation import Preprocessing
   import pickle
Then we load to show the checkpoint directory where the model was saved. For our case, it was checkpoints/1505148083.

Note

For this step, execute the predict.py script with the following command:

$ python3 predict.py --checkpoints_dir checkpoints/1517781236
# Change this path based on output by 'python3 train.py' 
checkpoints_dir = 'checkpoints/1517781236' 

ifcheckpoints_dir is None:
    raise ValueError('Please, a valid checkpoints directory is required (--checkpoints_dir <file name>)')

Now load the testing dataset and prepare it to evaluate the model:

data_lstm = Preprocessing(data_dir=data_dir,
                 stopwords_file=stopwords_file,
                 sequence_len=sequence_len,
                 n_samples=n_samples,
                 test_size=test_size,
                 val_samples=batch_size,
                 random_state=random_state,
                 ensure_preprocessed=True)

In the preceding code, use the following parameters exactly as we did in the training step:

data_dir = 'data/' # Data directory containing 'data.csv'
stopwords_file = 'data/stopwords.txt' # Path to stopwords file.
sequence_len = None # Maximum sequence length
n_samples= None # Set n_samples=None to use the whole dataset
test_size = 0.2
batch_size = 100 #Batch size
random_state = 0 # Random state used for data splitting. Default is 0

The workflow for this evaluation method is as follows:

  1. First, import the meta graph and evaluate the model using the testing data
  2. Create the TensorFlow session for the computation
  3. Import the graph and restore its weights
  4. Recover the input/output tensors
  5. Perform the prediction
  6. Finally, we print the accuracy and the result on the simple testing set

Step 1 has already been completed previously. This code does steps 2 to 5:

original_text, x_test, y_test, test_seq_len = data_lstm.get_test_data(original_text=True)
graph = tf.Graph()
with graph.as_default():
    sess = tf.Session()    
    print('Restoring graph ...')
    saver = tf.train.import_meta_graph("{}/model.ckpt.meta".format(FLAGS.checkpoints_dir))
    saver.restore(sess, ("{}/model.ckpt".format(checkpoints_dir)))
    input = graph.get_operation_by_name('input').outputs[0]
    target = graph.get_operation_by_name('target').outputs[0]
    seq_len = graph.get_operation_by_name('lengths').outputs[0]
    dropout_keep_prob = graph.get_operation_by_name('dropout_keep_prob').outputs[0]
    predict = graph.get_operation_by_name('final_layer/softmax/predictions').outputs[0]
    accuracy = graph.get_operation_by_name('accuracy/accuracy').outputs[0]
    pred, acc = sess.run([predict, accuracy],
                         feed_dict={input: x_test,
                                    target: y_test,
                                    seq_len: test_seq_len,
                                    dropout_keep_prob: 1})
    print("Evaluation done.")

The following is the output of the preceding code:

>>>
Restoring graph ...
The evaluation was done.

Well done! The training is finished, so let's print the results:

print('
Accuracy: {0:.4f}
'.format(acc))
for i in range(100):
    print('Sample: {0}'.format(original_text[i]))
    print('Predicted sentiment: [{0:.4f}, {1:.4f}]'.format(pred[i, 0], pred[i, 1]))
    print('Real sentiment: {0}
'.format(y_test[i]))

The following is the output of the preceding code:

>>>
Accuracy: 0.9858

Sample: I loved the Da Vinci code, but it raises many theological questions most of which are very absurd...
Predicted sentiment: [0.0000, 1.0000]
Real sentiment: [0. 1.]

Sample: I'm sorry I hate to read Harry Potter, but I love the movies!
Predicted sentiment: [1.0000, 0.0000]
Real sentiment: [1. 0.]

Sample: I LOVE Brokeback Mountain...
Predicted sentiment: [0.0002, 0.9998]
Real sentiment: [0. 1.]

Sample: We also went to see Brokeback Mountain which totally SUCKED!!!
Predicted sentiment: [1.0000, 0.0000]
Real sentiment: [1. 0.]

The accuracy is above 98%. This is brilliant! However, you could try to iterate the training for even higher iterations with tuned hyperparameters, and you might get even higher accuracy. I leave this up to the readers.

In the next section, we will see how to develop a more advanced ML project using LSTM, which is called human activity recognition using smartphones dataset. In short, our ML model will be able to classify human movement into six categories: walking, walking upstairs, walking downstairs, sitting, standing, and laying.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset