Sentiment analysis is one of the most widely used tasks in NLP. An LSTM network can be used to classify short texts into desired categories, a classification problem. For example, a set of tweets can be categorized as either positive or negative. In this section, we will see such an example.
The implemented LSTM network will have three layers: an embedding layer, an RNN layer, and a softmax layer. A high-level view of this can be seen in the following diagram. Here, I summarize the functionalities of all of the layers:
We will see later how cross-entropy can be used as the loss function, and RMSProp
is the optimizer that minimizes it.
The UMICH SI650 – sentiment classification dataset (with duplication removed) contains data about product and movie reviews donated by the University of Michigan can be downloaded from https://inclass.kaggle.com/c/si650winter11/data. Unwanted or special characters have been cleaned, before getting, the tokens (see the data.csv
file).
The following script also removes stop words (see data_preparation.py
). Some samples are given that are labeled as either negative or positive (1 is positive and 0 is negative):
Sentiment |
SentimentText |
---|---|
1 |
The Da Vinci Code book is just awesome. |
1 |
I liked the Da Vinci Code a lot. |
0 |
OMG, I HATE BROKEBACK MOUNTAIN. |
0 |
I hate Harry Potter. |
Table 1: A sample of the sentiment dataset
Now, let's see a step-by-step example of training the LSTM network for this task. At first, we import the necessary modules and packages (execute the train.py
file):
from data_preparation import Preprocessing from lstm_network import LSTM_RNN_Network import tensorflow as tf import pickle import datetime import time import os import matplotlib.pyplot as plt
In the preceding import declarations, data_preparation
and lstm_network
are two helper Python scripts that are used for dataset preparation and network design. We will see more details later shortly. Now let's define parameters for the LSTM:
data_dir = 'data/' # Data directory containing 'data.csv' stopwords_file = 'data/stopwords.txt' # Path to stopwords file n_samples= None # Set n_samples=None to use the whole dataset # Directory where TensorFlow summaries will be stored' summaries_dir= 'logs/' batch_size = 100 #Batch size train_steps = 1000 #Number of training steps hidden_size= 75 # Hidden size of LSTM layer embedding_size = 75 # Size of embeddings layer learning_rate = 0.01 test_size = 0.2 dropout_keep_prob = 0.5 # Dropout keep-probability sequence_len = None # Maximum sequence length validate_every = 100 # Step frequency to validate
I believe the preceding parameters are self-explanatory. The next task is to prepare summaries to be used by the TensorBoard:
summaries_dir = '{0}/{1}'.format(summaries_dir, datetime.datetime.now().strftime('%d_%b_%Y-%H_%M_%S')) train_writer = tf.summary.FileWriter(summaries_dir + '/train') validation_writer = tf.summary.FileWriter(summaries_dir + '/validation')
Now let's prepare the model directory:
model_name = str(int(time.time())) model_dir = '{0}/{1}'.format(checkpoints_root, model_name) if not os.path.exists(model_dir): os.makedirs(model_dir)
Next, let's prepare the data and build a TensorFlow graph (see the data_preparation.py
file):
data_lstm = Preprocessing(data_dir=data_dir, stopwords_file=stopwords_file, sequence_len=sequence_len, test_size=test_size, val_samples=batch_size, n_samples=n_samples, random_state=100)
In the preceding code segment, Preprocessing
is a class continuing (see data_preparation.py
for detail) several function and constructor that help us pre-process the training and testing set in order to train the LSTM network. Here, I have provided the code for each function and its functionality.
The constructor of this class initializes the data pre-processor. This class provides an interface to load, pre-process, and split the data into training, validation, and testing sets. It takes the following parameters:
data_dir
: A data directory containing the dataset file, data.csv
, with columns called SentimentText
and Sentiment
.stopwords_file
: Optional. If provided, it discards each stop word from the original data.sequence_len
: Optional. If m
is the maximum sequence length in the dataset, it's required that sequence_len >= m
. If sequence_len
is None
, then it'll be automatically assigned to m
.n_samples
: Optional. It's the number of samples to load from the dataset (which is useful for large datasets). If n_samples
is None
, then the whole dataset will be loaded (be careful; if the dataset is large it may take a while to pre-process every sample).test_size
: Optional. 0<test_size<1
. It represents the proportion of the dataset to include in the testing set (the default is 0.2
).val_samples
: Optional but can be used to represent the absolute number of validations samples (the default is 100
).random_state
: This is an optional parameter for the random seed used for splitting data into training, testing, and validation sets (the default is 0
).ensure_preprocessed
: Optional. If ensure_preprocessed=True
, it ensures that the dataset is already pre-processed (the default is False
).The code for the constructor is as follows:
def __init__(self, data_dir, stopwords_file=None, sequence_len=None, n_samples=None, test_size=0.2, val_samples=100, random_state=0, ensure_preprocessed=False): self._stopwords_file = stopwords_file self._n_samples = n_samples self.sequence_len = sequence_len self._input_file = os.path.join(data_dir, 'data.csv') self._preprocessed_file=os.path.join(data_dir,"preprocessed_"+str(n_samples)+ ".npz") self._vocab_file = os.path.join(data_dir,"vocab_" + str(n_samples) + ".pkl") self._tensors = None self._sentiments = None self._lengths = None self._vocab = None self.vocab_size = None # Prepare data if os.path.exists(self._preprocessed_file)and os.path.exists(self._vocab_file): print('Loading preprocessed files ...') self.__load_preprocessed() else: if ensure_preprocessed: raise ValueError('Unable to findpreprocessed files.') print('Reading data ...') self.__preprocess() # Split data in train, validation and test sets indices = np.arange(len(self._sentiments)) x_tv, self._x_test, y_tv, self._y_test,tv_indices, test_indices = train_test_split( self._tensors, self._sentiments, indices, test_size=test_size, random_state=random_state, stratify=self._sentiments[:, 0]) self._x_train,self._x_val,self._y_train,self._y_val,train_indices,val_indices= train_test_split(x_tv, y_tv, tv_indices, test_size=val_samples,random_state = random_state, stratify=y_tv[:, 0]) self._val_indices = val_indices self._test_indices = test_indices self._train_lengths = self._lengths[train_indices] self._val_lengths = self._lengths[val_indices] self._test_lengths = self._lengths[test_indices] self._current_index = 0 self._epoch_completed = 0
Now let's see the signature of the preceding method. We start with the _preprocess()
method, which loads data from data_dir
/data.csv
, pre-processes each sample loaded, and stores intermediate files to avoid pre-processing later. The workflow is as follows:
sequence_len
Now let's take a look at the following code block, which represents the preceding workflow:
def __preprocess(self): data = pd.read_csv(self._input_file, nrows=self._n_samples) self._sentiments = np.squeeze(data.as_matrix(columns=['Sentiment'])) self._sentiments = np.eye(2)[self._sentiments] samples = data.as_matrix(columns=['SentimentText'])[:, 0] samples = self.__clean_samples(samples) vocab = dict() vocab[''] = (0, len(samples)) # add empty word for sample in samples: sample_words = sample.split() for word in list(set(sample_words)): # distinct words value = vocab.get(word) if value is None: vocab[word] = (-1, 1) else: encoding, count = value vocab[word] = (-1, count + 1) sample_lengths = [] tensors = [] word_count = 1 for sample in samples: sample_words = sample.split() encoded_sample = [] for word in list(set(sample_words)): # distinct words value = vocab.get(word) if value is not None: encoding, count = value if count / len(samples) > 0.0001: if encoding == -1: encoding = word_count vocab[word] = (encoding, count) word_count += 1 encoded_sample += [encoding] else: del vocab[word] tensors += [encoded_sample] sample_lengths += [len(encoded_sample)] self.vocab_size = len(vocab) self._vocab = vocab self._lengths = np.array(sample_lengths) self.sequence_len, self._tensors = self.__apply_to_zeros(tensors, self.sequence_len) with open(self._vocab_file, 'wb') as f: pickle.dump(self._vocab, f) np.savez(self._preprocessed_file, tensors=self._tensors, lengths=self._lengths, sentiments=self._sentiments)
Next, we invoke the preceding method and load the intermediate files, avoiding data pre-processing:
def __load_preprocessed(self): with open(self._vocab_file, 'rb') as f: self._vocab = pickle.load(f) self.vocab_size = len(self._vocab) load_dict = np.load(self._preprocessed_file) self._lengths = load_dict['lengths'] self._tensors = load_dict['tensors'] self._sentiments = load_dict['sentiments'] self.sequence_len = len(self._tensors[0])
Once we have the pre-processed dataset, the next task is to clean the samples. The workflow is as follows:
Now let's write the above steps programatically. For this, we have the following function:
def __clean_samples(self, samples): print('Cleaning samples ...') ret = [] reg_punct = '[' + re.escape(''.join(string.punctuation)) + ']' if self._stopwords_file is not None: stopwords = self.__read_stopwords() sw_pattern = re.compile(r'(' + '|'.join(stopwords) + r')') for sample in samples: text = html.unescape(sample) words = text.split() words = [word for word in words if not word.startswith('@') and not word.startswith('http://')] text = ' '.join(words) text = text.lower() text = re.sub(reg_punct, ' ', text) text = re.sub(r'([a-z])1{2,}', r'1', text) if stopwords is not None: text = sw_pattern.sub('', text) ret += [text] return ret
The __apply_to_zeros()
method returns the padding_length
used and a NumPy array of padded tensors. First, it finds the maximum length, m
, and ensures that m>=sequence_len
. Then it pads the list with zeros according to sequence_len
:
def __apply_to_zeros(self, lst, sequence_len=None): inner_max_len = max(map(len, lst)) if sequence_len is not None: if inner_max_len > sequence_len: raise Exception('Error: Provided sequence length is not sufficient') else: inner_max_len = sequence_len result = np.zeros([len(lst), inner_max_len], np.int32) for i, row in enumerate(lst): for j, val in enumerate(row): result[i][j] = val return inner_max_len, result
The next task is to remove all the stop words (which are provided in the data
/StopWords.txt file
). This method returns the stop words list:
def __read_stopwords(self): if self._stopwords_file is None: return None with open(self._stopwords_file, mode='r') as f: stopwords = f.read().splitlines() return stopwords
The next_batch()
method takes batch_size>0
as the number of samples that'll be included, returns batch size samples (text_tensor
, text_target
, text_length
) after completing the epoch, and randomly shuffles the training samples:
def next_batch(self, batch_size): start = self._current_index self._current_index += batch_size if self._current_index > len(self._y_train): self._epoch_completed += 1 ind = np.arange(len(self._y_train)) np.random.shuffle(ind) self._x_train = self._x_train[ind] self._y_train = self._y_train[ind] self._train_lengths = self._train_lengths[ind] start = 0 self._current_index = batch_size end = self._current_index return self._x_train[start:end], self._y_train[start:end], self._train_lengths[start:end]
The next method, called get_val_data()
, is then used to get the validation set to be used during the training period. It takes the original text and returns the validation data. By default, it returns the original_text
(original_samples
, text_tensor
, text_target
, text_length
), or otherwise returns text_tensor
, text_target
, text_length
:
def get_val_data(self, original_text=False): if original_text: data = pd.read_csv(self._input_file, nrows=self._n_samples) samples = data.as_matrix(columns=['SentimentText'])[:, 0] return samples[self._val_indices], self._x_val, self._y_val, self._val_lengths return self._x_val, self._y_val, self._val_lengths
Finally, we have an additional method called get_test_data()
, which is used to prepare the testing set that will be used during the model evaluation period:
def get_test_data(self, original_text=False): if original_text: data = pd.read_csv(self._input_file, nrows=self._n_samples) samples = data.as_matrix(columns=['SentimentText'])[:, 0] return samples[self._test_indices], self._x_test, self._y_test, self._test_lengths return self._x_test, self._y_test, self._test_lengths
Now we prepare the data so that the LSTM network can feed it:
lstm_model = LSTM_RNN_Network(hidden_size=[hidden_size], vocab_size=data_lstm.vocab_size, embedding_size=embedding_size, max_length=data_lstm.sequence_len, learning_rate=learning_rate)
In the preceding code segment, LSTM_RNN_Network
is a class containing several functions and constructors that help us create the LSTM network. The upcoming constructor builds a TensorFlow LSTM model. It takes the following parameters:
hidden_size
: An array holding the number of units in an LSTM cell of rnn layersvocab_size
: The vocabulary size in the sampleembedding_size
: Words will be encoded using a vector of this sizemax_length
: The maximum length of an input tensorn_classes
: The number of classification classeslearning_rate
: The learning rate of the RMSProp algorithmrandom_state
: The random state for dropoutThe code for the constructor is as follows:
def __init__(self, hidden_size, vocab_size, embedding_size, max_length, n_classes=2, learning_rate=0.01, random_state=None): # Build TensorFlow graph self.input = self.__input(max_length) self.seq_len = self.__seq_len() self.target = self.__target(n_classes) self.dropout_keep_prob = self.__dropout_keep_prob() self.word_embeddings = self.__word_embeddings(self.input, vocab_size, embedding_size, random_state) self.scores = self.__scores(self.word_embeddings, self.seq_len, hidden_size, n_classes, self.dropout_keep_prob, random_state) self.predict = self.__predict(self.scores) self.losses = self.__losses(self.scores, self.target) self.loss = self.__loss(self.losses) self.train_step = self.__train_step(learning_rate, self.loss) self.accuracy = self.__accuracy(self.predict, self.target) self.merged = tf.summary.merge_all()
The next function is called _input()
, and it takes a parameter called param max_length
, which is the maximum length of an input tensor. It then returns an input placeholder with the shape [batch_size, max_length]
for the TensorFlow computation:
def __input(self, max_length): return tf.placeholder(tf.int32, [None, max_length], name='input')
Next, the _seq_len()
function returns a sequence length placeholder with the shape [batch_size]
. It holds each tensor's real length in a given batch, allowing a dynamic sequence length:
def __seq_len(self): return tf.placeholder(tf.int32, [None], name='lengths')
The next function is called _target()
. It takes a parameter called param n_classes
, which contains the number of classification classes. Finally, it returns the target placeholder with the shape [batch_size, n_classes]
:
def __target(self, n_classes): return tf.placeholder(tf.float32, [None, n_classes], name='target')
_dropout_keep_prob()
returns a placeholder holding the dropout keep probability to reduce the overfitting:
def __dropout_keep_prob(self): return tf.placeholder(tf.float32, name='dropout_keep_prob')
The _cell()
method is used to build a LSTM cell with a dropout wrapper. It takes the following parameters:
hidden_size
: It is the number of units in the LSTM celldropout_keep_prob
: This indicates the tensor holding the dropout keep probabilityseed
: It is an optional value that ensures the reproducibility of the computation for the random state for the dropout wrapper.Finally, it returns an LSTM cell with a dropout wrapper:
def __cell(self, hidden_size, dropout_keep_prob, seed=None): lstm_cell = tf.nn.rnn_cell.LSTMCell(hidden_size, state_is_tuple=True) dropout_cell = tf.nn.rnn_cell.DropoutWrapper(lstm_cell, input_keep_prob=dropout_keep_prob, output_keep_prob = dropout_keep_prob, seed=seed) return dropout_cell
Once we have created the LSTM cells, we can create the embedding of the input tokens. For this, __word_embeddings()
does the trick. It builds an embedding layer with the shape [vocab_size, embedding_size]
, with input parameters such as x
, which is the input with the shape [batch_size, max_length]
. The vocab_size
is the vocabulary size, that is, the number of possible words that may appear in a sample. The embedding_size
is the words that will be represented using a vector of this size and seed is optional, but it ensures the random state for the embedding initialization.
Finally, it returns the embedding lookup tensor with the shape [batch_size, max_length, embedding_size]
:
def __word_embeddings(self, x, vocab_size, embedding_size, seed=None): with tf.name_scope('word_embeddings'): embeddings = tf.get_variable("embeddings",shape=[vocab_size, embedding_size], dtype=tf.float32, initializer=None, regularizer=None, trainable=True, collections=None) embedded_words = tf.nn.embedding_lookup(embeddings, x) return embedded_words
The __rnn_layer ()
method creates the LSTM layer. It takes several input parameters, which are described here:
hidden_size
: This is the number of units in the LSTM cellx
: This is the input with shapeseq_len
: This is the sequence length tensor with shapedropout_keep_prob
: This is the tensor holding the dropout keep probabilityvariable_scope
: This is the name of the variable scope (the default layer is rnn_layer
)random_state
: This is the random state for the dropout wrapperFinally, it returns the outputs with the shape [batch_size, max_seq_len, hidden_size]
:
def __rnn_layer(self, hidden_size, x, seq_len, dropout_keep_prob, variable_scope=None, random_state=None): with tf.variable_scope(variable_scope, default_name='rnn_layer'): lstm_cell = self.__cell(hidden_size, dropout_keep_prob, random_state) outputs, _ = tf.nn.dynamic_rnn(lstm_cell, x, dtype=tf.float32, sequence_length=seq_len) return outputs
The _score()
method is used to compute the network output. It takes several input parameters, as follows:
embedded_words
: This is the embedding lookup tensor with the shape [batch_size, max_length, embedding_size]
seq_len
: This is the sequence length tensor with the shape [batch_size]
hidden_size
: This is an array holding the number of units in the LSTM cell in each RNN layern_classes
: This is the number of classification classesdropout_keep_prob
: This is the tensor holding the dropout keep probabilityrandom_state
: This is an optional parameter, but it can be used to ensure the random state for the dropout wrapperFinally, the _score()
method returns the linear activation of each class with the shape [batch_size, n_classes]
:
def __scores(self, embedded_words, seq_len, hidden_size, n_classes, dropout_keep_prob, random_state=None): outputs = embedded_words for h in hidden_size: outputs = self.__rnn_layer(h, outputs, seq_len, dropout_keep_prob) outputs = tf.reduce_mean(outputs, axis=[1]) with tf.name_scope('final_layer/weights'): w = tf.get_variable("w", shape=[hidden_size[-1], n_classes], dtype=tf.float32, initializer=None, regularizer=None, trainable=True, collections=None) self.variable_summaries(w, 'final_layer/weights') with tf.name_scope('final_layer/biases'): b = tf.get_variable("b", shape=[n_classes], dtype=tf.float32, initializer=None, regularizer=None,trainable=True, collections=None) self.variable_summaries(b, 'final_layer/biases') with tf.name_scope('final_layer/wx_plus_b'): scores = tf.nn.xw_plus_b(outputs, w, b, name='scores') tf.summary.histogram('final_layer/wx_plus_b', scores) return scores
The _predict()
method takes scores as the linear activation of each class with the shape [batch_size, n_classes]
and returns softmax (to normalize the score in a scale of [0, 1]
) activations with the shape [batch_size, n_classes]
:
def __predict(self, scores): with tf.name_scope('final_layer/softmax'): softmax = tf.nn.softmax(scores, name='predictions') tf.summary.histogram('final_layer/softmax', softmax) return softmax
The _losses()
method returns the cross-entropy losses (since softmax is used as the activation function) with the shape [batch_size]
. It also takes two parameters, such as scores, as the linear activation of each class with the shape [batch_size, n_classes]
and the target tensor with the shape [batch_size, n_classes]
:
def __losses(self, scores, target): with tf.name_scope('cross_entropy'): cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=scores, labels=target, name='cross_entropy') return cross_entropy
The _loss()
function computes and returns the mean cross-entropy loss. It takes only one parameter, called losses, which indicates the cross-entropy losses with the shape [batch_size]
and is computed by the previous function:
def __loss(self, losses): with tf.name_scope('loss'): loss = tf.reduce_mean(losses, name='loss') tf.summary.scalar('loss', loss) return loss
Now, _train_step()
computes and returns the RMSProp
training step operation. It takes two parameters, learning_rate
, which is the learning rate for the RMSProp
optimizer; and the mean cross-entropy loss computed by the previous function:
def __train_step(self, learning_rate, loss): return tf.train.RMSPropOptimizer(learning_rate).minimize(loss)
When it is time for performance evaluation, the _accuracy()
function computes the accuracy of the classification. It takes three parameters, predict, which the softmax activation is having the shape [batch_size, n_classes]
; and the target tensor with the shape [batch_size, n_classes]
and the mean accuracy obtained in the current batch:
def __accuracy(self, predict, target): with tf.name_scope('accuracy'): correct_pred = tf.equal(tf.argmax(predict, 1), tf.argmax(target, 1)) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy') tf.summary.scalar('accuracy', accuracy) return accuracy
The next function is called initialize_all_variable()
and, as you may be able to guess, it initializes all variables:
def initialize_all_variables(self): return tf.global_variables_initializer()
Finally, we have a static method called variable_summaries()
, which attaches a lot of summaries to a tensor for the TensorBoard visualization. It takes the following parameters:
var: is the variable to summarize mean: mean of the summary name.
The signature is given below:
@staticmethod def variable_summaries(var, name): with tf.name_scope('summaries'): mean = tf.reduce_mean(var) tf.summary.scalar('mean/' + name, mean) with tf.name_scope('stddev'): stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean))) tf.summary.scalar('stddev/' + name, stddev) tf.summary.scalar('max/' + name, tf.reduce_max(var)) tf.summary.scalar('min/' + name, tf.reduce_min(var)) tf.summary.histogram(name, var)
Now we need to create a TensorFlow session before we can train the model:
sess = tf.Session()
Let's initialize all the variables:
init_op = tf.global_variables_initializer() sess.run(init_op)
Then we save the TensorFlow model for future use:
saver = tf.train.Saver()
Now let's prepare the training set:
x_val, y_val, val_seq_len = data_lstm.get_val_data()
Now we should write the logs of the TensorFlow graph computation:
train_writer.add_graph(lstm_model.input.graph)
Additionally, we can create some empty lists to hold the training loss, validation loss, and the steps so that we can see them graphically:
train_loss_list = [] val_loss_list = [] step_list = [] sub_step_list = [] step = 0
Now we start the training. In each step, we record the training error. The validation errors are recorded in each sub-step:
for i in range(train_steps): x_train, y_train, train_seq_len = data_lstm.next_batch(batch_size) train_loss, _, summary = sess.run([lstm_model.loss, lstm_model.train_step, lstm_model.merged], feed_dict={lstm_model.input: x_train, lstm_model.target: y_train, lstm_model.seq_len: train_seq_len, lstm_model.dropout_keep_prob:dropout_keep_prob}) train_writer.add_summary(summary, i) # Write train summary for step i (TensorBoard visualization) train_loss_list.append(train_loss) step_list.append(i) print('{0}/{1} train loss: {2:.4f}'.format(i + 1, FLAGS.train_steps, train_loss)) if (i + 1) %validate_every == 0: val_loss, accuracy, summary = sess.run([lstm_model.loss, lstm_model.accuracy, lstm_model.merged], feed_dict={lstm_model.input: x_val, lstm_model.target: y_val, lstm_model.seq_len: val_seq_len, lstm_model.dropout_keep_prob: 1}) validation_writer.add_summary(summary, i) print(' validation loss: {0:.4f} (accuracy {1:.4f})'.format(val_loss, accuracy)) step = step + 1 val_loss_list.append(val_loss) sub_step_list.append(step)
The following is the output of the preceding code:
>>> 1/1000 train loss: 0.6883 2/1000 train loss: 0.6879 3/1000 train loss: 0.6943 99/1000 train loss: 0.4870 100/1000 train loss: 0.5307 validation loss: 0.4018 (accuracy 0.9200) … 199/1000 train loss: 0.1103 200/1000 train loss: 0.1032 validation loss: 0.0607 (accuracy 0.9800) … 299/1000 train loss: 0.0292 300/1000 train loss: 0.0266 validation loss: 0.0417 (accuracy 0.9800) … 998/1000 train loss: 0.0021 999/1000 train loss: 0.0007 1000/1000 train loss: 0.0004 validation loss: 0.0939 (accuracy 0.9700)
The preceding code prints the training and validation error. When the training is over, the model will be saved to the checkpoint directory that has a unique id:
checkpoint_file = '{}/model.ckpt'.format(model_dir) save_path = saver.save(sess, checkpoint_file) print('Model saved in: {0}'.format(model_dir))
The following is the output of the preceding code:
>>> Model saved in checkpoints/1517781236
The checkpoint directory will produce at least three files:
config.pkl
contains parameters used to train the model.model.ckpt
contains the weights of the model.model.ckpt.meta
contains the TensorFlow graph definition.Let's see how the training went, that is, what were the training and the validation losses like:
# Plot loss over time plt.plot(step_list, train_loss_list, 'r--', label='LSTM training loss per iteration', linewidth=4) plt.title('LSTM training loss per iteration') plt.xlabel('Iteration') plt.ylabel('Training loss') plt.legend(loc='upper right') plt.show() # Plot accuracy over time plt.plot(sub_step_list, val_loss_list, 'r--', label='LSTM validation loss per validating interval', linewidth=4) plt.title('LSTM validation loss per validation interval') plt.xlabel('Validation interval') plt.ylabel('Validation loss') plt.legend(loc='upper left') plt.show()
The following is the output of the preceding code:
>>>
If we examine the preceding graphs, it is clear that the training went pretty well in both the training phase and the validation phase with only 1,000 steps. However, readers should increase the training step, tune the hyperparameters, and see how it goes.
Now let's observe the TensorFlow computational graph on TensorBoard. Simply execute the following command and access TensorBoard at localhost:6006/
:
tensorboard --logdir /home/logs/
The graph tab shows the execution graph, including the gradients used, loss_op
, the accuracy, the final layer, the optimizer used (in our case it's RMSPro
), the LSTM layer (that is, RNN layer), the embedding layer, and save_op
:
The execution graph shows that the computations we have done for this LSTM-based classifier for sentiment analysis are quite transparent. We can also observe the validation, training losses, accuracies, and the operations in the layers:
We have trained and saved our LSTM model. We can easily restore the trained model and do some evaluation. We need to prepare the testing set and use the previously trained TensorFlow model to make predictions on it. Let's do this straight away. First, we load the required models:
import tensorflow as tf
from data_preparation import Preprocessing
import pickle
Then we load to show the checkpoint directory where the model was saved. For our case, it was checkpoints/1505148083
.
# Change this path based on output by 'python3 train.py' checkpoints_dir = 'checkpoints/1517781236' ifcheckpoints_dir is None: raise ValueError('Please, a valid checkpoints directory is required (--checkpoints_dir <file name>)')
Now load the testing dataset and prepare it to evaluate the model:
data_lstm = Preprocessing(data_dir=data_dir, stopwords_file=stopwords_file, sequence_len=sequence_len, n_samples=n_samples, test_size=test_size, val_samples=batch_size, random_state=random_state, ensure_preprocessed=True)
In the preceding code, use the following parameters exactly as we did in the training step:
data_dir = 'data/' # Data directory containing 'data.csv' stopwords_file = 'data/stopwords.txt' # Path to stopwords file. sequence_len = None # Maximum sequence length n_samples= None # Set n_samples=None to use the whole dataset test_size = 0.2 batch_size = 100 #Batch size random_state = 0 # Random state used for data splitting. Default is 0
The workflow for this evaluation method is as follows:
Step 1 has already been completed previously. This code does steps 2 to 5:
original_text, x_test, y_test, test_seq_len = data_lstm.get_test_data(original_text=True) graph = tf.Graph() with graph.as_default(): sess = tf.Session() print('Restoring graph ...') saver = tf.train.import_meta_graph("{}/model.ckpt.meta".format(FLAGS.checkpoints_dir)) saver.restore(sess, ("{}/model.ckpt".format(checkpoints_dir))) input = graph.get_operation_by_name('input').outputs[0] target = graph.get_operation_by_name('target').outputs[0] seq_len = graph.get_operation_by_name('lengths').outputs[0] dropout_keep_prob = graph.get_operation_by_name('dropout_keep_prob').outputs[0] predict = graph.get_operation_by_name('final_layer/softmax/predictions').outputs[0] accuracy = graph.get_operation_by_name('accuracy/accuracy').outputs[0] pred, acc = sess.run([predict, accuracy], feed_dict={input: x_test, target: y_test, seq_len: test_seq_len, dropout_keep_prob: 1}) print("Evaluation done.")
The following is the output of the preceding code:
>>> Restoring graph ... The evaluation was done.
Well done! The training is finished, so let's print the results:
print(' Accuracy: {0:.4f} '.format(acc)) for i in range(100): print('Sample: {0}'.format(original_text[i])) print('Predicted sentiment: [{0:.4f}, {1:.4f}]'.format(pred[i, 0], pred[i, 1])) print('Real sentiment: {0} '.format(y_test[i]))
The following is the output of the preceding code:
>>> Accuracy: 0.9858 Sample: I loved the Da Vinci code, but it raises many theological questions most of which are very absurd... Predicted sentiment: [0.0000, 1.0000] Real sentiment: [0. 1.] … Sample: I'm sorry I hate to read Harry Potter, but I love the movies! Predicted sentiment: [1.0000, 0.0000] Real sentiment: [1. 0.] … Sample: I LOVE Brokeback Mountain... Predicted sentiment: [0.0002, 0.9998] Real sentiment: [0. 1.] … Sample: We also went to see Brokeback Mountain which totally SUCKED!!! Predicted sentiment: [1.0000, 0.0000] Real sentiment: [1. 0.]
The accuracy is above 98%. This is brilliant! However, you could try to iterate the training for even higher iterations with tuned hyperparameters, and you might get even higher accuracy. I leave this up to the readers.
In the next section, we will see how to develop a more advanced ML project using LSTM, which is called human activity recognition using smartphones dataset. In short, our ML model will be able to classify human movement into six categories: walking, walking upstairs, walking downstairs, sitting, standing, and laying.