Vanishing and exploding gradients rescued by LSTM

The problem the RNN suffers from is either vanishing or exploding gradients. This happens because, over time, the gradient we try to minimize or reduce becomes so small or big that any additional training has no effect. This limits the usefulness of the RNN, but fortunately this problem was corrected with Long Short-Term Memory (LSTM) blocks, as shown in this diagram:

Example of an LSTM block

LSTM blocks overcome the vanishing gradient problem using a few techniques. Internally, in the diagram where you see a x inside a circle, it denotes a gate controlled by an activation function. In the diagram, the activation functions are σ and tanh. These activation functions work much like a step or ReLU do, and we may use either function for activation in a regular network layer. For the most part, we will treat an LSTM as a black box, and all you need to remember is that LSTMs overcome the gradient problem of RNN and can remember long-term sequences.

Let's take a look at a working example to see how this comes together. Open up Chapter_2_4.py and follow the these steps:

We begin as per usual by importing the various Keras pieces we need, as shown:

This example was pulled from https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/. This is a site hosted by Dr. Jason Brownlee, who has plenty more excellent examples explaining the use of LSTM and recurrent networks.

import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils

This time we are importing two new classes, Sequential and LSTM. Of course we know what LSTM is for, but what about Sequential? Sequential is a form of model that defines the layers in a sequence one after another. We were less worried about this detail before, since our previous models were all sequential.
Next, we set the random seed to a known value. We do this so that our example can replicate itself. You may have noticed in previous examples that not all runs perform the same. In many cases, we want our training to be consistent, and hence we set a known seed value by using this code:

numpy.random.seed(7)

It is important to realize that this just sets the numpy random seed value. Other libraries may use different random number generators and require different seed settings. We will try to identify these inconsistencies in the future when possible.
Next, we need to identify a sequence we will train to; in this case, we will just use the alphabet as shown in this code:

alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

seq_length = 1
dataX = []
dataY = []

for i in range(0, len(alphabet) - seq_length, 1):
  seq_in = alphabet[i:i + seq_length]
  seq_out = alphabet[i + seq_length]
  dataX.append([char_to_int[char] for char in seq_in])
  dataY.append(char_to_int[seq_out])
  print(seq_in, '->', seq_out)

The preceding code builds our sequence of characters as integers and builds a map of each character sequence. It builds a seq_in and seq_out showing the forward and reverse positions. Since the length of a sequence is defined by seq_length = 1, then we are only concerned about a letter of the alphabet and the character that comes after it. You could, of course, do longer sequences.
With the sequence data built, it is time to shape the data and normalize it with this code:

X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

The first line in the preceding code reshapes the data into a tensor with a size length of dataX, the number of steps or sequences, and the number of features to identify. We then normalize the data. Normalizing the data comes in many forms, but in this case we are normalizing values from 0 to 1. Then we one hot encode the output for easier training.

One hot encoding is where we you set the value to 1 where you have data or a response, and to zero everywhere else. In the example, our model output is 26 neurons, which could also be represented by 26 zeros, one zero for each neuron, like so:
00000000000000000000000000

Where each zero represents the matching character position in the alphabet. If we wanted to denote a character A, we would output the one hot encoded value as this:
10000000000000000000000000

Then we construct the model, using a slightly different form of code than we have seen before and as shown here:

model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

The critical piece to the preceding code is the highlighted line that shows the construction of the LSTM layer. We construct an LSTM layer by setting the number of units, in this case 32, since our sequence is 26 characters long and we want our units disable by 2. Then we set the input_shape to match the previous tensor, X, that we created to hold our training data. In this case, we are just setting the shape to match all the characters (26) and the sequence length, in this case 1.
Finally, we output the model with the following code:

for pattern in dataX:
  x = numpy.reshape(pattern, (1, len(pattern), 1))
  x = x / float(len(alphabet))
  prediction = model.predict(x, verbose=0)
  index = numpy.argmax(prediction)
  result = int_to_char[index]
  seq_in = [int_to_char[value] for value in pattern]
  print(seq_in, "->", result)

Run the code as you normally would and examine the output. You will notice that the accuracy is around 80%. See whether you can improve the accuracy of the model for predicting the next sequence in the alphabet.

This simple example demonstrated the basic use of an LSTM block for recognizing a simple sequence. In the next section, we look at a more complex example: using LSTM to play Rock, Paper, Scissors.

Table of Contents for Vanishing and exploding gradients rescued by LSTM

Create new playlist

Sign In

Sign Up

Table of Contents for
Vanishing and exploding gradients rescued by LSTM