LSTM

In the last section, we learned about basic RNNs. In theory, simple RNNs should be able to retain even long-term memories. However, in practice, this approach often falls short because of the vanishing gradients problem.

Over the course of many timesteps, the network has a hard time keeping up meaningful gradients. While this is not the focus of this chapter, a more detailed exploration of why this happens can be read in the 1994 paper, Learning long-term dependencies with gradient descent is difficult, available at -https://ieeexplore.ieee.org/document/279181 - by Yoshua Bengio, Patrice Simard, and Paolo Frasconi.

In direct response to the vanishing gradients problem of simple RNNs, the Long Short-Term Memory (LSTM) layer was invented. This layer performs much better at longer time series. Yet, if relevant observations are a few hundred steps behind in the series, then even LSTM will struggle. This is why we manually included some lagged observations.

Before we dive into details, let's look at a simple RNN that has been unrolled over time:

LSTM

A rolled out RNN

As you can see, this is the same as the RNN that we saw in Chapter 2, Applying Machine Learning to Structured Data, except that this has been unrolled over time.

The carry

The central addition of an LSTM over an RNN is the carry. The carry is like a conveyor belt that runs along the RNN layer. At each time step, the carry is fed into the RNN layer. The new carry gets computed from the input, RNN output, and old carry, in a separate operation from the RNN layer itself::

The carry

The LSTM schematic

To understand what the Compute Carry is, we should determine what should be added from the input and state:

The carry
The carry

In these formulas The carry is the state at time t (output of the simple RNN layer), The carry is the input at time t, and Ui, Wi, Uk, and Wk are the model parameters (matrices) that will be learned. a() is an activation function.

To determine what should be forgotten from the state and input, we need to use the following formula:

The carry

The new carry is then computed as follows:

The carry

While the standard theory claims that the LSTM layer learns what to add and what to forget, in practice, nobody knows what really happens inside an LSTM. However, LSTM models have been shown to be quite effective at learning long-term memory.

Take this time to note that LSTM layers do not need an extra activation function as they already come with a tanh activation function out of the box.

LSTMs can be used in the same way as SimpleRNN:

from keras.layers import LSTM

model = Sequential()
model.add(LSTM(16,input_shape=(max_len,n_features)))
model.add(Dense(1))

To stack layers, you also need to set return_sequences to True. Note that you can easily combine LSTM and SimpleRNN using the following code:

model = Sequential()
model.add(LSTM(32,return_sequences=True,input_shape=(max_len,n_features)))
model.add(SimpleRNN(16, return_sequences = True))
model.add(LSTM(16))
model.add(Dense(1))

Note

Note: If you are using a GPU and TensorFlow backend with Keras, use CuDNNLSTM instead of LSTM. It's significantly faster while working in exactly the same way.

We'll now compile and run the model just as we did before:

model.compile(optimizer='adam',loss='mean_absolute_percentage_error')

model.fit_generator(train_gen, epochs=20,steps_per_epoch=n_train_samples // batch_size, validation_data= val_gen, validation_steps=n_val_samples // batch_size)

This time, the loss went as low as 88,735, which is several orders of magnitude better than our initial model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset