RNNs and vanishing gradients

RNNs themselves are an important architectural innovation, but run into problems in terms of their gradients vanishing. When gradient values become so small that the updates are equally tiny, this slows or even halts learning. Your digital neurons die, and your network doesn't do what you want it to do. But is a neural network with a bad memory better than one with no memory at all?

Let's zoom in a bit and discuss what's actually going on when you run into this problem. Recall the formula for calculating the value for a given weight during backpropagation:

W = W - LR*G

Here, the weight value equals the weight minus (learning rate multiplied by the gradient).

Your network is propagating error derivatives across layers and across timesteps. The larger your dataset, the greater the number of timesteps and parameters, and so the greater the number of layers. At each step, the unrolled RNN contains an activation function that squashes the output of the network to be between 0 and 1.

The repetition of these operations on gradient values that are very close to zero means that neurons die, or cease to fire. The mathematical representation on our computation graph of the neuronal model becomes brittle. This is because if the changes in the parameter we are learning about are too small to have an effect on the output of the network itself, then the network will fail to learn the value for that parameter.

So, instead of using the entirety of the hidden state from the previous timestep, is there another way to make the network a bit smarter in terms of what information it chooses to keep as we step our network through time during the training process? The answer is yes! Let's consider these changes to the network architecture.

Table of Contents for RNNs and vanishing gradients

Create new playlist

Sign In

Sign Up

Table of Contents for
RNNs and vanishing gradients