The vanishing gradient problem

In backpropagation algorithm, the weights are adjusted in proportion to the gradient error, and for the way in which the gradients are computed. Let's check the following:

If the weights are small, it can lead to a situation called vanishing gradients where the gradient signal gets so small that learning either becomes very slow or stops working altogether. This is often referred to as vanishing gradients.
If the weights in this matrix are large it can lead to a situation where the gradient signal is so large that it can cause learning to diverge. This is often referred to as exploding gradients.

The vanishing-exploding gradient problem also afflicts RNNs. In fact, the BPTT rolls out the RNN creating a very deep feed-forward neural network. The impossibility of having a long-term context by the RNN is due precisely to this phenomenon, if the gradient vanishes or explodes within a few layers, the network will not be able to learn high temporal distance relationships between the data.

The following figure shows schematically what happens; the computed and back propagated gradient tend to decrease (or increase) at each instant of time and then, after a certain number of instants of time, tend to converge to zero (or explode to infinity):

Vanishing gradient problem in RNNs

To overcome the vanishing-exploding problem, various extensions of the basic RNN model have been proposed, one of these, is represented, for example, by Long Short Term Memory (LSTM) networks that will be introduced in the next section.

Table of Contents for The vanishing gradient problem

Create new playlist

Sign In

Sign Up

Table of Contents for
The vanishing gradient problem