But what is a recurrent neural network, really?

How does the network keep track of the previous states? To put it in the context of text generation, think of our training data as a list of character sequences (tokenized words). For each word, from the first character, we will predict the following:

Formally, let's denote a sequence of t+1 characters as x = [x0, x1, x2, ... , xt]. Let s-1 =0.

For k=0,2,...t, we construct the following sequence:

This is summarized in the following diagram, when input x is received, the internal state, s, of the network is modified, and then used to generate an output, o:

Figure 1: Information flow for an RNN. The representations are equivalent; the unfolded representation hints at how backpropagation could work in this setup.

The hidden state, s, need not be one-dimensional, as you might have imagined. Of course, it should not be too big, otherwise you might experience overfitting. A good way to start is to choose a hidden layer one order of magnitude smaller than the input space, so that if your input vector has lives in a thousand-dimensional space (not uncommon in text problems), then your hidden state should be in the hundreds. 

A traditional deep neural network uses different parameters at each layer. In a recurrent neural network, the parameters are shared across all time steps (see preceding Figure 1) across all steps. This reflects the fact that the task is the same (in this case, reading), just with different inputs. It also helps to reduce the number of parameters we need to learn.

We choose the tanh function because it puts the inputs in the range [-1,1]. Later, we will want to use the outputs of the network as probabilities; the output layer is the softmax function. 

The diagram shown in Figure 1 fully describes the forward propagation of our network,  given the weight matrices, we can calculate the output of the network. But how should we train the network to find those weights?

As usual in neural networks, the idea is to use the difference between the predictions of the network and the training labels to adjust the weights in the right direction. For this, we need to specify a loss function. A common choice is the cross-entropy loss function:

Where N denotes the number of training examples. This function measures a weighted score on the predicted class times the log probability of that class. Note that we get a loss of zero if we have probability of 1 for the correct class, and a positive loss otherwise. This means that the cross entropy punishes us for not being confident. Notice that it also punishes us for being confident and incorrect (why is that?).

To update the weights, we need to compute the derivatives:

 

We also need to update the weights in the direction that reduces the error, that is, the opposite direction of the gradient. This sounds no more complicated than vanilla backpropagation in standard neural networks. The situation is a bit more complicated in recurrent neural networks because of the hidden state, which links the current time step with the previous history. To see this, let's say that we want to compute the partial derivative of the loss function with respect to W. We need to calculate:

Nothing strange here, simply using the chain rule. What's the trick? The catch is in the second and third term in the preceding expression. The internal state at time t depends on the internal state at the previous time. Hence, our computation is actually:

You can see how this computation should be done from the preceding diagram, showing the unfolded version of the network. The preceding computation is a different form of the backpropagation algorithm, called backpropagation through time. 

An important consequence of the preceding is that recurrent neural networks are hard to train. Since the derivative of the tanh function is bounded by 1, in the preceding computation, we see a lot of terms multiplying each other between 0 and 1. So the resulting gradient will be small, and it might become numerically zero early on in the training. This is quite bad, because this prevents from learning long-term dependencies. This is because the gradient contributions from far away steps become zero because, in line with the chain rule, they are the product of many numbers less than 1. This problem is known as the vanishing gradient problem and was first discovered by Sepp Hochreiter in 1991. It is by no means exclusive of recurrent neural networks. Standard feed-forward neural networks have the same issue when you have many layers (we say they are deep when they have four or more layers), it is just that recurrent neural networks tend to be very deep; as deep as the sentence length, in our case.

There are a few ways to address the vanishing gradient problem, you can use the ReLU function, which in R notation is:

ifelse(x>0,x,0)

This function has a gradient of 0 or 1, hence it helps us to avoid the vanishing gradient issue. Relevant sections of the parameter space will be explored, even if they correspond to changes in the far-away history. Another possibility is to use different mechanisms, such as LSTM networks, proposed by Hochreiter and Schmidhuber in 1997, or GRU networks, by Cho and coauthors in 2014. We will describe these networks briefly in the next section.

Besides the vanishing gradient problem, you can imagine that, depending on the structure of the network and the activation functions you choose, you can have the opposite situation, exploding gradients. This can happen, for instance, if the Jacobian has very small values. This is somehow less serious because you will get a NaN error on your code. One way around it is to clip the gradients, this simply means that you should add a condition that filters out values above a certain threshold. Of course, this will depend on your particular application and you might want to take a look at different values in certain cases.

The main complication for implementing different types of neural networks is the computation of the gradients. In simple cases, this can be done by hand, but, as you can see, things quickly get out of control.

Once the gradients are obtained, there are different methods we can use, perhaps the most popular being stochastic gradient descent. Unfortunately, stochastic gradient descent is not plug and play. You need to spend some time messing with the step size hyperparameter to get it to work properly. This is quite bad for working with neural networks. Other solutions, such as Adagrad, are used in practice. Adagrad is particularly simple to implement, and we will do so later in this chapter. The learning rate is adapted component-wise, and is given by the square root of the sum of squares of the historical, component-wise gradient. Other optimizers include RMSProp and Adam. An advantage, or at least a theoretical guarantee of Adagrad, is that it has a sound theory behind it, rather than being heuristic.

It is possible to optimize a recurrent neural network without relying on gradient methods. For instance, by using genetic or evolutionary algorithms. Evolutionary algorithms are good for finding global minima in some cases, as by avoiding the direction of the gradient, they can find better solutions. A drawback of such algorithms is that they can be quite computing-intensive.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset