13

Making Predictions with Sequences Using Recurrent Neural Networks

In the previous chapter, we focused on convolutional neural networks (CNNs) and used them to deal with image-related tasks. In this chapter, we will explore recurrent neural networks (RNNs), which are suitable for sequential data and time-dependent data, such as daily temperature, DNA sequences, and customers' shopping transactions over time. You will learn how the recurrent architecture works and see variants of the model. We will then work on their applications, including sentiment analysis and text generation. Finally, as a bonus section, we will cover a recent state-of-the-art sequential learning model: the Transformer.

We will cover the following topics in this chapter:

  • Sequential learning by RNNs
  • Mechanisms and training of RNNs
  • Different types of RNNs
  • Long Short-Term Memory RNNs
  • RNNs for sentiment analysis
  • RNNs for text generation
  • Self-attention and the Transformer model

Introducing sequential learning

The machine learning problems we have solved so far in this book have been time-independent. For example, ad click-through doesn't depend on the user's historical ad clicks under our previous approach; in face classification, the model only takes in the current face image, not previous ones. However, there are many cases in life that depend on time. For example, in financial fraud detection, we can't just look at the present transaction; we should also consider previous transactions so that we can model based on their discrepancy. Another example is part-of-speech (PoS) tagging, where we assign a PoS (verb, noun, adverb, and so on) to a word. Instead of solely focusing on the given word, we must look at some previous words, and sometimes the next words too.

In time-dependent cases like those just mentioned, the current output is dependent on not only the current input, but also the previous inputs; note that the length of the previous inputs is not fixed. Using machine learning to solve such problems is called sequence learning, or sequence modeling. And obviously, the time-dependent event is called a sequence. Besides events that occur in disjoint time intervals (such as financial transactions, phone calls, and so on), text, speech, and video are also sequential data.

You may be wondering why we can't just model the sequential data in a regular fashion by feeding in the entire sequence. This can be quite limiting as we have to fix the input size. One problem is that we will lose information if an important event lies outside of the fixed window. But can we just use a very large time window? Note that the feature space grows along with the window size. The feature space will become excessive if we want to cover enough events in a certain time window. Hence, overfitting can be another problem.

I hope you now see why we need to model sequential data in a different way. In the next section, we will talk about the model used for modern sequence learning: RNNs.

Learning the RNN architecture by example

As you can imagine, RNNs stand out because of their recurrent mechanism. We will start with a detailed explanation of this in the next section. We will talk about different types of RNNs after that, along with some typical applications.

Recurrent mechanism

Recall that in feedforward networks (such as vanilla neural networks and CNNs), data moves one way, from the input layer to the output layer. In RNNs, the recurrent architecture allows data to circle back to the input layer. This means that data is not limited to a feedforward direction. Specifically, in a hidden layer of an RNN, the output from the previous time point will become part of the input for the current time point. The following diagram illustrates how data flows in an RNN in general:

Figure 13.1: The general form of an RNN

Such a recurrent architecture makes RNNs work well with sequential data, including time series (such as daily temperatures, daily product sales, and clinical EEG recordings) and general consecutive data with order (such as words in a sentence, DNA sequences, and so on). Take a financial fraud detector as an example; the output features from the previous transaction go into the training for the current transaction. In the end, the prediction for one transaction depends on all of its previous transactions. Let me explain the recurrent mechanism in a mathematical and visual way.

Suppose we have some inputs, xt. Here, t represents a time step or a sequential order. In a feedforward neural network, we simply assume that inputs at different t are independent of each other. We denote the output of a hidden layer at a time step, t, as ht = f(xt), where f is the abstract of the hidden layer.

This is depicted in the following diagram:

Figure 13.2: General form of a feedforward neural network

On the contrary, the feedback loop in an RNN feeds the information of the previous state to the current state. The output of a hidden layer of an RNN at a time step, t, can be expressed as ht = f(ht−1, xt). This is depicted in the following diagram:

Figure 13.3: Unfolded recurrent layer over time steps

The same task, f, is performed on each element of the sequence, and the output, ht, is dependent on the output that's generated from previous computations, ht−1. The chain-like architecture captures the "memory" that has been calculated so far. This is what makes RNNs so successful in dealing with sequential data.

Moreover, thanks to the recurrent architecture, RNNs also have great flexibility in dealing with different combinations of input sequences and/or output sequences. In the next section, we will talk about different categories of RNNs based on input and output, including the following:

  • Many-to-one
  • One-to-many
  • Many-to-many (synced)
  • Many-to-many (unsynced)

We will start by looking at many-to-one RNNs.

Many-to-one RNNs

The most intuitive type of RNN is probably many-to-one. A many-to-one RNN can have input sequences with as many time steps as you want, but it only produces one output after going through the entire sequence. The following diagram depicts the general structure of a many-to-one RNN:

Figure 13.4: General form of a many-to-one RNN

Here, f represents one or more recurrent hidden layers, where an individual layer takes in its own output from the previous time step. Here is an example of three hidden layers stacking up:

Figure 13.5: Example of three recurrent layers stacking up

Many-to-one RNNs are widely used for classifying sequential data. Sentiment analysis is a good example of this and is where the RNN reads the entire customer review, for instance, and assigns a sentiment score (positive, neutral, or negative sentiment). Similarly, we can also use RNNs of this kind in the topic classification of news articles. Identifying the genre of a song is another application as the model can read the entire audio stream. We can also use many-to-one RNNs to determine whether a patient is having a seizure based on an EEG trace.

One-to-many RNNs

One-to-many RNNs are the exact opposite of many-to-one RNNs. They take in only one input (not a sequence) and generate a sequence of outputs. A typical one-to-many RNN is presented in the following diagram:

Figure 13.6: General form of a one-to-many RNN

Again, f represents one or more recurrent hidden layers.

Note that "one" here doesn't mean that there is only one input feature. It means the input is from one time step, or it is time-independent.

One-to-many RNNs are commonly used as sequence generators. For example, we can generate a piece of music given a starting note or/and a genre. Similarly, we can write a movie script like a professional screenwriter using one-to-many RNNs with a starting word we specify. Image captioning is another interesting application: the RNN takes in an image and outputs the description (a sentence of words) of the image.

Many-to-many (synced) RNNs

The third type of RNN, many-to-many (synced), allows each element in the input sequence to have an output. Let's look at how data flows in the following many-to-many (synced) RNN:

Figure 13.7: General form of a many-to-many (synced) RNN

As you can see, each output is calculated based on its corresponding input and all the previous outputs.

One common use case for this type of RNN is time series forecasting, where we want to perform rolling prediction at every time step based on the current and previously observed data. Here are some examples of time series forecasting where we can leverage synced many-to-many RNNs:

  • Product sales each day for a store
  • Daily closing price of a stock
  • Power consumption of a factory each hour

They are also widely used in solving NLP problems, including PoS tagging, named-entity recognition, and real-time speech recognition.

Many-to-many (unsynced) RNNs

Sometimes, we only want to generate the output sequence after we've processed the entire input sequence. This is the unsynced version of a many-to-many RNN.

Refer to the following diagram for the general structure of a many-to-many (unsynced) RNN:

Figure 13.8: General form of a many-to-many (unsynced) RNN

Note that the length of the output sequence (Ty in the preceding diagram) can be different from that of the input sequence (Tx in the preceding diagram). This provides us with some flexibility.

This type of RNN is the go-to model for machine translation. In French-English translation, for example, the model first reads a complete sentence in French and then produces a translated sentence in English. Multi-step ahead forecasting is another popular example: sometimes, we are asked to predict sales for multiple days in the future when given data from the past month.

You have now learned about four types of RNN based on the model's input and output.

Wait, what about one-to-one RNNs? There is no such thing. One-to-one is just a regular feedforward model.

We will be applying some of these types of RNN to solve projects, including sentiment analysis and word generation, later in this chapter. Now, let's figure out how an RNN model is trained.

Training an RNN model

To explain how we optimize the weights (parameters) of an RNN, we first annotate the weights and the data on the network, as follows:

  • U denotes the weights connecting the input layer and the hidden layer.
  • V denotes the weights between the hidden layer and the output layer. Note here that we use only one recurrent layer for simplicity.
  • W denotes the weights of the recurrent layer; that is, the feedback layer.
  • xt denotes the inputs at time step t.
  • st denotes the hidden state at time step t.
  • ht denotes the outputs at time step t.

Next, we unfold the simple RNN model over three time steps: t − 1, t, and t + 1, as follows:

Figure 13.9: Unfolding a recurrent layer

We describe the mathematical relationships between the layers as follows:

  • We let a denote the activation function for the hidden layer. In RNNs, we usually choose tanh or ReLU as the activation function for the hidden layers.
  • Given the current input, xt, and the previous hidden state, st−1, we compute the current hidden state, st, by st = a(Uxt + Wst−1). Feel free to read Chapter 8, Predicting Stock Prices with Artificial Neural Networks again to brush up on your knowledge of neural networks.
  • In a similar manner, we compute st−1 based on We repeat this until s1, which depends on We usually set s0 to all zeros.
  • We let g denote the activation function for the output layer. It can be a sigmoid function if we want to perform binary classification, a softmax function for multi-class classification, and a simple linear function (that is, no activation) for regression.
  • Finally, we compute the output at time step t, .

With the dependency in hidden states over time steps (that is, st depends on st−1, st−1 depends on st−2, and so on), the recurrent layer brings memory to the network, which captures and retains information from all the previous time steps.

As we did for traditional neural networks, we apply the backpropagation algorithm to optimize all the weights, U, V, and W, in RNNs. However, as you may have noticed, the output at a time step is indirectly dependent on all the previous time steps (ht depends on st, while st depends on all the previous ones). Hence, we need to compute the loss over all previous t-1 time steps, besides the current time step. Consequently, the gradients of the weights are calculated this way. For example, if we want to compute the gradients at time step t = 4, we need to backpropagate the previous four time steps (t = 3, t = 2, t = 1, t = 0) and sum up the gradients over these five time steps. This version of the backpropagation algorithm is called Backpropagation Through Time (BPTT).

The recurrent architecture enables RNNs to capture information from the very beginning of the input sequence. This advances the predictive capability of sequence learning. You may be wondering whether vanilla RNNs can handle long sequences. They can in theory, but not in practice due to the vanishing gradient problem. Vanishing gradient means the gradient will become vanishingly small over long time steps, which prevents the weight from updating. I will explain this in detail in the next section, as well as introducing a variant architecture, Long Short-Term Memory, that helps solve this issue.

Overcoming long-term dependencies with Long Short-Term Memory

Let's start with the vanishing gradient issue in vanilla RNNs. Where does it come from? Recall that during backpropagation, the gradient decays along with each time step in the RNN (that is, ); early elements in a long input sequence will have little contribution to the computation of the current gradient. This means that vanilla RNNs can only capture the temporal dependencies within a short time window. However, dependencies between time steps that are far away are sometimes critical signals to the prediction. RNN variants, including Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), are specifically designed to solve problems that require learning long-term dependencies.

We will be focusing on LSTM in this book as it is a lot more popular than GRU. LSTM was introduced a decade earlier and is more mature than GRU. If you are interested in learning more about GRU and its applications, feel free to check out Chapter 6, Recurrent Neural Networks, of Hands-On Deep Learning Architectures with Python by Yuxi Hayden Liu (Packt Publishing).

In LSTM, we use a grating mechanism to handle long-term dependencies. Its magic comes from a memory unit and three information gates built on top of the recurrent cell. The word "gate" is taken from the logic gate in a circuit (https://en.wikipedia.org/wiki/Logic_gate). It is basically a sigmoid function whose output value ranges from 0 to 1. 0 represents the "off" logic, while 1 represents the "on" logic.

The LSTM version of the recurrent cell is depicted in the following diagram, right after the vanilla version for comparison:

Figure 13.10: Recurrent cell in vanilla RNNs versus LSTM RNNs

Let's look at the LSTM recurrent cell in detail from left to right:

  • ct is the memory unit. It memorizes information from the very beginning of the input sequence.
  • "f" stands for the forget gate. It determines how much information from the previous memory state, ct−1, to forget, or in other words, how much information to pass forward. Let Wf denote the weights between the forget gate and the previous hidden state, st−1, and Uf denote the weights between the forget gate and the current input, xt.
  • "i" represents the input gate. It controls how much information from the current input to put through. Wi and Ui are the weights connecting the input gate to the previous hidden state, st−1, and the current input, xt, respectively.
  • The "tanh" is simply the activation function for the hidden state. It acts as the "a" in the vanilla RNN. Its output is computed based on the current input, xt, along with the associated weights, Uc, the previous hidden state, st−1, and the corresponding weights, Wc.
  • "o" serves as the output gate. It defines how much information is extracted from the internal memory for the output of the entire recurrent cell. As always, Wo and Uo are the associated weights for the previous hidden state and current input, respectively.

We describe the relationship between these components as follows:

  • The output of the forget gate, f, at time step t is computed as .
  • The output of the input gate, i, at time step t is computed as .
  • The output of the tanh activation, c', at time step t is computed as .
  • The output of the output gate, o, at time step t is computed as .
  • The memory unit, ct, at time step t is updated using (here, the operator .* denotes element-wise multiplication). Again, the output of a sigmoid function has a value from 0 to 1. Hence, the forget gate, f, and input gate, i, control how much of the previous memory, ct−1, and the current memory input, c', to carry forward, respectively.
  • Finally, we update the hidden state, st, at time step t by . Here, the output gate, o, governs how much of the updated memory unit, ct, will be used as the output of the entire cell.

As always, we apply the BPTT algorithm to train all the weights in LSTM RNNs, including four sets each of weights, U and W, associated with three gates and the tanh activation function. By learning these weights, the LSTM network explicitly models long-term dependencies in an efficient way. Hence, LSTM is the go-to or default RNN model in practice. Next, you will learn how to use LSTM RNNs to solve real-world problems. We will start by categorizing movie review sentiment.

Analyzing movie review sentiment with RNNs

So, here comes our first RNN project: movie review sentiment. We'll use the IMDb (https://www.imdb.com/) movie review dataset (https://ai.stanford.edu/~amaas/data/sentiment/) as an example. It contains 25,000 highly polar movie reviews for training, and another 25,000 for testing. Each review is labeled as 1 (positive) or 0 (negative). We'll build our RNN-based movie sentiment classifier in the following three sections: Analyzing and preprocessing the movie review data, Developing a simple LSTM network, and Boosting the performance with multiple LSTM layers.

Analyzing and preprocessing the data

We'll start with data analysis and preprocessing, as follows:

  1. We import all necessary modules from TensorFlow:
    >>> import tensorflow as tf
    >>> from tensorflow.keras.datasets import imdb
    >>> from tensorflow.keras import layers, models, losses, optimizers
    >>> from tensorflow.keras.preprocessing.sequence import pad_sequences
    
  2. Keras has a built-in IMDb dataset, so first, we load the dataset:
    >>> vocab_size = 5000
    >>> (X_train, y_train), (X_test, y_test) = 
                         imdb.load_data(num_words=vocab_size)
    

    Here, we set the vocabulary size and only keep this many most frequent words. In this example, this is the top 5,000 words that occur most frequently in the dataset. If num_words is None, all the words will be kept.

  3. Take a look at the training and testing data we just loaded:
    >>> print('Number of training samples:', len(y_train))
    Number of training samples: 25000
    >>> print('Number of positive samples', sum(y_train))
    Number of positive samples 12500
    >>> print('Number of test samples:', len(y_test))
    Number of test samples: 25000
    

    The training set is perfectly balanced, with the same number of positive and negative samples.

  4. Print a training sample, as follows:
    >>> print(X_train[0])
    [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]
    

    As you can see, the raw text has already been transformed into a bag of words and each word is represented by an integer. And for convenience, the value of the integer indicates how frequently the word occurs in the dataset. For instance, "1" represents the most frequent word ("the", as you can imagine), while "10" represents the 10th most frequent word. Can we find out what the words are? Let's see in the next step.

  5. We use the word dictionary to map the integer back to the word it represents:
    >>> word_index = imdb.get_word_index()
    >>> index_word = {index: word for word, index in word_index.items()}
    

    Take the first review as an example:

    >>> print([index_word.get(i, ' ') for i in X_train[0]])
    ['the', 'as', 'you', 'with', 'out', 'themselves', 'powerful', 'lets', 'loves', 'their', 'becomes', 'reaching', 'had', 'journalist', 'of', 'lot', 'from', 'anyone', 'to', 'have', 'after', 'out', 'atmosphere', 'never', 'more', 'room', 'and', 'it', 'so', 'heart', 'shows', 'to', 'years', 'of', 'every', 'never', 'going', 'and', 'help', 'moments', 'or', 'of', 'every', 'chest', 'visual', 'movie', 'except', 'her', 'was', 'several', 'of', 'enough', 'more', 'with', 'is', 'now', 'current', 'film', 'as', 'you', 'of', 'mine', 'potentially', 'unfortunately', 'of', 'you', 'than', 'him', 'that', 'with', 'out', 'themselves', 'her', 'get', 'for', 'was', 'camp', 'of', 'you', 'movie', 'sometimes', 'movie', 'that', 'with', 'scary', 'but', 'and', 'to', 'story', 'wonderful', 'that', 'in', 'seeing', 'in', 'character', 'to', 'of', '70s', 'and', 'with', 'heart', 'had', 'shadows', 'they', 'of', 'here', 'that', 'with', 'her', 'serious', 'to', 'have', 'does', 'when', 'from', 'why', 'what', 'have', 'critics', 'they', 'is', 'you', 'that', "isn't", 'one', 'will', 'very', 'to', 'as', 'itself', 'with', 'other', 'and', 'in', 'of', 'seen', 'over', 'and', 'for', 'anyone', 'of', 'and', 'br', "show's", 'to', 'whether', 'from', 'than', 'out', 'themselves', 'history', 'he', 'name', 'half', 'some', 'br', 'of', 'and', 'odd', 'was', 'two', 'most', 'of', 'mean', 'for', '1', 'any', 'an', 'boat', 'she', 'he', 'should', 'is', 'thought', 'and', 'but', 'of', 'script', 'you', 'not', 'while', 'history', 'he', 'heart', 'to', 'real', 'at', 'and', 'but', 'when', 'from', 'one', 'bit', 'then', 'have', 'two', 'of', 'script', 'their', 'with', 'her', 'nobody', 'most', 'that', 'with', "wasn't", 'to', 'with', 'armed', 'acting', 'watch', 'an', 'for', 'with', 'and', 'film', 'want', 'an']
    
  6. Next, we analyze the length of each sample (the number of words in each review, for example). We do so because all the input sequences to an RNN model must be the same length:
    >>> review_lengths = [len(x) for x in X_train]
    

    Plot the distribution of these document lengths, as follows:

    >>> import matplotlib.pyplot as plt
    >>> plt.hist(review_lengths, bins=10)
    >>> plt.show()
    

    Refer to the following diagram for the distribution result:

    Figure 13.11: Review length distribution

  7. As you can see, the majority of the reviews are around 200 words long. Next, we set 200 as the universal sequence length by padding shorter reviews with zeros and truncating longer reviews. We use the pad_sequences function from Keras to accomplish this:
    >>> maxlen = 200
    >>> X_train = pad_sequences(X_train, maxlen=maxlen)
    >>> X_test = pad_sequences(X_test, maxlen=maxlen)
    

    Let's look at the shape of the input sequences after this:

    >>> print('X_train shape after padding:', X_train.shape)
    X_train shape after padding: (25000, 200)
    >>> print('X_test shape after padding:', X_test.shape)
    X_test shape after padding: (25000, 200)
    

Let's move on to building an LSTM network.

Building a simple LSTM network

Now that the training and testing datasets are ready, we can build our first RNN model:

  1. First, we fix the random seed and initiate a Keras Sequential model:
    >>> tf.random.set_seed(42)
    >>> model = models.Sequential()
    
  2. Since our input sequences are word indices that are equivalent to one-hot encoded vectors, we need to embed them in dense vectors using the Embedding layer from Keras:
    >>> embedding_size = 32
    >>> model.add(layers.Embedding(vocab_size, embedding_size))
    

    Here, we embed the input sequences that are made of up vocab_size=5000 unique word tokens into dense vectors of size 32.

    Feel free to reread Best practice 14 – Extracting features from text data using word embedding with neural networks from Chapter 11, Machine Learning Best Practices.

  3. Now here comes the recurrent layer, the LSTM layer specifically:
    >>> model.add(layers.LSTM(50))
    

    Here, we only use one recurrent layer with 50 nodes.

  4. After that, we add the output layer, along with a sigmoid activation function, since we are working on a binary classification problem:
    >>> model.add(layers.Dense(1, activation='sigmoid'))
    
  5. Display the model summary to double-check the layers:
    >>> print(model.summary())
    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding (Embedding)        (None, None, 32)          160000    
    _________________________________________________________________
    lstm (LSTM)                  (None, 50)                16600     
    _________________________________________________________________
    dense (Dense)                (None, 1)                 51        
    =================================================================
    Total params: 176,651
    Trainable params: 176,651
    Non-trainable params: 0
    _________________________________________________________________
    
  6. Next, we compile the model with the Adam optimizer and use binary cross-entropy as the optimization target:
    >>> model.compile(loss='binary_crossentropy',
    ...               optimizer='adam',
    ...               metrics=['accuracy'])
    
  7. Finally, we train the model with batches of size 64 for three epochs:
    >>> batch_size = 64
    >>> n_epoch = 3
    >>> model.fit(X_train, y_train,
    ...           batch_size=batch_size,
    ...           epochs=n_epoch,
    ...           validation_data=(X_test, y_test))
    Train on 25000 samples, validate on 25000 samples
    Epoch 1/3
    391/391 [==============================] - 70s 178ms/step - loss: 0.4284 - accuracy: 0.7927 - val_loss: 0.3396 - val_accuracy: 0.8559
    Epoch 2/3
    391/391 [==============================] - 69s 176ms/step - loss: 0.2658 - accuracy: 0.8934 - val_loss: 0.3034 - val_accuracy: 0.8730
    Epoch 3/3
    391/391 [==============================] - 69s 177ms/step - loss: 0.2283 - accuracy: 0.9118 - val_loss: 0.3118 - val_accuracy: 0.8705 
    
  8. Using the trained model, we evaluate the classification accuracy on the testing set:
    >>> acc = model.evaluate(X_test, y_test, verbose = 0)[1]
    >>> print('Test accuracy:', acc)
    Test accuracy: 0.8705199956893921
    

    We obtained a test accuracy of 87.05%.

Stacking multiple LSTM layers

Now, let's try to stack two recurrent layers. The following diagram shows how two recurrent layers can be stacked:

Figure 13.12: Unfolding two stacked recurrent layers

Let's see whether we can beat the previous accuracy by following these steps to build a multi-layer RNN model:

  1. Initiate a new model and add an embedding layer, two LSTM layers, and an output layer:
    >>> model = models.Sequential()
    >>> model.add(layers.Embedding(vocab_size, embedding_size))
    >>> model.add(layers.LSTM(50, return_sequences=True, dropout=0.2))
    >>> model.add(layers.LSTM(50, dropout=0.2))
    >>> model.add(layers.Dense(1, activation='sigmoid'))
    

    Here, the first LSTM layer comes with return_sequences=True as we need to feed its entire output sequence to the second LSTM layer. We also add 20% dropout to both LSTM layers to reduce overfitting since we will have more parameters to train:

    >>> print(model.summary())
    Model: "sequential_1"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding_1 (Embedding)      (None, None, 32)          160000    
    _________________________________________________________________
    lstm_1 (LSTM)                (None, None, 50)          16600     
    _________________________________________________________________
    lstm_2 (LSTM)                (None, 50)                20200     
    _________________________________________________________________
    dense_1 (Dense)              (None, 1)                 51        
    =================================================================
    Total params: 196,851
    Trainable params: 196,851
    Non-trainable params: 0
    _________________________________________________________________
    None
    
  2. Similarly, we compile the model with the Adam optimizer at a 0.003 learning rate:
    >>> optimizer = optimizers.Adam(lr=0.003)
    >>> model.compile(loss='binary_crossentropy',
    ...               optimizer=optimizer,
    ...               metrics=['accuracy'])
    
  3. Then, we train the stacked model for 7 epochs:
    >>> n_epoch = 7
    >>> model.fit(X_train, y_train,
    ...           batch_size=batch_size,
    ...           epochs=n_epoch,
    ...           validation_data=(X_test, y_test))
    Train on 25000 samples, validate on 25000 samples
    Epoch 1/7
    391/391 [==============================] - 139s 356ms/step - loss: 0.4755 - accuracy: 0.7692 - val_loss: 0.3438 - val_accuracy: 0.8511
    Epoch 2/7
    391/391 [==============================] - 140s 357ms/step - loss: 0.3272 - accuracy: 0.8631 - val_loss: 0.3407 - val_accuracy: 0.8573
    Epoch 3/7
    391/391 [==============================] - 137s 350ms/step - loss: 0.3042 - accuracy: 0.8782 - val_loss: 0.3436 - val_accuracy: 0.8580
    Epoch 4/7
    391/391 [==============================] - 136s 349ms/step - loss: 0.2468 - accuracy: 0.9028 - val_loss: 0.6771 - val_accuracy: 0.7860
    Epoch 5/7
    391/391 [==============================] - 137s 350ms/step - loss: 0.2201 - accuracy: 0.9117 - val_loss: 0.3273 - val_accuracy: 0.8684
    Epoch 6/7
    391/391 [==============================] - 137s 349ms/step - loss: 0.1867 - accuracy: 0.9278 - val_loss: 0.3352 - val_accuracy: 0.8736
    Epoch 7/7
    391/391 [==============================] - 138s 354ms/step - loss: 0.1586 - accuracy: 0.9398 - val_loss: 0.3335 - val_accuracy: 0.8756
    
  4. Finally, we verify the test accuracy:
    >>> acc = model.evaluate(X_test, y_test, verbose=0)[1]
    >>> print('Test accuracy with stacked LSTM:', acc)
    Test accuracy with stacked LSTM: 0.8755999803543091
    

    We obtained a better test accuracy of 87.56%.

With that, we've just finished the review sentiment classification project using RNNs. The RNNs were in the many-to-one structure. In the next project, we will develop an RNN under the many-to-many structure to write a "novel."

Writing your own War and Peace with RNNs

In this project, we'll work on an interesting language modeling problem–text generation.

An RNN-based text generator can write anything, depending on what text we feed it. The training text can be from a novel such as A Game of Thrones, a poem from Shakespeare, or the movie scripts for The Matrix. The artificial text that's generated should read similar (but not identical) to the original one if the model is well-trained. In this section, we are going to write our own War and Peace with RNNs, a novel written by the Russian author Leo Tolstoy. Feel free to train your own RNNs on any of your favorite books.

We will start with data acquisition and analysis before constructing the training set. After that, we will build and train an RNN model for text generation.

Acquiring and analyzing the training data

I recommend downloading text data for training from books that are not currently protected by copyright. Project Gutenberg (www.gutenberg.org) is a great place for this. It provides over 60,000 free eBooks whose copyright has expired.

The original work, War and Peace, can be downloaded from http://www.gutenberg.org/ebooks/2600, but note that there will be some cleanup, such as removing the extra beginning section "The Project Gutenberg EBook," the table of contents, and the extra appendix "End of the Project Gutenberg EBook of War and Peace" of the plain text UTF-8 file (http://www.gutenberg.org/files/2600/2600-0.txt) required. So, instead of doing this, we will download the cleaned text file directly from https://cs.stanford.edu/people/karpathy/char-rnn/warpeace_input.txt. Let's get started:

  1. First, we read the file and convert the text into lowercase:
    >>> training_file = 'warpeace_input.txt'
    >>> raw_text = open(training_file, 'r').read()
    >>> raw_text = raw_text.lower()
    
  2. Then, we take a quick look at the training text data by printing out the first 200 characters:
    >>> print(raw_text[:200])
    "well, prince, so genoa and lucca are now just family estates of the 
    buonapartes. but i warn you, if you don't tell me that this means war,
    if you still try to defend the infamies and horrors perpetr
    
  3. Next, we count the number of unique words:
    >>> all_words = raw_text.split()
    >>> unique_words = list(set(all_words))
    >>> print(f'Number of unique words: {len(unique_words)}')
    Number of unique words: 39830
    

    And then, we count the total number of characters:

    >>> n_chars = len(raw_text)
    >>> print(f'Total characters: {n_chars}')
    Total characters: 3196213
    
  4. From these 3 million characters, we obtain the unique characters, as follows:
    >>> chars = sorted(list(set(raw_text)))
    >>> n_vocab = len(chars)
    >>> print(f'Total vocabulary (unique characters): {n_vocab}')
    Total vocabulary (unique characters): 57
    >>> print(chars)
    ['
    ', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'ä', 'é', 'ê', 'ufeff']
    

The raw training text is made up of 57 unique characters and made up of close to 40,000 unique words. Generating words, which requires computing 40,000 probabilities at one step, is far more difficult than generating characters, which requires computing only 57 probabilities at one step. Hence, we treat a character as a token, and the vocabulary here is composed of 57 characters.

So, how can we feed the characters to the RNN model and generate output characters? Let's see in the next section.

Constructing the training set for the RNN text generator

Recall that in a synced "many-to-many" RNN, the network takes in a sequence and simultaneously produces a sequence; the model captures the relationships among the elements in a sequence and reproduces a new sequence based on the learned patterns. As for our text generator, we can feed in fixed-length sequences of characters and let it generate sequences of the same length, where each output sequence is one character shifted from its input sequence. The following example will help you understand this better:

Say that we have a raw text sample, "learning," and we want the sequence length to be 5. Here, we can have an input sequence, "learn," and an output sequence, "earni." We can put them into the network as follows:

Figure 13.13: Feeding a training set ("learn," "earni") to the RNN

We've just constructed a training sample ("learn," "earni"). Similarly, to construct training samples from the entire original text, first, we need to split the original text into fixed-length sequences, X; then, we need to ignore the first character of the original text and split shift it into sequences of the same length, Y. A sequence from X is the input of a training sample, while the corresponding sequence from Y is the output of the sample. Let's say we have a raw text sample, "machine learning by example," and we set sequence length to 5. We will construct the following training samples:

Figure 13.14: Training samples constructed from "machine learning by example"

Here, □ denotes space. Note that the remaining subsequence, "le", is not long enough, so we simply ditch it.

We also need to one-hot encode the input and output characters since neural network models only take in numerical data. We simply map the 57 unique characters to indices from 0 to 56, as follows:

>>> index_to_char = dict((i, c) for i, c in enumerate(chars))
>>> char_to_index = dict((c, i) for i, c in enumerate(chars))
>>> print(char_to_index)
{'
': 0, ' ': 1, '!': 2, '"': 3, "'": 4, '(': 5, ')': 6, '*': 7, ',': 8, '-': 9, '.': 10, '/': 11, '0': 12, '1': 13, '2': 14, '3': 15, '4': 16, '5': 17, '6': 18, '7': 19, '8': 20, '9': 21, ':': 22, ';': 23, '=': 24, '?': 25, 'a': 26, 'b': 27, 'c': 28, 'd': 29, 'e': 30, 'f': 31, 'g': 32, 'h': 33, 'i': 34, 'j': 35, 'k': 36, 'l': 37, 'm': 38, 'n': 39, 'o': 40, 'p': 41, 'q': 42, 'r': 43, 's': 44, 't': 45, 'u': 46, 'v': 47, 'w': 48, 'x': 49, 'y': 50, 'z': 51, 'à': 52, 'ä': 53, 'é': 54, 'ê': 55, 'ufeff': 56}

For instance, the character "c" becomes a vector of length 57 with "1" in index 28 and "0"s in all other indices; the character "h" becomes a vector of length 57 with "1" in index 33 and "0"s in all other indices.

Now that the character lookup dictionary is ready, we can construct the entire training set, as follows:

>>> import numpy as np
>>> seq_length = 160
>>> n_seq = int(n_chars / seq_length)

Here, we set the sequence length to 160 and obtain n_seq training samples. Next, we initialize the training inputs and outputs, which are both of the shape (number of samples, sequence length, feature dimension):

>>> X = np.zeros((n_seq, seq_length, n_vocab))
>>> Y = np.zeros((n_seq, seq_length, n_vocab))

RNN models in Keras require the shape of the input and output sequences to be in the shape (number of samples, sequence length, feature dimension).

Now, for each of the n_seq samples, we assign "1" to the indices of the input and output vectors where the corresponding characters exist:

>>> for i in range(n_seq):
...     x_sequence = raw_text[i * seq_length : 
                              (i + 1) * seq_length]
...     x_sequence_ohe = np.zeros((seq_length, n_vocab))
...     for j in range(seq_length):
...             char = x_sequence[j]
...             index = char_to_index[char]
...             x_sequence_ohe[j][index] = 1.
...     X[i] = x_sequence_ohe
...     y_sequence = raw_text[i * seq_length + 1 : (i + 1) * 
                                                 seq_length + 1]
...     y_sequence_ohe = np.zeros((seq_length, n_vocab))
...     for j in range(seq_length):
...             char = y_sequence[j]
...             index = char_to_index[char]
...             y_sequence_ohe[j][index] = 1.
...     Y[i] = y_sequence_ohe

Next, take a look at the shapes of the constructed input and output samples:

>>> X.shape
(19976, 160, 57)
>>> Y.shape
(19976, 160, 57)

Again, each sample (input or output sequence) is composed of 160 elements. Each element is a 57-dimension one-hot encoded vector.

We finally got the training set ready and it is time to build and fit the RNN model. Let's do this in the next two sections.

Building an RNN text generator

In this section, we will build an RNN with two stacked recurrent layers. This has more predictive power than an RNN with a single recurrent layer for complicated problems such as text generation. Let's get started:

  1. First, we import all the necessary modules and fix a random seed:
    >>> import tensorflow as tf
    >>> from tensorflow.keras import layers, models, losses, optimizers
    >>> tf.random.set_seed(42)
    
  2. Each recurrent layer contains 700 units, with a 0.4 dropout ratio and a tanh activation function:
    >>> hidden_units = 700
    >>> dropout = 0.4
    
  3. We specify other hyperparameters, including the batch size, 100, and the number of epochs, 300:
    >>> batch_size = 100
    >>> n_epoch= 300
    
  4. Now, we create the RNN model, as follows:
    >>> model = models.Sequential()
    >>> model.add(layers.LSTM(hidden_units, input_shape=(None, n_vocab), return_sequences=True, dropout=dropout))
    >>> model.add(layers.LSTM(hidden_units, return_sequences=True, dropout=dropout))
    >>> model.add(layers.TimeDistributed(layers.Dense(n_vocab, activation='softmax')))
    

    There are a few things worth looking into:

    • return_sequences=True for the first recurrent layer: The output of the first recurrent layer is a sequence so that we can stack the second recurrent layer on top.
    • return_sequences=True for the second recurrent layer: The output of the second recurrent layer is a sequence, which enables the many-to-many structure.
    • Dense(n_vocab, activation='softmax'): Each element of the output sequence is a one-hot encoded vector, so softmax activation is used to compute the probabilities for individual characters.
    • TimeDistributed: Since the output of the recurrent layers is a sequence and the Dense layer does not take in a sequential input, TimeDistributed is used as an adapter so that the Dense layer can be applied to every element of the input sequence.
  5. Next, we compile the network. As for the optimizer, we choose RMSprop with a learning rate of 0.001:
    >>> optimizer = optimizers.RMSprop(lr=0.001)
    >>> model.compile(loss="categorical_crossentropy", 
                      optimizer=optimizer)
    

    Here, the loss function is multiclass cross-entropy.

  6. Let's summarize the model we just built:
    >>> print(model.summary())  
    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    lstm (LSTM)                  (None, None, 700)         2122400   
    _________________________________________________________________
    lstm_1 (LSTM)                (None, None, 700)         3922800   
    _________________________________________________________________
    time_distributed (TimeDistri (None, None, 57)          39957     
    =================================================================
    Total params: 6,085,157
    Trainable params: 6,085,157
    Non-trainable params: 0
    _________________________________________________________________
    

With that, we've just finished building and are ready to train the model. We'll do this in the next section.

Training the RNN text generator

As shown in the model summary, we have more than 6 million parameters to train. Hence, it is recommended to train the model on a GPU. If you don't have a GPU in-house, you can use the free GPU provided by Google Colab. You can set it up by following the tutorial at https://ml-book.now.sh/free-gpu-for-deep-learning/.

Also, for a deep learning model that requires long training, it is good practice to set up some callbacks in order to keep track of the internal states and performance of the model during training. In our project, we employ the following callbacks:

  • Model checkpoint: This saves the model after each epoch. If anything goes wrong unexpectedly during training, you don't have to retrain the model. You can simply load the saved model and resume training from there.
  • Early stopping: We covered this in Chapter 8, Predicting Stock Prices with Artificial Neural Networks.
  • Generating text with the latest model on a regular basis: By doing this, we can see how reasonable the generated text is on the fly.

We employ these three callbacks to train our RNN model as follows:

  1. First, we import the necessary modules:
    >>> from tensorflow.keras.callbacks import Callback, ModelCheckpoint, EarlyStopping
    
  2. Then, we define the model checkpoint callback:
    >>> file_path =  
            "weights/weights_epoch_{epoch:03d}_loss_{loss:.4f}.hdf5"
    >>> checkpoint = ModelCheckpoint(file_path, monitor='loss', 
                       verbose=1, save_best_only=True, mode='min')
    

    The model checkpoints will be saved with filenames made up of the epoch number and training loss.

  3. After that, we create an early stopping callback to halt the training if the validation loss doesn't decrease for 50 successive epochs:
    >>> early_stop = EarlyStopping(monitor='loss', min_delta=0, 
                                   patience=50, verbose=1, mode='min')
    
  4. Next, we develop a helper function that generates text of any length, given a model:
    >>> def generate_text(model, gen_length, n_vocab, index_to_char):
    ...     """
    ...     Generating text using the RNN model
    ...     @param model: current RNN model
    ...     @param gen_length: number of characters we want to generate
    ...     @param n_vocab: number of unique characters
    ...     @param index_to_char: index to character mapping
    ...     @return: string of text generated
    ...     """
    ...     # Start with a randomly picked character
    ...     index = np.random.randint(n_vocab)
    ...     y_char = [index_to_char[index]]
    ...     X = np.zeros((1, gen_length, n_vocab))
    ...     for i in range(gen_length):
    ...         X[0, i, index] = 1.
    ...         indices = np.argmax(model.predict(
                       X[:, max(0, i - seq_length -1):i + 1, :])[0], 1)
    ...         index = indices[-1]
    ...         y_char.append(index_to_char[index])
    ...     return ''.join(y_char)
    

    It starts with a randomly picked character. Then, the input model predicts each of the remaining gen_length-1 characters based on its previously generated characters.

  5. Now, we define the callback class that generates text with the generate_text util function for every N epochs:
    >>> class ResultChecker(Callback):
    ...     def __init__(self, model, N, gen_length):
    ...         self.model = model
    ...         self.N = N
    ...         self.gen_length = gen_length
    ...
    ...     def on_epoch_end(self, epoch, logs={}):
    ...         if epoch % self.N == 0:
    ...             result = generate_text(self.model, 
                             self.gen_length, n_vocab, index_to_char)
    ...             print('
    My War and Peace:
    ' + result)
    

    Next, we initiate a text generation checker callback:

    >>> result_checker = ResultChecker(model, 10, 500) 
    

    The model will generate text of 500 characters for every 10 epochs.

  6. Now that all the callback components are ready, we can start training the model:
    >>> model.fit(X, Y, batch_size=batch_size, 
          verbose=1, epochs=n_epoch,callbacks=[
          result_checker, checkpoint, early_stop])
    

    I will only demonstrate the results for epochs 1, 51, 101, and 291 here:

    Epoch 1:

    Epoch 1/300
    200/200 [==============================] - 117s 584ms/step - loss: 2.8908
    My War and Peace:
    8 the tout to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to t
    Epoch 00001: loss improved from inf to 2.89075, saving model to weights/weights_epoch_001_loss_2.8908.hdf5
    

    Epoch 51:

    Epoch 51/300
    200/200 [==============================] - ETA: 0s - loss: 1.7430
    My War and Peace:
    re and the same time the same time the same time he had not yet seen the first time that he was always said to him that the countess was sitting in the same time and the same time that he was so saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was saying that he was sa
    Epoch 00051: loss improved from 1.74371 to 1.74298, saving model to weights/weights_epoch_051_loss_1.7430.hdf5
    200/200 [==============================] - 64s 321ms/step - loss: 1.7430
    

    Epoch 101:

    Epoch 101/300
    200/200 [==============================] - ETA: 0s - loss: 1.6892
    My War and Peace:
    's and the same time and the same sonse of his life and her face was already in her hand.
    "what is it?" asked natasha. "i have not the post and the same to
    her and will not be able to say something to her and went to
    the door.
    "what is it?" asked natasha. "i have not the post and the same to her
    and that i shall not be able to say something to her and
    went on to the door.
    "what a strange in the morning, i am so all to say something to her,"
    said prince andrew, "i have not the post and the
    same
    Epoch 00101: loss did not improve from 1.68711
    200/200 [==============================] - 64s 321ms/step - loss: 1.6892
     
    

    Epoch 291:

    Epoch 291/300
    200/200 [==============================] - ETA: 0s - loss: 1.6136
    My War and Peace:
    à to the countess, who was sitting in the same way the sound of a sound of company servants were standing in the middle of the road.
    "what are you doing?" said the officer, turning to the princess with
    a smile.
    "i don't know what to say and want to see you."
    "yes, yes," said prince andrew, "i have not been the first to see you
    and you will be a little better than you are and we will be
    married. what a sin i want to see you."
    "yes, yes," said prince andrew, "i have not been the first to see yo
    Epoch 00291: loss did not improve from 1.61188
    200/200 [==============================] - 65s 323ms/step - loss: 1.6136
    

Each epoch takes around 60 seconds on a Tesla K80 GPU. After a couple of hours of training, the RNN-based text generator can write a realistic and interesting version of War and Peace. With that, we've successfully used a many-to-many type of RNN to generate text.

An RNN with a many-to-many structure is a type of sequence-to-sequence (seq2seq) model that takes in a sequence and outputs another sequence. A typical example is machine translation, where a sequence of words from one language is transformed into a sequence in another language. The state-of-the-art seq2seq model is the Transformer model, and it was developed by Google Brain. We will briefly discuss it in the next section.

Advancing language understanding with the Transformer model

The Transformer model was first proposed in Attention Is All You Need (https://arxiv.org/abs/1706.03762). It can effectively handle long-term dependencies, which are still challenging in LSTM. In this section, we will go through the Transformer's architecture and building blocks, as well as its most crucial part: the self-attention layer.

Exploring the Transformer's architecture

We'll start by looking at the high-level architecture of the Transformer model (image taken from Attention Is All You Need):

Figure 13.15: Transformer architecture

As you can see, the Transformer consists of two parts: the encoder (the big rectangle on the left-hand side) and the decoder (the big rectangle on the right-hand side). The encoder encrypts the input sequence. It has a multi-head attention layer (we will talk about this next) and a regular feedforward layer. On the other hand, the decoder generates the output sequence. It has a masked multi-head attention layer, along with a multi-head attention layer and a regular feedforward layer.

At step t, the Transformer model takes in input steps x1, x2, …, xt and output steps y1, y2, …, yt−1. It then predicts yt. This is no different from the many-to-many RNN model.

The multi-head attention layer is probably the only thing that looks strange to you, so we'll take a look at it in the next section.

Understanding self-attention

Let's discuss how the self-attention layer plays a key role in the Transformer in the following example:

"I read Python Machine Learning by Example and it is indeed a great book." Apparently, it refers to Python Machine Learning by Example. When the Transformer model processes this sentence, self-attention will associate it with Python Machine Learning by Example. Given a word in an input sequence, self-attention allows the model to look at the other words in the sequence at different attention levels, which boosts language understanding and learning in seq2seq tasks.

Now, let's see how we calculate the attention score.

As shown by the architecture diagram, there are three input vectors to the attention layer:

  • The query vector, Q, which represents the query word (that is, the current word) in the sequence
  • The key vector, K, which represents individual words in the sequence
  • The value vector, V, which also represents individual words in the sequence

These three vectors are trained during training.

The output of the attention layer is calculated as follows:

Here, dk is the dimension of the key vector. Take the sequence python machine learning by example as an example; we take the following steps to calculate the self-attention for the first word, python:

  1. We calculate the dot products between each word in the sequence and the word python. They are q1, k1, q1, k2, q1, k3, q1, k4, and q1, k5. Here, q1 is the query vector for the first word and k1 to k5 are the key vectors for the five words, respectively.
  2. We normalize the resulting dot products with division and softmax activation:
  3. Then, we multiply the resulting softmax vectors by the value vectors, , and sum up the results:

z1 is the self-attention score for the first word, python, in the sequence. We repeat this process for each remaining word in the sequence to obtain its attention score. Now, you should understand why this is called multi-head attention: self-attention is not computed for just one word (one step), but for all words (all steps).

All output attention scores are then concatenated and fed into the downstream regular feedforward layer.

In this section, we have covered the main concepts of the Transformer model. It has become the model of choice for many complicated problems in NLP, such as speech to text, text summarization, and question answering. With added attention mechanisms, the Transformer model can effectively handle long-term dependencies in sequential learning. Moreover, it allows parallelization during training since self-attention can be computed independently for individual steps.

If you are interested in reading more, here are some recent developments that have been made using Transformer:

Summary

In this chapter, we worked on two NLP projects: sentiment analysis and text generation using RNNs. We started with a detailed explanation of the recurrent mechanism and different RNN structures for different forms of input and output sequences. You also learned how LSTM improves vanilla RNNs. Finally, as a bonus section, we covered the Transformer, a recent state-of-the-art sequential learning model.

In the next chapter, we will focus on the third type of machine learning problem: reinforcement learning. You will learn how the reinforcement learning model learns by interacting with the environment to reach the learning goal.

Exercises

  1. Use a bi-directional recurrent layer (it is easy enough to learn about it by yourself) and apply it to the sentiment analysis project. Can you beat what we achieved? Read https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional if you want to see an example.
  2. Feel free to fine-tune the hyperparameters, as we did in Chapter 8, Predicting Stock Prices with Artificial Neural Networks, and see whether you can improve the classification performance further.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset