Seq2seq models

In 2016, Google announced that it had replaced the entire Google Translate algorithm with a single neural network. The special thing about the Google Neural Machine Translation system is that it translates mutliple languages "end-to-end" using only a single model. It works by encoding the semantics of a sentence and then decoding the semantics into the desired output language.

The fact that such a system is possible at all baffled many linguists and other researchers, as it shows that machine learning can create systems that accurately capture high-level meanings and semantics without being given any explicit rules.

These semantic meanings are represented as an encoding vector, and while we don't quite yet know how to interpret these vectors, there are a lot of useful applications for them. Translating from one language to another is one such popular method, but we could use a similar approach to "translate" a report into a summary. Text summarization has made great strides, but the downside is that it requires a lot of computing power to deliver meaningful results, so we will be focusing on language translation.

Seq2seq architecture overview

If all phrases had the exact same length, we could simply use an LSTM (or multiple LSTMs). Remember that an LSTM can also return a full sequence of the same length as the input sequence. However, in many cases, sequences will not have the same length.

To deal with different lengths of phrases, we'll need to create an encoder that aims to capture the sentence's semantic meaning. We then create a decoder that has two inputs: the encoded semantics and the sequence that was already produced. The decoder then predicts the next item in the sequence. For our character-level translator, it looks like this:

Seq2seq architecture overview

Seq2seq architecture overview

Note how the output of the decoder is used as the input of the decoder again. This process is only stopped once the decoder produces a <STOP> tag, which indicates that the sequence is over.

Note

Note: The data and code for this section can be found on Kaggle at https://www.kaggle.com/jannesklaas/a-simple-seq2seq-translat.

The data

We use a dataset of English phrases and their translation. This dataset was obtained from the Tabotea project, a translation database, and you can find the file attached to the code on Kaggle. We implement this model on a character level, which means that unlike previous models, we won't tokenize words, but characters. This makes the task harder for our network because it now has to also learn how to spell words! However, on the other hand, there are a lot fewer characters than words, therefore we can just one-hot encode characters instead of having to work with embeddings. This makes our model a bit simpler.

To get started, we have to set a few parameters:

batch_size = 64                #1
epochs = 100                   #2
latent_dim = 256               #3 
num_samples = 10000            #4
data_path = 'fra-eng/fra.txt'  #5

But what are the parameters that we've set up?

  1. Batch size for training.
  2. The number of epochs to train for.
  3. Dimensionality of the encoding vectors. How many numbers we use to encode the meaning of a sentence.
  4. A number of samples to train on. The whole dataset has about 140,000 samples. However, we will train on fewer for memory and time reasons.
  5. The path to the data .txt file on disk.

Input (English) and target (French) is tab delimited in the data file. Each row represents a new phrase. The translations are separated by a tab (escaped character: ). So, we loop over the lines and read out inputs and targets by splitting the lines at the tab symbol.

To build up our tokenizer, we also need to know which characters are present in our dataset. So, for all of the characters, we need to check whether they are already in our set of seen characters, and if not, add them to it.

To do this, we must first set up the holding variables for texts and characters:

input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

Then we loop over as many lines as we want samples and extract the texts and characters:

lines = open(data_path).read().split('
')
for line in lines[: min(num_samples, len(lines) - 1)]:
    
    input_text, target_text = line.split('	')          #1
    
    target_text = '	' + target_text + '
'             #2
    input_texts.append(input_text)
    target_texts.append(target_text)

    for char in input_text:                             #3
        if char not in input_characters:
            input_characters.add(char)
            
    for char in target_text:                            #4
        if char not in target_characters:
            target_characters.add(char)

Let's break this code down so that we can understand it in more detail:

  1. Input and target are split by tabs, English TAB French, so we split the lines by tabs to obtain input and target texts.
  2. We use as the "start sequence" character for the targets, and as "end sequence" character. This way, we know when to stop decoding.
  3. We loop over the characters in the input text, adding all characters that we have not seen yet to our set of input characters.
  4. We loop over the characters in the output text, adding all characters that we have not seen yet to our set of output characters.

Encoding characters

We now need to create lists of alphabetically sorted input and output characters, which we can do by running:

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))

We're also going to count how many input and output characters we have. This is important since we need to know how many dimensions our one-hot encodings should have. We can find this by writing the following:

num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)

Instead of using the Keras tokenizer, we will build our own dictionary mapping characters to token numbers. We can do this by running the following:

input_token_index = {char: i for i, char in enumerate(input_characters)}
target_token_index = {char: i for i, char in enumerate(target_characters)}

We can see how this works by printing the token numbers for all characters in a short sentence:

for c in 'the cat sits on the mat':
    print(input_token_index[c], end = ' ')
63 51 48 0 46 44 63 0 62 52 63 62 0 58 57 0 63 51 48 0 56 44 63

Next, we build up our model training data. Remember that our model has two inputs but only one output. While our model can handle sequences of any length, it is handy to prepare the data in NumPy and thus to know how long our longest sequence is:

max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)
Max sequence length for inputs: 16
Max sequence length for outputs: 59

Now we prepare input and output data for our model. encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences:

encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length, num_encoder_tokens),dtype='float32')

decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) containing a one-hot vectorization of the French sentences:

decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens),dtype='float32')

decoder_target_data is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].

decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens),dtype='float32')

You can see that the input and output of the decoder are the same except that the output is one timestep ahead. This makes sense when you consider that we feed an unfinished sequence into the decoder and want it to predict the next character. We will use the functional API to create a model with two inputs.

You can see that the decoder also has two inputs: the decoder inputs and the encoded semantics. The encoded semantics, however, are not directly the outputs of the encoder LSTM but its states. In an LSTM, states are the hidden memory of the cells. What happens is that the first "memory" of our decoder is the encoded semantics. To give the decoder this first memory, we can initialize its states with the states of the decoder LSTM.

To return states, we have to set the return_state argument, configuring an RNN layer to return a list where the first entry is the outputs and the next entries are the internal RNN states. Once again, we are using CuDNNLSTM. If you do not have a GPU, replace it with LSTM, but note that training this model without a GPU can take a very long time to complete:

encoder_inputs = Input(shape=(None, num_encoder_tokens), name = 'encoder_inputs')               #1
encoder = CuDNNLSTM(latent_dim, return_state=True, name = 'encoder')                         #2
encoder_outputs, state_h, state_c = encoder(encoder_inputs)   #3

encoder_states = [state_h, state_c]                           #4

Let's look at the four key elements of the code:

  1. We create an input layer for our encoder
  2. We create the LSTM encoder
  3. We link the LSTM encoder to the input layer and get back the outputs and states
  4. We discard encoder_outputs and only keep the states

Now we define the decoder. The decoder uses the states of the encoder as initial states for its decoding LSTM.

You can think of it like this: imagine you were a translator translating English to French. When tasked with translating, you would first listen to the English speaker and form ideas about what the speaker wants to say in your head. You would then use these ideas to form a French sentence expressing the same idea.

It is important to understand that we are not just passing a variable, but a piece of the computational graph. This means that we can later backpropagate from the decoder to the encoder. In the case of our previous analogy, you might think that your French translation suffered from a poor understanding of the English sentence, so you might start changing your English comprehension based on the outcomes of your French translation, for example:

decoder_inputs = Input(shape=(None, num_decoder_tokens), name = 'decoder_inputs')                    #1
decoder_lstm = CuDNNLSTM(latent_dim, return_sequences=True, return_state=True, name = 'decoder_lstm')                     #2

decoder_outputs, _, _ = decoder_lstm(decoder_inputs,initial_state=encoder_states) #3

decoder_dense = Dense(num_decoder_tokens, activation='softmax', name = 'decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)                   #4

The preceding code is made up of four key elements:

  1. Set up the decoder inputs.
  2. We set up our decoder to return full output sequences, and to return internal states as well. We don't use the return states in the training model, but we will use them for inference.
  3. Connect the decoder to the decoder inputs and specify the internal state. As mentioned previously, we don't use the internal states of the decoder for training, so we discard them here.
  4. Finally, we need to decide which character we want to use as the next character. This is a classification task, so we will use a simple Dense layer with a softmax activation function.

We now have the pieces we need to define our model with two inputs and one output:

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

If you have the graphviz library installed, you can visualize the model very nicely using the following code lines. Unfortunately, however, this code snippet won't work on Kaggle:

from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(model).create(prog='dot', format='svg'))

As you can see, this visualization is represented in the following diagram:

Encoding characters

Seq2seq visual

You can now compile and train the model. Since we have to choose between a number of possible characters to output next, this is basically a multi-class classification task. Therefore, we'll use a categorical cross-entropy loss:

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
history = model.fit([encoder_input_data, decoder_input_data], decoder_target_data,batch_size=batch_size,epochs=epochs,validation_split=0.2)

The training process takes about 7 minutes on a GPU. However, if we were to plot the model's progress, you can see that it's overfitting:

Encoding characters

Seq2seq overfitting

The reason it's overfitting is largely because we used only 10,000 sentence pairs of only relatively short sentences. To get a bigger model, a real translation or summarization system would have to be trained on many more examples. To allow you to follow the examples without owning a massive datacenter, we are just using a smaller model to give an example of what a seq2seq architecture can do.

Creating inference models

Overfitting or not, we would like to use our model now. Using a seq2seq model for inference, in this case for doing translations, requires us to build a separate inference model that uses the weights trained in the training model, but does the routing a bit differently. More specifically, we will separate the encoder and decoder. This way, we can first create the encoding once and then use it for decoding instead of creating it again and again.

The encoder model maps from the encoder inputs to the encoder states:

encoder_model = Model(encoder_inputs, encoder_states)

The decoder model then takes in the encoder memory plus its own memory from the last character as an input. It then spits out a prediction plus its own memory to be used for the next character:

#Inputs from the encoder
decoder_state_input_h = Input(shape=(latent_dim,))     #1
decoder_state_input_c = Input(shape=(latent_dim,))

#Create a combined memory to input into the decoder
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]                                 #2

#Decoder
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)                   #3

decoder_states = [state_h, state_c]                    #4

#Predict next char
decoder_outputs = decoder_dense(decoder_outputs)       #5

decoder_model = Model([decoder_inputs] + decoder_states_inputs,[decoder_outputs] + decoder_states)                #6

Let's look at the six elements of this code:

  1. The encoder memory consists of two states. We need to create two inputs for both of them.
  2. We then combine the two states into one memory representation.
  3. We then connect the decoder LSTM we trained earlier to the decoder inputs and the encoder memory.
  4. We combine the two states of the decoder LSTM into one memory representation.
  5. We reuse the dense layer of the decoder to predict the next character.
  6. Finally, we set up the decoder model to take in the character input as well as the state input and map it to the character output as well as the state output.

Making translations

We can now start to use our model. To do this, we must first create an index that maps tokens to characters again:

reverse_input_char_index = {i: char for char, i in input_token_index.items()}
reverse_target_char_index = {i: char for char, i in target_token_index.items()}

When we translate a phrase, we must first encode the input. We'll then loop, feeding the decoder states back into the decoder until we receive a STOP; in our case, we use the tab character to signal STOP.

target_seq is a NumPy array representing the last character predicted by the decoder:

def decode_sequence(input_seq):
    
    states_value = encoder_model.predict(input_seq)      #1
    
    target_seq = np.zeros((1, 1, num_decoder_tokens))    #2
    
    target_seq[0, 0, target_token_index['	']] = 1.      #3

    stop_condition = False                               #4
    decoded_sentence = ''
    
    while not stop_condition:                            #5
        
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)                           #6

        sampled_token_index = np.argmax(output_tokens[0, -1, :])   #7

        sampled_char = reverse_target_char_index[sampled_token_index]                                               #8
        
        decoded_sentence += sampled_char                           #9


        if (sampled_char == '
' or                               #10
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = np.zeros((1, 1, num_decoder_tokens))         #11
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]                                     #12

    return decoded_sentence

For the final time in this chapter, let's break down the code:

  1. Encode the input as state vectors
  2. Generate an empty target sequence of length one
  3. Populate the first character of the target sequence with the start character
  4. There was no stop sign, and the decoded sequence is empty so far
  5. Loop until we receive a stop sign
  6. Get output and internal states of the decoder
  7. Get the predicted token (the token with the highest probability)
  8. Get the character belonging to the token number
  9. Append a character to the output
  10. Exit condition: either hit max length or find stop character
  11. Update the target sequence (of length one)
  12. Update states

Now we can translate English into French! At least for some phrases, it works quite well. Given that we did not supply our model with any rules about French words or grammar, this is quite impressive. Translation systems such as Google Translate, of course, use much bigger datasets and models, but the underlying principles are the same.

To translate a text, we first create a placeholder array full of zeros:

my_text = 'Thanks!'
placeholder = np.zeros((1,len(my_text)+10,num_encoder_tokens))

We then one-hot encode all characters in the text by setting the element at the index of the characters' token numbers to 1:

for i, char in enumerate(my_text):
    print(i,char, input_token_index[char])
    placeholder[0,i,input_token_index[char]] = 1

This will print out the characters' token numbers alongside the character and its position in the text:

0 T 38
1 h 51
2 a 44
3 n 57
4 k 54
5 s 62
6 ! 1

Now we can feed this placeholder into our decoder:

decode_sequence(placeholder)

And we get the translation back:

'Merci !
'

Seq2seq models are useful not only for translating between languages. They can be trained on just about anything that takes a sequence as an input and also outputs a sequence.

Remember our forecasting task from the last chapter? The winning solution to the forecasting problem was a seq2seq model. Text summarization is another useful application. Seq2seq models can also be trained to output a series of actions, such as a sequence of trades that would minimize the impact of a large order.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset