Other LSTM architectures

Bidirectional LSTM seemed to be a good idea, right? What about a simpler network architecture? 

Instead of a bidirectional LSTM, we can consider a simple LSTM. To do this, we can replace the preceding model (after doing the same preprocessing; that is, feeding the data in the same format) with a simple LSTM:

model <- keras_model_sequential()
model %>%
layer_embedding(input_dim = length(vocab),
output_dim = 128,
input_length = 100) %>%
layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>%
layer_dense(units = 1, activation = 'sigmoid')

model %>% compile(
loss = "binary_crossentropy",
optimizer = "adam",
     metrics = "accuracy"
)

After training, we get the following results:

Training an LSTM in the IMDb data

So, as you can see, we experienced a significant loss in quality. This dataset is unfortunately too small for vanilla LSTMs to beat simpler configurations, such as tf-idf and logistic regression, or our GloVe + random forest experiment.

Notice also the different shape of the loss. An interesting, if not completely clear (at least to us), issue that calls our attention is the difference in shape—a small bump instead of exponential decay. This is not unusual to see in LSTMs, as opposed to other architectures, such as feed-forward networks or convolutional neural networks.

Actually, now that we mention CNNs, they can be used in language, too!: 

# Initialize model
model <- keras_model_sequential()

model %>%
layer_embedding(
input_dim = length(vocab),
output_dim = 128,
input_length = 100
) %>%
layer_dropout(0.25) %>%
layer_conv_1d(
filters=64,
kernel_size=5,
padding = "valid",
activation = "relu",
strides = 1
) %>%
layer_max_pooling_1d(pool_size=4) %>%
layer_lstm(70) %>%
layer_dense(1) %>%
layer_activation("sigmoid")

model %>% compile(
loss = "binary_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)

The results are shown as follows:

Results for a combination of a CNN and an LSTM network

We get an almost too impressive performance. How do CNNs work in text? In a similar way as they work in pictures, one-dimensional convolutional neural networks are convolved with the input layer over a spatial window length (five in our example). It is certainly easier to imagine average of colors/intensity on a certain window in an image, and we would not like to push an analogy that makes no sense. As the field is evolving so fast, perhaps soon we would have a more satisfactory explanation of the effectiveness of CNN in sentiment analysis tasks. For the time being, let's be glad that they work fine and can be trained reasonably fast.

Beware that, for the sake of brevity, in most of the examples in this chapter we have omitted the verification of the results on the validation set (which is still the right thing to do, even if we rely on Keras' validation mechanism). So do not take the scores presented at face value. 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset