Bi-directional LSTM networks

Well, our model did not quite work before with simple feed-forward networks. In this section, we will try a different model: Bi-directional LSTM networks. 

Recall that LSTM networks preserve parts of the previous information via the hidden state. However, this information is only about the past.

Bidirectional LSTM run both ways—from past to future and back! The LSTM that runs backwards preserves information from the future. Using the two hidden states combined, you are able to keep the context of both past and future. Clearly this would not make sense for stock price prediction! Their use was initially justified in the domain of speech recognition because, as you might know from experience, the context of the full phrase is often needed to understand the meaning of a word. This happens, for instance, when you are trying to simultaneously translate from one language to another.

Ok, so how do we do this? Let's come back to keras. This will make the experience rather smooth, as we have seen before. 

For the bidirectional LSTM API, keras expects, per row, one document with a list of words, which are passed over sequentially. 

Let's begin with some familiar preprocessing steps:

library(purrr)
library(stringr)
library(tm)
library(keras)

df <- read.csv("./data/labeledTrainData.tsv", encoding = "utf-8", quote = "", sep=" ", stringsAsFactors = F)

text <- df$review

corpus <- VCorpus(VectorSource(text))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, content_transformer(removeNumbers))
corpus <- tm_map(corpus, content_transformer(removeWords), stopwords("english"))
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, sparse=0.99)

X <- as.data.frame(as.matrix(dtm))

This returns a huge document-term matrix that has a lot of zeros. Now, we need to parse it to keep only the indices of the words that have non-zero values:

vocab <- names(X)
maxlen <- 100
dataset <- map(
1:nrow(X),
~list(review = which(X[.x,]!=0))
)
dataset <- transpose(dataset)

And finally vectorize this in a usable way:

X <- array(0, dim = c(length(dataset$review), maxlen))
y <- array(0, dim = c(length(dataset$review)))
for(i in 1:length(dataset$review)){
for(j in 1:maxlen){
if(length(dataset$review[[i]])>j){
X[i,j] <- dataset$review[[i]][j]
}
else{
X[i,j] <- 0
}

}
y[i] <- df[i,"sentiment"]
}

X <- as.matrix(X)

Which gives us:

> X[1,]
[1] 23 46 49 65 71 100 109 115 137 144 149 161 165 185 188 190 193 196 210 217 235 271
[23] 286 287 295 308 317 326 359 365 366 376 380 390 407 436 441 464 469 483 494 498 511 514
[45] 520 521 571 580 585 588 595 603 613 628 662 693 705 726 734 742 749 760 776 795 797 803
[67] 808 828 832 843 848 852 871 872 890 892 897 900 908 922 929 931 955 973 975 983 994 1008
[89] 1019 1044 1072 1127 1140 1144 1184 1205 1217 1315 1317 1321

On each row, we have the first 100 words of the review, and we added zeros to the end if the review was shorter than 100 words. Now, we are ready to define our network:

# Initialize model
model <- keras_model_sequential()
model %>%
# Creates dense embedding layer; outputs 3D tensor
# with shape (batch_size, sequence_length, output_dim)
layer_embedding(input_dim = length(vocab),
output_dim = 128,
input_length = maxlen) %>%
bidirectional(layer_lstm(units = 64)) %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 1, activation = 'sigmoid')

A few comments are in order here: First, note that the output layer is one-dimensional instead of two-dimensional, as before. This is purely for pedagogic purposes, and in the multi-class setup you would prefer to use similar code as before. Next, observe that there is a layer_embedding function, which is used for passing sequential data. Finally, the layer corresponding to the bidirectional LSTM comes, followed by a dropout. 

Wait, what is dropout? Dropout is a technique introduced recently that simply forgets to train some neurons. This might seem weird, but in reality it is a very efficient way of performing model averaging with neural networks. In a similar way that averaging trees produces better models (less prone to over-fitting), averaging weaker neural networks results in a more robust model.

Now, we compile the model (note the different loss function, due to the different output):

# Compile: you can try different compilers
model %>% compile(
loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = c('accuracy')
)

And we are now ready to call the fit method:

> history <- model %>% fit(
X, y,
batch_size = 128,
epochs = 4,
validation_size = 0.2
)
Epoch 1/4
25000/25000 [==============================] - 155s - loss: 0.4428 - acc: 0.7886
Epoch 2/4
25000/25000 [==============================] - 161s - loss: 0.3162 - acc: 0.8714
Epoch 3/4
25000/25000 [==============================] - 166s - loss: 0.2983 - acc: 0.8770
Epoch 4/4
25000/25000 [==============================] - 176s - loss: 0.2855 - acc: 0.8825

# Train model over four epochs
history <- model %>% fit(
X, y,
batch_size = 128,
epochs = 4,
validation_size = 0.2
)

> plot(history)

Not bad, we reach very good accuracy with LSTMs, significantly above feed-forward networks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset