In this chapter we show how to work with sequences in TensorFlow, and in particular text. We begin by introducing recurrent neural networks (RNNs), a powerful class of deep learning algorithms particularly useful and popular in natural language processing (NLP). We show how to implement RNN models from scratch, introduce some important TensorFlow capabilities, and visualize the model with the interactive TensorBoard. We then explore how to use an RNN in a supervised text classification problem with word-embedding training. Finally, we show how to build a more advanced RNN model with long short-term memory (LSTM) networks and how to handle sequences of variable length.
We saw in the previous chapter that using the spatial structure of images can lead to advanced models with excellent results. As discussed in that chapter, exploiting structure is the key to success. As we will see shortly, an immensely important and useful type of structure is the sequential structure. Thinking in terms of data science, this fundamental structure appears in many datasets, across all domains. In computer vision, video is a sequence of visual content evolving over time. In speech we have audio signals, in genomics gene sequences; we have longitudinal medical records in healthcare, financial data in the stock market, and so on (see Figure 5-1).
A particularly important type of data with strong sequential structure is natural language—text data. Deep learning methods that exploit the sequential structure inherent in texts—characters, words, sentences, paragraphs, documents—are at the forefront of natural language understanding (NLU) systems, often leaving more traditional methods in the dust. There are a great many types of NLU tasks that are of interest to solve, ranging from document classification to building powerful language models, from answering questions automatically to generating human-level conversation agents. These tasks are fiendishly difficult, garnering the efforts and attention of the entire AI community in both academia and industry.
In this chapter, we focus on the basic building blocks and tasks, and show how to work with sequences—primarily of text—in TensorFlow. We take a detailed deep dive into the core elements of sequence models in TensorFlow, implementing some of them from scratch, to gain a thorough understanding. In the next chapter we show more advanced text modeling techniques with TensorFlow, and in Chapter 7 we use abstraction libraries that offer simpler, high-level ways to implement our models.
We begin with the most important and popular class of deep learning models for sequences (in particular, text): recurrent neural networks.
Recurrent neural networks are a powerful and widely used class of neural network architectures for modeling sequence data. The basic idea behind RNN models is that each new element in the sequence contributes some new information, which updates the current state of the model.
In the previous chapter, which explored computer vision with CNN models, we discussed how those architectures are inspired by the current scientific perceptions of the way the human brain processes visual information. These scientific perceptions are often rather close to our commonplace intuition from our day-to-day lives about how we process sequential information.
When we receive new information, clearly our “history” and “memory” are not wiped out, but instead “updated.” When we read a sentence in some text, with each new word, our current state of information is updated, and it is dependent not only on the new observed word but on the words that preceded it.
A fundamental mathematical construct in statistics and probability, which is often used as a building block for modeling sequential patterns via machine learning is the Markov chain model. Figuratively speaking, we can view our data sequences as “chains,” with each node in the chain dependent in some way on the previous node, so that “history” is not erased but carried on.
RNN models are also based on this notion of chain structure, and vary in how exactly they maintain and update information. As their name implies, recurrent neural nets apply some form of “loop.” As seen in Figure 5-2, at some point in time t, the network observes an input xt (a word in a sentence) and updates its “state vector” to ht from the previous vector ht-1. When we process new input (the next word), it will be done in some manner that is dependent on ht and thus on the history of the sequence (the previous words we’ve seen affect our understanding of the current word). As seen in the illustration, this recurrent structure can simply be viewed as one long unrolled chain, with each node in the chain performing the same kind of processing “step” based on the “message” it obtains from the output of the previous node. This, of course, is very related to the Markov chain models discussed previously and their hidden Markov model (HMM) extensions, which are not discussed in this book.
In this section we implement a basic RNN from scratch, explore its inner workings, and gain insight into how TensorFlow can work with sequences. We introduce some powerful, fairly low-level tools that TensorFlow provides for working with sequence data, which you can use to implement your own systems.
In the next sections, we will show how to use higher-level TensorFlow RNN modules.
We begin with defining our basic model mathematically. This mainly consists of defining the recurrence structure—the RNN update step.
The update step for our simple vanilla RNN is
ht = tanh(Wxxt + Whht-1 + b)
where Wh, Wx, and b are weight and bias variables we learn, tanh(·) is the hyperbolic tangent function that has its range in [–1,1] and is strongly connected to the sigmoid function used in previous chapters, and xt and ht are the input and state vectors as defined previously. Finally, the hidden state vector is multiplied by another set of weights, yielding the outputs that appear in Figure 5-2.
To get a first taste of the power and general applicability of sequence models, in this section we implement our first RNN to solve the MNIST image classification task that you are by now familiar with. Later in this chapter we will focus on sequences of text, and see how neural sequence models can powerfully manipulate them and extract information to solve NLU tasks.
But, you may ask, what have images got to do with sequences?
As we saw in the previous chapter, the architecture of convolutional neural networks makes use of the spatial structure of images. While the structure of natural images is well suited for CNN models, it is revealing to look at the structure of images from different angles. In a trend in cutting-edge deep learning research, advanced models attempt to exploit various kinds of sequential structures in images, trying to capture in some sense the “generative process” that created each image. Intuitively, this all comes down to the notion that nearby areas in images are somehow related, and trying to model this structure.
Here, to introduce basic RNNs and how to work with sequences, we take a simple sequential view of images: we look at each image in our data as a sequence of rows (or columns). In our MNIST data, this just means that each 28×28-pixel image can be viewed as a sequence of length 28, each element in the sequence a vector of 28 pixels (see Figure 5-3). Then, the temporal dependencies in the RNN can be imaged as a scanner head, scanning the image from top to bottom (rows) or left to right (columns).
We start by loading data, defining some parameters, and creating placeholders for our data:
import
tensorflow
as
tf
# Import MNIST data
from
tensorflow.examples.tutorials.mnist
import
input_data
mnist
=
input_data
.
read_data_sets
(
"/tmp/data/"
,
one_hot
=
True
)
# Define some parameters
element_size
=
28
time_steps
=
28
num_classes
=
10
batch_size
=
128
hidden_layer_size
=
128
# Where to save TensorBoard model summaries
LOG_DIR
=
"logs/RNN_with_summaries"
# Create placeholders for inputs, labels
_inputs
=
tf
.
placeholder
(
tf
.
float32
,
shape
=
[
None
,
time_steps
,
element_size
],
name
=
'inputs'
)
y
=
tf
.
placeholder
(
tf
.
float32
,
shape
=
[
None
,
num_classes
],
name
=
'labels'
)
element_size
is the dimension of each vector in our sequence—in our case, a row/column of 28 pixels. time_steps
is the number of such elements in a sequence.
As we saw in previous chapters, when we load data with the built-in MNIST data loader, it comes in unrolled form—a vector of 784 pixels. When we load batches of data during training (we’ll get to that later in this section), we simply reshape each unrolled vector to [batch_size
, time_steps
, element_size
]:
batch_x
,
batch_y
=
mnist
.
train
.
next_batch
(
batch_size
)
# Reshape data to get 28 sequences of 28 pixels
batch_x
=
batch_x
.
reshape
((
batch_size
,
time_steps
,
element_size
))
We set hidden_layer_size
(arbitrarily to 128
, controlling the size of the hidden RNN state vector discussed earlier.
LOG_DIR
is the directory to which we save model summaries for TensorBoard visualization. You will learn what this means as we go.
In this chapter, we will also briefly introduce TensorBoard visualizations. TensorBoard allows you to monitor and explore the model structure, weights, and training process, and requires some very simple additions to the code. More details are provided throughout this chapter and further along in the book.
Finally, our input and label placeholders are created with the suitable dimensions.
Let’s implement the mathematical model for the RNN step.
We first create a function used for logging summaries, which we will use later in TensorBoard to visualize our model and training process (it is not important to understand its technicalities at this stage):
# This helper function, taken from the official TensorFlow documentation,
# simply adds some ops that take care of logging summaries
def
variable_summaries
(
var
):
with
tf
.
name_scope
(
'summaries'
):
mean
=
tf
.
reduce_mean
(
var
)
tf
.
summary
.
scalar
(
'mean'
,
mean
)
with
tf
.
name_scope
(
'stddev'
):
stddev
=
tf
.
sqrt
(
tf
.
reduce_mean
(
tf
.
square
(
var
-
mean
)))
tf
.
summary
.
scalar
(
'stddev'
,
stddev
)
tf
.
summary
.
scalar
(
'max'
,
tf
.
reduce_max
(
var
))
tf
.
summary
.
scalar
(
'min'
,
tf
.
reduce_min
(
var
))
tf
.
summary
.
histogram
(
'histogram'
,
var
)
Next, we create the weight and bias variables used in the RNN step:
# Weights and bias for input and hidden layer
with
tf
.
name_scope
(
'rnn_weights'
):
with
tf
.
name_scope
(
"W_x"
):
Wx
=
tf
.
Variable
(
tf
.
zeros
([
element_size
,
hidden_layer_size
]))
variable_summaries
(
Wx
)
with
tf
.
name_scope
(
"W_h"
):
Wh
=
tf
.
Variable
(
tf
.
zeros
([
hidden_layer_size
,
hidden_layer_size
]))
variable_summaries
(
Wh
)
with
tf
.
name_scope
(
"Bias"
):
b_rnn
=
tf
.
Variable
(
tf
.
zeros
([
hidden_layer_size
]))
variable_summaries
(
b_rnn
)
We now create a function that implements the vanilla RNN step we saw in the previous section using the variables we created. It should by now be straightforward to understand the TensorFlow code used here:
def
rnn_step
(
previous_hidden_state
,
x
):
current_hidden_state
=
tf
.
tanh
(
tf
.
matmul
(
previous_hidden_state
,
Wh
)
+
tf
.
matmul
(
x
,
Wx
)
+
b_rnn
)
return
current_hidden_state
Next, we apply this function across all 28 time steps:
# Processing inputs to work with scan function
# Current input shape: (batch_size, time_steps, element_size)
processed_input
=
tf
.
transpose
(
_inputs
,
perm
=
[
1
,
0
,
2
])
# Current input shape now: (time_steps, batch_size, element_size)
initial_hidden
=
tf
.
zeros
([
batch_size
,
hidden_layer_size
])
# Getting all state vectors across time
all_hidden_states
=
tf
.
scan
(
rnn_step
,
processed_input
,
initializer
=
initial_hidden
,
name
=
'states'
)
In this small code block, there are some important elements to understand. First, we reshape the inputs from [batch_size, time_steps, element_size]
to [time_steps, batch_size, element_size]
. The perm
argument to tf.transpose()
tells TensorFlow which axes we want to switch around. Now that the first axis in our input Tensor represents the time axis, we can iterate across all time steps by using the built-in tf.scan()
function, which repeatedly applies a callable (function) to a sequence of elements in order, as explained in the following note.
This important function was added to TensorFlow to allow us to introduce loops into the computation graph, instead of just “unrolling” the loops explicitly by adding more and more replications of the same operations. More technically, it is a higher-order function very similar to the reduce operator, but it returns all intermediate accumulator values over time. There are several advantages to this approach, chief among them the ability to have a dynamic number of iterations rather than fixed, computational speedups and optimizations for graph construction.
To demonstrate the use of this function, consider the following simple example (which is separate from the overall RNN code in this section):
import
numpy
as
np
import
tensorflow
as
tf
elems
=
np
.
array
([
"T"
,
"e"
,
"n"
,
"s"
,
"o"
,
"r"
,
" "
,
"F"
,
"l"
,
"o"
,
"w"
])
scan_sum
=
tf
.
scan
(
lambda
a
,
x
:
a
+
x
,
elems
)
sess
=
tf
.
InteractiveSession
()
sess
.
run
(
scan_sum
)
Let’s see what we get:
array([b'T', b'Te', b'Ten', b'Tens', b'Tenso', b'Tensor', b'Tensor ', b'Tensor F', b'Tensor Fl', b'Tensor Flo', b'Tensor Flow'], dtype=object)
In this case, we use tf.scan()
to sequentially concatenate characters to a string, in a manner analogous to the arithmetic cumulative sum.
As we saw earlier, in an RNN we get a state vector for each time step, multiply it by some weights, and get an output vector—our new representation of the data. Let’s implement this:
# Weights for output layers
with
tf
.
name_scope
(
'linear_layer_weights'
)
as
scope
:
with
tf
.
name_scope
(
"W_linear"
):
Wl
=
tf
.
Variable
(
tf
.
truncated_normal
([
hidden_layer_size
,
num_classes
],
mean
=
0
,
stddev
=.
01
))
variable_summaries
(
Wl
)
with
tf
.
name_scope
(
"Bias_linear"
):
bl
=
tf
.
Variable
(
tf
.
truncated_normal
([
num_classes
],
mean
=
0
,
stddev
=.
01
))
variable_summaries
(
bl
)
# Apply linear layer to state vector
def
get_linear_layer
(
hidden_state
):
return
tf
.
matmul
(
hidden_state
,
Wl
)
+
bl
with
tf
.
name_scope
(
'linear_layer_weights'
)
as
scope
:
# Iterate across time, apply linear layer to all RNN outputs
all_outputs
=
tf
.
map_fn
(
get_linear_layer
,
all_hidden_states
)
# Get last output
output
=
all_outputs
[
-
1
]
tf
.
summary
.
histogram
(
'outputs'
,
output
)
Our input to the RNN is sequential, and so is our output. In this sequence classification example, we take the last state vector and pass it through a fully connected linear layer to extract an output vector (which will later be passed through a softmax activation function to generate predictions). This is common practice in basic sequence classification, where we assume that the last state vector has “accumulated” information representing the entire sequence.
To implement this, we first define the linear layer’s weights and bias term variables, and create a factory function for this layer. Then we apply this layer to all outputs with tf.map_fn()
, which is pretty much the same as the typical map function that applies functions to sequences/iterables in an element-wise manner, in this case on each element in our sequence.
Finally, we extract the last output for each instance in the batch, with negative indexing (similarly to ordinary Python). We will see some more ways to do this later and investigate outputs and states in some more depth.
We’re now ready to train a classifier, much in the same way we did in the previous chapters. We define the ops for loss function computation, optimization, and prediction, add some more summaries for TensorBoard, and merge all these summaries into one operation:
with
tf
.
name_scope
(
'cross_entropy'
):
cross_entropy
=
tf
.
reduce_mean
(
tf
.
nn
.
softmax_cross_entropy_with_logits
(
logits
=
output
,
labels
=
y
))
tf
.
summary
.
scalar
(
'cross_entropy'
,
cross_entropy
)
with
tf
.
name_scope
(
'train'
):
# Using RMSPropOptimizer
train_step
=
tf
.
train
.
RMSPropOptimizer
(
0.001
,
0.9
)
.
minimize
(
cross_entropy
)
with
tf
.
name_scope
(
'accuracy'
):
correct_prediction
=
tf
.
equal
(
tf
.
argmax
(
y
,
1
),
tf
.
argmax
(
output
,
1
))
accuracy
=
(
tf
.
reduce_mean
(
tf
.
cast
(
correct_prediction
,
tf
.
float32
)))
*
100
tf
.
summary
.
scalar
(
'accuracy'
,
accuracy
)
# Merge all the summaries
merged
=
tf
.
summary
.
merge_all
()
By now you should be familiar with most of the components used for defining the loss function and optimization. Here, we used the RMSPropOptimizer
, implementing a well-known and strong gradient descent algorithm, with some standard hyperparameters. Of course, we could have used any other optimizer (and do so throughout this book!).
We create a small test set with unseen MNIST images, and add some more technical ops and commands for logging summaries that we will use in TensorBoard.
Let’s run the model and check out the results:
# Get a small test set
test_data
=
mnist
.
test
.
images
[:
batch_size
]
.
reshape
((
-
1
,
time_steps
,
element_size
))
test_label
=
mnist
.
test
.
labels
[:
batch_size
]
with
tf
.
Session
()
as
sess
:
# Write summaries to LOG_DIR -- used by TensorBoard
train_writer
=
tf
.
summary
.
FileWriter
(
LOG_DIR
+
'/train'
,
graph
=
tf
.
get_default_graph
())
test_writer
=
tf
.
summary
.
FileWriter
(
LOG_DIR
+
'/test'
,
graph
=
tf
.
get_default_graph
())
sess
.
run
(
tf
.
global_variables_initializer
())
for
i
in
range
(
10000
):
batch_x
,
batch_y
=
mnist
.
train
.
next_batch
(
batch_size
)
# Reshape data to get 28 sequences of 28 pixels
batch_x
=
batch_x
.
reshape
((
batch_size
,
time_steps
,
element_size
))
summary
,
_
=
sess
.
run
([
merged
,
train_step
],
feed_dict
=
{
_inputs
:
batch_x
,
y
:
batch_y
})
# Add to summaries
train_writer
.
add_summary
(
summary
,
i
)
if
i
%
1000
==
0
:
acc
,
loss
,
=
sess
.
run
([
accuracy
,
cross_entropy
],
feed_dict
=
{
_inputs
:
batch_x
,
y
:
batch_y
})
(
"Iter "
+
str
(
i
)
+
", Minibatch Loss= "
+
"{:.6f}"
.
format
(
loss
)
+
", Training Accuracy= "
+
"{:.5f}"
.
format
(
acc
))
if
i
%
10
:
# Calculate accuracy for 128 MNIST test images and
# add to summaries
summary
,
acc
=
sess
.
run
([
merged
,
accuracy
],
feed_dict
=
{
_inputs
:
test_data
,
y
:
test_label
})
test_writer
.
add_summary
(
summary
,
i
)
test_acc
=
sess
.
run
(
accuracy
,
feed_dict
=
{
_inputs
:
test_data
,
y
:
test_label
})
(
"Test Accuracy:"
,
test_acc
)
Finally, we print some training and testing accuracy results:
Iter
0
,
Minibatch
Loss
=
2.303386
,
Training
Accuracy
=
7.03125
Iter
1000
,
Minibatch
Loss
=
1.238117
,
Training
Accuracy
=
52.34375
Iter
2000
,
Minibatch
Loss
=
0.614925
,
Training
Accuracy
=
85.15625
Iter
3000
,
Minibatch
Loss
=
0.439684
,
Training
Accuracy
=
82.81250
Iter
4000
,
Minibatch
Loss
=
0.077756
,
Training
Accuracy
=
98.43750
Iter
5000
,
Minibatch
Loss
=
0.220726
,
Training
Accuracy
=
89.84375
Iter
6000
,
Minibatch
Loss
=
0.015013
,
Training
Accuracy
=
100.00000
Iter
7000
,
Minibatch
Loss
=
0.017689
,
Training
Accuracy
=
100.00000
Iter
8000
,
Minibatch
Loss
=
0.065443
,
Training
Accuracy
=
99.21875
Iter
9000
,
Minibatch
Loss
=
0.071438
,
Training
Accuracy
=
98.43750
Testing
Accuracy
:
97.6563
To summarize this section, we started off with the raw MNIST pixels and regarded them as sequential data—each column (or row) of 28 pixels as a time step. We then applied the vanilla RNN to extract outputs corresponding to each time-step and used the last output to perform classification of the entire sequence (image).
TensorBoard is an interactive browser-based tool that allows us to visualize the learning process, as well as explore our trained model.
To run TensorBoard, go to the command terminal and tell TensorBoard where the relevant summaries you logged are:
tensorboard
-
-
logdir
=
LOG_DIR
Here, LOG_DIR
should be replaced with your log directory. If you are on Windows and this is not working, make sure you are running the terminal from the same drive where the log data is, and add a name to the log directory as follows in order to bypass a bug in the way TensorBoard parses the path:
tensorboard
-
-
logdir
=
rnn_demo
:
LOG_DIR
TensorBoard allows us to assign names to individual log directories by putting a colon between the name and the path, which may be useful when working with multiple log directories. In such a case, we pass a comma-separated list of log directories as follows:
tensorboard
-
-
logdir
=
rnn_demo1
:
LOG_DIR1
,
rnn_demo2
:
LOG_DIR2
In our example (with one log directory), once you have run the tensorboard
command, you should get something like the following, telling you where to navigate in your browser:
Starting
TensorBoard
b
'39'
on
port
6006
(
You
can
navigate
to
http
:
//
10.100
.
102.4
:
6006
)
If the address does not work, go to localhost:6006, which should always work.
TensorBoard recursively walks the directory tree rooted at LOG_DIR
looking for subdirectories that contain tfevents log data. If you run this example multiple times, make sure to either delete the LOG_DIR
folder you created after each run, or write the logs to separate subdirectories within LOG_DIR
, such as LOG_DIR
/run1/train, LOG_DIR
/run2/train, and so forth, to avoid issues with overwriting log files, which may lead to some “funky” plots.
Let’s take a look at some of the visualizations we can get. In the next section, we will explore interactive visualization of high-dimensional data with TensorBoard—for now, we focus on plotting training process summaries and trained weights.
First, in your browser, go to the Scalars tab. Here TensorBoard shows us summaries of all scalars, including not only training and testing accuracy, which are usually most interesting, but also some summary statistics we logged about variables (see Figure 5-4). Hovering over the plots, we can see some numerical figures.
In the Graphs tab we can get an interactive visualization of our computation graph, from a high-level view down to the basic ops, by zooming in (see Figure 5-5).
Finally, in the Histograms tab we see histograms of our weights across the training process (see Figure 5-6). Of course, we had to explicitly add these histograms to our logging in order to view them, with tf.summary.histogram()
.
The preceding example taught us some of the fundamental and powerful ways we can work with sequences, by implementing our graph pretty much from scratch. In practice, it is of course a good idea to use built-in higher-level modules and functions. This not only makes the code shorter and easier to write, but exploits many low-level optimizations afforded by TensorFlow implementations.
In this section we first present a new, shorter version of the code in its entirety. Since most of the overall details have not changed, we focus on the main new elements, tf.contrib.rnn.BasicRNNCell
and tf.nn.dynamic_rnn()
:
import
tensorflow
as
tf
from
tensorflow.examples.tutorials.mnist
import
input_data
mnist
=
input_data
.
read_data_sets
(
"/tmp/data/"
,
one_hot
=
True
)
element_size
=
28
;
time_steps
=
28
;
num_classes
=
10
batch_size
=
128
;
hidden_layer_size
=
128
_inputs
=
tf
.
placeholder
(
tf
.
float32
,
shape
=
[
None
,
time_steps
,
element_size
],
name
=
'inputs'
)
y
=
tf
.
placeholder
(
tf
.
float32
,
shape
=
[
None
,
num_classes
],
name
=
'inputs'
)
# TensorFlow built-in functions
rnn_cell
=
tf
.
contrib
.
rnn
.
BasicRNNCell
(
hidden_layer_size
)
outputs
,
_
=
tf
.
nn
.
dynamic_rnn
(
rnn_cell
,
_inputs
,
dtype
=
tf
.
float32
)
Wl
=
tf
.
Variable
(
tf
.
truncated_normal
([
hidden_layer_size
,
num_classes
],
mean
=
0
,
stddev
=.
01
))
bl
=
tf
.
Variable
(
tf
.
truncated_normal
([
num_classes
],
mean
=
0
,
stddev
=.
01
))
def
get_linear_layer
(
vector
):
return
tf
.
matmul
(
vector
,
Wl
)
+
bl
last_rnn_output
=
outputs
[:,
-
1
,:]
final_output
=
get_linear_layer
(
last_rnn_output
)
softmax
=
tf
.
nn
.
softmax_cross_entropy_with_logits
(
logits
=
final_output
,
labels
=
y
)
cross_entropy
=
tf
.
reduce_mean
(
softmax
)
train_step
=
tf
.
train
.
RMSPropOptimizer
(
0.001
,
0.9
)
.
minimize
(
cross_entropy
)
correct_prediction
=
tf
.
equal
(
tf
.
argmax
(
y
,
1
),
tf
.
argmax
(
final_output
,
1
))
accuracy
=
(
tf
.
reduce_mean
(
tf
.
cast
(
correct_prediction
,
tf
.
float32
)))
*
100
sess
=
tf
.
InteractiveSession
()
sess
.
run
(
tf
.
global_variables_initializer
())
test_data
=
mnist
.
test
.
images
[:
batch_size
]
.
reshape
((
-
1
,
time_steps
,
element_size
))
test_label
=
mnist
.
test
.
labels
[:
batch_size
]
for
i
in
range
(
3001
):
batch_x
,
batch_y
=
mnist
.
train
.
next_batch
(
batch_size
)
batch_x
=
batch_x
.
reshape
((
batch_size
,
time_steps
,
element_size
))
sess
.
run
(
train_step
,
feed_dict
=
{
_inputs
:
batch_x
,
y
:
batch_y
})
if
i
%
1000
==
0
:
acc
=
sess
.
run
(
accuracy
,
feed_dict
=
{
_inputs
:
batch_x
,
y
:
batch_y
})
loss
=
sess
.
run
(
cross_entropy
,
feed_dict
=
{
_inputs
:
batch_x
,
y
:
batch_y
})
(
"Iter "
+
str
(
i
)
+
", Minibatch Loss= "
+
"{:.6f}"
.
format
(
loss
)
+
", Training Accuracy= "
+
"{:.5f}"
.
format
(
acc
))
(
"Testing Accuracy:"
,
sess
.
run
(
accuracy
,
feed_dict
=
{
_inputs
:
test_data
,
y
:
test_label
}))
TensorFlow’s RNN cells are abstractions that represent the basic operations each recurrent “cell” carries out (see Figure 5-2 at the start of this chapter for an illustration), and its associated state. They are, in general terms, a “replacement” of the rnn_step()
function and the associated variables it required. Of course, there are many variants and types of cells, each with many methods and properties. We will see some more advanced cells toward the end of this chapter and later in the book.
Once we have created the rnn_cell
, we feed it into tf.nn.dynamic_rnn()
. This function replaces tf.scan()
in our vanilla implementation and creates an RNN specified by rnn_cell
.
As of this writing, in early 2017, TensorFlow includes a static and a dynamic function for creating an RNN. What does this mean? The static version creates an unrolled graph (as in Figure 5-2) of fixed length. The dynamic version uses a tf.While
loop to dynamically construct the graph at execution time, leading to faster graph creation, which can be significant. This dynamic construction can also be very useful in other ways, some of which we will touch on when we discuss variable-length sequences toward the end of this chapter.
Note that contrib refers to the fact that code in this library is contributed and still requires testing. We discuss the contrib
library in much more detail in Chapter 7. BasicRNNCell
was moved to contrib
in TensorFlow 1.0 as part of ongoing development. In version 1.2, many of the RNN functions and classes were moved back to the core namespace with aliases kept in contrib
for backward compatibiliy, meaning that the preceding code works for all versions 1.X as of this writing.
We began this chapter by learning how to implement RNN models in TensorFlow. For ease of exposition, we showed how to implement and use an RNN for a sequence made of pixels in MNIST images. We next show how to use these sequence models on text sequences.
Text data has some properties distinctly different from image data, which we will discuss here and later in this book. These properties can make it somewhat difficult to handle text data at first, and text data always requires at least some basic pre-processing steps for us to be able to work with it. To introduce working with text in TensorFlow, we will thus focus on the core components and create a minimal, contrived text dataset that will let us get straight to the action. In Chapter 7, we will apply RNN models to movie review sentiment classification.
Let’s get started, presenting our example data and discussing some key properties of text datasets as we go.
In the MNIST RNN example we saw earlier, each sequence was of fixed size—the width (or height) of an image. Each element in the sequence was a dense vector of 28 pixels. In NLP tasks and datasets, we have a different kind of “picture.”
Our sequences could be of words forming a sentence, of sentences forming a paragraph, or even of characters forming words or paragraphs forming whole documents.
Consider the following sentence: “Our company provides smart agriculture solutions for farms, with advanced AI, deep-learning.” Say we obtain this sentence from an online news blog, and wish to process it as part of our machine learning system.
Each of the words in this sentence would be represented with an ID—an integer, commonly referred to as a token ID in NLP. So, the word “agriculture” could, for instance, be mapped to the integer 3452, the word “farm” to 12, “AI” to 150, and “deep-learning” to 0. This representation in terms of integer identifiers is very different from the vector of pixels in image data, in multiple ways. We will elaborate on this important point shortly when we discuss word embeddings, and in Chapter 6.
To make things more concrete, let’s start by creating our simplified text data.
Our simulated data consists of two classes of very short “sentences,” one composed of odd digits and the other of even digits (with numbers written in English). We generate sentences built of words representing even and odd numbers. Our goal is to learn to classify each sentence as either odd or even in a supervised text-classification task.
Of course, we do not really need any machine learning for this simple task—we use this contrived example only for illustrative purposes.
First, we define some constants, which will be explained as we go:
import
numpy
as
np
import
tensorflow
as
tf
batch_size
=
128
;
embedding_dimension
=
64
;
num_classes
=
2
hidden_layer_size
=
32
;
times_steps
=
6
;
element_size
=
1
Next, we create sentences. We sample random digits and map them to the corresponding “words” (e.g., 1 is mapped to “One,” 7 to “Seven,” etc.).
Text sequences typically have variable lengths, which is of course the case for all real natural language data (such as in the sentences appearing on this page).
To make our simulated sentences have different lengths, we sample for each sentence a random length between 3 and 6 with np.random.choice(range(3, 7))
—the lower bound is inclusive, and the upper bound is exclusive.
Now, to put all our input sentences in one tensor (per batch of data instances), we need them to somehow be of the same size—so we pad sentences with a length shorter than 6 with zeros (or PAD symbols) to make all sentences equally sized (artificially). This pre-processing step is known as zero-padding. The following code accomplishes all of this:
digit_to_word_map
=
{
1
:
"One"
,
2
:
"Two"
,
3
:
"Three"
,
4
:
"Four"
,
5
:
"Five"
,
6
:
"Six"
,
7
:
"Seven"
,
8
:
"Eight"
,
9
:
"Nine"
}
digit_to_word_map
[
0
]
=
"PAD"
even_sentences
=
[]
odd_sentences
=
[]
seqlens
=
[]
for
i
in
range
(
10000
):
rand_seq_len
=
np
.
random
.
choice
(
range
(
3
,
7
))
seqlens
.
append
(
rand_seq_len
)
rand_odd_ints
=
np
.
random
.
choice
(
range
(
1
,
10
,
2
),
rand_seq_len
)
rand_even_ints
=
np
.
random
.
choice
(
range
(
2
,
10
,
2
),
rand_seq_len
)
# Padding
if
rand_seq_len
<
6
:
rand_odd_ints
=
np
.
append
(
rand_odd_ints
,
[
0
]
*
(
6
-
rand_seq_len
))
rand_even_ints
=
np
.
append
(
rand_even_ints
,
[
0
]
*
(
6
-
rand_seq_len
))
even_sentences
.
append
(
" "
.
join
([
digit_to_word_map
[
r
]
for
r
in
rand_odd_ints
]))
odd_sentences
.
append
(
" "
.
join
([
digit_to_word_map
[
r
]
for
r
in
rand_even_ints
]))
data
=
even_sentences
+
odd_sentences
# Same seq lengths for even, odd sentences
seqlens
*=
2
Let’s take a look at our sentences, each padded to length 6:
even_sentences
[
0
:
6
]
Out
:
[
'Four Four Two Four Two PAD'
,
'Eight Six Four PAD PAD PAD'
,
'Eight Two Six Two PAD PAD'
,
'Eight Four Four Eight PAD PAD'
,
'Eight Eight Four PAD PAD PAD'
,
'Two Two Eight Six Eight Four'
]
odd_sentences
[
0
:
6
]
Out
:
[
'One Seven Nine Three One PAD'
,
'Three Nine One PAD PAD PAD'
,
'Seven Five Three Three PAD PAD'
,
'Five Five Three One PAD PAD'
,
'Three Three Five PAD PAD PAD'
,
'Nine Three Nine Five Five Three'
]
Notice that we add the PAD word (token) to our data and digit_to_word_map
dictionary, and separately store even and odd sentences and their original lengths (before padding).
Let’s take a look at the original sequence lengths for the sentences we printed:
seqlens
[
0
:
6
]
Out
:
[
5
,
3
,
4
,
4
,
3
,
6
]
Why keep the original sentence lengths? By zero-padding, we solved one technical problem but created another: if we naively pass these padded sentences through our RNN model as they are, it will process useless PAD
symbols. This would both harm model correctness by processing “noise” and increase computation time. We resolve this issue by first storing the original lengths in the seqlens
array and then telling TensorFlow’s tf.nn.dynamic_rnn()
where each sentence ends.
In this chapter, our data is simulated—generated by us. In real applications, we would start off by getting a collection of documents (e.g., one-sentence tweets) and then mapping each word to an integer ID.
So, we now map words to indices—word identifiers—by simply creating a dictionary with words as keys and indices as values. We also create the inverse map. Note that there is no correspondence between the word IDs and the digits each word represents—the IDs carry no semantic meaning, just as in any NLP application with real data:
# Map from words to indices
word2index_map
=
{}
index
=
0
for
sent
in
data
:
for
word
in
sent
.
lower
()
.
split
():
if
word
not
in
word2index_map
:
word2index_map
[
word
]
=
index
index
+=
1
# Inverse map
index2word_map
=
{
index
:
word
for
word
,
index
in
word2index_map
.
items
()}
vocabulary_size
=
len
(
index2word_map
)
This is a supervised classification task—we need an array of labels in the one-hot format, train and test sets, a function to generate batches of instances, and placeholders, as usual.
First, we create the labels and split the data into train and test sets:
labels
=
[
1
]
*
10000
+
[
0
]
*
10000
for
i
in
range
(
len
(
labels
)):
label
=
labels
[
i
]
one_hot_encoding
=
[
0
]
*
2
one_hot_encoding
[
label
]
=
1
labels
[
i
]
=
one_hot_encoding
data_indices
=
list
(
range
(
len
(
data
)))
np
.
random
.
shuffle
(
data_indices
)
data
=
np
.
array
(
data
)[
data_indices
]
labels
=
np
.
array
(
labels
)[
data_indices
]
seqlens
=
np
.
array
(
seqlens
)[
data_indices
]
train_x
=
data
[:
10000
]
train_y
=
labels
[:
10000
]
train_seqlens
=
seqlens
[:
10000
]
test_x
=
data
[
10000
:]
test_y
=
labels
[
10000
:]
test_seqlens
=
seqlens
[
10000
:]
Next, we create a function that generates batches of sentences. Each sentence in a batch is simply a list of integer IDs corresponding to words:
def
get_sentence_batch
(
batch_size
,
data_x
,
data_y
,
data_seqlens
):
instance_indices
=
list
(
range
(
len
(
data_x
)))
np
.
random
.
shuffle
(
instance_indices
)
batch
=
instance_indices
[:
batch_size
]
x
=
[[
word2index_map
[
word
]
for
word
in
data_x
[
i
]
.
lower
()
.
split
()]
for
i
in
batch
]
y
=
[
data_y
[
i
]
for
i
in
batch
]
seqlens
=
[
data_seqlens
[
i
]
for
i
in
batch
]
return
x
,
y
,
seqlens
Finally, we create placeholders for data:
_inputs
=
tf
.
placeholder
(
tf
.
int32
,
shape
=
[
batch_size
,
times_steps
])
_labels
=
tf
.
placeholder
(
tf
.
float32
,
shape
=
[
batch_size
,
num_classes
])
# seqlens for dynamic calculation
_seqlens
=
tf
.
placeholder
(
tf
.
int32
,
shape
=
[
batch_size
])
Note that we have created a placeholder for the original sequence lengths. We will see how to make use of these in our RNN shortly.
Our text data is now encoded as lists of word IDs—each sentence is a sequence of integers corresponding to words. This type of atomic representation, where each word is represented with an ID, is not scalable for training deep learning models with large vocabularies that occur in real problems. We could end up with millions of such word IDs, each encoded in one-hot (binary) categorical form, leading to great data sparsity and computational issues. We will discuss this in more depth in Chapter 6.
A powerful approach to work around this issue is to use word embeddings. The embedding is, in a nutshell, simply a mapping from high-dimensional one-hot vectors encoding words to lower-dimensional dense vectors. So, for example, if our vocabulary has size 100,000, each word in one-hot representation would be of the same size. The corresponding word vector—or word embedding—would be of size 300, say. The high-dimensional one-hot vectors are thus “embedded” into a continuous vector space with a much lower dimensionality.
In Chapter 6 we dive deeper into word embeddings, exploring a popular method to train them in an “unsupervised” manner known as word2vec.
Here, our end goal is to solve a text classification problem, and we will train word vectors in a supervised framework, tuning the embedded word vectors to solve the downstream classification task.
It is helpful to think of word embeddings as basic hash tables or lookup tables, mapping words to their dense vector values. These vectors are optimized as part of the training process. Previously, we gave each word an integer index, and sentences are then represented as sequences of these indices. Now, to obtain a word’s vector, we use the built-in tf.nn.embedding_lookup()
function, which efficiently retrieves the vectors for each word in a given sequence of word indices:
with
tf
.
name_scope
(
"embeddings"
):
embeddings
=
tf
.
Variable
(
tf
.
random_uniform
([
vocabulary_size
,
embedding_dimension
],
-
1.0
,
1.0
),
name
=
'embedding'
)
embed
=
tf
.
nn
.
embedding_lookup
(
embeddings
,
_inputs
)
We will see examples of and visualizations of our vector representations of words shortly.
In the introductory RNN example with which we began, we implemented and used the basic vanilla RNN model. In practice, we often use slightly more advanced RNN models, which differ mainly by how they update their hidden state and propagate information through time. A very popular recurrent network is the long short-term memory (LSTM) network. It differs from vanilla RNN by having some special memory mechanisms that enable the recurrent cells to better store information for long periods of time, thus allowing them to capture long-term dependencies better than plain RNN.
There is nothing mysterious about these memory mechanisms; they simply consist of some more parameters added to each recurrent cell, enabling the RNN to overcome optimization issues and propagate information. These trainable parameters act as filters that select what information is worth “remembering” and passing on, and what is worth “forgetting.” They are trained in exactly the same way as any other parameter in a network, with gradient-descent algorithms and backpropagation. We don’t go into the more technical mathematical formulations here, but there are plenty of great resources out there delving into the details.
We create an LSTM cell with tf.contrib.rnn.BasicLSTMCell()
and feed it to tf.nn.dynamic_rnn()
, just as we did at the start of this chapter. We also give dynamic_rnn()
the length of each sequence in a batch of examples, using the _seqlens
placeholder we created earlier. TensorFlow uses this to stop all RNN steps beyond the last real sequence element. It also returns all output vectors over time (in the outputs
tensor), which are all zero-padded beyond the true end of the sequence. So, for example, if the length of our original sequence is 5 and we zero-pad it to a sequence of length 15, the output for all time steps beyond 5 will be zero:
with
tf
.
variable_scope
(
"lstm"
):
lstm_cell
=
tf
.
contrib
.
rnn
.
BasicLSTMCell
(
hidden_layer_size
,
forget_bias
=
1.0
)
outputs
,
states
=
tf
.
nn
.
dynamic_rnn
(
lstm_cell
,
embed
,
sequence_length
=
_seqlens
,
dtype
=
tf
.
float32
)
weights
=
{
'linear_layer'
:
tf
.
Variable
(
tf
.
truncated_normal
([
hidden_layer_size
,
num_classes
],
mean
=
0
,
stddev
=.
01
))
}
biases
=
{
'linear_layer'
:
tf
.
Variable
(
tf
.
truncated_normal
([
num_classes
],
mean
=
0
,
stddev
=.
01
))
}
# Extract the last relevant output and use in a linear layer
final_output
=
tf
.
matmul
(
states
[
1
],
weights
[
"linear_layer"
])
+
biases
[
"linear_layer"
]
softmax
=
tf
.
nn
.
softmax_cross_entropy_with_logits
(
logits
=
final_output
,
labels
=
_labels
)
cross_entropy
=
tf
.
reduce_mean
(
softmax
)
We take the last valid output vector—in this case conveniently available for us in the states
tensor returned by dynamic_rnn()
—and pass it through a linear layer (and the softmax function), using it as our final prediction. We will explore the concepts of last relevant output and zero-padding further in the next section, when we look at some outputs generated by dynamic_rnn()
for our example sentences.
We have all the pieces in the puzzle. Let’s put them together, and complete an end-to-end training of both word vectors and a classification model:
train_step
=
tf
.
train
.
RMSPropOptimizer
(
0.001
,
0.9
)
.
minimize
(
cross_entropy
)
correct_prediction
=
tf
.
equal
(
tf
.
argmax
(
_labels
,
1
),
tf
.
argmax
(
final_output
,
1
))
accuracy
=
(
tf
.
reduce_mean
(
tf
.
cast
(
correct_prediction
,
tf
.
float32
)))
*
100
with
tf
.
Session
()
as
sess
:
sess
.
run
(
tf
.
global_variables_initializer
())
for
step
in
range
(
1000
):
x_batch
,
y_batch
,
seqlen_batch
=
get_sentence_batch
(
batch_size
,
train_x
,
train_y
,
train_seqlens
)
sess
.
run
(
train_step
,
feed_dict
=
{
_inputs
:
x_batch
,
_labels
:
y_batch
,
_seqlens
:
seqlen_batch
})
if
step
%
100
==
0
:
acc
=
sess
.
run
(
accuracy
,
feed_dict
=
{
_inputs
:
x_batch
,
_labels
:
y_batch
,
_seqlens
:
seqlen_batch
})
(
"Accuracy at
%d
:
%.5f
"
%
(
step
,
acc
))
for
test_batch
in
range
(
5
):
x_test
,
y_test
,
seqlen_test
=
get_sentence_batch
(
batch_size
,
test_x
,
test_y
,
test_seqlens
)
batch_pred
,
batch_acc
=
sess
.
run
([
tf
.
argmax
(
final_output
,
1
),
accuracy
],
feed_dict
=
{
_inputs
:
x_test
,
_labels
:
y_test
,
_seqlens
:
seqlen_test
})
(
"Test batch accuracy
%d
:
%.5f
"
%
(
test_batch
,
batch_acc
))
output_example
=
sess
.
run
([
outputs
],
feed_dict
=
{
_inputs
:
x_test
,
_labels
:
y_test
,
_seqlens
:
seqlen_test
})
states_example
=
sess
.
run
([
states
[
1
]],
feed_dict
=
{
_inputs
:
x_test
,
_labels
:
y_test
,
_seqlens
:
seqlen_test
})
As we can see, this is a pretty simple toy text classification problem:
Accuracy at 0: 32.81250 Accuracy at 100: 100.00000 Accuracy at 200: 100.00000 Accuracy at 300: 100.00000 Accuracy at 400: 100.00000 Accuracy at 500: 100.00000 Accuracy at 600: 100.00000 Accuracy at 700: 100.00000 Accuracy at 800: 100.00000 Accuracy at 900: 100.00000 Test batch accuracy 0: 100.00000 Test batch accuracy 1: 100.00000 Test batch accuracy 2: 100.00000 Test batch accuracy 3: 100.00000 Test batch accuracy 4: 100.00000
We’ve also computed an example batch of outputs generated by dynamic_rnn()
, to further illustrate the concepts of zero-padding and last relevant outputs discussed in the previous section.
Let’s take a look at one example of these outputs, for a sentence that was zero-padded (in your random batch of data you may see different output, of course—look for a sentence whose seqlen
was lower than the maximal 6):
seqlen_test
[
1
]
Out
:
4
output_example
[
0
][
1
]
.
shape
Out
:
(
6
,
32
)
This output has, as expected, six time steps, each a vector of size 32. Let’s take a glimpse at its values (printing only the first few dimensions to avoid clutter):
output_example
[
0
][
1
][:
6
,
0
:
3
]
Out
:
array
([[
-
0.44493711
,
-
0.51363373
,
-
0.49310589
],
[
-
0.72036862
,
-
0.68590945
,
-
0.73340571
],
[
-
0.83176643
,
-
0.78206956
,
-
0.87831545
],
[
-
0.87982416
,
-
0.82784462
,
-
0.91132098
],
[
0.
,
0.
,
0.
],
[
0.
,
0.
,
0.
]],
dtype
=
float32
)
We see that for this sentence, whose original length was 4, the last two time steps have zero vectors due to padding.
Finally, we look at the states vector returned by dynamic_rnn()
:
states_example
[
0
][
1
][
0
:
3
]
Out
:
array
([
-
0.87982416
,
-
0.82784462
,
-
0.91132098
],
dtype
=
float32
)
We can see that it conveniently stores for us the last relevant output vector—its values match the last relevant output vector before zero-padding.
At this point, you may be wondering how to access and manipulate the word vectors and explore the trained representation. We show how to do so, including interactive embedding visualization, in the next chapter.
Earlier, we focused on a one-layer LSTM network for ease of exposition. Adding more layers is straightforward, using the MultiRNNCell()
wrapper that combines multiple RNN cells into one multilayer cell.
Say, for example, we wanted to stack two LSTM layers in the preceding example. We can do this as follows:
num_LSTM_layers
=
2
with
tf
.
variable_scope
(
"lstm"
):
lstm_cell_list
=
[
tf
.
contrib
.
rnn
.
BasicLSTMCell
(
hidden_layer_size
,
forget_bias
=
1.0
)
for
ii
in
range
(
num_LSTM_layers
)]
cell
=
tf
.
contrib
.
rnn
.
MultiRNNCell
(
cells
=
lstm_cell_list
,
state_is_tuple
=
True
)
outputs
,
states
=
tf
.
nn
.
dynamic_rnn
(
cell
,
embed
,
sequence_length
=
_seqlens
,
dtype
=
tf
.
float32
)
We first define an LSTM cell as before, and then feed it into the tf.contrib.rnn.MultiRNNCell()
wrapper.
Now our network has two layers of LSTM, causing some shape issues when trying to extract the final state vectors. To get the final state of the second layer, we simply adapt our indexing a bit:
# Extract the final state and use in a linear layer
final_output
=
tf
.
matmul
(
states
[
num_LSTM_layers
-
1
][
1
],
weights
[
"linear_layer"
])
+
biases
[
"linear_layer"
]
In this chapter we introduced sequence models in TensorFlow. We saw how to implement a basic RNN model from scratch by using tf.scan()
and built-in modules, as well as more advanced LSTM networks, for both text and image data. Finally, we trained an end-to-end text classification RNN with word embeddings, and showed how to handle sequences of variable length. In the next chapter, we dive deeper into word embeddings and word2vec. In Chapter 7, we will see some cool abstraction layers over TensorFlow, and how they can be used to train advanced text classification RNN models with considerably less effort.