Video question answering system

The following example will be focusing on building a video question answering model, and we will be using Keras to define the model.

In order to solve this problem, we will retrain it using high-level TensorFlow training in a distributed setting.

Figure 6: Video Question Answering

As we can see that we have videos which are sampled 4 frames per second and it's roughly 10 seconds per video so we have about 40 frames total per video. And we are asking questions about the video contents, just like the ones that are shown in figure 6.

So we are going to build a deep learning model that will take as an input:

Video: Which will be represented as a sequence of frames, so it will be about 40 frames in order.
Question: A sequence of words asking about the video contents.

The model will output an answer to this question.

This is very interesting and challenging problem, because if you tried to only take one frame and train CNN over them you will only model the visual information of this frame which might not be representative for the whole video. By using the entire video frames or even a sample, you will get to model these frames and understand the context by combining the different source of information which is the frames. So you are helping deep learning model to be able to leverage this order to correctly answer the correct question.

This kind of problem was very difficult a few years ago and it wasn't accessible to lots of researchers but with TensorFlow as a platform and Keras as an API, this problem's solution is accessible to anyone with just basic python scripting capabilities.

So the following is the model that we are going to follow and explain in details:

Figure 7: General video question answering architecture

At the high level, we have two main branches in this proposed architecture. The first branch is meant to encode the video frames into a single vector and the other branch is meant to encode the question, which is a sequence of words, into a single vector. So we have one vector encoding information about the entire video and another one for encoding information about the entire question, then we concatenate these two vectors to give us one single vector that encodes information about the entire problem.

What is really interesting about this deep learning architecture is that you are taking video as an input and also we are taking the semantic meaning that's represented in the question. Then we are representing this video and semantic meaning into the geometric space by turning them into vectors and then deep learning gets to learn interesting transformations of this geometric space. Then we will take this vector for concatenates the encoding for both the video and the question and pass it to a fully connected network which will end up with a softmax over a predefined vocabulary and then we pick the word has the highest probability in this vocabulary to be the answer to the question.

The following is more elaboration about the proposed infrastructure:

So for the video branch, we are starting out with the video as a sequence of frames and each frame is just an RGB image. Then we pass each frame to a CNN to transform it into a single vector and we are using the pre-trained network as a base for CNN. So After passing all the video frames through a series of CNNs, we will get the video encoded as a sequence of vectors. Then you take this sequence of vectors and run it through LSTM (a recurrent type of network that can process sequences and takes order into account) which will output a single vector to represent the video.

For the question we are following a very similar process, we represent the question as a sequence of integers in which each integer standing for a word and then we map each word to a word vector via the embedding process. So we get a sequence of vectors out of this sequence of words. Finally, we run it through a different LSTM that will encode the entire question as a single vector.

So let's see the representation of the previous architecture in Keras:

Figure 8: Keras video question answering architecture

The Keras architecture is very similar to the previous one. So for the video encoder, we are starting by representing the video as a 5D tensor, the first dimension/axis will be the batch axis, then you have the time axis and finally, you have a 3D tensor which encodes the frame.

So we will apply InceptionV3 network, which is pre-trained on ImageNet, to every frame of the 5D tensor to extract one vector per frame. So out of this, we will get a sequence of feature vectors which will be fed to and LSTM layer to produce a single vector for the video.

For the question part, we are simply using an embedding layer to map our question and we will also run this through an LSTM to produce a single vector.

At the top, we are using a concat operation to bring these two vectors together and then we stake a bunch of dense layers and finally, end up with a softmax over a predefined vocabulary and we are also training that with a target answer word encoded as one hot vector.

So what does the implementation look like?

video = tf.keras.layers.Input(shape=(None, 150, 150, 3)) 
cnn = tk.keras.applications.InceptionV3(weights='imagenet', 
                                             include_top=False, 
                                             pool='avg') 
cnn.trainable = False 
encodedframes = tk.keras.TimeDistributed(cnn)(video) 
encoded_vid = tf.layers.LSTM(256)(encoded_frames)

The previous code snippet represents the video encoding in just 5 lines of code.

During the first line, we are just specifying the shape of the video input. So it is a 5D tensor with the shape argument and you do not actually explicitly mention the batch size. The first axis which is set to none is the time axis and it is set to none because we want to be variable to able to encode videos with a different number of frames. Then the shape Argument is just 150 by 150 RGB image.

The second line, we are instantiating an Inception V3 network that will automatically load pre-trained weights (was trained no ImageNet) and we configure this network to work as a feature extraction layer. So we will not by including the classifier part of Inception V3 network because we only want the convolutional base. Finally, we apply average pooling on top of the bottleneck layer. The output of this line will be a single vector per image/frame.

Maybe someone will ask, we are using a pre-trained InceptionV3 model? And the reason for that is we are dealing with a small dataset and this dataset will not have enough data to allow you to learn to extract interesting visual features.

So in order to get this network to actually work well, you really need to be leveraging these pre-trained weights.

In the third line, we are setting the CNN to be non-trainable which means that during training we will not be updating the weights because it's pre-trained model and if we updated its weights while training on this new problem of question answering then we will likely be destroying the representations that this model has already learned on ImageNet.

In the fourth line, we use a time distributed layer to essentially take this CNN and apply it to every step of the time axis of the video. And what comes out of this is just a 3D tensor representing a sequence of visual vectors extracted from the frames.

Finally, in the fifth line, we run this sequence tensor through an LSTM layer and this gives us one single vector encoding of the entire video.

As you can notice, while instantiating the Keras LSTM layer, you only need to specify one parameter which is the number of units in the LSTM layer. So you do not have to go into the complex details of LSTMs. And one principle of Keras is that best practices are included, so every keras layer has a well-optimized default configuration that takes into account all these best practises. So you can rely on keras defaults to be a good one.

For the question part, we are going to encode the question in the following three lines of code:

question = tk.keras.layers.Input(shape=(100),dtype='int32') 
x = tf.keras.layers.Embedding(10000,256,mask_zero=true)(question) 
encoded_q = tf.keras.layers.LSTM(128)(x)

In the first line, we specify the input tensor of the question. So every question will be just a sequence of 100 integers and a result we will be limited to questions that are at most 100 words long.

In the second line, we embed every integer into word vector via the embedding layer and we are masking this embedding layer which means that, if the question is not 100 words long, then the embedding layer will pad the rest with zeros to get to 100.

In the third line, we are propagating this to the LSTM layer to encode this set of word vectors into one vector.

And finally this is how you end up with the answer word:

x = tk.keras.layers.concat([encoded_vid, encoded_q]) 
x = tf.keras.layers.Dense(128, activation=tf.nn.relu)(x) 
outputs = tf.keras.layers.Dense(1000)(x)

So for the first line, you are taking the video vector and the question vector and concatenating them with just a concat operation and finally, you are applying a couple of dense layers. And we'll end up with 1000 units. So we will have a vocabulary that is just 1000 different words.

And here's the step at which you are specifying the training configuration:

model = tk.keras.models.Model(, ouputs) 
model.compile(optimizer=tf.AdamOptimizer(), 
              loss=tf.softmax_crossentropy_with_logits)

So you are just instantiating a model which is a container for a graph of layers, and you're instantiating them by just specifying what are the inputs of the model, what are the outputs and you are telling the model that it should use AdamOptimizer during training and use this loss softmax cross entropy with logits

You can notice that when we specified our classification layer with 1000 units, we did not specify any activation so it's actually a purely linear layer. The softmax activation will be included with the loss

To sum up, this is the entire code which is about 15 lines. So it's very, very short. So we are essentially turning this very complex architecture including loading pre-trained weights into just a few lines of code.

video = tf.keras.layers.Input(shape=(None, 150, 150, 3)) 
cnn = tk.keras.applications.InceptionV3(weights='imagenet', 
                                             include_top=False, 
                                             pool='avg') 
 
cnn.trainable = False 
encodedframes = tk.keras.TimeDistributed(cnn)(video) 
encoded_vid = tf.layers.LSTM(256)(encoded_frames) 
 
question = tk.keras.layers.Input(shape=(100),dtype='int32') 
x = tf.keras.layers.Embedding(10000,256,mask_zero=true)(question) 
encoded_q = tf.keras.layers.LSTM(128)(x) 
 
x = tk.keras.layers.concat([encoded_vid, encoded_q]) 
x = tf.keras.layers.Dense(128, activation=tf.nn.relu)(x) 
outputs = tf.keras.layers.Dense(1000)(x) 
 
model = tk.keras.models.Model(, ouputs) 
model.compile(optimizer=tf.AdamOptimizer(), 
              loss=tf.softmax_crossentropy_with_logits)

As we mentioned that because this implementation of Keras is built from the ground up for TensorFlow so it is fully compatible with things like estimators and experiments so in just one line, you can instantiate a TensorFlow experiment and this gives you access to the distributed training to train on Could ML and so on.

So you can start running your experiment as you did with your question answering model reading your video data, question and answer data from pandas data frame and you can start running it on in a cluster of GPUs in just a few lines.

train_panda_dataframe = pandas.read_hdf(...) 
 
train_inputs = tf.inputs.pandas_input_fn( 
    train_panda_dataframe, 
    batch_size=32, 
    shuffle=True, 
    target_column='answer') 
 
eval_inputs = tf.inputs.pandas_input_fn(...) 
 
exp = tf.training.Experiment( 
    model, 
    train_input_fn=train_inputs, 
    eval_input_fn=eval_inputs) 
 
exp.run(...)

Table of Contents for Video question answering system

Create new playlist

Sign In

Sign Up

Table of Contents for
Video question answering system