Attention mechanism for image captioning

From the introduction, so far, it must be clear to you that the attention mechanism works on a sequence of objects, assigning each element in the sequence a weight for a specific iteration of a required output. With every next step, not only the sequence but also the weights in the attention mechanism can change. So, attention-based architectures are essentially sequence networks, best implemented in deep learning using RNNs (or their variants).

The question now is: how do we implement a sequence-based attention on a static image, especially the one represented in a convolutional neural network (CNN)? Well, let's take an example that sits right in between a text and image to understand this. Assume that we need to caption an image with respect to its contents.

We have some images with captions provided by humans as training data and using this, we need to create a system that can provide a decent caption for any new image not seen earlier by the model. As seen earlier, let's take an example and see how we, as humans, will perceive this task and the analogous process to it that needs to be implemented in deep learning and CNN. Let's consider the following image and conceive some plausible captions for it. We'll also rank them heuristically using human judgment:

Some probable captions (in order of most likely to least likely) are:

  • Woman seeing dog in snow forest
  • Brown dog in snow
  • A person wearing cap in woods and white land
  • Dog, tree, snow, person, and sunshine

An important thing to note here is that, despite the fact that the woman is central to the image and the dog is not the biggest object in the image, the caption we sought probable focused on them and then their surroundings here. This is because we consider them as important entities here (given no prior context). So as humans, how we reached this conclusion is as follows: we first glanced the whole image, and then we focused towards the woman, in high resolution, while putting everything in the background (assume a Bokeh effect in a dual-camera phone). We identified the caption part for that, and then the dog in high resolution while putting everything else in low resolution; and we appended the caption part. Finally, we did the same for the surroundings and caption part for that. 

So essentially, we saw it in this sequence to reach to the first caption:

 
Image 1: Glance the image first
 
Image 2: Focus on woman
 
Image 3: Focus on dog
Image 4: Focus on snow
Image 5: Focus on forest

In terms of weight of attention or focus, after glancing the image, we focus on the first most important object: the woman here. This is analogous to creating a mental frame in which we put the part of the image with the woman in high-resolution and the remaining part of the image in low-resolution.

In a deep learning reference, the attention sequence will have the highest weight for the vector (embedding) representing the concept of the woman for this part of the sequence. In the next step of the output/sequence, the weight will shift more towards the vector representation for the dog and so on.

To understand this intuitively, we convert the image represented in the form of CNN into a flattened vector or some other similar structure; then we create different splices of the image or sequences with different parts in varying resolutions. Also, as we understand now from our discussion in Chapter 7, Object-Detection & Instance-Segmentation with CNN, we must have the relevant portions that we need to detect in varying scales as well for effective detection. The same concept applies here too, and besides resolution, we also vary the scale; but for now, we will keep it simple and ignore the scale part for intuitive understanding.

These splices or sequences of images now act as a sequence of words, as in our earlier example, and hence they can be treated inside an RNN/LSTM or similar sequence-based architecture for the purpose of attention. This is done to get the best-suited word as the output in every iteration. So the first iteration of the sequence leads to woman (from the weights of a sequence representing an object represented as a Woman in Image 2) → then the next iteration as → seeing (from a sequence identifying the back of the Woman as in Image 2) → Dog (sequence as in Image 3) → in (from a sequence where everything is blurred generating filler words transitioning from entities to surroundings) → Snow (sequence as in Image 4) → Forest (sequence as in Image 5).

Filler words such as in and action words such as seeing can also be automatically learned when the best image splice/sequence mapping to human-generated captions is done across several images. But for the simpler version, a caption such as Woman, Dog, Snow, and Forest can also be a good depiction of entities and surroundings in the image.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset