Hard Attention

In reality, in our recent image caption example, several more pictures would be selected, but due to our training with the handwritten captions, those would never be weighted higher. However, the essential thing to understand is how the system would understand what all pixels (or more precisely, the CNN representations of them) the system focuses on to draw these high-resolution images of different aspects and then how to choose the next pixel to repeat the process.

In the preceding example, the points are chosen at random from a distribution and the process is repeated. Also, which pixels around this point get a higher resolution is decided inside the attention network. This type of attention is known as hard attention.

Hard attention has something called the differentiability problem. Let's spend some time understanding this. We know that in deep learning the networks have to be trained and to train them we iterate across training batches in order to minimize the loss function. We can minimize the loss function by changing the weights in the direction of the gradient of the minima, which in turn is arrived at after differentiating the loss function.

This process of minimizing losses across layers of a deep network, starting from the last layer to the first, is known as back-propagation.

Examples of some differentiable loss functions used in deep learning and machine learning are the log-likelihood loss function, squared-error loss function, binomial and multinominal cross-entropy, and so on.

However, since the points are chosen randomly in each iteration in hard attention—and since such a random pixel choosing mechanism is not a differentiable function—we essentially cannot train this attention mechanism, as explained. This problem is overcome either by using Reinforcement Learning (RL) or by switching to soft attention.

RL involves mechanisms of solving two problems, either separately or in combination. The first is called the control problem, which determines the most optimal action that the agent should take in each step given its state, and the second is the prediction problem, which determines the optimal value of the state.

Table of Contents for Hard Attention

Create new playlist

Sign In

Sign Up

Table of Contents for
Hard Attention