In reality, in our recent image caption example, several more pictures would be selected, but due to our training with the handwritten captions, those would never be weighted higher. However, the essential thing to understand is how the system would understand what all pixels (or more precisely, the CNN representations of them) the system focuses on to draw these high-resolution images of different aspects and then how to choose the next pixel to repeat the process.
In the preceding example, the points are chosen at random from a distribution and the process is repeated. Also, which pixels around this point get a higher resolution is decided inside the attention network. This type of attention is known as hard attention.
Hard attention has something called the differentiability problem. Let's spend some time understanding this. We know that in deep learning the networks have to be trained and to train them we iterate across training batches in order to minimize the loss function. We can minimize the loss function by changing the weights in the direction of the gradient of the minima, which in turn is arrived at after differentiating the loss function.
However, since the points are chosen randomly in each iteration in hard attention—and since such a random pixel choosing mechanism is not a differentiable function—we essentially cannot train this attention mechanism, as explained. This problem is overcome either by using Reinforcement Learning (RL) or by switching to soft attention.