Soft Attention

As introduced in the preceding sub-section on hard attention, soft attention uses RL to progressively train and determine where to seek next (control problem).

There exist two major problems with using the combination of hard attention and RL to achieve the required objective:

It becomes slightly complicated to involve RL and train an RL agent and an RNN/deep network based on it separately.
The variance in the gradient of the policy function is not only high (as in A3C model), but also has a computational complexity of O(N), where N is the number of units in the network. This increases the computation load for such approaches massively. Also, given that the attention mechanism adds more value in overly long sequences (of words or image embedding splices)—and to train networks involving longer sequences requires larger memory, and hence much deeper networks—this approach is computationally not very efficient.

The Policy Function in RL, determined as Q(a,s), is the function used to determine the optimal policy or the action (a) that should be taken in any given state (s) to maximize the rewards.

So what is the alternative? As we discussed, the problem arose because the mechanism that we were choosing for attention led to a non-differentiable function, because of which we had to go with RL. So let's take a different approach here. Taking an analogy of our language modeling problem example (as in the Attention Mechanism - Intuition section) earlier, we assume that we have the vector of the tokens for the objects/ words present in the attention network. Also, in same vector space (say in the embedding hyperspace) we bring the tokens for the object/ words in the required query of the particular sequence step. On taking this approach, finding the right attention weights for the tokens in the attention network with the respect to the tokens in query space is as easy as computing the vector similarity between them; for example, a cosine distance. Fortunately, most vector distance and similarity functions are differentiable; hence the loss function derived by using such vector distance/similarity functions in such space is also differentiable, and our back-propagation can work in this scenario.

The cosine distance between two vectors, say

, and

, in a multi-dimensional (three in this example) vector space is given as:

This approach of using a differentiable loss function for training an attention network is known as soft attention.

Table of Contents for Soft Attention

Create new playlist

Sign In

Sign Up

Table of Contents for
Soft Attention