The policy

In the case that the actions are discrete and limited in number, the most common approach is to create a parameterized policy that produces a numerical value for each action.

Note that, differently from the Deep Q-Network algorithm, here, the output values of the policy aren't the Q(s,a) action values.

Then, each output value is converted to a probability. This operation is performed with the softmax function, which is given as follows:

The softmax values are normalized to have a sum of one, so as to produce a probability distribution where each value corresponds to the probability of selecting a given action.

The next two plots show an example of five action-value predictions before (the plot on the left) and after (the right plot) they are applied to the softmax function. Indeed, from the plot on the right, you can see that, after the softmax is computed, the sum of the new values is one, and that they all have values greater than zero:

The right plot indicates that actions 0,1,2,3, and 4, will be selected approximately, with probabilities of 0.64, 0.02, 0.09, 0.21, and 0.02, correspondingly.

To use a softmax distribution on the action values that are returned by the parameterized policy, we can use the code that is given in the Computing the gradient section, with only one change, which has been highlighted in the following snippet:

pi = policy(states) # actions probability for each action
onehot_action = tf.one_hot(actions, depth=env.action_space.n) 

pi_log = tf.reduce_sum(onehot_action * tf.nn.log_softmax(pi), axis=1) # instead of tf.math.log(pi)

pi_loss = -tf.reduce_mean(pi_log * Q_function(states, actions))
gradients = tf.gradient(pi_loss, variables)

Here, we used tf.nn.log_softmax, because it's been designed to be more stable than first calling tf.nn.softmax, and then tf.math.log.

An advantage of having actions according to stochastic distribution, is in the intrinsic randomness of the actions selected, which enable a dynamic exploration of the environment. This can seem like a side effect, but it's very important to have a policy that can adapt the level of exploration by itself.

In the case of DQN, we had to use a hand-crafted variable to adjust the exploration throughout all the training, using linear decay. Now that the exploration is built into the policy, at most, we have to add a term (the entropy) in the loss function in order to incentivize it.

Table of Contents for The policy

Create new playlist

Sign In

Sign Up

Table of Contents for
The policy