Computing the gradient

As long as the policy is differentiable, its gradient can be easily computed, taking advantage of modern automatic differentiation software.

To do that in TensorFlow, we can define the computational graph and call tf.gradient(loss_function,variables) to calculate the gradient of the loss function (loss_function) with respect to the variables trainable parameters. An alternative would be to directly maximize the objective function using the stochastic gradient descent optimizer, for example, by calling tf.train.AdamOptimizer(lr).minimize(-objective_function).

The following snippet is an example of the steps that are required to compute the approximation in formula (6.5), with a policy of discrete action space of the env.action_space.n dimension:

pi = policy(states) # actions probability for each action
onehot_action = tf.one_hot(actions, depth=env.action_space.n)
pi_log = tf.reduce_sum(onehot_action * tf.math.log(pi), axis=1)

pi_loss = -tf.reduce_mean(pi_log * Q_function(states, actions))

# calculate the gradients of pi_loss with respect to the variables
gradients = tf.gradient(pi_loss, variables)

# or optimize directly pi_loss with Adam (or any other SGD optimizer)
# pi_opt = tf.train.AdamOptimizer(lr).minimize(pi_loss) #

tf.one_hot produces a one-hot encoding of the actions actions. That is, it produces a mask with 1, corresponding with the numerical value of the action, 0, in the others.

Then, in the third line of the code, the mask is multiplied by the logarithm of the action probability, in order to obtain the log probability of the actions actions. The fourth line computes the loss as follows:

And finally, tf.gradient calculates the gradients of pi_loss, with respect to the variables parameter, as in formula (6.5).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset