The gradient of the policy

The objective of RL is to maximize the expected return (the total reward, discounted or undiscounted) of a trajectory. The objective function, can then be expressed as:

Where θ is the parameters of the policy, such as the trainable variables of a deep neural network.

In PG methods, the maximization of the objective function is done through the gradient of the objective function . Using gradient ascent, we can improve by moving the parameters toward the direction of the gradient, as the gradient points in the direction in which the function increases.

We have to take the same direction of the gradient, because we aim to maximize the objective function (6.1).

Once the maximum is found, the policy, π_θ, will produce trajectories with the highest possible return. On an intuitive level, policy gradient incentivizes good policies by increasing their probability while punishing bad policies by reducing their probabilities.

Using equation (6.1), the gradient of the objective function is defined as follows:

By relating to the concepts from the previous chapters, in policy gradient methods, policy evaluation is the estimation of the return, . Instead, policy improvement is the optimization step of the parameter . Thus, policy gradient methods have to symbiotically carry on both phases in order to improve the policy.

Table of Contents for The gradient of the policy

Create new playlist

Sign In

Sign Up

Table of Contents for
The gradient of the policy