Policy gradient theorem

An initial problem is encountered when looking at equation (6.2), because, in its formulation, the gradient of the objective function depends on the distribution of the states of a policy; that is:

We would use a stochastic approximation of that expectation, but to compute the distribution of the states, , we still need a complete model of the environment. Thus, this formulation isn't suitable for our purposes.

The policy gradient theorem comes to the rescue here. Its purpose is to provide an analytical formulation to compute the gradient of the objective function, with respect to the parameters of the policy, without involving the derivative of the state distribution. Formally, the policy gradient theorem, enables us to express the gradient of the objective function as:

The proof of the policy gradient theorem is beyond the scope of this book, and thus, isn't included. However, you can find it in the book by Sutton and Barto (http://incompleteideas.net/book/the-book-2nd.htmlor) or from other online resources.

Now that the derivative of the objective doesn't involve the derivative of the state distribution, the expectation can be estimated by sampling from the policy. Thus, the derivative of the objective can be approximated as follows:

This can be used to produce a stochastic update with gradient ascent:

Note, that because the goal is to maximize the objective function, gradient ascent is used to move the parameters in the same direction as the gradient (contrary to gradient descent, which performs ).

The idea behind equation (6.5) is to increase the probability that good actions will be re-proposed in the future, while reducing the probability of bad actions. The quality of the actions is carried on by the usual scalar value of , which gives the quality of the state-action pair.

Table of Contents for Policy gradient theorem

Create new playlist

Sign In

Sign Up

Table of Contents for
Policy gradient theorem