Policy gradient methods

The algorithms that have been learned and developed so far are value-based, which, at their core, learn a value function, V(s), or action-value function, Q(s, a). A value function is a function that defines the total reward that can be accumulated from a given state or state-action pair. An action can then be selected, based on the estimated action (or state) values.

Therefore, a greedy policy can be defined as follows:

Value-based methods, when combined with deep neural networks, can learn very sophisticated policies in order to control agents that operate in high-dimensionality spaces. Despite these great qualities, they suffer when dealing with problems with a large number of actions, or when the action space is continuous.

In such cases, maximum operation is not feasible. Policy gradient (PG) algorithms exhibit incredible potential in such contexts, as they can be easily adapted to continuous action spaces.

PG methods belong to the broader class of policy-based methods, including evolution strategies, which are studied later in Chapter 11, Understanding Black-Box Optimization Algorithms. The distinctiveness of PG algorithms is in their use of the gradient of the policy, hence the name policy gradient.

A more concise categorization of RL algorithms, with respect to the one reported in Chapter 3, Solving Problems with Dynamic Programming, is shown in the following diagram:

Examples of policy gradient methods are REINFORCE and AC that will be introduced in the next sections.

Table of Contents for Policy gradient methods

Create new playlist

Sign In

Sign Up

Table of Contents for
Policy gradient methods