Deterministic policy gradient

Designing an algorithm that is both off-policy and able to learn stable policies in high-dimensional action spaces is challenging. DQN already solves the problem of learning a stable deep neural network policy in off-policy settings. An approach to making DQN also suitable for continuous actions is to discretize the action space. For example, if an action has values between 0 and 1, a solution could be to discretize it in 11 values (0, 0.1, 0.2,.., 0.9, 1.0), and predict their probabilities using DQN. However, this solution is not manageable with a lot of actions, because the number of possible discrete actions increases exponentially with the degree of freedom of the agent. Moreover, this technique isn't applicable in tasks that need more fine-grained control. Thus, we need to find an alternative.

A valuable idea is to learn a deterministic actor-critic. It has a close relationship with Q-learning. If you remember, in Q-learning, the best action is chosen in order to maximize the approximated Q-function among all of the possible actions: 

The idea is to learn a deterministic policy that approximates . This overcomes the problem of computing a global maximization at every step, and opens up the possibility of extending it to very high-dimensional and continuous actions. Deterministic policy gradient (DPG) applies this concept successfully to some simple problems such as Mountain Car, Pendulum, and an octopus arm. After DPG, DDPG expands the ideas of DPG, using deep neural networks as policies and adopting some more careful design choices in order to make the algorithm more stable. A further algorithm, TD3, addresses the problems of high variance, and the overestimation bias that is common in DPG and DDPG. Both DDPG and TD3 will be explained and developed in the following sections. When we construct a map that categorizes RL algorithms, we place DPG, DDPG, and TD3 in the intersection between policy gradient and Q-learning algorithms, as in the following diagram. For now, let's focus on the foundation of DPGs and how they work:

Categorization of the model-free RL algorithms developed so far

The new DPG algorithms combine both Q-learning and policy gradient methods. A parametrized deterministic policy only outputs deterministic values. In continuous contexts, these can be the mean of the actions. The parameters of the policy can then be updated by solving the following equation: 

is the parametrized action-value function. Note that deterministic approaches differ from stochastic approaches in the absence of additional noise added to the actions. In PPO and TRPO, we were sampling from a normal distribution, with a mean and a standard deviation. Here, the policy has only a deterministic mean. Going back to the update (8.1), as always, maximization is done with stochastic gradient ascent, which will incrementally improve the policy with small updates. Then, the gradient of the objective function can be computed as follows: 

is the state distribution following the policy. This formulation comes from the deterministic policy gradient theorem. It says that the gradient of the objective function is obtained in expectation by following the chain rule that is applied to the Q-function, which is taken with respect to the policy parameters. Using automated differentiable software such as TensorFlow, it's very easy to compute. In fact, the gradient is estimated just by computing the gradient, starting from the Q-values, all the way through the policy, but updating only the parameters of the latter, as shown here: 

An illustration of the DPG theorem
The gradient is computed starting from the Q-values, but only the policy is updated.

This is a more theoretical result. As we know, deterministic policies don't explore the environment, and thus, they won't find a good solution. To make the DPG off-policy, we need to take a step further, and define the gradient of the objective function in such a way that the expectation follows the distribution of a stochastic exploratory policy:

is an exploratory policy, also called a behavior policy. This equation gives the off-policy deterministic policy gradient and gives the estimated gradient with respect to a deterministic policy (), while generating trajectories that follow a behavior policy (). Note that, in practice, the behavior policy is just the deterministic policy with additional noise.

Though we have talked about deterministic actor-critic previously, until now, we have only shown how the policy learning takes place. Instead, we are learning both the actor that is represented by the deterministic policy (), and the critic that is represented by the Q-function (). The differentiable action-value function () can easily be learned with the Bellman updates that minimize the Bellman error
(), as done in Q-learning algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset