The DDPG algorithm

DDPG uses two key ideas, both borrowed from DQN but adapted for the actor-critic case:

  • Replay buffer: All the transitions acquired during the lifetime of the agent are stored in a replay buffer, also called experienced replay. Then, this is used for training the actor and the critic by sampling mini-batches from it.
  • Target network: Q-learning is unstable, since the network that is updated is also the one that is used for computing the target values. If you remember, DQN mitigates this problem by employing a target network that is updated every N iterations (copying the parameters of the online network in the target network). In the DDQN paper, they show that a soft target update works better in this context. With a soft update, the parameters of the target network, , are partially updated on each step with the parameters of the online network, :  with . Yes, it may slow the learning, as the target network is changed only partially, but it outweighs the benefit that is derived from the increased instability. The trick of using a target network is used for both the actor and the critic, thereby the parameters of the target critic will also be updated following the soft update: .

Note that, from now on, we'll refer to  and  as the parameters of the online actor and the online critic, and to  and  as the parameters of the target actor and the target critic.

A characteristic that DDPG inherits from DQN is the ability to update the actor and the critic for each step taken in the environment. This follows on from the fact that DDPG is off-policy, and learns from the mini-batches that were sampled from the replay buffer. DDPG doesn't have to wait until a sufficiently large batch is gathered from the environment, as would be the case in on-policy stochastic policy gradient methods. 

Previously, we saw how DPG acts according to an exploratory behavior policy, despite that fact that it is still learning a deterministic policy. But, how is this exploratory policy built? In DDPG, the policy is constructed by adding noise that is sampled from a noise process ():

The process will make sure that the environment is sufficiently explored.

Wrapping up, DDPG learns by cyclically repeating these three steps until convergence occurs:

  • The behavior policy interacts with the environment, collecting observations and rewards from it by storing them in a buffer.
  • At each step, the actor and the critic are updated, based on the information held in the mini-batch that was sampled from the buffer. Specifically, the critic is updated by minimizing the mean squared error (MSE) loss between the values that were predicted by the online critic (), and the target values that were computed using the target policy () and the target critic (). Instead, the actor is updated following formula (8.3).
  • The target network parameters are updated following the soft update.

The whole algorithm is summarized in this pseudocode:

---------------------------------------------------------------------------------
DDPG Algorithm
---------------------------------------------------------------------------------

Initialize online networks and
Initialize target networks and with the same weights as the online networks
Initialize empty replay buffer
Initialize environment

for do
> Run an episode
while not d:


> Store the transition in the buffer



> Sample a minibatch

> Calculate the target value for every i in b
(8.4)

> Update the critic
(8.5)

> Update the policy
(8.6)

> Targets update



if :

With a more clear understanding of the algorithm, we can now start implementing it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset