The DDPG algorithm

DDPG uses two key ideas, both borrowed from DQN but adapted for the actor-critic case:

Replay buffer: All the transitions acquired during the lifetime of the agent are stored in a replay buffer, also called experienced replay. Then, this is used for training the actor and the critic by sampling mini-batches from it.
Target network: Q-learning is unstable, since the network that is updated is also the one that is used for computing the target values. If you remember, DQN mitigates this problem by employing a target network that is updated every N iterations (copying the parameters of the online network in the target network). In the DDQN paper, they show that a soft target update works better in this context. With a soft update, the parameters of the target network, , are partially updated on each step with the parameters of the online network, : with . Yes, it may slow the learning, as the target network is changed only partially, but it outweighs the benefit that is derived from the increased instability. The trick of using a target network is used for both the actor and the critic, thereby the parameters of the target critic will also be updated following the soft update: .

Note that, from now on, we'll refer to and as the parameters of the online actor and the online critic, and to and as the parameters of the target actor and the target critic.

A characteristic that DDPG inherits from DQN is the ability to update the actor and the critic for each step taken in the environment. This follows on from the fact that DDPG is off-policy, and learns from the mini-batches that were sampled from the replay buffer. DDPG doesn't have to wait until a sufficiently large batch is gathered from the environment, as would be the case in on-policy stochastic policy gradient methods.

Previously, we saw how DPG acts according to an exploratory behavior policy, despite that fact that it is still learning a deterministic policy. But, how is this exploratory policy built? In DDPG, the policy is constructed by adding noise that is sampled from a noise process ():

The process will make sure that the environment is sufficiently explored.

Wrapping up, DDPG learns by cyclically repeating these three steps until convergence occurs:

The behavior policy interacts with the environment, collecting observations and rewards from it by storing them in a buffer.
At each step, the actor and the critic are updated, based on the information held in the mini-batch that was sampled from the buffer. Specifically, the critic is updated by minimizing the mean squared error (MSE) loss between the values that were predicted by the online critic (), and the target values that were computed using the target policy () and the target critic (). Instead, the actor is updated following formula (8.3).
The target network parameters are updated following the soft update.

The whole algorithm is summarized in this pseudocode:

---------------------------------------------------------------------------------
DDPG Algorithm
---------------------------------------------------------------------------------

Initialize online networks  and 
Initialize target networks  and  with the same weights as the online networks
Initialize empty replay buffer 
Initialize environment 

for  do
    > Run an episode
    while not d:
        
        
        > Store the transition in the buffer
        
        

        > Sample a minibatch 
        
        > Calculate the target value for every i in b
         (8.4)

        > Update the critic 
         (8.5)

        > Update the policy
         (8.6)

        > Targets update
        
        

        if :

With a more clear understanding of the algorithm, we can now start implementing it.

Table of Contents for The DDPG algorithm

Create new playlist

Sign In

Sign Up

Table of Contents for
The DDPG algorithm