Target regularization

Critics that update from deterministic actions tend to overfit in narrow peaks. The consequence is an increase in variance. TD3 presents a smoothing regularization technique that adds a clipped noise to a small area near the target action:

The regularization can be implemented in a function that takes a vector and a scale as arguments:

def add_normal_noise(x, noise_scale):
    return x + np.clip(np.random.normal(loc=0.0, scale=noise_scale, size=x.shape), -0.5, 0.5)

Then, add_normal_noise is called after running the target policy, as shown in the following lines of code (the changes with respect to the DDPG implementation are written in bold):

            ...            
            double_actions = sess.run(p_tar, feed_dict={obs_ph:mb_obs2})
            double_noisy_actions = np.clip(add_normal_noise(double_actions, target_noise), env.action_space.low, env.action_space.high)

            q1_target_mb, q2_target_mb = sess.run([qa1_tar,qa2_tar], feed_dict={obs_ph:mb_obs2, act_ph:double_noisy_actions})
            q_target_mb = np.min([q1_target_mb, q2_target_mb], axis=0) 
            y_r = np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb
            ..

We clipped the actions, after having added the extra noise, to make sure that they don't exceed the ranges that were set by the environment.

Putting everything together, we obtain the algorithm that is shown in the following pseudocode:

---------------------------------------------------------------------------------
TD 3 Algorithm
---------------------------------------------------------------------------------

Initialize online networks  and 
Initialize target networks  and  with the same weights as the online networks
Initialize empty replay buffer 
Initialize environment 

for  do
    > Run an episode
    while not d:
        
        
        > Store the transition in the buffer
        
        

        > Sample a minibatch 
        
        > Calculate the target value for every i in b
         

        > Update the critics 
         
        

        if iter % policy_update_frequency == 0:
            > Update the policy
             
            > Targets update
            
            
            

        if :

That's everything for the TD3 algorithm. Now, you have a clear understanding of all the deterministic and non-deterministic policy gradient methods. Almost all of the model-free algorithms are based on the principles that we explained in these chapters, and if you master them, you will be able to understand and implement all of them.

Table of Contents for Target regularization

Create new playlist

Sign In

Sign Up

Table of Contents for
Target regularization