Target regularization

Critics that update from deterministic actions tend to overfit in narrow peaks. The consequence is an increase in variance. TD3 presents a smoothing regularization technique that adds a clipped noise to a small area near the target action:

The regularization can be implemented in a function that takes a vector and a scalas arguments

def add_normal_noise(x, noise_scale):
return x + np.clip(np.random.normal(loc=0.0, scale=noise_scale, size=x.shape), -0.5, 0.5)

Then, add_normal_noise is called after running the target policy, as shown in the following lines of code (the changes with respect to the DDPG implementation are written in bold):

            ...            
double_actions = sess.run(p_tar, feed_dict={obs_ph:mb_obs2})
double_noisy_actions = np.clip(add_normal_noise(double_actions, target_noise), env.action_space.low, env.action_space.high)

q1_target_mb, q2_target_mb = sess.run([qa1_tar,qa2_tar], feed_dict={obs_ph:mb_obs2, act_ph:double_noisy_actions})
q_target_mb = np.min([q1_target_mb, q2_target_mb], axis=0)
y_r = np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb
..

We clipped the actions, after having added the extra noise, to make sure that they don't exceed the ranges that were set by the environment.

Putting everything together, we obtain the algorithm that is shown in the following pseudocode:

---------------------------------------------------------------------------------
TD 3 Algorithm
---------------------------------------------------------------------------------

Initialize online networks and
Initialize target networks and with the same weights as the online networks
Initialize empty replay buffer
Initialize environment

for do
> Run an episode
while not d:


> Store the transition in the buffer



> Sample a minibatch

> Calculate the target value for every i in b


> Update the critics



if iter % policy_update_frequency == 0:
> Update the policy

> Targets update




if :

That's everything for the TD3 algorithm. Now, you have a clear understanding of all the deterministic and non-deterministic policy gradient methods. Almost all of the model-free algorithms are based on the principles that we explained in these chapters, and if you master them, you will be able to understand and implement all of them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset