Critics that update from deterministic actions tend to overfit in narrow peaks. The consequence is an increase in variance. TD3 presents a smoothing regularization technique that adds a clipped noise to a small area near the target action:
The regularization can be implemented in a function that takes a vector and a scale as arguments:
def add_normal_noise(x, noise_scale):
return x + np.clip(np.random.normal(loc=0.0, scale=noise_scale, size=x.shape), -0.5, 0.5)
Then, add_normal_noise is called after running the target policy, as shown in the following lines of code (the changes with respect to the DDPG implementation are written in bold):
...
double_actions = sess.run(p_tar, feed_dict={obs_ph:mb_obs2})
double_noisy_actions = np.clip(add_normal_noise(double_actions, target_noise), env.action_space.low, env.action_space.high)
q1_target_mb, q2_target_mb = sess.run([qa1_tar,qa2_tar], feed_dict={obs_ph:mb_obs2, act_ph:double_noisy_actions})
q_target_mb = np.min([q1_target_mb, q2_target_mb], axis=0)
y_r = np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb
..
We clipped the actions, after having added the extra noise, to make sure that they don't exceed the ranges that were set by the environment.
Putting everything together, we obtain the algorithm that is shown in the following pseudocode:
---------------------------------------------------------------------------------
TD 3 Algorithm
---------------------------------------------------------------------------------
Initialize online networks and
Initialize target networks and with the same weights as the online networks
Initialize empty replay buffer
Initialize environment
for do
> Run an episode
while not d:
> Store the transition in the buffer
> Sample a minibatch
> Calculate the target value for every i in b
> Update the critics
if iter % policy_update_frequency == 0:
> Update the policy
> Targets update
if :
That's everything for the TD3 algorithm. Now, you have a clear understanding of all the deterministic and non-deterministic policy gradient methods. Almost all of the model-free algorithms are based on the principles that we explained in these chapters, and if you master them, you will be able to understand and implement all of them.