Delayed policy updates

Since high variance is attributed to an inaccurate critic, TD3 proposes to delay the update of the policy until the critic error is small enough. TD3 delays the update in an empirical way, by updating the policy only after a fixed number of iterations. In this manner, the critic has time to learn and stabilize itself, before the policy's optimization takes place. In practice, the policy remains fixed only for a few iterations, typically between 1 and 6. If set to 1, then it is the same as in DDPG. The delayed policy updates can be implemented as follows:

            ...
q1_train_loss, q2_train_loss
= sess.run([q1_opt, q2_opt], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})
if step_count % policy_update_freq == 0:
sess.run(p_opt, feed_dict={obs_ph:mb_obs})
sess.run(update_target_op)
...
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset