Implementation of TD3

To put this strategy into code, we have to create two critics with different initializations, compute the target action value as in (8.7), and optimize both critics.

TD3 is applied on the DDPG implementation that we discussed in the previous section. The following snippets are only a portion of the additional code that is needed to implement TD3. The complete implementation is available in the GitHub repository of the book: https://github.com/PacktPublishing/Hands-On-Reinforcement-Learning-Algorithms-with-Python.

With regard to the double critic, you have just to create them by calling  deterministic_actor_double_critic twice, once for the target and once for the online networks, as done in DDPG. The code will be similar to this:

def deterministic_actor_double_critic(x, a, hidden_sizes, act_dim, max_act):
with tf.variable_scope('p_mlp'):
p_means = max_act * mlp(x, hidden_sizes, act_dim, last_activation=tf.tanh)

# First critic
with tf.variable_scope('q1_mlp'):
q1_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None)
with tf.variable_scope('q1_mlp', reuse=True): # Use the weights of the mlp just defined
q1_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)

# Second critic
with tf.variable_scope('q2_mlp'):
q2_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None)
with tf.variable_scope('q2_mlp', reuse=True):
q2_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)

return p_means, tf.squeeze(q1_d), tf.squeeze(q1_a), tf.squeeze(q2_d), tf.squeeze(q2_a)

The clipped target value ( (8.7)) is implemented by first running the two target critics that we called qa1_tar and qa2_tar, and then calculating the minimum between the estimated values, and finally, using it to estimate the target values:

            ...            
double_actions = sess.run(p_tar, feed_dict={obs_ph:mb_obs2})

q1_target_mb, q2_target_mb = sess.run([qa1_tar,qa2_tar], feed_dict={obs_ph:mb_obs2, act_ph:double_actions})
q_target_mb = np.min([q1_target_mb, q2_target_mb], axis=0)
y_r = np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb
..

Next, the critics can be optimized as usual:

            ...
q1_train_loss, q2_train_loss
= sess.run([q1_opt, q2_opt], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})
...

An important observation to make is that the policy is optimized with respect to only one approximated Q-function, in our case, . In fact, if you look at the full code, you'll see that p_loss is defined as p_loss = -tf.reduce_mean(qd1_onl).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset