Implementation of TD3

To put this strategy into code, we have to create two critics with different initializations, compute the target action value as in (8.7), and optimize both critics.

TD3 is applied on the DDPG implementation that we discussed in the previous section. The following snippets are only a portion of the additional code that is needed to implement TD3. The complete implementation is available in the GitHub repository of the book: https://github.com/PacktPublishing/Hands-On-Reinforcement-Learning-Algorithms-with-Python.

With regard to the double critic, you have just to create them by calling deterministic_actor_double_critic twice, once for the target and once for the online networks, as done in DDPG. The code will be similar to this:

def deterministic_actor_double_critic(x, a, hidden_sizes, act_dim, max_act):
    with tf.variable_scope('p_mlp'):
        p_means = max_act * mlp(x, hidden_sizes, act_dim, last_activation=tf.tanh)
    
    # First critic
    with tf.variable_scope('q1_mlp'):
        q1_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None)
    with tf.variable_scope('q1_mlp', reuse=True): # Use the weights of the mlp just defined
        q1_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)

    # Second critic
    with tf.variable_scope('q2_mlp'):
        q2_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None)
    with tf.variable_scope('q2_mlp', reuse=True):
        q2_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)

    return p_means, tf.squeeze(q1_d), tf.squeeze(q1_a), tf.squeeze(q2_d), tf.squeeze(q2_a)

The clipped target value ( (8.7)) is implemented by first running the two target critics that we called qa1_tar and qa2_tar, and then calculating the minimum between the estimated values, and finally, using it to estimate the target values:

            ...            
            double_actions = sess.run(p_tar, feed_dict={obs_ph:mb_obs2})

            q1_target_mb, q2_target_mb = sess.run([qa1_tar,qa2_tar], feed_dict={obs_ph:mb_obs2, act_ph:double_actions})
            q_target_mb = np.min([q1_target_mb, q2_target_mb], axis=0) 
            y_r = np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb
            ..

Next, the critics can be optimized as usual:

            ...
            q1_train_loss, q2_train_loss = sess.run([q1_opt, q2_opt], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})
            ...

An important observation to make is that the policy is optimized with respect to only one approximated Q-function, in our case, . In fact, if you look at the full code, you'll see that p_loss is defined as p_loss = -tf.reduce_mean(qd1_onl).

Table of Contents for Implementation of TD3

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementation of TD3