DDPG implementation

The pseudocode that was given in the preceding section already provides a comprehensive view of the algorithm, but from an implementation standpoint, there are a few things that are worth looking at in more depth. Here, we'll show the more interesting features that could also recur in other algorithms. The full code is available in the GitHub repository of the book: https://github.com/PacktPublishing/Reinforcement-Learning-Algorithms-with-Python.

Specifically, we'll focus on a few main parts:

  • How to build a deterministic actor-critic
  • How to do soft updates
  • How to optimize a loss function, with respect to only some parameters
  • How to calculate the target values

We defined a deterministic actor and a critic inside a function called deterministic_actor_critic. This function will be called twice, as we need to create both an online and a target actor-critic. The code is as follows:

def deterministic_actor_critic(x, a, hidden_sizes, act_dim, max_act):
with tf.variable_scope('p_mlp'):
p_means = max_act * mlp(x, hidden_sizes, act_dim, last_activation=tf.tanh)
with tf.variable_scope('q_mlp'):
q_d = mlp(tf.concat([x,p_means], axis=-1), hidden_sizes, 1, last_activation=None)
with tf.variable_scope('q_mlp', reuse=True): # reuse the weights
q_a = mlp(tf.concat([x,a], axis=-1), hidden_sizes, 1, last_activation=None)
return p_means, tf.squeeze(q_d), tf.squeeze(q_a)

There are three interesting things happening inside this function. The first is that we are distinguishing between two types of input for the same critic. One that takes a state as the input, and a p_means deterministic action is returned by the policy; and the other that takes a state and an arbitrary action as the input. This distinction is needed, because one critic will be used for optimizing the actor, while the other is used for optimizing the critic. Nevertheless, despite these two critics having two different inputs, they are the same neural network, meaning that they share the same parameters. This different use case is accomplished by defining the same variable scope for both instances of the critic, and setting reuse=True on the second one. This will make sure that the parameters are the same for both definitions, in practice creating only one critic.

The second observation is that we are defining the actor inside a variable scope called p_mlp. This is because, later on, we'll need to retrieve only these parameters, and not those of the critic.

The third observation is that, because the policy has a tanh function as its final activation layer (to constrain the values to be between -1 and 1) but our actor may need values out of this range, we have to multiply the output by a max_act factor (this assumes that the minimum and maximum values are opposite, that is, if the maximum allowed value is 3, the minimum is -3).

Nice! Let's now have a look through the remaining of the computational graph, where we define the placeholders; create the online and target actors, as well as the online and target critics; define the losses; implement the optimizers; and update the target networks.

We'll start from the creation of the placeholders that we'll need for the observations, the actions, and the target values:

obs_dim = env.observation_space.shape
act_dim = env.action_space.shape
obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
act_ph = tf.placeholder(shape=(None, act_dim[0]), dtype=tf.float32, name='act') y_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='y')

In the preceding code, y_ph is the placeholder for the target Q-values, obs_ph for the observations, and act_ph for the actions.

We then call the previously defined deterministic_actor_critic function inside an online and target variable scope, so as to differentiate the four neural networks:

with tf.variable_scope('online'):
p_onl, qd_onl, qa_onl = deterministic_actor_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))

with tf.variable_scope('target'):
_, qd_tar, _ = deterministic_actor_critic(obs_ph, act_ph, hidden_sizes, act_dim[0], np.max(env.action_space.high))

The loss of the critic is the MSE loss between the Q-values of the qa_onl online network, and the y_ph target action value:

q_loss = tf.reduce_mean((qa_onl - y_ph)**2)

This will be minimized with the Adam optimizer:

q_opt = tf.train.AdamOptimizer(cr_lr).minimize(q_loss)

With regard to the actor's loss function, it is the opposite sign of the online Q-network. In this case, the online Q-network has the actions chosen by the online deterministic actor as the input (as from formula (8.6), which was defined in the pseudocode of The DDPG algorithm section). Thus, the Q-values are represented by qd_onl, and the policy loss function is written as follows:

p_loss = -tf.reduce_mean(qd_onl)

We took the opposite sign of the objective function, because we have to convert it to a loss function, considering that the optimizers need to minimize a loss function.

Now, the most important thing to remember here is that, despite computing the gradient from the p_loss loss function that depends on both the critic and the actor, we only need to update the actor. Indeed, from DPG we know that .

This is accomplished by passing p_loss to the minimize method of the optimizer, which specifies the variables that need updating. In this case, we need to update only the variables of the online actor that was defined in the online/m_mlp variable scope:

p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss, var_list=variables_in_scope('online/p_mlp'))

In this way, the computation of the gradient will start from p_loss, go through the critic's network, and then the actor's network. By the end, only the parameters of the actor will be optimized.

Now, we have to define the variable_in_scope(scope) function that returns the variables in the scope named scope:

def variables_in_scope(scope):
return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope)

It's now time to look at how the target networks are updated. We can use variable_in_scope to get the target and online variables of both the actors and the critics, and use the TensorFlow assign function on the target variables to update them, following the soft update formula:


 This is done in the following snippet of code:

update_target = [target_var.assign(tau*online_var + (1-tau)*target_var) for target_var, online_var in zip(variables_in_scope('target'), variables_in_scope('online'))]
update_target_op = tf.group(*update_target)

That's it! For the computational graph, that's everything. Pretty straightforward, right? Now we can take a quick look at the main cycle, where the parameters are updated, following the estimated gradient on a finite batch of samples. The interaction of the policy with the environment is standard, with the exception that now the actions that are returned by the policy are deterministic, and we have to add a certain amount of noise in order to adequately explore the environment. Here, we don't provide this part of the code, but you can find the full implementation on GitHub.

When a minimum amount of experience has been acquired, and the buffer has reached a certain threshold, the optimization of the policy and the critic starts. The steps that follow are those that are summarized in the DDPG pseudocode that was provided in The DDPG algorithm section. These are as follows:

  1. Sample a mini-batch from the buffer
  2. Calculate the target action values
  3. Optimize the critic
  4. Optimize the actor
  5. Update the target networks

All these operations are executed in just a few lines of code: 

    ... 

mb_obs, mb_rew, mb_act, mb_obs2, mb_done
= buffer.sample_minibatch(batch_size)

q_target_mb
= sess.run(qd_tar, feed_dict={obs_ph:mb_obs2})
y_r
= np.array(mb_rew) + discount*(1-np.array(mb_done))*q_target_mb

_, q_train_loss
= sess.run([q_opt, q_loss], feed_dict={obs_ph:mb_obs, y_ph:y_r, act_ph: mb_act})

_, p_train_loss
= sess.run([p_opt, p_loss], feed_dict={obs_ph:mb_obs})

sess.
run(update_target_op)

...

The first line of code samples a mini-batch of size batch_size, the second and third lines compute the target action values, as defined in equation (8.4), by running the critic and actor target networks on mb_obs2, which contains the next states. The fourth line optimizes the critic by feeding the dictionary with the target action values that were just computed, as well as the observations and actions. The fifth line optimizes the actor, and the last one updates the target networks by running update_target_op.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset