Implementation of the TRPO algorithm

In this implementation section of the TRPO algorithm, we'll concentrate our efforts on the computational graph and the steps that are required to optimize the policy. We'll leave out the implementation of other aspects that we looked at in the previous chapters (such as the cycle to gather trajectories from the environment, the conjugate gradient algorithm, and the line search algorithm). However, make sure to check out the full code in this book's GitHub repository. The implementation is for continuous control.

First, let's create all the placeholders and the two deep neural networks for the policy (the actor) and the value function (the critic):

act_ph = tf.placeholder(shape=(None,act_dim), dtype=tf.float32, name='act')
obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')
adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='adv')
old_p_log_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_p_log')
old_mu_ph = tf.placeholder(shape=(None, act_dim), dtype=tf.float32, name='old_mu')
old_log_std_ph = tf.placeholder(shape=(act_dim), dtype=tf.float32, name='old_log_std')
p_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_ph')
# result of the conjugate gradient algorithm
cg_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='cg')

# Actor neural network
with tf.variable_scope('actor_nn'):
    p_means = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)
    log_std = tf.get_variable(name='log_std', initializer=np.ones(act_dim, dtype=np.float32))

# Critic neural network
with tf.variable_scope('critic_nn'):
    s_values = mlp(obs_ph, hidden_sizes, 1, tf.nn.relu, last_activation=None)
    s_values = tf.squeeze(s_values)

There are a few things to note here:

The placeholder with the old_ prefix refers to the tensors of the old policy.
The actor and the critic are defined in two separate variable scopes because we'll need to select the parameters separately later.

The action space is a Gaussian distribution with a covariance matrix that is diagonal and independent of the state. A diagonal matrix can then be resized as a vector with one element for each action. We also work with the logarithm of this vector.

Now, we can add normal noise to the predicted mean according to the standard deviation, clip the actions, and compute the Gaussian log likelihood, as follows:

p_noisy = p_means + tf.random_normal(tf.shape(p_means), 0, 1) * tf.exp(log_std)

a_sampl = tf.clip_by_value(p_noisy, low_action_space, high_action_space)

p_log = gaussian_log_likelihood(act_ph, p_means, log_std)

We then have to compute the objective function, , the MSE loss function of the critic, and create the optimizer for the critic, as follows:

# TRPO loss function
ratio_new_old = tf.exp(p_log - old_p_log_ph)
p_loss = - tf.reduce_mean(ratio_new_old * adv_ph)

# MSE loss function
v_loss = tf.reduce_mean((ret_ph - s_values)**2)

# Critic optimization
v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)

Then, the subsequent steps involve the creation of the graph for the points (2), (3), and (4), as given in the preceding pseudocode. Actually, (2) and (3) are not done in TensorFlow and so they aren't part of the computational graph. Nevertheless, in the computational graph, we have to take care of some related things. The steps for this are as follows:

Estimate the gradient of the policy loss function.
Define a procedure to restore the policy parameters. This is needed because in the line search algorithm, we'll optimize the policy and test the constraints, and if the new policy doesn't satisfy them, we'll have to restore the policy parameters and try with a smaller coefficient.
Compute the Fisher-vector product. It is an efficient way to compute without forming the full .
Compute the TRPO step.
Update the policy.

Let's start from step 1, that is, estimating the gradient of the policy loss function:

def variables_in_scope(scope):    
    return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)

# Gather and flatten the actor parameters
p_variables = variables_in_scope('actor_nn')
p_var_flatten = flatten_list(p_variables)

# Gradient of the policy loss with respect to the actor parameters
p_grads = tf.gradients(p_loss, p_variables)
p_grads_flatten = flatten_list(p_grads)

Since we are working with vector parameters, we have to flatten them using flatten_list. variable_in_scope returns the trainable variables in scope. This function is used to get the variables of the actor since the gradients have to be computed with respect to these variables only.

Regarding step 2, the policy parameters are restored in this way:

p_old_variables = tf.placeholder(shape=(None,), dtype=tf.float32, name='p_old_variables')

# variable used as index for restoring the actor's parameters
it_v1 = tf.Variable(0, trainable=False)
restore_params = []

for p_v in p_variables:
    upd_rsh = tf.reshape(p_old_variables[it_v1 : it_v1+tf.reduce_prod(p_v.shape)], shape=p_v.shape)
    restore_params.append(p_v.assign(upd_rsh))
    it_v1 += tf.reduce_prod(p_v.shape)

restore_params = tf.group(*restore_params)

It iterates over each layer's variables and assigns the values of the old variables to the current one.

The Fisher-vector product of step 3 is done by calculating the second derivative of the KL divergence with respect to the policy variables:

# gaussian KL divergence of the two policies 
dkl_diverg = gaussian_DKL(old_mu_ph, old_log_std_ph, p_means, log_std)

# Jacobian of the KL divergence (Needed for the Fisher matrix-vector product)
dkl_diverg_grad = tf.gradients(dkl_diverg, p_variables)
dkl_matrix_product = tf.reduce_sum(flatten_list(dkl_diverg_grad) * p_ph)

# Fisher vector product
Fx = flatten_list(tf.gradients(dkl_matrix_product, p_variables))

Steps 4 and 5 involve the application of the updates to the policy, where beta_ph is , which is calculated using formula (7.6), and alpha is the rescaling factor found by line search:

# NPG update
beta_ph = tf.placeholder(shape=(), dtype=tf.float32, name='beta')
npg_update = beta_ph * cg_ph
alpha = tf.Variable(1., trainable=False)

# TRPO update
trpo_update = alpha * npg_update

# Apply the updates to the policy
it_v = tf.Variable(0, trainable=False)
p_opt = []
for p_v in p_variables:
    upd_rsh = tf.reshape(trpo_update[it_v : it_v+tf.reduce_prod(p_v.shape)], shape=p_v.shape)
    p_opt.append(p_v.assign_sub(upd_rsh))
    it_v += tf.reduce_prod(p_v.shape)

p_opt = tf.group(*p_opt)

Note how, without , the update can be seen as the NPG update.

The update is applied to each variable of the policy. The work is done by p_v.assign_sub(upd_rsh), which assigns the p_v - upd_rsh values to p_v, that i,: . The subtraction is due to the fact that we converted the objective function into a loss function.

Now, let's briefly see how all the pieces we implemented come together when we update the policy at every iteration of the algorithm. The snippets of code we'll present here should be added after the innermost cycle where the trajectories are sampled. But before digging into the code, let's recap what we have to do:

Get the output, log probability, standard deviation, and parameters of the policy that we used to sample the trajectory. This policy is our old policy.
Get the conjugate gradient.
Compute the step length, .

Execute the backtracking line search to get .
Run the policy update.

The first point is achieved by running a few operations:

    ...    
    old_p_log, old_p_means, old_log_std = sess.run([p_log, p_means, log_std], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})
    old_actor_params = sess.run(p_var_flatten)
    old_p_loss = sess.run([p_loss], feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})

The conjugate gradient algorithm requires an input function that returns the estimated Fisher Information Matrix, the gradient of the objective function, and the number of iterations (in TRPO, this is a value between 5 and 15):

     def H_f(p):
        return sess.run(Fx, feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, p_ph:p, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})

    g_f = sess.run(p_grads_flatten, feed_dict={old_mu_ph:old_p_means,obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})
    conj_grad = conjugate_gradient(H_f, g_f, iters=conj_iters)

We can then compute the step length, , beta_np, and the maximum coefficient, ,
best_alpha, which satisfies the constraint using the backtracking line search algorithm, and run the optimization by feeding all the values to the computational graph:

    beta_np = np.sqrt(2*delta / np.sum(conj_grad * H_f(conj_grad)))

    def DKL(alpha_v):
        sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:alpha_v, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})
        a_res = sess.run([dkl_diverg, p_loss], feed_dict={old_mu_ph:old_p_means, old_log_std_ph:old_log_std, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch, old_p_log_ph:old_p_log})
        sess.run(restore_params, feed_dict={p_old_variables: old_actor_params})
        return a_res

    best_alpha = backtracking_line_search(DKL, delta, old_p_loss, p=0.8)
    sess.run(p_opt, feed_dict={beta_ph:beta_np, alpha:best_alpha, cg_ph:conj_grad, obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, old_p_log_ph:old_p_log})

    ...

As you can see, backtracking_line_search takes a function called DKL that returns the KL divergence between the old and the new policy, the coefficient (this is the constraint value), and the loss of the old policy. What backtracking_line_search does is, starting from , incrementally decrease the value until it satisfies the following condition: the KL divergence is less than and the new loss function has decreased.

To this end, the hyperparameters that are unique to TRPO are as follows:

delta, (), the maximum KL divergence between the old and new policy.
The number of conjugate iterations, conj_iters. Usually, it is a number between 5 and 15.

Congratulations for coming this far! That was tough.

Table of Contents for Implementation of the TRPO algorithm

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementation of the TRPO algorithm