Implementation of PPO

Now that we have the basic ingredients of PPO, we can implement it using Python and TensorFlow.

The structure and implementation of PPO is very similar to the actor-critic algorithms but with only a few additional parts, all of which we'll explain here.

One such addition is the generalized advantage estimation (7.11) that takes just a few lines of code using the already implemented discounted_rewards function, which computes (7.12):

def GAE(rews, v, v_last, gamma=0.99, lam=0.95):
    vs = np.append(v, v_last)
    delta = np.array(rews) + gamma*vs[1:] - vs[:-1]
    gae_advantage = discounted_rewards(delta, 0, gamma*lam)
    return gae_advantage

The GAE function is used in the store method of the Buffer class when a trajectory is stored:

class Buffer():
    def __init__(self, gamma, lam):
        ...

    def store(self, temp_traj, last_sv):
        if len(temp_traj) > 0:
            self.ob.extend(temp_traj[:,0])
            rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)
            self.adv.extend(GAE(temp_traj[:,1], temp_traj[:,3], last_sv, self.gamma, self.lam))
            self.rtg.extend(rtg)
            self.ac.extend(temp_traj[:,2])

    def get_batch(self):
        return np.array(self.ob), np.array(self.ac), np.array(self.adv), np.array(self.rtg)

    def __len__(self):
        ...

Here, ... stands for the lines of code that we didn't report.

We can now define the clipped surrogate loss function (7.9):

def clipped_surrogate_obj(new_p, old_p, adv, eps):
    rt = tf.exp(new_p - old_p) # i.e. pi / old_pi
    return -tf.reduce_mean(tf.minimum(rt*adv, tf.clip_by_value(rt, 1-eps, 1+eps)*adv))

It is quite intuitive and it doesn't need further explanation.

The computational graph holds nothing new, but let's go through it quickly:

# Placeholders
act_ph = tf.placeholder(shape=(None,act_dim), dtype=tf.float32, name='act')
obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')
adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='adv')
old_p_log_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_p_log')

# Actor
with tf.variable_scope('actor_nn'):
    p_means = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)
    log_std = tf.get_variable(name='log_std', initializer=np.ones(act_dim, dtype=np.float32))
    p_noisy = p_means + tf.random_normal(tf.shape(p_means), 0, 1) * tf.exp(log_std)
    act_smp = tf.clip_by_value(p_noisy, low_action_space, high_action_space)
    # Compute the gaussian log likelihood
    p_log = gaussian_log_likelihood(act_ph, p_means, log_std)

# Critic 
with tf.variable_scope('critic_nn'):
    s_values = tf.squeeze(mlp(obs_ph, hidden_sizes, 1, tf.tanh, last_activation=None))

# PPO loss function
p_loss = clipped_surrogate_obj(p_log, old_p_log_ph, adv_ph, eps)
# MSE loss function
v_loss = tf.reduce_mean((ret_ph - s_values)**2)

# Optimizers
p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss)
v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)

The code for interaction with the environment and the collection of the experience is equal to AC and TRPO. However, in the PPO implementation in this book's GitHub repository, you can find a simple implementation that uses multiple agents.

Once transitions (where N is the number of trajectories to run and T is the time horizon of each trajectory) are collected, we are ready to update the policy and the critic. In both cases, the optimization is run multiple times and done on mini-batches. But before it, we have to run p_log on the full batch because the clipped objective needs the action log probabilities of the old policy:

        ...    
        obs_batch, act_batch, adv_batch, rtg_batch = buffer.get_batch()     
        old_p_log = sess.run(p_log, feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})
        old_p_batch = np.array(old_p_log)
lb = len(buffer)
        lb = len(buffer)
        shuffled_batch = np.arange(lb) 

        # Policy optimization steps
        for _ in range(actor_iter):
            # shuffle the batch on every iteration
            np.random.shuffle(shuffled_batch)

            for idx in range(0,lb, minibatch_size):
                minib = shuffled_batch[idx:min(idx+batch_size,lb)]
                sess.run(p_opt, feed_dict={obs_ph:obs_batch[minib], act_ph:act_batch[minib], adv_ph:adv_batch[minib], old_p_log_ph:old_p_batch[minib]})

        # Value function optimization steps
        for _ in range(critic_iter):
            # shuffle the batch on every iteration
            np.random.shuffle(shuffled_batch)

            for idx in range(0,lb, minibatch_size):
                minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]
                sess.run(v_opt, feed_dict={obs_ph:obs_batch[minib], ret_ph:rtg_batch[minib]})
        ...

On each optimization iteration, we shuffle the batch so that every mini-batch is different from the others.

That's everything for the PPO implementation, but keep in mind that before and after every iteration, we are also running the summaries that we will later use with TensorBoard to analyze the results and debug the algorithm. Again, we don't show the code here as it is always the same and is quite long, but you can go through it in the full form in this book's repository. It is fundamental for you to understand what each plot displays if you want to master these RL algorithms.

Table of Contents for Implementation of PPO

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementation of PPO