Implementation of PPO

Now that we have the basic ingredients of PPO, we can implement it using Python and TensorFlow.

The structure and implementation of PPO is very similar to the actor-critic algorithms but with only a few additional parts, all of which we'll explain here.

One such addition is the generalized advantage estimation (7.11) that takes just a few lines of code using the already implemented discounted_rewards function, which computes (7.12):

def GAE(rews, v, v_last, gamma=0.99, lam=0.95):
vs = np.append(v, v_last)
delta = np.array(rews) + gamma*vs[1:] - vs[:-1]
gae_advantage = discounted_rewards(delta, 0, gamma*lam)
return gae_advantage

The GAE function is used in the store method of the Buffer class when a trajectory is stored: 

class Buffer():
def __init__(self, gamma, lam):
...

def
store(self, temp_traj, last_sv):
if len(temp_traj) > 0:
self.ob.extend(temp_traj[:,0])
rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)
self.adv.extend(GAE(temp_traj[:,1], temp_traj[:,3], last_sv, self.gamma, self.lam))
self.rtg.extend(rtg)
self.ac.extend(temp_traj[:,2])

def get_batch(self):
return np.array(self.ob), np.array(self.ac), np.array(self.adv), np.array(self.rtg)

def __len__(self):
...

Here, ... stands for the lines of code that we didn't report.

We can now define the clipped surrogate loss function (7.9):

def clipped_surrogate_obj(new_p, old_p, adv, eps):
rt = tf.exp(new_p - old_p) # i.e. pi / old_pi
return -tf.reduce_mean(tf.minimum(rt*adv, tf.clip_by_value(rt, 1-eps, 1+eps)*adv))

It is quite intuitive and it doesn't need further explanation.

The computational graph holds nothing new, but let's go through it quickly:

# Placeholders
act_ph = tf.placeholder(shape=(None,act_dim), dtype=tf.float32, name='act')
obs_ph = tf.placeholder(shape=(None, obs_dim[0]), dtype=tf.float32, name='obs')
ret_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='ret')
adv_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='adv')
old_p_log_ph = tf.placeholder(shape=(None,), dtype=tf.float32, name='old_p_log')

# Actor
with tf.variable_scope('actor_nn'):
p_means = mlp(obs_ph, hidden_sizes, act_dim, tf.tanh, last_activation=tf.tanh)
log_std = tf.get_variable(name='log_std', initializer=np.ones(act_dim, dtype=np.float32))
p_noisy = p_means + tf.random_normal(tf.shape(p_means), 0, 1) * tf.exp(log_std)
act_smp = tf.clip_by_value(p_noisy, low_action_space, high_action_space)
# Compute the gaussian log likelihood
p_log = gaussian_log_likelihood(act_ph, p_means, log_std)

# Critic
with tf.variable_scope('critic_nn'):
s_values = tf.squeeze(mlp(obs_ph, hidden_sizes, 1, tf.tanh, last_activation=None))

# PPO loss function
p_loss = clipped_surrogate_obj(p_log, old_p_log_ph, adv_ph, eps)
# MSE loss function
v_loss = tf.reduce_mean((ret_ph - s_values)**2)

# Optimizers
p_opt = tf.train.AdamOptimizer(ac_lr).minimize(p_loss)
v_opt = tf.train.AdamOptimizer(cr_lr).minimize(v_loss)

The code for interaction with the environment and the collection of the experience is equal to AC and TRPO. However, in the PPO implementation in this book's GitHub repository, you can find a simple implementation that uses multiple agents.

Once  transitions (where N is the number of trajectories to run and T is the time horizon of each trajectory) are collected, we are ready to update the policy and the critic. In both cases, the optimization is run multiple times and done on mini-batches. But before it, we have to run p_log on the full batch because the clipped objective needs the action log probabilities of the old policy: 

        ...    
obs_batch, act_batch, adv_batch, rtg_batch = buffer.get_batch()
old_p_log = sess.run(p_log, feed_dict={obs_ph:obs_batch, act_ph:act_batch, adv_ph:adv_batch, ret_ph:rtg_batch})
old_p_batch = np.array(old_p_log)
lb = len(buffer)
lb = len(buffer)
shuffled_batch = np.arange(lb)

# Policy optimization steps
for _ in range(actor_iter):
# shuffle the batch on every iteration
np.random.shuffle(shuffled_batch)

for idx in range(0,lb, minibatch_size):
minib = shuffled_batch[idx:min(idx+batch_size,lb)]
sess.run(p_opt, feed_dict={obs_ph:obs_batch[minib], act_ph:act_batch[minib], adv_ph:adv_batch[minib], old_p_log_ph:old_p_batch[minib]})

# Value function optimization steps
for _ in range(critic_iter):
# shuffle the batch on every iteration
np.random.shuffle(shuffled_batch)

for idx in range(0,lb, minibatch_size):
minib = shuffled_batch[idx:min(idx+minibatch_size,lb)]
sess.run(v_opt, feed_dict={obs_ph:obs_batch[minib], ret_ph:rtg_batch[minib]})
...

On each optimization iteration, we shuffle the batch so that every mini-batch is different from the others. 

That's everything for the PPO implementation, but keep in mind that before and after every iteration, we are also running the summaries that we will later use with TensorBoard to analyze the results and debug the algorithm. Again, we don't show the code here as it is always the same and is quite long, but you can go through it in the full form in this book's repository. It is fundamental for you to understand what each plot displays if you want to master these RL algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset