The AC implementation

Overall, as we have seen so far, the AC algorithm is very similar to the REINFORCE algorithm, with the state function as a baseline. But, to provide a recap, the algorithm is summarized in the following code:

Initialize  with random weight
Initialize environment
for episode 1..M do
Initialize empty buffer

> Generate a few episodes
for step 1..MaxSteps do
> Collect experience by acting on the environment



if :

> Compute the n-step reward to go
# for each t
> Compute the advantage values
# for each t
> Store the episode in the buffer
# where is the lenght of the episode
> Actor update step using all the experience in

> Critic update using all the experience in D

The only differences with REINFORCE are the calculation of the n-step reward to go, the advantage function calculation, and a few adjustments of the main function. 

Let's first look at the new implementation of the discounted reward. Differently to before, the estimated value of the last last_sv state is now passed in the input and is used to bootstrap, as given in the following implementation: 

def discounted_rewards(rews, last_sv, gamma):
rtg = np.zeros_like(rews, dtype=np.float32)
rtg[-1] = rews[-1] + gamma*last_sv # Bootstrap with the estimate next state value

for i in reversed(range(len(rews)-1)):
rtg[i] = rews[i] + gamma*rtg[i+1]
return rtg

The computational graph doesn't change, but in the main cycle, we have to take care of a few small, but very important, changes. 

Obviously, the name of the function is changed to AC, and the learning rate of the cr_lr critic is added as an argument.

The first actual change involves the way in which the environment is reset. If, in REINFORCE, it was preferred to reset the environment on every iteration of the main cycle, in AC, we have to resume the environment from where we left off in the previous iteration, resetting it only when it reaches its final state.

The second change involves the way in which the action-value function is bootstrapped, and how the reward to go is calculated. Remember that  for every state-action pair, except in the case of when is a final state. In this case, . Thus, we have to bootstrap with a value of 0, whenever we are in the last state, and bootstrap with  in all the other cases. With these changes, the code is as follows:

    obs = env.reset()
ep_rews = []

for ep in range(num_epochs):
buffer = Buffer(gamma)
env_buf = []

for _ in range(steps_per_env):
act, val = sess.run([act_multn, s_values], feed_dict={obs_ph:[obs]})
obs2, rew, done, _ = env.step(np.squeeze(act))

env_buf.append([obs.copy(), rew, act, np.squeeze(val)])
obs = obs2.copy()
step_count += 1
last_test_step += 1
ep_rews.append(rew)

if done:

buffer.store(np.array(env_buf), 0)
env_buf = []

train_rewards.append(np.sum(ep_rews))
train_ep_len.append(len(ep_rews))
obs = env.reset()
ep_rews = []

if len(env_buf) > 0:
last_sv = sess.run(s_values, feed_dict={obs_ph:[obs]})
buffer.store(np.array(env_buf), last_sv)

obs_batch, act_batch, ret_batch, rtg_batch = buffer.get_batch()
sess.run([p_opt, v_opt], feed_dict={obs_ph:obs_batch, act_ph:act_batch, ret_ph:ret_batch, rtg_ph:rtg_batch})
...

The third change is in the store method of the Buffer class. In fact, now, we also have to deal with incomplete trajectories. In the previous snippet, we saw that the estimated state values are passed as the third argument to the store function. Indeed, we use them to bootstrap and to compute the reward to go. In the new version of store, we call the variable that is associated with the state values, last_sv, and pass it as the input to the discounted_reward function, as follows:

    def store(self, temp_traj, last_sv):
if len(temp_traj) > 0:
self.obs.extend(temp_traj[:,0])
rtg = discounted_rewards(temp_traj[:,1], last_sv, self.gamma)
self.ret.extend(rtg - temp_traj[:,3])
self.rtg.extend(rtg)
self.act.extend(temp_traj[:,2])

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset