Understanding the REINFORCE algorithm

The core of policy gradient algorithms has already been covered, but we have another important concept to explain. We are yet to look at how action values are computed.

We already saw with the formula (6.4):

that we are able to estimate the gradient of the objective function by sampling directly from the experience that is collected following the policy.

The only two terms that are involved are the values of  and the derivative of the logarithm of the policy, which can be obtained through modern deep learning frameworks (such as TensorFlow and PyTorch). While we defined , we haven't explained how to estimate the action-value function, yet.

The simpler way, introduced for the first time in the REINFORCE algorithm by Williams, is to estimate the return is using Monte Carlo (MC) returns. For this reason, REINFORCE is considered an MC algorithm. If you remember, MC returns are the return values of sampled trajectories run with a given policy. Thus, we can rewrite equation (6.4), changing the action-value function, , with the MC return,

The return is computed from a complete trajectory, implying that the PG update is available only after steps, where is the total number of steps in a trajectory. Another consequence is that the MC return is well defined only in episodic problems, where there is an upper bound to the maximum number of steps (the same conclusions that we came up with in the other MC algorithms that we previously learned).

To get more practical, the discounted return at time , which can also be called the reward to go, as it uses only future rewards, is as follows:

This can be rewritten recursively, as follows:

This function can be implemented by proceeding in reverse order, starting from the last reward, as shown here:

def discounted_rewards(rews, gamma):
rtg = np.zeros_like(rews, dtype=np.float32)
rtg[-1] = rews[-1]
for i in reversed(range(len(rews)-1)):
rtg[i] = rews[i] + gamma*rtg[i+1]
return rtg

Here, in the first place, a NumPy array is created, and the value of the last reward is assigned to the rtg variable. This is done because, at time . Then, the algorithm computes rtg[i] backward, using the subsequent value.

The main cycle of the REINFORCE algorithm involves running a few epochs until it gathers enough experience, and optimizing the policy parameter. To be effective, the algorithm has to complete at least one epoch before performing the update step (it needs at least a full trajectory to compute the reward to go ()). REINFORCE is summarized in the following pseudocode:

Initialize  with random weight

for episode 1..M do
Initialize environment
Initialize empty buffer

> Generate a few episodes
for step 1..MaxSteps do
> Collect experience by acting on the environment



if :

> Compute the reward to go
# for each t
> Store the episode in the buffer
# where is the length of the episode
> REINFORCE update step using all the experience in following formula (6.5)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset