REINFORCE with baseline

REINFORCE has the nice property of being unbiased, due to the MC return, which provides the true return of a full trajectory. However, the unbiased estimate is to the detriment of the variance, which increases with the length of the trajectory. Why? This effect is due to the stochasticity of the policy. By executing a full trajectory, you would know its true reward. However, the value that is assigned to each state-action pair may not be correct, since the policy is stochastic, and executing it another time may lead to a new state, and consequently, a different reward. Moreover, you can see that the higher the number of actions in a trajectory, the more stochasticity you will have introduced into the system, therefore, ending up with higher variance.

Luckily, it is possible to introduce a baseline, , in the estimation of the return, therefore decreasing the variance, and improving the stability and performance of the algorithm. The algorithms that adopt this strategy is called REINFORCE with baseline, and the gradient of its objective function is as follows:

This trick of introducing a baseline is possible, because the gradient estimator still remains unchanged in bias:

At the same time, for this equation to be true, the baseline must be a constant with respect to the actions.

Our job now is to find a good baseline. The simplest way is to subtract the average return.

If you would like to implement this in the REINFORCE code, the only change is in the get_batch() function of the Buffer class:

    def get_batch(self):
b_ret = self.ret - np.mean(self.ret)
return self.obs, self.act, b_ret

Although this baseline decreases the variance, it's not the best strategy. As the baseline can be conditioned on the state, a better idea is to use an estimate of the value function:

Remember that the value function is, on average, the return that is obtained following the policy.

This variation introduces more complexity into the system, as we have to design an approximation of the value function, but it's very common to use, and it considerably increases the performance of the algorithm.

To learn , the best solution is to fit a neural network with MC estimates:

In the preceding equation,  is the parameters of the neural network to be learned.

In order to not overrun the notation, from now on, we'll neglect to specify the policy, so that  will become .

The neural network is trained on the same trajectories' data that is used for learning , without requiring additional interaction with the environment. Once computed, the MC estimates, for example, with discounted_rewards(rews, gamma), will become the target values, and the neural network will be optimized in order to minimize the mean square error (MSE) loss—just as you'd do in a supervised learning task:

Here,  is the weights of the value function neural network, and each element of the dataset contains the state, and the target value .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset