The n-step AC model

In reality, as we already saw in TD learning, a fully online algorithm has low variance but high bias, the opposite of MC learning. However, usually, a middle-ground strategy, between fully online and MC methods, is preferred. To balance this trade-off, an n-step return can replace a one-step return of online algorithms.

If you remember, we already implemented n-step learning in the DQN algorithm. The only difference is that DQN is an off-policy algorithm, and in theory, n-step can be employed only on on-policy algorithms. Nevertheless, we showed that with a small , the performance increased.

AC algorithms are on-policy, therefore, as far as the performance increase goes, it's possible to use arbitrary large  values. The integration of n-step in AC is pretty straightforward; the one-step return is replaced by , and the value function is taken in the state:

Here, . Pay attention here to how, if is a final state, .

Besides reducing the bias, the n-step return propagates the subsequent returns faster, making the learning much more efficient.

Interestingly, the quantity can be seen as an estimate of the advantage function. In fact, the advantage function is defined as follows:

 

Due to the fact that is an estimate of , we obtain an estimate of the advantage function. Usually, this function is easier to learn, as it only denotes the preference of one particular action over the others in a particular state. It doesn't have to learn the value of that state.

Regarding the optimization of the weights of the critic, it is optimized using one of the well-known SGD optimization methods, minimizing the MSE loss:

In the previous equation, the target values are computed as follows: .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset