N-step DQN

The idea behind n-step DQN is old and comes from the shift between temporal difference learning and Monte Carlo learning. These algorithms, which were introduced in Chapter 4, Q-Learning and SARSA Applications, are at the opposite extremes of a common spectrum. TD learning learns from a single step, while MC learns from the complete trajectory. TD learning exhibits a minimal variance but a maximal bias, where as MC exhibits high variance but a minimal bias. The variance-bias problem can be balanced using an n-step return. An n-step return is a return computed after n steps. TD learning can be viewed as a 0-step return while MC can be viewed as a -step return.

With the n-step return, we can update the target value, as follows:

 (5.9)

Here,  is the number of steps.

An n-step return is like looking ahead n steps, but in practice, as it's impossible to actually look into the future, it's done in the opposite way, that is, by computing the  value of n-steps ago. This leads to values that are only available at time , delaying the learning process.

The main advantage of this approach is that the target values are less biased and this can lead to faster learning. An important problem that arises is that the target values that are calculated in this way are correct, but only when the learning is on-policy (DQN is off-policy). This is because formula (5.9) assumes that the policy that the agent will follow for the next n-steps is the same policy that collected the experience. There are some ways to adjust for the off-policy case, but they are generally complicated to implement and the best general practice is just to keep a small n and ignore the problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset