Double DQN

The over-estimation of the Q-values in Q-learning algorithms is a well-known problem. The cause of this is the max operator, which over-estimates the actual maximum estimated values. To comprehend this problem, let's assume that we have noisy estimates with a mean of 0 but a variance different from 0, as shown in the following illustration. Despite the fact that, asymptotically, the average value is 0, the max function will always return values greater than 0:

Figure 5.7. Six values sampled from a normal distribution with a mean of 0

In Q-learning, this over-estimation is not a real problem until the higher values are uniformly distributed. If, however, the over-estimation is not uniform and the error differs from states and actions, this over-estimation negatively affects the DQN algorithm, which degrades the resulting policy.

To address this problem, in the paper Deep Reinforcement Learning with Double Q-learning, the authors suggest using two different estimators (that is, two neural networks): one for the action selection and one for the Q-values estimation. But instead of using two different neural networks and increasing the complexity, the paper proposes the use of the online network to choose the best action with the max operation, and the use of the target network to compute its Q-values. With this solution, the target value, , will change from being as follows for standard Q-learning:

Now, it's as follows:

 (5.7)

This uncoupled version significantly reduces over-estimation problems and improves the stability of the algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset