Deep Q-learning instabilities

With the loss function and the optimization technique we just presented, you should be able to develop a deep Q-learning algorithm. However, the reality is much more subtle. Indeed, if we try to implement it, it probably won't work. Why? Once we introduce neural networks, we can no longer guarantee improvement. Although tabular Q-learning has convergence capabilities, its neural network counterpart does not. 

Sutton and Barto in Reinforcement Learning: An Introduction, introduced a problem called the deadly triad, which arises when the following three factors are combined:

  • Function approximation
  • Bootstrapping (that is, the update used by other estimates)
  • Off-policy learning (Q-learning is an off-policy algorithm since its update is independent on the policy that's being used)

But these are exactly the three main ingredients of the deep Q-learning algorithm. As the authors noted, we cannot get rid of bootstrapping without affecting the computational cost or data efficiency. Moreover, off-policy learning is important for creating more intelligent and powerful agents. And clearly, without deep neural networks, we'll lose an extremely important component. Therefore, it is very important to design algorithms that preserve these three components but at the same time mitigate the deadly triad problem.

Besides, from equations (5.2) and (5.3), the problem may seem similar to supervised regression, but it's not. In supervised learning, when performing SGD, the mini-batches are always sampled randomly from a dataset to make sure that they are independent and identically distributed (IID). In RL, it is the policy that gathers the experience. And because the states are sequential and strongly related to each other, the i.i.d assumption is immediately lost, causing severe instabilities when performing SGD.

Another cause of instability is due to the non-stationarity of the Q-learning process. From equation, (5.2) and (5.3), you can see that the same neural network that is updated is also the one that computes the target values, . This is dangerous, considering that the target values will also be updated during training. It's like shooting at a moving circular target without taking into consideration its movement. These behaviors are only due to the generalization capabilities of the neural network; in fact, they are not a problem in a tabular case. 

Deep Q-learning is poorly understood theoretically but, as we'll soon see, there is an algorithm that deploys a few tricks to increase the i.i.d of the data and alleviate the moving target problem. These tricks make the algorithm much more stable and flexible. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset