Q-learning with neural networks

In Q-learning, a deep neural network learns a set of weights to approximate the Q-value function. Thereby, the Q-value function is parametrized by  (the weights of the network) and written as follows:

To adapt Q-learning with deep neural networks (this combination takes the name of deep Q-learning), we have to come up with a loss function (or objective) to minimize. 

As you may recall, the tabular Q-learning update is as follows:

Here,  is the state at the next step. This update is done online on each sample that's collected by the behavior policy.

Compared to the previous chapters, to simplify the notation, here, we refer to  as the state and action in the present step, while  is referred to as the state and action in the next step.

With the neural network, our objective is to optimize the weight, , so that  resembles the optimal Q-value function. But since we don't have the optimal Q-function, we can only make small steps toward it by minimizing the Bellman error for one step, . This step is similar to what we've done in tabular Q-learning. However, in deep Q-learning, we don't update the single value, . Instead, we take the gradient of the Q-function with respect to the parameters, :

Here,  is the partial derivate of  with respect to  .  is called the learning rate, which is the size of the step to take toward the gradient.

In reality, the smooth transition that we just saw from tabular Q-learning to deep Q-learning doesn't yield a good approximation. The first fix involves the use of the Mean Square Error (MSE) as a loss function (instead of the Bellman error). The second fix is to migrate from an online Q-iteration to a batch Q-iteration. This means that the parameters of the neural network are updated using multiple transitions at once (such as using a mini-batch of size greater than 1 in supervised settings). These changes produce the following loss function: 

Here,  isn't the true action-value function since we haven't used it. Instead, it is the Q-target value:

Then, the network parameter, , is updated by gradient descent on the MSE loss function, :

It's very important to note that  is treated as a constant and that the gradient of the loss function isn't propagated further.

Since, in the previous chapter, we introduced MC algorithms, we want to highlight that these algorithms can also be adapted to work with neural networks. In this case,  will be the return, . Since the MC update isn't biased, it's asymptotically better than TD, but the latter has better results in practice.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset