In Q-learning, a deep neural network learns a set of weights to approximate the Q-value function. Thereby, the Q-value function is parametrized by (the weights of the network) and written as follows:
To adapt Q-learning with deep neural networks (this combination takes the name of deep Q-learning), we have to come up with a loss function (or objective) to minimize.
As you may recall, the tabular Q-learning update is as follows:
Here, is the state at the next step. This update is done online on each sample that's collected by the behavior policy.
With the neural network, our objective is to optimize the weight, , so that resembles the optimal Q-value function. But since we don't have the optimal Q-function, we can only make small steps toward it by minimizing the Bellman error for one step, . This step is similar to what we've done in tabular Q-learning. However, in deep Q-learning, we don't update the single value, . Instead, we take the gradient of the Q-function with respect to the parameters, :
Here, is the partial derivate of with respect to . is called the learning rate, which is the size of the step to take toward the gradient.
In reality, the smooth transition that we just saw from tabular Q-learning to deep Q-learning doesn't yield a good approximation. The first fix involves the use of the Mean Square Error (MSE) as a loss function (instead of the Bellman error). The second fix is to migrate from an online Q-iteration to a batch Q-iteration. This means that the parameters of the neural network are updated using multiple transitions at once (such as using a mini-batch of size greater than 1 in supervised settings). These changes produce the following loss function:
Here, isn't the true action-value function since we haven't used it. Instead, it is the Q-target value:
Then, the network parameter, , is updated by gradient descent on the MSE loss function, :
It's very important to note that is treated as a constant and that the gradient of the loss function isn't propagated further.