Now that all the components of DQN have been explained, we can put all the pieces together and show you the pseudocode version of the algorithm to clarify any uncertainties (don't worry if it doesn't – in the next section, you'll implement it and everything will be clearer).
The DQN algorithm involves three main parts:
- Data collection and storage. The data is collected by following a behavior policy (for example, -greedy).
- Neural network optimization (performing SGD on mini-batches that have been sampled from the buffer).
- Target update.
The pseudocode of DQN is as follows:
Initialize function with random weight
Initialize function with random weight
Initialize empty replay memory
for do
Initialize environment
for do
> Collect observation from the env:
> Store the transition in the replay buffer:
> Update the model using (5.4):
Sample a random minibatch from
Perform a step of GD on on
> Update target network:
Every C steps
end for
end for
Here, d is a flag that's returned by the environment that signals whether the environment is in its final state. If d=True, that is, the episode has ended, the environment has to be reset.
is a preprocessing step that changes the images to reduce their dimensionality (it converts the images into grayscale and resizes them into smaller images) and adds the last n frames to the current frame. Usually, n is a value between 2 and 4. The preprocessing part will be explained in more detail in the next section, where we'll implement DQN.
In DQN, the experienced replay, , is a dynamic buffer that stores a limited number of frames. In the paper, the buffer contains the last 1 million transitions and when it exceeds this dimension, it discards the older experiences.
All the other parts have already been described. If you are wondering why the target value, , takes the if value, it is because there won't be any other interactions with the environment after and so is its actual unbiased Q-value.