Pseudocode

Now that all the components of DQN have been explained, we can put all the pieces together and show you the pseudocode version of the algorithm to clarify any uncertainties (don't worry if it doesn't – in the next section, you'll implement it and everything will be clearer).

The DQN algorithm involves three main parts:

Data collection and storage. The data is collected by following a behavior policy (for example, -greedy).
Neural network optimization (performing SGD on mini-batches that have been sampled from the buffer).
Target update.

The pseudocode of DQN is as follows:

Initialize  function with random weight 
Initialize  function with random weight 
Initialize empty replay memory 

for  do
    Initialize environment 
    for  do
        > Collect observation from the env:
        
        
        > Store the transition in the replay buffer:
         
        
        > Update the model using (5.4):
        Sample a random minibatch  from 
         
        Perform a step of GD on  on 
        > Update target network:
        Every C steps   
        
    end for

end for

Here, d is a flag that's returned by the environment that signals whether the environment is in its final state. If d=True, that is, the episode has ended, the environment has to be reset.
is a preprocessing step that changes the images to reduce their dimensionality (it converts the images into grayscale and resizes them into smaller images) and adds the last n frames to the current frame. Usually, n is a value between 2 and 4. The preprocessing part will be explained in more detail in the next section, where we'll implement DQN.

In DQN, the experienced replay, , is a dynamic buffer that stores a limited number of frames. In the paper, the buffer contains the last 1 million transitions and when it exceeds this dimension, it discards the older experiences.

All the other parts have already been described. If you are wondering why the target value, , takes the if value, it is because there won't be any other interactions with the environment after and so is its actual unbiased Q-value.

Table of Contents for Pseudocode

Create new playlist

Sign In

Sign Up

Table of Contents for
Pseudocode