Results

Evaluating the progress of an RL algorithm is very challenging. The most obvious way to do this is to keep track of its end goal; that is, monitoring the total reward that's accumulated during the epochs. This is a good metric. However, training the average reward can be very noisy due to changes in the weights. This leads to large changes in the distribution of the state that's being visited. 

For these reasons, we evaluated the algorithm on 10 test games every 20 training epochs and kept track of the average of the total (non-discounted) reward that was accumulated throughout the games. Moreover, because of the determinism of the environment, we tested the agent on an -greedy policy (with ) so that we have a more robust evaluation. The scalar summary is called test_rew. You can see it in TensorBoard if you access the directory where the logs have been saved, and execute the following command:

tensorboard --logdir .

The plot, which should be similar to yours (if you run the DQN code), is shown in the following diagram. The x axis represents the number of steps. You can see that it reaches a steady score of  after a linear increase in the first 250,000 steps and a more significant growth in the next 300,000 steps:

Figure 5.4. A plot of the mean total reward across 10 games. The x axis represents the number of steps

Pong is a relatively simple task to complete. In fact, our algorithm has been trained on around 1.1 million steps, whereas in the DQN paper, all the algorithms were trained on 200 million steps. 

An alternative way to evaluate the algorithm involves the estimated action-values. Indeed, the estimated action-values are a valuable metric because they measure the belief of the quality of the state-action pair. Unfortunately, this option is not optimal as some algorithms tend to overestimate the Q-values, as we will soon learn. Despite this, we tracked it during training. The plot is visible in the following diagram and, as we expected, the Q-value increases throughout the training in a similar way to the plot in the preceding diagram:

Figure 5.5. A plot of the estimated training Q-values. The x axis represents the number of steps

Another important plot, shown in the following diagram, shows the loss function through time. It's not as useful as in supervised learning as the target values aren't the ground truth, but it can always provide a good insight into the quality of the model:

Figure 5.6. A plot of the loss function
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset