Analyzing the results

Throughout the learning, we monitored many parameters, including p_loss (the loss of the policy), old_p_loss (the policy's loss before the optimization phase), the total rewards, and the length of the episodes, in order to get a better understanding of the algorithm, and to properly tune the hyperparameters. We also summarized some histograms. Look at the code in the book's repository to learn more about the TensorBoard summaries!

In the following figure, we have plotted the mean of the total rewards of the full trajectories that were obtained during training:

From this plot, we can see that it reaches a mean score of 200, or slightly less, in about 500,000 steps; therefore requiring about 1,000 full trajectories, before it is able to master the game. 

When plotting the training performance, remember that it is likely that the algorithm is still exploring. To check whether this is true, monitor the entropy of the actions. If it's higher than 0, it means that the algorithm is uncertain about the actions selected, and it will keep exploring—choosing the other actions, and following their distribution. In this case, after 500,000 steps, the agent is also exploring the environment, as shown in the following plot: 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset