Monitoring training with TensorBoard

Training an agent with RL, or any DL model for that matter, while enjoyable, is not often a simple task and requires some attention to detail. Fortunately, TensorFlow ships with a set of graph tools called TensorBoard we can use to monitor training progress. Follow these steps to run TensorBoard:

Open an Anaconda or Python window. Activate the ml-agents virtual environment. Don't shut down the window running the trainer; we need to keep that going.
Navigate to the ML-Agents/ml-agents folder and run the following command:

tensorboard --logdir=summaries

This will run TensorBoard with its own built-in web server. You can load the page by using the URL that is shown after you run the previous command.
Enter the URL for TensorBoard as shown in the window, or use localhost:6006 or machinename:6006 in your browser. After an hour or so, you should see something similar to the following:

The TensorBoard graph window

In the preceding screenshot, you can see each of the various graphs denoting an aspect of training. Understanding each of these graphs is important to understanding how your agent is training, so we will break down the output from each section:
- Environment: This section shows how the agent is performing overall in the environment. A closer look at each of the graphs is shown in the following screenshot with their preferred trend:

Closer look at the Environment section plots

Cumulative Reward: This is the total reward the agent is maximizing. You generally want to see this going up, but there are reasons why it may fall. It is always best to maximize rewards in the range of 1 to -1. If you see rewards outside this range on the graph, you also want to correct this as well.
Episode Length: It usually is a better sign if this value decreases. After all, shorter episodes mean more training. However, keep in mind that the episode length could increase out of need, so this one can go either way.
Lesson: This represents which lesson the agent is on and is intended for Curriculum Learning. We will learn more about Curriculum Learning in Chapter 9, Rewards and Reinforcement Learning.
Losses: This section shows graphs that represent the calculated loss or cost of the policy and value. Of course, we haven't spent much time explaining PPO and how it uses a policy, so, at this point, just understand the preferred direction when training. A screenshot of this section is shown next, again with arrows showing the optimum preferences:

Losses and preferred training direction

Policy Loss: This determines how much the policy is changing over time. The policy is the piece that decides the actions, and in general this graph should be showing a downward trend, indicating that the policy is getting better at making decisions.
Value Loss: This is the mean or average loss of the value function. It essentially models how well the agent is predicting the value of its next state. Initially, this value should increase, and then after the reward is stabilized, it should decrease.
Policy: PPO uses the concept of a policy rather than a model to determine the quality of actions. Again, we will spend more time on this in Chapter 8, Understanding PPO, where we will uncover further details about PPO. The next screenshot shows the policy graphs and their preferred trend:

Policy graphs and preferred trends

Entropy: This represents how much the agent is exploring. You want this value to decrease as the agent learns more about its surroundings and needs to explore less.
Learning Rate: Currently, this value is set to decrease linearly over time.
Value Estimate: This is the mean or average value visited by all states of the agent. This value should increase in order to represent a growth of the agent's knowledge and then stabilize.

These graphs are all designed to work with the implementation of the PPO method Unity is based on. Don't worry too much about understanding these new terms just yet. We will explore the foundations of PPO in Chapter 7, Agent and the Environment.

Let the agent run to completion and keep TensorBoard running.
Go back to the Anaconda/Python window that was training the brain and run this command:

mlagents-learn config/trainer_config.yaml --run-id=secondRun --train

You will again be prompted to press Play in the editor; be sure to do so. Let the agent start the training and run for a few sessions. As you do so, monitor the TensorBoard window and note how the secondRun is shown on the graphs. Feel free to let this agent run to completion as well, but you can stop it now, if you want to.

In previous versions of ML-Agents, you needed to build a Unity executable first as a game-training environment and run that. The external Python brain would still run the same. This method made it very difficult to debug any code issues or problems with your game. All of these difficulties were corrected with the current method; however, we may need to use the old executable method later for some custom training.

Now that we have seen how easy it is to set up and train an agent, we will go through the next section to see how that agent can be run without an external Python brain and run directly in Unity.

Table of Contents for Monitoring training with TensorBoard

Create new playlist

Sign In

Sign Up

Table of Contents for
Monitoring training with TensorBoard