Reverting to the basics

Often, when you get stuck on a problem, it helps to go back to the beginning and reaffirm that your understanding of everything works as expected. Now, to be fair, we have yet to explore the internals of ML-Agents and really understand DRL, so we never actually started at the beginning, but, for the purposes of this example, we will take a step back and look at the Hallway example in more detail. Jump back into the editor and follow this exercise:

Open the Hallway sample scene in the editor. Remember, the scene is located in the Assets | ML-Agents | Examples | Hallway | Scenes folder.
This example is configured to use several concurrent training environments. We are able to train multiple concurrent training environments with the same brain, because Proximal Policy Optimization (PPO), the RL algorithm powering this agent, trains to a policy and not a model. We will cover the fundamentals of policy and model-based learning when we get to the internals of PPO in Chapter 8, Understanding PPO, for RL. For our purposes and for simplicity, we will disable these additional environments for now.
Press Shift and then select all the numbered HallwayArea (1-15) objects in the Hierarchy.
With all the extra HallwayArea objects selected, disable them all by clicking the Active checkbox, as shown in the following screenshot:

Disabling all the extra training hallways

Open the remaining active HallwayArea in the Hierarchy window and select the Agent.
Set the Brain agents to use the HallwayLearning brain. It may be set to use the player brain by default.
Select the Academy object back in the Hierarchy window, and make sure the Hallway Academy component has its brain set to Learning and that the Control checkbox is enabled.
Open a Python or Anaconda window to the ML-Agents/ml-agents folder. Make sure your ML-Agents virtual environment is active and run the following command:

mlagents-learn config/trainer_config.yaml --run-id=hallway --train

Let the trainer start up and prompt you to click Play in the editor. Watch the agent run and compare its performance to the VisualHallway example.

Generally, you will notice some amount of training activity from the agent before 50,000 iterations, but this may vary. By training activity, we mean the agent is responding with a Mean Reward greater than -1.0 and a Standard Reward not equal to zero. Even if you let the example run to completion, that is, 500,000 iterations again, it is unlikely that the sample will train to a positive Mean Reward. We generally want our rewards to range from -1.0 to +1.0, with some amount of variation to show learning activity. If you recall from the VisualHallway example, the agent showed no learning activity for the duration of the training. We could have extended the training iterations, but it is unlikely we would have seen any stable training emerge. The reason for this has to do with the increased state space and handling of rewards. We will expand our understanding of state and how it pertains to RL in the next section.

Table of Contents for Reverting to the basics

Create new playlist

Sign In

Sign Up

Table of Contents for
Reverting to the basics