Multiple agent policy

In this section, we are going to look at how policy or off-model based methods such as PPO can be improved on by introducing multiple agents to train the same policy. The example exercise you will use in this section will be completely up to you, and should be one that you are familiar with and/or interested in. For our purposes, we will explore a sample that we have looked at extensively—the Hallway/VisualHallway. If you have been following most of the exercises in this book, you should be more than capable of adapting this example. However, note that, for this exercise, we want to use a sample that is set up to use multiple agents for training.

Previously, we avoided discussing the multiple agents; we avoided this aspect of training before because it may complicate the discussion of on-model versus off-model. Now that you understand the differences and reasons for using a policy-based method, you can better appreciate that since our agents are using a policy-based method, we can simultaneously train multiple agents against the same policy. However, this can have repercussions for other training parameters and configuration, as you may well imagine.

Open up the Unity editor to the Hallway/VisualHallway example scene, or one of your choosing, and complete the following exercise:

Open up a Python or Anaconda console window and get it ready to train.
Select and enable the HallwayArea, selecting areas (1) to (19) so they become active and viewable in the scene.
Select the Agent object in each HallwayArea, and make sure that Hallway Agent | Brain is set to HallwayLearning and not HallwayPlayer. This will turn on all the additional training areas.
Depending on your previous experience, you may or may not want to modify the sample back to the original. Recall that in an earlier exercise, we modified the HallwayAgent script to only scan a smaller section of angles. This may also require you to alter the brain parameters as well.
After you have the scene set up, save it and the project.

Run the scene in training using a unique run-id and wait for a number of training iterations. This sample may train substantially slower, or even faster, depending on your hardware.

Now that we have established a new baseline for the Hallway environment, we can now determine what effect modifying some hyperparameters has on discrete action samples. The two parameters we will revisit are the num_epochs (number of training epochs) and batch_size (experiences per training epoch) parameters that we looked at earlier with the continuous action (control) sample. In the documentation, we noted that a larger batch size was preferred when training control agents.

Before we continue, let's open the trainer_config.yaml file and inspect the HallwayLearning configuration section as follows:

HallwayLearning:
    use_recurrent: true
    sequence_length: 64
    num_layers: 2
    hidden_units: 128
    memory_size: 256
    beta: 1.0e-2
    gamma: 0.99
    num_epoch: 3
    buffer_size: 1024
    batch_size: 128
    max_steps: 5.0e5
    summary_freq: 1000
    time_horizon: 64

In the Unity documentation, it specifically mentions only increasing the number of epochs when increasing the batch size, and this is in order to account for additional training experiences. We learned that control examples generally benefit from a larger batch size, and, consequently, a larger epoch size. However, one last thing we want to determine is the effect of altering the batch_size and num_epoch parameters in a discrete action example with multiple agents feeding into and learning from the same policy.

For the purposes of this exercise, we are only going to modify batch_size and num_epoch to values as follows:

Update the HallwayLearning or brain configuration you are using to use the following parameters:

HallwayLearning:
    use_recurrent: true
    sequence_length: 64
    num_layers: 2
    hidden_units: 128
    memory_size: 256
    beta: 1.0e-2
    gamma: 0.99
    num_epoch: 10
    buffer_size: 1024
    batch_size: 1000
    max_steps: 5.0e5
    summary_freq: 1000
    time_horizon: 64

We set num_epoch to 10 and batch_size to 1000. These settings are typical for a control sample, as we have previously seen, but now we want to see the effect in a discrete action example with multiple agents training the same policy.
Prepare the sample for training, and get the Python console ready and open.
Run the training session with the following command:

mlagents-learn config/trainer_config.yaml --run-id=hallway_e10b1000 --train

Notice how we have set run-id using a helper prefix to name the iteration. We used e10 to represent that the num_epoch parameter is set to 10, and b1000 represents the batch_size value of 1000. This type of naming scheme can be helpful, and is one we will continue using through this book.

As the agent trains, try and answer the following questions:

Does the agent train better or worse than you expected?
Why do you think that is?

It will be up to you to run the sample in order to learn the answer to those questions. In the next section, we will look at helpful exercises you can do on your own to help your understanding of these complex topics.

Table of Contents for Multiple agent policy

Create new playlist

Sign In

Sign Up

Table of Contents for
Multiple agent policy