Training self-play environments

Training these types of self-play environments opens up further possibilities for not only enhanced training possibilities but also for fun gaming environments. In some ways, these types of training environments can be just as much fun to watch, as we will see at the end of this chapter.

For now, though, we are going to jump back and continue setting up the configuration we need to train our SoccerTwos multi-agent environment in the next exercise:

  1. Open the ML-Agents/ml-agents/config/trainer_config.yaml file and inspect the StrikerLearning and GoalieLearning config sections, as shown:
StrikerLearning:
max_steps: 5.0e5
learning_rate: 1e-3
batch_size: 128
num_epoch: 3
buffer_size: 2000
beta: 1.0e-2
hidden_units: 256
summary_freq: 2000
time_horizon: 128
num_layers: 2
normalize: false

GoalieLearning:
max_steps: 5.0e5
learning_rate: 1e-3
batch_size: 320
num_epoch: 3
buffer_size: 2000
beta: 1.0e-2
hidden_units: 256
summary_freq: 2000
time_horizon: 128
num_layers: 2
normalize: false
  1. The obvious thought is that the brains should have a similar configuration, and you may start that way, yes. However, note that even in this example the batch_size parameter is set differently for each brain.
  2. Open a Python/Anaconda window and switch to your ML-Agents virtual environment and then launch the following command from the ML-Agents/ml-agents folder:
mlagents-learn config/trainer_config.yaml --run-id=soccer --train
  1. Press Play when prompted, and you should see the following training session running:

The SoccerTwos scene running in training mode
  1. As has been said, this can be a very entertaining sample to watch, and it trains surprisingly quickly.
  1. Open up the Python/Anaconda console after some amount of training, and note how you are getting stats on two brains now, StrikerLearning and GoalieLearning, as shown in the following screenshot:

Console output showing stats from two brains
  1. Note how StrikerLearning and GoalieLearning are returning opposite rewards to each other. This means, in order for these agents to be trained, they must balance their mean reward to 0 for both agents. As the agents train, you will notice their rewards start to converge to 0, the optimum reward for this example.  
  2. Let the sample run to completion. You can easily get lost watching these environments, so you may not even notice the time go by.

This example showed how we can harness the power of multi-agent training through self-play to teach two brains how to both compete and cooperate at the same time. In the next section, we look at multiple agents competing against one another in self-play.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset