Adversarial self-play

In the previous example, we saw an example of both cooperative and competitive self-play where multiple agents functioned almost symbiotically. While this was a great example, it still tied the functionality of one brain to another through their reward functions, hence our observation of the agents being in an almost rewards-opposite scenario. Instead, we now want to look at an environment that can train a brain with multiple agents using just adversarial self-play. Of course, ML-Agents has such an environment, called Banana, which comprises several agents that randomly wander the scene and collect bananas. The agents also have a laser pointer, which allows them to disable an opposing agent for several seconds if they are hit. This is the scene we will look at in the next exercise:

Open the Banana scene from the Assets | ML-Agents | Examples | BananaCollectors | Scenes folder.
Select and disable the additional training areas RLArea(1) to RLArea(3).
Select the five agents (Agent, Agent(1), Agent(2), Agent(3), Agent(4)) in the RLArea.
Swap the Banana Agent | Brain from BananaPlayer to BananaLearning.
Select the Academy and set the Banana Academy | Brains | Control property to Enabled.
Select the Banana Agent component (Script) in the editor, and open it in your code editor of choice. If you scroll down to the bottom, you can see the OnCollisionEnter method as shown:

void OnCollisionEnter(Collision collision)
{
  if (collision.gameObject.CompareTag("banana"))
  {
    Satiate();
    collision.gameObject.GetComponent<BananaLogic>().OnEaten();
    AddReward(1f);
    bananas += 1;
    if (contribute)
    {
      myAcademy.totalScore += 1;
    }
  }
 if (collision.gameObject.CompareTag("badBanana"))
 {
   Poison();
   collision.gameObject.GetComponent<BananaLogic>().OnEaten();

   AddReward(-1f);
   if (contribute)
   {
     myAcademy.totalScore -= 1;
   }
  }
}

Reading the preceding code, we can summarize our reward functions to the following:

This simply means the agents only receive a reward for eating bananas. Interestingly, there is no reward for disabling an opponent with a laser or by being disabled.

Save the scene and the project.
Open a prepared Python/Anaconda console and start training with the following command:

mlagents-learn config/trainer_config.yaml --run-id=banana --train

Press Play in the editor when prompted, and watch the action unfold as shown in the next screenshot:

The Banana Collector agents doing their work

Let the scene run for as long as you like.

This scene is an excellent example of how agents learn to use a secondary game mechanic that returns no rewards, but, like the laser, is still used to immobilize adversarial collectors and obtain more bananas, all while only receiving rewards for eating only bananas. This example shows some of the true power of RL and how it can be used to find secondary strategies in order to solve problems. While this is a very entertaining aspect and fun to watch in a game, consider the grander implications of this. RL has been shown to optimize everything from networking to recommender systems using adversarial self-play, and it will be interesting to see what this method of learning is capable of accomplishing in the near future.

Table of Contents for Adversarial self-play

Create new playlist

Sign In

Sign Up

Table of Contents for
Adversarial self-play