Sparsity of rewards

We call the situation where an agent does not get enough, or any, positive rewards, a sparsity of rewards. The simplest way to show how a sparsity of rewards can happen is by example, and fortunately, the GridWorld example can easily demonstrate this for us. Open the editor to the GridWorld example and follow this exercise:

  1. Open the GridWorld sample scene from where we left it in the last exercise. For the purposes of this exercise, it is also helpful to have trained the original sample to completion. GridWorld is one of those nice compact examples that train quickly and is an excellent place to test basic concepts, or even hyperparameters.
  1. Select the GridAcademy and change the Grid Academy | Reset Parameters | gridSize to 25, as shown in the following screen excerpt:

Setting the GridAcademy gridSize parameter
  1. Save the scene and the project.
  2. Launch the sample into training with the following command from your Python/Anaconda window:
mlagents-learn config/trainer_config.yaml --run-id=grid25x25 --train
  1. This will launch the sample and, assuming you still have the agentCam as the main camera, you should see the following in the Game window:

The GridWorld with a grid size of 25x25
  1. We have extended the game play space from a 5x5 grid to a 25x25 grid, making the goal (+) symbol much more difficult for the agent to randomly find.
  2. What you will quickly notice after a few reported iterations is how poorly the agent is performing in some cases even, reporting less than a -1 mean reward. What's more, the agent could continue training like this for a long time. In fact, it is possible the agent could never discover a reward within 100, 200, 1,000, or more iterations. Now, this may appear to be a problem of state, and, in some ways, you may think of it that way. However, remember that the input state into our agent is the same camera view, a state of 84x84 pixels image, and we have not changed that. So, for the purposes of this example, think of state in the policy RL algorithm as remaining fixed. Therefore, our best course of action in order to fix the problem is to increase the rewards.
  1. Stop the training example from the Python/Anaconda window by typing Ctrl + C. In order to be fair, we will increase the number of rewards for goals and deaths equally.
  2. Back in the editor, select the GridAcademy and increase the numObstacles and numGoals on the Grid Academy | Reset Parameters component properties, as shown in the following excerpt:

Updating the number of Obstacles and Goals
  1. Save the scene and the project.
  2. Launch the training session with the following code:
mlagents-learn config/trainer_config.yaml --run-id=grid25x25x5 --train
  1. This is to denote that we are running the sample with five times the number of obstacles and goals. 
  2. Let the agent train for 25,000 iterations and notice the performance increase. Let the agent train to completion and compare the results to our first run.
The problem of sparsity of rewards is generally encountered more frequently in discrete action tasks, such as GridWorld/Hallway and so on. because the reward function is often absolute. In continuous learning tasks, the reward function is often more gradual and is typically measured by some progress to a goal, and not just the goal itself. 

By increasing the number of obstacles and goals—the negative and positive rewards—we are able to train the agent much more quickly, although it is likely you will see very erratic cycles of training, and the agent never truly gets as good as the original. In fact, the training actually may diverge at some point later on. The reason for this is partly because of its limited vision, and we have only partially corrected the sparse rewards problem. We can, of course, fix the issue of sparse rewards in this example by simply increasing the number of goals and obstacles. You can go back and try a value of 25 for the number of obstacles and rewards and see much more stable, long-term results.

Of course, in many RL problems, an increasing number of rewards is not an option, and we need to look at cleverer methods, as we will see in the next section. Fortunately, a number of methods have arisen, in very brief time, looking to address the problem of sparse or difficult rewards. Unity, being at the top, quickly jumped on and implemented a number of methods, the first of which we will look at is called Curriculum Learning, which we will discuss in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset