Rewards and reward functions

We often face this preconceived notion of rewards-based learning or training as comprising of an action being completed, followed by a reward, be it good or bad. While this notion of RL works completely fine for a single action-based task, such as the old multi-arm bandit problem we looked at earlier, or teaching a dog a trick, recall that reinforcement learning is really about an agent learning the value of actions by anticipating future rewards through a series of actions. At each action step, when the agent is not exploring, the agent will determine its next course of action based on what it perceives as having the best reward. What is not always so clear is what those rewards should represent numerically, and to what extent that matters. Therefore, it is often helpful to map out a simple set of reward functions that describe the learning behavior we want our agent to train on.

Let's open up the Unity editor to the GridWorld example and learn how to create a set of reward functions and mappings that describe that training, as follows:

  1. Open up the GridWorld example from the Assets | ML-Agents | Examples | GridWorld | Scenes folder.
  2. Select the trueAgent object in the Hierarchy and then switch the agent's brain, at Grid Agent | Brain, to GridWorldLearning.
  3. Select the GridAcademy and set the Grid Academy | Brains | Control option to enabled.
  4. Select and disable the Main Camera in the scene. This will make the agent's camera the primary camera, and the one we can view the scene with.
  5. Open up and prepare a Python or Anaconda window for training. Check previous chapters or the Unity documentation if you need to remember how to do this.
  6. Save the scene and project.
  7. Launch the sample into training using the following command at the Python/Anaconda window:
mlagents-learn config/trainer_config.yaml --run-id=gridworld --train
  1. One of the first things you will appreciate about this sample is how quickly it trains. Remember that the primary reason the agent trains so quickly is because the state space is so small; 5x5 in this example. An example of the simulation running is shown in the following screenshot:
GridWorld example running on 5x5 grid
  1. Run the sample until completion. It does not take long to run, even on older systems. 

Notice how the agent quickly goes from a negative reward to a positive reward as it learns to place the cube over the green +. However, did you notice that the agent starts training from a negative mean reward? The agent starts with a zero reward value, so let's examine where the negative reward is coming from. In the next section, we look at how to build the reward functions by looking at the code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset