Building reward functions

Building reward functions can be quite simple, as this one will be, or extremely complex, as you may well imagine. While this step is optional for training these examples, it is almost mandatory when you go to build your own environments. It can also identify problems in your training, and ways of enhancing or easing training as well. 

Open up the Unity editor and follow this exercise to build these sample reward functions:

  1. Select the trueAgent object in the Hierarchy window and then click the target icon beside the Grid Agent component.
  2. Select Edit Script from the Contact menu.
  3. After the script opens in your editor, scroll down to the AgentAction method as follows:
public override void AgentAction(float[] vectorAction, string textAction)
{
AddReward(-0.01f);
int action = Mathf.FloorToInt(vectorAction[0]);

... // omitted for brevity

Collider[] blockTest = Physics.OverlapBox(targetPos, new Vector3(0.3f, 0.3f, 0.3f));
if (blockTest.Where(col => col.gameObject.CompareTag("wall")).ToArray().Length == 0)
{
transform.position = targetPos;
if (blockTest.Where(col => col.gameObject.CompareTag("goal")).ToArray().Length == 1)
{
Done();
SetReward(1f);
}
if (blockTest.Where(col => col.gameObject.CompareTag("pit")).ToArray().Length == 1)
{
Done();
SetReward(-1f);
}
}
}
  1. We want to focus on the highlighted lines, AddReward and SetReward:
    • AddReward(-.1f): This first line denotes a step reward. Every step the agent takes will cost the agent a negative reward. This is the reason we see the agent show negative rewards until it finds the positive reward.
    • SetReward(1f): This the final positive reward the agent receives, and it is set to the maximum value of 1. In these types of training scenarios, we prefer to use a range of rewards from -1 to +1.
    • SetReward(-1f): This is the pit of death reward, and a final negative reward.
  2. Using each of the previous statements, we can map these to reward functions as follows:
    • AddReward(-.1f)
    • SetReward(1f)
    • SetReward(-1f) =
  3. One thing to notice here is that AddReward is an incremental reward, while SetReward sets the final value. So, the agent only ever sees a positive reward by reaching the final goal. 

By mapping these reward functions, we can see that the only way an agent can learn a positive reward is by finding its way to a goal. This is the reason the agent begins with a negative reward, it essentially only first learns to avoid wasting time or moves until it randomly encounters the goal. From there, the agent can quickly assign value to states based on previous positive rewards received. The issue is that the agent first needs to encounter a positive reward before we begin with the actual training. We discuss this particular problem in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset