Generalized advantage estimate

The area of RL is seeing explosive growth due to constant research that is pushing the envelope on what is possible. With every little advancement comes additional hyperparameters and small tweaks that can be applied to stabilize and/or improve training performance. Unity has recently add a new parameter called lambda, and the definition taken from the documentation is as follows:

  • lambda: This corresponds to the lambda parameter used when calculating the Generalized Advantage Estimate (GAE) https://arxiv.org/abs/1506.02438. This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to more reliance on the current value estimate (which can be high bias), and high values correspond to more reliance on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process:
    • Typical range: 0.9 – 0.95

The GAE paper describes a function parameter called lambda that can be used to shape the reward estimation function, and is best used for control or marathon RL tasks. We won't go too far into details, and interested readers should certainly pull down the paper and review it on their own. However, we will explore how altering this parameter can affect a control sample such as the Walker scene in the next exercise:

  1. Open the Unity editor to the Walker example scene.
  2. Select the Academy object in the Hierarchy and confirm that the scene is still set for training/learning. If it is, you won't have to do anything else. If the scene isn't set up to learn, you know what to do.
  1. Open the trainer_config.yaml file and modify WalkerLearning as follows:
WalkerLearning:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2048
buffer_size: 20480
gamma: 0.995
max_steps: 2e6
summary_freq: 3000
num_layers: 3
hidden_units: 512
lambd: .99
  1. Notice how we are setting the lambd parameters and make sure that num_layers and hidden_units are reset to the original values. In the paper, the authors describe optimum values from .95 to .99, but this differs from the Unity documentation.
  2. Save the file when you are done editing.
  3. Open up a Python console setup for training and run it with the following command:
mlagents-learn config/trainer_config.yaml --run-id=walker_lambd --train
  1. Make sure that you let the sample run as long as you have previously to get a good comparison.

One thing you will notice after a log of training is that the agent does indeed train almost 25% slower on this example. What this result tells us is that, by increasing lambda, we are telling the agent to put more value on rewards. Now, this may seem counter-intuitive, but in this sample or this type of environment, the agent is receiving constant small positive rewards. This results in each reward getting skewed, which, as we can see, skews training and impedes agent progress. It may be an interesting exercise for interested readers to try and play with the lambda parameter in the Hallway environment, where the agent only receives a single positive episode reward.

The RL advantage function or functions come in many forms, and are in place to address many of the issues with off-model or policy-driven algorithms such as PPO. In the next section, we round off the chapter by modifying and creating a new sample control/marathon learning environment on our own.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset