Expanding network architecture

Actor-Critic architectures increase the complexity of the problem, and thus the complexity and size of the networks needed to solve them. This is really no different than the case in our earlier look at PilotNet, the multilayer CNN architecture that was used by Nvidia to self-drive.

What we want to see is the immediate effect that increasing the size of our network has on a complex example such as the Walker example. Open Unity to the Walker example and complete the following exercise:

  1. Open trainer_config.yaml from where it is normally located.
  2. Modify the WalkerLearning configuration, as shown in the following code:
WalkerLearning:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2048
buffer_size: 20480
gamma: 0.995
max_steps: 2e6
summary_freq: 3000
num_layers: 1
hidden_units: 128
  1. Set num_layers: 1 and hidden_units: 128. These are typical values that we would use for discrete action space problems. You can confirm this by looking at another discrete sample, such as the VisualHallwayLearning configuration, as follows:
VisualHallwayLearning:
use_recurrent: false
sequence_length: 64
num_layers: 1
hidden_units: 128
memory_size: 256
beta: 1.0e-2
gamma: 0.99
num_epoch: 3
buffer_size: 1024
batch_size: 64
max_steps: 5.0e5
summary_freq: 1000
time_horizon: 64
  1. This sample uses the same settings as we just set our continuous action problem to.
  2. When you are done editing, save everything and get ready for training.
  3. Launch a training session, with a new run-id parameter. Remember to get in the practice of changing the run-id parameter with every run so that it is easier to discern each run in TensorBoard.
  4. As always, let the sample run for as long as you did the earlier unaltered run for a good comparison.

One of the things you may immediately notice when running this sample is how stable the training is. The second thing you may notice is that training stability increases, but performance slightly decreases. Remember that a smaller network has less weights and will generally be more stable and quicker to train. However, in this problem, while the training is more stable on the network and promises to be faster, you may notice that training hits a wall. The agent, now limited by network size, is able to optimize the smaller network faster, but without the fine control we have seen before. In fact, this agent will never be as good as the first unaltered run since it is now limited by a smaller network. This is another one of those trade-offs you need to balance when building DRL agents for games/simulations.

In the next section, we take a further look at what we call advantage functions or those used like in Actor-Critic, and will first explore TRPO, and, of course, PPO.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset