Actor-Critic and continuous action spaces

Another complexity we introduced when looking at marathon RL or control learning was the introduction of continuous action spaces. Continuous action spaces represent a set of infinite possible actions an agent could take. Where our agent could previously favor a discrete action, yes or no, it now has to select some points within an infinite space of actions as an action for each joint. This mapping from an infinite action space to an action is not easy to solve—however, we do have neural networks at our disposal, and these provide us with an excellent solution using an architecture not unlike the GANs we looked at in Chapter 3, GAN for Games.

As we discovered in the chapter on GANs, we could propose a network architecture composed of two competing networks. These competing networks would force each network to learn by competing against each other for the best solution to mapping a random space into a convincing forgery. A similar concept to a GAN can be applied in this case as well, and is called the Actor-Critic model. A diagram of this model is as follows:



Actor-Critic architecture

What happens here is that the Actor selects an action from the policy given a state. The state is first passed through a Critic, which values the best action given the current state, provided some error. More simply put, the Critic criticizes each action based on the current state, and then the Actor chooses the best action given the state

This method of action selection was first explored in an algorithm called dueling double Q networks (DDQN). It is now the basis for most advanced RL algorithms.

Actor-Critic was essentially required to solve the continuous action space problem, but, given its performance, this method has been incorporated into some advanced discrete algorithms as well. ML-Agents uses an Actor-Critic model for continuous spaces, but does not use one for discrete action spaces. 

Using Actor-Critic requires, or works best with, additional layers and neurons in our network, which is something we can configure in ML-Agents. The hyperparameter definitions for these are pulled from the ML-Agents documents, and are as follows:

  • num_layers: This corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems, fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems:
    • Typical range: 1 – 3
  • hidden_units: These correspond to how many units are in each fully-connected layer of the neural network. For simple problems where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where the action is a very complex interaction between the observation variables, this should be larger:

    • Typical range: 32 – 512

Let's open up a new ML-Agents marathon or control sample and see what effect modifying these parameters has on training. Follow this exercise to understand the effect of adding layers and neurons (units) to a control problem:

  1. Open the Walker scene from the Assets/ML-Agents/Examples/Walker/Scenes folder. This example features a walking humanoid animation.
  2. Locate and select the WalkerAgent object in the Hierarchy window, and then look to the Inspector window and examine the Agent and Brain settings, as shown in the following screenshot:
The WalkerAgent and WalkerLearning properties
  1. Select WalkerAcademy in the Hierarchy window and make sure the Control option is enabled for the Brains parameter.
  2. Open the trainer_config.yaml file located in the ML-Agents/ml-agents/config folder and scroll down to the WalkerLearning section as follows:
WalkerLearning:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2048
buffer_size: 20480
gamma: 0.995
max_steps: 2e6
summary_freq: 3000
num_layers: 3
hidden_units: 512
  1. Notice how many layers and units this example is using. Is it more or fewer than what we used for the discrete action problems?
  2. Save everything and set the sample up for training.
  3. Launch a training session from your Python console with the following command:
mlagents-learn config/trainer_config.yaml --run-id=walker --train
  1. This agent may take considerably longer to train, but try and wait for about 100,000 iterations in order to get a good sense of its training progress.

Now that we have a better understanding of Actor-Critic and how it is used in continuous action spaces, we can move on to exploring what effect changing the network size has on training these more complex networks in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset