Actor-critic methods

Approaches to reinforcement learning can be divided into three broad categories:

  • Value-based learning: This tries to learn the expected reward/value for being in a state. The desirability of getting into different states can then be evaluated based on their relative value. Q-learning in an example of value-based learning.
  • Policy-based learning: In this, no attempt is made to evaluate the state, but different control policies are tried out and evaluated based on the actual reward from the environment. Policy gradients are an example of that.
  • Model-based learning: In this approach, which will be discussed in more detail later in the chapter, the agent attempts to model the behavior of the environment and choose an action based on its ability to simulate the result of actions it might take by evaluating its model.

Actor-critic methods all revolve around the idea of using two neural networks for training. The first, the critic, uses value-based learning to learn a value function for a given state, the expected reward achieved by the agent. Then the actor network uses policy-based learning to maximize the value function from the critic. The actor is learning using policy gradients, but now its target has changed. Rather than being the actual reward received by playing, it is using the critic's estimate of that reward.

One of the big problems with Q-learning is that in complex cases, it can be very hard for the algorithm to ever converge. As re-evaluations of the Q-function change what actions are selected, the actual value rewards received can vary massively. For example, imagine a simple maze-walking robot. At the first T-junction it encounters in the maze, it initially moves left. Successive iterations of Q-learning eventually lead to it determining that right is the preferable way to move. But now because its path is completely different, every other state evaluation must now be recalculated; the previously learned knowledge is now of little value. Q-learning suffers from high variance because small shifts in policy can have huge impacts on reward.

In actor-critic, what the critic is doing is very similar to Q-learning, but there is a key difference: instead of learning the hypothetical best action for a given state, it is learning the expected reward based on the most likely sub-optimal policy that the actor is currently following.

Conversely, policy gradients have the inverse high variance problem. As policy gradients are exploring a maze stochastically, certain moves may be selected, which are, in fact, quite good but end up being evaluated as bad because of other bad moves being selected in the same rollout. It suffers because though the policy is more stable, it has high variance related to evaluating the policy.

This is where actor critic aims to mutually solve these two problems. The value-based learning now has lower variance because the policy is now more stable and predictable, while the policy gradient learning is also more stable because it now has a much lower variance value function from which to get its gradients.

Baseline for variance reduction

There are a few different variants of actor-critic methods: the first one we will look at is the baseline actor critic. Here, the critic tries to learn the average performance of the agent from a given position, so its loss function would be this:

Baseline for variance reduction

Here, Baseline for variance reduction is the output of the critic network for the state at time step t, and rt is the cumulative discounted reward from time step t. The actor can then be trained using the target:

Baseline for variance reduction

Because the baseline is the average performance from this state, this has the effect of massively reducing the variance of training. If we run the cart pole task once using policy gradients and once using baselines, where we do not use batch normalization, we can see that baselines perform much better. But if we add in batch normalization, the result is not much different. For more complex tasks than cart pole, where the reward may vary a lot more with the state, the baselines approach may improve things a lot more. An example of this can be found at actor_critic_baseline_cart_pole.py.

Generalized advantage estimator

The baselines approach does a great job at reducing variance, but it is not a true actor-critic approach because the actor is not learning the gradient of the critic, simply using it to normalize the reward. Generalized advantage estimator goes a step further and incorporates the critics gradients into the actor's objective.

In order to do this, we need to learn not just the value of the states the agent is in, but also of the state action pairs it takes. If V(st) is the value of the state, and Q(st, at) is the value of the state action pair, we can define an advantage function like this:

Generalized advantage estimator

This will give us the difference between how well the action at did in state st and the average action the agent takes in this position. Moving towards the gradient of this function should lead to us maximizing our reward. Also, we don't need another network to estimate Q(st, at) because we can use the fact that we have the value function for the state we reached at st+1, and the definition of a Q-function is as follows:

Generalized advantage estimator

Here, r t is now the reward for that time step, not the cumulative reward as in the baseline equation, and Generalized advantage estimator is the future reward discount factor. We can now substitute that in to give us our advantage function purely in terms in V:

Generalized advantage estimator

Again, this gives us a measure of whether the critic thinks a given action improved or hurt the value of the position. We replace the cumulative reward in our actor's loss function with the result of the advantage function. The full code for this is in actor_critic_advantage_cart_pole.py. This approach used on the cart-pole challenge can complete it, but it may take longer than simply using policy gradients with batch normalization. But for more complex tasks such as learning computer games, advantage actor-critic can perform the best.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset