Approaches to reinforcement learning can be divided into three broad categories:
Actor-critic methods all revolve around the idea of using two neural networks for training. The first, the critic, uses value-based learning to learn a value function for a given state, the expected reward achieved by the agent. Then the actor network uses policy-based learning to maximize the value function from the critic. The actor is learning using policy gradients, but now its target has changed. Rather than being the actual reward received by playing, it is using the critic's estimate of that reward.
One of the big problems with Q-learning is that in complex cases, it can be very hard for the algorithm to ever converge. As re-evaluations of the Q-function change what actions are selected, the actual value rewards received can vary massively. For example, imagine a simple maze-walking robot. At the first T-junction it encounters in the maze, it initially moves left. Successive iterations of Q-learning eventually lead to it determining that right is the preferable way to move. But now because its path is completely different, every other state evaluation must now be recalculated; the previously learned knowledge is now of little value. Q-learning suffers from high variance because small shifts in policy can have huge impacts on reward.
In actor-critic, what the critic is doing is very similar to Q-learning, but there is a key difference: instead of learning the hypothetical best action for a given state, it is learning the expected reward based on the most likely sub-optimal policy that the actor is currently following.
Conversely, policy gradients have the inverse high variance problem. As policy gradients are exploring a maze stochastically, certain moves may be selected, which are, in fact, quite good but end up being evaluated as bad because of other bad moves being selected in the same rollout. It suffers because though the policy is more stable, it has high variance related to evaluating the policy.
This is where actor critic aims to mutually solve these two problems. The value-based learning now has lower variance because the policy is now more stable and predictable, while the policy gradient learning is also more stable because it now has a much lower variance value function from which to get its gradients.
There are a few different variants of actor-critic methods: the first one we will look at is the baseline actor critic. Here, the critic tries to learn the average performance of the agent from a given position, so its loss function would be this:
Here, is the output of the critic network for the state at time step t, and rt is the cumulative discounted reward from time step t. The actor can then be trained using the target:
Because the baseline is the average performance from this state, this has the effect of massively reducing the variance of training. If we run the cart pole task once using policy gradients and once using baselines, where we do not use batch normalization, we can see that baselines perform much better. But if we add in batch normalization, the result is not much different. For more complex tasks than cart pole, where the reward may vary a lot more with the state, the baselines approach may improve things a lot more. An example of this can be found at actor_critic_baseline_cart_pole.py
.
The baselines approach does a great job at reducing variance, but it is not a true actor-critic approach because the actor is not learning the gradient of the critic, simply using it to normalize the reward. Generalized advantage estimator goes a step further and incorporates the critics gradients into the actor's objective.
In order to do this, we need to learn not just the value of the states the agent is in, but also of the state action pairs it takes. If V(st) is the value of the state, and Q(st, at) is the value of the state action pair, we can define an advantage function like this:
This will give us the difference between how well the action at did in state st and the average action the agent takes in this position. Moving towards the gradient of this function should lead to us maximizing our reward. Also, we don't need another network to estimate Q(st, at) because we can use the fact that we have the value function for the state we reached at st+1, and the definition of a Q-function is as follows:
Here, r t is now the reward for that time step, not the cumulative reward as in the baseline equation, and is the future reward discount factor. We can now substitute that in to give us our advantage function purely in terms in V:
Again, this gives us a measure of whether the critic thinks a given action improved or hurt the value of the position. We replace the cumulative reward in our actor's loss function with the result of the advantage function. The full code for this is in actor_critic_advantage_cart_pole.py
. This approach used on the cart-pole challenge can complete it, but it may take longer than simply using policy gradients with batch normalization. But for more complex tasks such as learning computer games, advantage actor-critic can perform the best.