So far, we've addressed and developed value-based reinforcement learning algorithms. These algorithms learn a value function in order to be able to find a good policy. Despite the fact that they exhibit good performances, their application is constrained by some limits that are embedded in their inner workings. In this chapter, we'll introduce a new class of algorithms called policy gradient methods, which are used to overcome the constraints of value-based methods by approaching the RL problem from a different perspective.
Policy gradient methods select an action based on a learned parametrized policy, instead of relying on a value function. In this chapter, we will also elaborate on the theory and intuition behind these methods, and with this background, develop the most basic version of a policy gradient algorithm, named REINFORCE.
REINFORCE exhibits some deficiencies due to its simplicity, but these can be mitigated with only a small amount of additional effort. Thus, we'll present two improved versions of REINFORCE, called REINFORCE with baseline and actor-critic (AC) models.
The following topics will be covered in this chapter:
- Policy gradient methods
- Understanding the REINFORCE algorithm
- REINFORCE with a baseline
- Learning the AC algorithm