Learning Stochastic and PG Optimization

So far, we've addressed and developed value-based reinforcement learning algorithms. These algorithms learn a value function in order to be able to find a good policy. Despite the fact that they exhibit good performances, their application is constrained by some limits that are embedded in their inner workings. In this chapter, we'll introduce a new class of algorithms called policy gradient methods, which are used to overcome the constraints of value-based methods by approaching the RL problem from a different perspective.

Policy gradient methods select an action based on a learned parametrized policy, instead of relying on a value function. In this chapter, we will also elaborate on the theory and intuition behind these methods, and with this background, develop the most basic version of a policy gradient algorithm, named REINFORCE.

REINFORCE exhibits some deficiencies due to its simplicity, but these can be mitigated with only a small amount of additional effort. Thus, we'll present two improved versions of REINFORCE, called REINFORCE with baseline and actor-critic (AC) models. 

The following topics will be covered in this chapter:

  • Policy gradient methods
  • Understanding the REINFORCE algorithm
  • REINFORCE with a baseline
  • Learning the AC algorithm
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset