Learning Stochastic and PG Optimization

So far, we've addressed and developed value-based reinforcement learning algorithms. These algorithms learn a value function in order to be able to find a good policy. Despite the fact that they exhibit good performances, their application is constrained by some limits that are embedded in their inner workings. In this chapter, we'll introduce a new class of algorithms called policy gradient methods, which are used to overcome the constraints of value-based methods by approaching the RL problem from a different perspective.

Policy gradient methods select an action based on a learned parametrized policy, instead of relying on a value function. In this chapter, we will also elaborate on the theory and intuition behind these methods, and with this background, develop the most basic version of a policy gradient algorithm, named REINFORCE.

REINFORCE exhibits some deficiencies due to its simplicity, but these can be mitigated with only a small amount of additional effort. Thus, we'll present two improved versions of REINFORCE, called REINFORCE with baseline and actor-critic (AC) models.

The following topics will be covered in this chapter:

Policy gradient methods
Understanding the REINFORCE algorithm
REINFORCE with a baseline
Learning the AC algorithm

Table of Contents for Learning Stochastic and PG Optimization

Create new playlist

Sign In

Sign Up

Table of Contents for
Learning Stochastic and PG Optimization