Summary

In this chapter, we learned about a new class of reinforcement learning algorithms called policy gradients. They approach the RL problem in a different way, compared to the value function methods that were studied in the previous chapters.

The simpler version of PG methods is called REINFORCE, which was learned, implemented, and tested throughout the course of this chapter. We then proposed adding a baseline in REINFORCE in order to decrease the variance and increase the convergence property of the algorithm. AC algorithms are free from the need for a full trajectory using a critic, and thus, we then solved the same problem using the AC model.

With a solid foundation of the classic policy gradient algorithms, we can now go further. In the next chapter, we'll look at some more complex, state-of-the-art policy gradient algorithms; namely, Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). These two algorithms are built on top of the material that we have covered in this chapter, but additionally, they propose a new objective function that improves the stability and efficiency of PG algorithms.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary