TRPO and PPO Implementation

In the previous chapter, we looked at policy gradient algorithms. Their uniqueness lies in the order in which they solve a reinforcement learning (RL) problem—policy gradient algorithms take a step in the direction of the highest gain of the reward. The simpler version of this algorithm (REINFORCE) has a straightforward implementation that alone achieves good results. Nevertheless, it is slow and has a high variance. For this reason, we introduced a value function that has a double goal—to critique the actor and to provide a baseline. Despite their great potential, these actor-critic algorithms can suffer from unwanted rapid variations in the action distribution that may cause a drastic change in the states that are visited, followed by a rapid decline in the performance from which they could never recover from.

In this chapter, we will address this problem by showing you how introducing a trust-region, or a clipped objective, can mitigate it. We'll show two practical algorithms, namely TRPO and PPO. These have shown ability in controlling simulated walking, controlling hopping and swimming robots, and playing Atari games. We'll cover a new set of environments for continuous control and show how policy gradient algorithms can be adapted to work in a continuous action space. By applying TRPO and PPO to these new environments, you'll be able to train an agent to run, jump, and walk.

The following topics will be covered in this chapter:

Roboschool
Natural policy gradient
Trust region policy optimization
Proximal policy optimization

Table of Contents for TRPO and PPO Implementation

Create new playlist

Sign In

Sign Up

Table of Contents for
TRPO and PPO Implementation