TRPO and PPO Implementation

In the previous chapter, we looked at policy gradient algorithms. Their uniqueness lies in the order in which they solve a reinforcement learning (RL) problempolicy gradient algorithms take a step in the direction of the highest gain of the reward. The simpler version of this algorithm (REINFORCE) has a straightforward implementation that alone achieves good results. Nevertheless, it is slow and has a high variance. For this reason, we introduced a value function that has a double goalto critique the actor and to provide a baseline. Despite their great potential, these actor-critic algorithms can suffer from unwanted rapid variations in the action distribution that may cause a drastic change in the states that are visited, followed by a rapid decline in the performance from which they could never recover from.

In this chapter, we will address this problem by showing you how introducing a trust-region, or a clipped objective, can mitigate it. We'll show two practical algorithms, namely TRPO and PPO. These have shown ability in controlling simulated walking, controlling hopping and swimming robots, and playing Atari games. We'll cover a new set of environments for continuous control and show how policy gradient algorithms can be adapted to work in a continuous action space. By applying TRPO and PPO to these new environments, you'll be able to train an agent to run, jump, and walk.

The following topics will be covered in this chapter:

  • Roboschool
  • Natural policy gradient
  • Trust region policy optimization
  • Proximal policy optimization
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset