Trust region policy optimization

Trust region policy optimization (TRPO) is the first successful algorithm that makes use of several approximations to compute the natural gradient with the goal of training a deep neural network policy in a more controlled and stable way. From NPG, we saw that it isn't possible to compute the inverse of the FIM for nonlinear functions with a lot of parameters. TRPO overcomes these difficulties by building on top of NPG. It does this by introducing a surrogate objective function and making a series of approximations, which means it succeeds in learning about complex policies for walking, hopping, or playing Atari games from raw pixels.

TRPO is one of the most complex model-free algorithms and though we already learned the underlying principles of the natural gradient, there are still difficult parts behind it. In this chapter, we'll only give an intuitive level of detail regarding the algorithm and provide the main equations. If you want to dig into the algorithm in more detail, check their paper (https://arxiv.org/abs/1502.05477) for a complete explanation and proof of the theorems.

We'll also implement the algorithm and apply it to a Roboschool environment. Nonetheless, we won't discuss every component of the implementation here. For the complete implementation, check the GitHub repository of this book. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset