ME-TRPO applied to an inverted pendulum

Many variants exist of the vanilla model-based and model-free algorithms introduced in the pseudocode in the A useful combination section. Pretty much all of them propose different ways to deal with the imperfections of the model of the environment.

This is a key problem to address in order to reach the same performance as model-free methods. Models learned from complex environments will always have some inaccuracies. So, the main challenge is to estimate or control the uncertainty of the model to stabilize and accelerate the learning process. 

ME-TRPO proposes the use of an ensemble of models to maintain the model uncertainty and regularize the learning process. The models are deep neural networks with different weight initialization and training data. Together, they provide a more robust general model of the environment that is less prone to exploit regions where insufficient data is available.

Then, the policy is learned from trajectories simulated with the ensemble. In particular, the algorithm chosen to learn the policy is trust region policy optimization (TRPO), which was explained in Chapter 7, TRPO and PPO Implementation

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset