ME-TRPO applied to an inverted pendulum

Many variants exist of the vanilla model-based and model-free algorithms introduced in the pseudocode in the A useful combination section. Pretty much all of them propose different ways to deal with the imperfections of the model of the environment.

This is a key problem to address in order to reach the same performance as model-free methods. Models learned from complex environments will always have some inaccuracies. So, the main challenge is to estimate or control the uncertainty of the model to stabilize and accelerate the learning process.

ME-TRPO proposes the use of an ensemble of models to maintain the model uncertainty and regularize the learning process. The models are deep neural networks with different weight initialization and training data. Together, they provide a more robust general model of the environment that is less prone to exploit regions where insufficient data is available.

Then, the policy is learned from trajectories simulated with the ensemble. In particular, the algorithm chosen to learn the policy is trust region policy optimization (TRPO), which was explained in Chapter 7, TRPO and PPO Implementation.

Table of Contents for ME-TRPO&#xA0;applied to an inverted pendulum

Create new playlist

Sign In

Sign Up

Table of Contents for
ME-TRPO applied to an inverted pendulum