Understanding ME-TRPO

In the first part of ME-TRPO, the dynamics of the environment (that is, the ensemble of models) are learned. The algorithm starts by interacting with the environment with a random policy, , to collect a dataset of transitions, . This dataset is then used to train all the dynamic models, , in a supervised fashion. The models, , are initialized with different random weights and are trained with different mini-batches. To avoid overfitting issues, a validation set is created from the dataset. Also, a mechanism of early stopping (a regularization technique widely used in machine learning) interrupts the training process whenever the loss on the validation set stops improving.

In the second part of the algorithm, the policy is learned with TRPO. Specifically, the policy is trained on the data gathered from the learned models, which we'll also call the simulated environment, instead of the real environment. To avoid the policy exploiting inaccurate regions of a single learned model, the policy, , is trained using the predicted transitions from the whole ensemble of models, . In particular, the policy is trained on the simulated dataset composed of transitions acquired from the models, , randomly chosen among the ensemble. During training, the policy is monitored constantly, and the process stops as soon as the performance stops improving. 

Finally, the cycle constituted by the two parts is repeated until convergence. However, at each new iteration, the data from the real environment is collected by running the newly learned policy, , and the data collected is aggregated with the dataset of the previous iterations. The ME-TRPO algorithm is briefly summarized in the following pseudocode:

Initialize randomly policy  and models 
Initialize empty buffer

while not done:
> populate buffer with transitions from the real environment using policy (or random)
> learn models that minimize in a supervised way using data in

until convergence:
> sample an initial state
> simulate transitions using models and the policy
> take a TRPO update to optimize policy

An important note to make here is that, unlike most model-based algorithms, the reward is not embedded in the model of the environment. Therefore, ME-TRPO assumes that the reward function is known.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset