A useful combination

As you know, model-free learning has good asymptotic performance but poor sample complexity. On the other side, model-based learning is efficient from a data standpoint, but struggles when it comes to more complex tasks. By combining model-based and model-free approaches, it is possible to reach a smooth spot where sample complexity decreases consistently, while achieving the high performance of model-free algorithms. 

There are many ways to integrate both worlds, and the algorithms that propose to do it are very different from one another. For example, when the model is given (as they are in the games of Go and Chess), search tree and value-based algorithms can help each other to produce a better action value estimate.

Another example is to combine the learning of the environment and the policy directly in a deep neural network architecture so that the learned dynamics can contribute to the planning of a policy. Another strategy used by a fair number of algorithms is to use a learned model of the environment to generate additional samples to optimize the policy.

To put it in another way, the policy is trained by playing simulated games inside the learned model. This can be done in multiple ways, but the main recipe is shown in the pseudocode that follows: 

while not done:
> collect transitions from the real environment using a policy
> add the transitions to the buffer
> learn a model that minimizes in a supervised way using data in
> (optionally learn )


repeat K times:
> sample an initial state
> simulate transitions from the model using a policy
> update the policy using a model-free RL

This blueprint involves two cycles. The outermost cycle collects data from the real environment to train the model, while, in the innermost cycle, the model generates simulated samples that are used to optimize the policy using model-free algorithms. Usually, the dynamics model is trained to minimize the MSE loss in a supervised fashion. The more precise the predictions made by the model, the more accurate the policy can be.

In the innermost cycle, either full or fixed-length trajectories can be simulated. In practice, the latter option can be adopted to mitigate the imperfections of the model. Furthermore, the trajectories can start from a random state sampled from the buffer that contains real transitions or from an initial state. The former option is preferred in situations where the model is inaccurate, because that prevents the trajectory from diverging too much from the real one. To illustrate this situation, take the following diagram. The trajectories that have been collected in the real environment are colored black, while those simulated are colored blue:

You can see that the trajectories that start from an initial state are longer, and thus will diverge more rapidly as the errors of the inaccurate model propagate in all the subsequent predictions.

Note that you could do only a single iteration of the main cycle and gather all the data required to learn a decent approximated model of the environment. However, for the reasons outlined previously, it's better to use iterative data aggregation methods to cyclically retrain the model with transitions that come from the newer policy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset