Unknown model

What should you do when the model of the environment is unknown? Learn it! Almost everything we have seen so far involves learning. So, is it the best approach? Well, if you actually want to use a model-based approach, the answer is yes, and soon we'll see how to do it. However, this isn't always the best way to proceed.

In reinforcement learning, the end goal is to learn an optimal policy for a given task. Previously in this chapter, we said that the model-based approach is primarily used to reduce the number of interactions with the environment, but is this always true? Imagine your goal is to prepare an omelet. Knowing the exact breaking point of the egg isn't useful at all; you just need to know approximately how to break it. Thus, in this situation, a model-free algorithm that doesn't deal with the exact structure of the egg is more appropriate.

However, this shouldn't lead you to think that model-based algorithms are not worth it. For example, model-based approaches outweigh model-free approaches in situations where the model is much easier to learn than the policy.

The only way to learn a model is (unfortunately) through interactions with the environment. This is an obligatory step, as it allows us to acquire and create a dataset about the environment. Usually, the learning process takes place in a supervised fashion, where a function approximator (such as a deep neural network) is trained to minimize a loss function, such as the mean squared error loss between the transitions obtained from the environment and the prediction. An example of this is shown in the following diagram, where a deep neural network is trained to model the environment by predicting the next state, s', and the reward, r, from a state, s and an action, a:

There are other options besides neural networks, such as Gaussian processes, and Gaussian mixture models. In particular, Gaussian processes have the particularity of taking into account the uncertainty of the model and are regarded as being very data efficient. In fact, until the advent of deep neural networks, they were the most popular choice.

However, the main drawback of Gaussian processes is that they are slow with large datasets. Indeed, to learn more complex environments (thereby requiring bigger datasets), deep neural networks are preferred. Furthermore, deep neural networks can learn models of environments that have images as observations. 

There are two main ways to learn a model of the environment; one in which the model is learned once and then kept fixed, and one in which the model is learned at the beginning but retrained once the plan or policy has changed. The two options are illustrated in the following diagram: 

In the top half of the diagram, a sequential model-based algorithm is shown, where the agent interacts with the environment only before learning the model. In the bottom half, a cyclic approach to model-based learning is shown, where the model is refined with additional data from a different policy.

To understand how an algorithm can benefit from the second option, we have to define a key concept. In order to collect the dataset for learning the dynamics of the environment, you need a policy that lets you navigate it. But in the beginning, the policy may be deterministic or completely random. Thus, with a limited number of interactions, the space explored will be very restricted.

This precludes the model from learning those parts of the environment that are needed to plan or learn optimal trajectories. But if the model is retrained with new interactions coming from a newer and better policy, it will iteratively adapt to the new policy and capture all the parts of the environment (from a policy perspective) that haven't been visited yet. This is called data aggregation. 

In practice, in most cases, the model is unknown and is learned using data aggregation methods to adapt to the new policy produced. However, learning a model can be challenging, and the potential problems are the following:

  • Overfitting the model: The learned model overfits on a local region of the environment, missing its global structure.
  • Inaccurate model: Planning or learning a policy on top of an imperfect model may induce a cascade of errors with potentially catastrophic conclusions. 

Good model-based algorithms that learn a model have to deal with those problems. A potential solution may be to use algorithms that estimate the uncertainty, such as Bayesian neural networks, or by using an ensemble of models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset