The IL structure

Now that all the ingredients of imitation learning have been tackled, we can elaborate on the algorithms and approaches that can be used in order to design a full imitation learning algorithm. 

The most straightforward way to tackle the imitation problem is shown in the following diagram:

The preceding diagram can be summarized in two main steps:

  • An expert collects data from the environment.
  • A policy is learned through supervised learning on the dataset.

Unfortunately, despite supervised learning being the imitation algorithm for excellence, most of the time, it doesn't work.

To understand why the supervised learning approach isn't a good alternative, we have to recall the foundations of supervised learning. We are mostly interested in two basic principles: the training and test set should belong to the same distribution, and the data should be independent and identically distributed (i.i.d). However, a policy should be tolerant of different trajectories and be robust to eventual distribution shifts.

If an agent is trained using only a supervised learning approach to drive a car, whenever it shifts a little bit from the expert trajectories, it will be in a new state never seen before, and that will create a distribution mismatch. In this new state, the agent will be uncertain about the next action to take. In a usual supervised learning problem, it doesn't matter too much. If a prediction is missed, this will not have an influence on the next prediction. However, in an imitation learning problem, the algorithm is learning a policy and the i.i.d property is no longer valid because subsequent actions are strictly correlated to each other. Thus, they will have consequences and a compounding effect on all the others.

In our example of the self-driving car, once the distribution has changed from that of the expert, the correct path will be very difficult to recover, since bad actions will accumulate and lead to dramatic consequences. The longer the trajectory, the worse the effect of imitation learning. To clarify, supervised learning problems with i.i.d. data can be seen as having a trajectory of length 1. No consequences on the next actions are found. The paradigm we have just presented is what we referred to previously as passive learning.

To overcome the distributional shift that can have catastrophic effects on policies learned using passive imitation, different techniques can be adopted. Some are hacks, while others are more algorithmic variations. Two of these strategies that work well are the following:

  • Learning a model that generalizes very well on the data without overfitting
  • Using an active imitation in addition to the passive one

Because the first is more of a broad challenge, we will concentrate on the second strategy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset