Before showing the results of the imitation learning approach, we want to provide some numbers so that you can compare these with those of a reinforcement learning algorithm. We know that this is not a fair comparison (the two algorithms work on very different conditions), but nevertheless, they underline why imitation learning can be rewarding when an expert is available.
The expert has been trained with proximal policy optimization for about 2 million steps and, after about 400,000 steps, reached a plateau score of about 138.
We tested DAgger on Flappy Bird with the following hyperparameters:
Hyperparameter | Variable name | Value |
Learner hidden layers | hidden_sizes | 16,16 |
DAgger iterations | dagger_iterations | 8 |
Learning rate | p_lr | 1e-4 |
Number of steps for every DAgger iteration | step_iterations | 100 |
Mini-batch size | batch_size | 50 |
Training epochs | train_epochs | 2000 |
The plot in the following screenshot shows the trend of the performance of DAgger with respect to the number of steps taken:
The horizontal line represents the average performance reached by the expert. From the results, we can see that a few hundred steps are sufficient to reach the performance of the expert. However, compared with the experience required by PPO to train the expert, this represents about a 100-fold increase in sample efficiency.
Again, this is not a fair comparison as the methods are in different contexts, but it highlights that whenever an expert is available, it is suggested that you use an imitation learning approach (perhaps at least to learn a starting policy).