Analyzing the results on Flappy Bird

Before showing the results of the imitation learning approach, we want to provide some numbers so that you can compare these with those of a reinforcement learning algorithm. We know that this is not a fair comparison (the two algorithms work on very different conditions), but nevertheless, they underline why imitation learning can be rewarding when an expert is available.

The expert has been trained with proximal policy optimization for about 2 million steps and, after about 400,000 steps, reached a plateau score of about 138.

We tested DAgger on Flappy Bird with the following hyperparameters:

Hyperparameter Variable name Value
Learner hidden layers hidden_sizes 16,16
DAgger iterations dagger_iterations 8
Learning rate p_lr 1e-4
Number of steps for every DAgger iteration step_iterations 100
Mini-batch size batch_size 50
Training epochs train_epochs 2000

The plot in the following screenshot shows the trend of the performance of DAgger with respect to the number of steps taken:

The horizontal line represents the average performance reached by the expert. From the results, we can see that a few hundred steps are sufficient to reach the performance of the expert. However, compared with the experience required by PPO to train the expert, this represents about a 100-fold increase in sample efficiency.

Again, this is not a fair comparison as the methods are in different contexts, but it highlights that whenever an expert is available, it is suggested that you use an imitation learning approach (perhaps at least to learn a starting policy).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset