Applying scalable ES to LunarLander

How well will the scalable version of evolution strategies perform in the LunarLander environment? Let's find out!

As you may recall, we already used LunarLander against A2C and REINFORCE in Chapter 6, Learning Stochastic and PG optimization. This task consists of landing a lander on the moon through continuous actions. We decided to use this environment for its medium difficulty and to compare the ES results to those that were obtained with A2C.

The hyperparameters that performed the best in this environment are as follows:

Hyperparameter Variable name Value
Neural network size hidden_sizes [32, 32]
Training iterations (or generations) number_iter 200
Worker's number num_workers 4
Adam learning rate lr 0.02
Individuals per worker indiv_per_worker 12
Standard deviation std_noise 0.05

 

The results are shown in the following graph. What immediately catches your eye is that the curve is very stable and smooth. Furthermore, notice that it reaches an average score of about 200 after 2.5-3 million steps. Comparing the results with those obtained with A2C (in Figure 6.7), you can see that the evolution strategy took almost 2-3 times more steps than A2C and REINFORCE.

As demonstrated in the paper, by using massive parallelization (using at least hundreds of CPUs), you should be able to obtain very good policies in just minutes. Unfortunately, we don't have such computational power. However, if you do, you may want to try it for yourself:

Figure 11.5 The performance of scalable evolution strategies

Overall, the results are great and show that ES is a viable solution for very long horizon problems and tasks with very sparse rewards.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset