How well will the scalable version of evolution strategies perform in the LunarLander environment? Let's find out!
As you may recall, we already used LunarLander against A2C and REINFORCE in Chapter 6, Learning Stochastic and PG optimization. This task consists of landing a lander on the moon through continuous actions. We decided to use this environment for its medium difficulty and to compare the ES results to those that were obtained with A2C.
The hyperparameters that performed the best in this environment are as follows:
Hyperparameter | Variable name | Value |
Neural network size | hidden_sizes | [32, 32] |
Training iterations (or generations) | number_iter | 200 |
Worker's number | num_workers | 4 |
Adam learning rate | lr | 0.02 |
Individuals per worker | indiv_per_worker | 12 |
Standard deviation | std_noise | 0.05 |
The results are shown in the following graph. What immediately catches your eye is that the curve is very stable and smooth. Furthermore, notice that it reaches an average score of about 200 after 2.5-3 million steps. Comparing the results with those obtained with A2C (in Figure 6.7), you can see that the evolution strategy took almost 2-3 times more steps than A2C and REINFORCE.
As demonstrated in the paper, by using massive parallelization (using at least hundreds of CPUs), you should be able to obtain very good policies in just minutes. Unfortunately, we don't have such computational power. However, if you do, you may want to try it for yourself:
Overall, the results are great and show that ES is a viable solution for very long horizon problems and tasks with very sparse rewards.