PPO application

PPO and TRPO are very similar algorithms and we choose to compare them by testing PPO in the same environment as TRPO, namely RoboschoolWalker2d. We devoted the same computational resources for tuning both of the algorithms so that we have a fairer comparison. The hyperparameters for TRPO are the same as those we listed in the previous section but instead, the hyperparameters of PPO are shown in the following table: 

Hyperparameter Value
Neural network 64, tanh, 64, tanh
Policy learning rate  3e-4
Number of actor iterations 10
Number of agents 1
Time horizon 5,000
Mini-batch size 256
Clipping coefficient 0.2
Delta (for GAE) 0.95
Gamma (for GAE) 0.99

 

A comparison between PPO and TRPO is shown in the following diagram. PPO needs more experience to take off, but once it reaches this state, it has a rapid improvement that outpaces TRPO. In these specific settings, PPO also outperforms TRPO in terms of its final performance. Keep in mind that further tuning of the hyperparameters could bring better and slightly different results:

Figure 7.9. Comparison of performance between PPO and TRPO
A few personal observations: we found PPO more difficult to tune compared to TRPO. One reason for that is the higher number of hyperparameters in PPO. Moreover, the actor learning rate is one of the most important coefficients to tune, and if not properly tuned, it can greatly affect the final results. A great point in favor of TRPO is that it doesn't have a learning rate and that the policy is conditioned on a few hyperparameters that are easy to tune. Instead, an advantage of PPO is that it's faster and has been shown to work with a bigger variety of environments.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset