PPO and TRPO are very similar algorithms and we choose to compare them by testing PPO in the same environment as TRPO, namely RoboschoolWalker2d. We devoted the same computational resources for tuning both of the algorithms so that we have a fairer comparison. The hyperparameters for TRPO are the same as those we listed in the previous section but instead, the hyperparameters of PPO are shown in the following table:
Hyperparameter | Value |
Neural network | 64, tanh, 64, tanh |
Policy learning rate | 3e-4 |
Number of actor iterations | 10 |
Number of agents | 1 |
Time horizon | 5,000 |
Mini-batch size | 256 |
Clipping coefficient | 0.2 |
Delta (for GAE) | 0.95 |
Gamma (for GAE) | 0.99 |
A comparison between PPO and TRPO is shown in the following diagram. PPO needs more experience to take off, but once it reaches this state, it has a rapid improvement that outpaces TRPO. In these specific settings, PPO also outperforms TRPO in terms of its final performance. Keep in mind that further tuning of the hyperparameters could bring better and slightly different results: