The PPO algorithm

The practical algorithm that is introduced in the PPO paper uses a truncated version of Generalized Advantage Estimation (GAE), an idea that was introduced for the first time in the paper High-Dimensional Continuous Control using Generalized Advantage Estimation. GAE calculates the advantage as follows:

 (7.11)

It does this instead of using the common advantage estimator:

 (7.12)

Continuing with the PPO algorithm, on each iteration, N trajectories from multiple parallel actors are collected with time horizon T, and the policy is updated K times with mini-batches. Following this trend, the critic can also be updated multiple times using mini-batches. The following table contains standard values of every PPO hyperparameter and coefficient. Despite the fact that every problem needs ad hoc hyperparameters, it would be useful to get an idea of their ranges (reported in the third column of the table):

Hyperparameter Symbol Range
Policy learning rate  - [1e-5, 1e-3]
Number of policy iterations K [3, 15]
Number of trajectories (equivalent to the number of parallel actors) N [1, 20]
Time horizon T [64, 5120]
Mini-batch size - [64, 5120]
Clipping coefficient 0.1 or 0.2
Delta (for GAE) δ [0.9, 0.97]
Gamma (for GAE) γ [0.8, 0.995]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset