A quick overview

The main idea behind PPO is to clip the surrogate objective function when it moves away, instead of constraining it as it does in TRPO. This prevents the policy from making updates that are too large. The main objective is as follows:


Here,  is defined as follows: 


What the objective is saying is that if the probability ratio, , between the new and the old policy is higher or lower than a constant, , then the minimum value should be taken. This prevents  from moving outside the interval . The value of  is taken as the reference point, 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.