The main idea behind PPO is to clip the surrogate objective function when it moves away, instead of constraining it as it does in TRPO. This prevents the policy from making updates that are too large. The main objective is as follows:
(7.9)
Here, is defined as follows:
(7.10)
What the objective is saying is that if the probability ratio, , between the new and the old policy is higher or lower than a constant, , then the minimum value should be taken. This prevents from moving outside the interval . The value of is taken as the reference point, .