Understanding TRPO and PPO

There are many variations to the policy-and model-free algorithms that have become popular for solving RL problems of optimizing predictions of future rewards. As we have seen, many of these algorithms use an advantage function, such as Actor-Critic, where we have two sides of the problem trying to converge to the optimum solution. In this case, the advantage function is trying to find the maximum expected discounted rewards. TRPO and PPO do this by using an optimization method called a Minorize-Maximization (MM) algorithm. An example of how the MM algorithm solves a problem is shown in the following diagram:

Using the MM algorithm

This diagram was extracted from a series of blogs by Jonathon Hui that elegantly describe the MM algorithm along with the TRPO and PPO methods in much greater detail. See the following link for the source: (https://medium.com/@jonathan_hui/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12).

Essentially, the MM algorithm finds the optimum pair function by interactively maximizing and minimizing function parameters until it arrives at a converged solution. In the diagram, the red line denotes the function we are looking to approximate, and the blue line denotes the converging function. You can see the progression as the algorithm picks min/max values that will find a solution.

The problem we encounter when using MM is that the function approximation can sometimes fall off, or down into a valley. In order to understand this better, let's consider this as solving the problem of climbing an uneven hill using a straight line. An example of such a scenario is seen here:

Attempting to climb a hill using linear methods

You can see that using only linear paths to try and navigate this quite treacherous ridge would, in fact, be dangerous. While the danger may not be as real, it is still a big problem when using linear methods to solve MM, as it is if you were hiking up a steep ridge using only a straight fixed path.

TRPO solves the problem of using linear methods by using a quadratic method, and by limiting the amount of steps each iteration can take in a form of trust region. That is, the algorithm makes sure that every position is positive and safe. If we consider our hill climbing example again, we may consider TRPO as placing a path or region of trust, like in the following photo:

A trust region path up the hill

In the preceding photo, the path is shown for example purposes only as a connected set of circles or regions; the real trust path may or may not be closer to the actual peak or ridge. Regardless, this has the effect of allowing the agent to learn at a more gradual and progressive pace. With TRPO, the size of the trust region can be altered and made bigger or smaller to coincide with our preferred policy convergence. The problem with TRPO is that it is quite complex to implement since it requires the second-degree derivation of some complex equations.

PPO addresses this issue by limiting or clipping the Kulbach-Leibler (KL) divergence between two policies through each iteration. KL divergence measures the difference in probability distributions and can be described through the following diagram:

Understanding KL divergence

In the preceding diagram, p(x) and q(x) each represent a different policy where the KL divergence is measured. The algorithm then, in turn, uses this measure of divergence to limit or clip the amount of policy change that may occur in an iteration. ML-Agents uses two hyperparameters that allow you to control this amount of clipping applied to the objective or function that determines the amount of policy change in an iteration. The following are the definitions for the beta and epsilon parameters, as described in the Unity documentation:

Beta: This corresponds to the strength of the entropy regularization, which makes the policy more random. This ensures that agents properly explore the action space during training. Increasing this will ensure that more random actions are taken. This should be adjusted so that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase beta. If entropy drops too slowly, decrease beta:
Typical range: 1e-4 – 1e-2
Epsilon: This corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value to be small will result in more stable updates, but will also slow the training process:
Typical range: 0.1 – 0.3

The key thing to remember about these parameters is that they control how quickly a policy changes from one iteration to the next. If you notice an agent training somewhat erratically, it may be beneficial to tune these parameters to smaller values. The default value for epsilon is .2 and for beta is 1.0e-2, but, of course, we will want to explore how these values may affect training, either in a positive or negative way. In the next exercise, we will modify these policy change parameters and see what effect they have in training:

For this example, we will open up the CrawlerDynamic scene from the Assets/ML-Agents/Examples/Crawler/Scenes folder.
Open the trainer_config.yaml file located in the ML-Agents/ml-agents/config folder. Since we have already evaluated the performance of this sample, there are a couple of ways we will revert the training configuration and make some modification to the beta and epsilon parameters.
Scroll down to the CrawlerDynamicLearning configuration section and modify it as follows:

CrawlerDynamicLearning:
    normalize: true
    num_epoch: 3
    time_horizon: 1000
    batch_size: 1024
    buffer_size: 20240
    gamma: 0.995
    max_steps: 1e6
    summary_freq: 3000
    num_layers: 3
    hidden_units: 512
    epsilon: .1
    beta: .1

We modified the epsilon and beta parameters to higher values, meaning that the training will be less stable. If you recall, however, these marathon examples generally train in a more stable manner.
Open up a properly configured Python console and run the following command to launch training:

mlagents-learn config/trainer_config.yaml --run-id=crawler_policy --train

As usual, wait for a number of training sessions for a good comparison from one example to the next.

What you may find unexpected is that the agent appears to start regressing, and in fact, it is. This is happening because we made those trust regions too large (a large beta), and while we allowed the rate of change to be lower (.1 epsilon), we can see the beta value is more sensitive to training.

Keep in mind that the Unity ML-Agents implementation uses a number of cross-features in tandem, which comprise a powerful RL framework. In the next section, we will take another quick look at a late-comer optimization parameter that Unity has recently added.

Table of Contents for Understanding TRPO and PPO

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding TRPO and PPO