The TRPO algorithm

From a broad perspective, TRPO can be seen as a continuation of the NPG algorithm for nonlinear function approximation. The biggest improvement that was introduced in TRPO is the use of a constraint on the KL divergence between the new and the old policy that forms a trust region. This allows the network to take larger steps, always within the trust region. The resulting constraint problem is formulated as follows: 

 (7.2)

Here,  is the objective surrogate function that we'll see soon,  is the KL divergence between the old policy with the  parameters, and the new policy with
the  and  parameters is a coefficient of the constraint.

The objective surrogate function is designed in such a way that it is maximized with respect to the new policy parameters using the state distribution of the old policy. This is done using importance sampling, which estimates the distribution of the new policy (the desired one) while only having the distribution of the old policy (the known distribution). Importance sampling is required because the trajectory was sampled with the old policy, but what we actually care about is the distribution of the new one. Using importance sampling, the surrogate objective function is defined:

 (7.3)

 is the advantage function of the old policy. Thus, the constraint optimization problem is equivalent to the following: 

(7.4)

Here,  indicates the actions distributions conditioned on the state, .

What we are left to do is replace the expectation with an empirical average over a batch of samples and substitute  with an empirical estimate.

Constraint problems are difficult to solve and in TRPO, the optimization problem in equation (7.4) is approximately solved by using a linear approximation of the objective function and a quadratic approximation to the constraint so that the solution becomes similar to the NPG update:

Here, .

The approximation of the original optimization problem can now be solved using the Conjugate Gradient (CG) method, an iterative method for solving linear systems. When we talked about NPG, we emphasize that computing  is computationally very expensive with a large number of parameters. However, CG can approximately solve a linear problem without forming the full matrix, . Thus, using CG, we can compute  as follows:

 (7.5)

TRPO also gives us a way of estimating the step size: 

(7.6)

Therefore, the update becomes as follows:

(7.7)

So far, we have created a special case of the natural policy gradient step, but to complete the TRPO update, we are missing a key ingredient. Remember that we approximated the problem with the solution of a linear objective function and quadratic constraint. Thus, we are solving only a local approximation to the expected return. With the introduction of these approximations, we cannot be certain that the KL divergence constraint is still satisfied. To ensure the nonlinear constraint while improving the nonlinear objective, TRPO performs a line search to find the higher value, , that satisfies the constraint. The TRPO update with the line search becomes the following:


 (7.8)

It may seem to you that the line search is a negligible part of the algorithm, but as demonstrated in the paper, it has a fundamental role. Without it, the algorithm may compute large steps, causing catastrophic degradation in the performance. 

In terms of the TRPO algorithm, it computes a search direction with the conjugate gradient algorithm to find a solution for the approximated objective function and constraint. Then it uses a line search for the maximal step length, , so that the constraint on the KL divergence is satisfied and the objective is improved. To further increase the speed of the algorithm, the conjugate gradient algorithm also makes use of an efficient Fisher-Vector product (to learn more about it, check out the paper that can be found at https://arxiv.org/abs/1502.05477paper).

TRPO can be integrated into an AC architecture where the critic is included in the algorithm to provide additional support to the policy (the actor) in the learning of the task. A high-level implementation of such an algorithm (that is, TRPO combined with a critic), when written in pseudocode, is as follows:

Initialize  with random weight
Initialize environment
for episode 1..M do
Initialize empty buffer

> Generate few trajectories
for step 1..TimeHorizon do
> Collect experience by acting on the environment



if :

> Store the episode in the buffer
# where is the length of the episode

Compute the advantage values and n-step reward to go

> Estimate the gradient of the objective function
(1)
> Compute using conjugate gradient
(2)
> Compute the step length
(3)

> Update the policy using all the experience in

Backtracking line search to find the maximum value that satisfy the constraint

(4)

> Critic update using all the experience in

After this high-level overview of TRPO, we can finally start implementing it. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset