Application of TRPO

The efficiency and stability of TRPO allowed us to test it on new and more complex environments. We applied it on Roboschool. Roboschool and its Mujoco counterpart are often used as a testbed for algorithms that are able to control complex agents with continuous actions, such as TRPO. Specifically, we tested TRPO on RoboschoolWalker2d, where the task of the agent is to learn to walk as fast as possible. This environment is shown in the following figure. The environment terminates whenever the agent falls or when more than 1,000 timesteps have passed since the start. The state is encoded in a Box class of size 22 and the agent is controlled with 6 float values with a range of :

Figure 7.6. Render of the RoboschoolWalker2d environment

In TRPO, the number of steps to collect from an environment on each episode is called the time horizon. This number will also determine the size of the batch. Moreover, it can be beneficial to run multiple agents in parallel so as to collect more representative data of the environment. In this case, the batch size will be equal to the time horizon, multiplied by the number of agents. Although our implementation is not predisposed to running multiple agents in parallel, the same objective can be achieved by using a time horizon longer than the maximum number of steps allowed on each episode. For example, knowing that, in RoboschoolWalker2d, an agent has a maximum of 1,000 time steps to reach the goal, by using a time horizon of 6,000, we are sure that at least six full trajectories are run.

We run TRPO with the hyperparameters that are reported in the following table. Its third column also shows the standard ranges for each hyperparameter: 

Hyperparameter For RoboschoolWalker2
Range
Conjugate iterations 10 [7-10]
Delta (δ) 0.01 [0.005-0.03]
Batch size (Time Horizon * Number of Agents) 6000 [500-20000]

 

The progress of TRPO (and PPO, as we'll see in the next section) can be monitored by specifically looking at the total reward accumulated in each game and the state values that were predicted by the critic.

We trained for 6 million steps and the result of the performance is shown in the following diagram. With 2 million steps, it is able to reach a good score of 1,300 and it is able to walk fluently and with a moderate speed. In the first phase of training, we can note a transition period where the score decreases a little bit, probably due to a local optimum. After that, the agent recovers and improves until reaching a score of 1,250:

Figure 7.7. Learning curve of TRPO on RoboschoolWalker2d

Also, the predicted state value offers an important metric with which we can study the results. Generally, it is more stable than the total reward and is easier to analyze. The shown is provided in the following diagram. Indeed, it confirms our hypothesis since it is showing a smoother function in general, despite a few spikes around 4 million and 4. 5 million steps:

Figure 7.8. State values predicted by the critic of TRPO on RoboschoolWalker2d

From this plot, it is also easier to see that after the first 3 million steps, the agent continues to learn, if even at a very slow rate

As you saw, TRPO is a pretty complex algorithm with many moving parts. Nonetheless, it constitutes as proof of the effectiveness of limiting the policy inside a trust region so as to keep the policy from deviating too much from the current distribution. 

But can we design a simpler and more general algorithm that uses the same underlying approach?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset