Appling DDPG to BipedalWalker-v2

Let's now apply DDPG to a continuous task called BipedalWalker-v2, that is, one of the environments provided by Gym that uses Box2D, a 2D physical engine. A screenshot of this environment follows. The goal is to make the agent walk as fast as possible in rough terrains. A score of 300+ is given for moving until the end, but every application of the motors costs a small amount. The more optimally the agent moves, the less it costs. Furthermore, if the agent falls, it receives a reward of -100. The state consists of 24 float numbers that represent the speeds and the positions of the joints and the hull, and LiDar rangefinder measurements. The agent is controlled by four continuous actions, with the range [-1,1]. The following is a screenshot of BipedalWalker 2D environment:

Screenshot of BipedalWalker2d environment

We run DDPG with the hyperparameters that are given in the following table. In the first row, the hyperparameters that are needed to run DDPG are listed, while the corresponding values that are used in this particular case are listed in the second row. Let's refer to the following table:

Hyperparameter Actor Learning Rate Critic Learning Rate DNN architecture Buffer Size Batch Size Tau
Value 3e-4 4e-4 [64,relu,64,relu] 200000 64 0.003

During training, we added extra noise in the actions that were predicted by the policy, however, to measure the performance of the algorithm, we run 10 games on a pure deterministic policy (without extra noise) every 10 episodes. The cumulative rewards that is averaged across the 10 games in the function of the timesteps is plotted in the following diagram: 

Performance of the DDPG algorithm on BipedalWalker2d-v2

From the results, we can see that the performance is quite unstable, ranging from 250 to less than -100, after only a few thousand steps. It is known that DDPG is unstable and very sensitive to the hyperparameters, but with more careful fine-tuning, the results may be smoother. Nonetheless, we can see that the performance increases in the first 300k steps, reaching an average score of about 100, with peaks of up to 300. 

Additionally, BipedalWalker-v2 is a notoriously difficult environment to solve. Indeed, it is considered solved when the agent obtains an average reward of at least 300 points, on 100 consecutive episodes. With DDPG, we aren't able to reach those performances, but still, we obtained a good policy that is able to make the agent run fairly fast. 

In our implementation, we used a constant exploratory factor. By using a more sophisticated function, you could probably reach a higher performance in fewer iterations. For example, in the DDPG paper, they use an Ornstein-Uhlenbeck process. You can start from this process, if you wish to.

DDPG is a beautiful example of how deterministic policy can be used in contraposition to stochastic policies. However, because it's been the first of its kind to deal with complex problems, there are many further adjustments that can be applied to it. The next algorithm that is proposed in this chapter, takes DDPG one step further.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset