Let's now apply DDPG to a continuous task called BipedalWalker-v2, that is, one of the environments provided by Gym that uses Box2D, a 2D physical engine. A screenshot of this environment follows. The goal is to make the agent walk as fast as possible in rough terrains. A score of 300+ is given for moving until the end, but every application of the motors costs a small amount. The more optimally the agent moves, the less it costs. Furthermore, if the agent falls, it receives a reward of -100. The state consists of 24 float numbers that represent the speeds and the positions of the joints and the hull, and LiDar rangefinder measurements. The agent is controlled by four continuous actions, with the range [-1,1]. The following is a screenshot of BipedalWalker 2D environment:
We run DDPG with the hyperparameters that are given in the following table. In the first row, the hyperparameters that are needed to run DDPG are listed, while the corresponding values that are used in this particular case are listed in the second row. Let's refer to the following table:
Hyperparameter | Actor Learning Rate | Critic Learning Rate | DNN architecture | Buffer Size | Batch Size | Tau |
Value | 3e-4 | 4e-4 | [64,relu,64,relu] | 200000 | 64 | 0.003 |
During training, we added extra noise in the actions that were predicted by the policy, however, to measure the performance of the algorithm, we run 10 games on a pure deterministic policy (without extra noise) every 10 episodes. The cumulative rewards that is averaged across the 10 games in the function of the timesteps is plotted in the following diagram:
From the results, we can see that the performance is quite unstable, ranging from 250 to less than -100, after only a few thousand steps. It is known that DDPG is unstable and very sensitive to the hyperparameters, but with more careful fine-tuning, the results may be smoother. Nonetheless, we can see that the performance increases in the first 300k steps, reaching an average score of about 100, with peaks of up to 300.
Additionally, BipedalWalker-v2 is a notoriously difficult environment to solve. Indeed, it is considered solved when the agent obtains an average reward of at least 300 points, on 100 consecutive episodes. With DDPG, we aren't able to reach those performances, but still, we obtained a good policy that is able to make the agent run fairly fast.
DDPG is a beautiful example of how deterministic policy can be used in contraposition to stochastic policies. However, because it's been the first of its kind to deal with complex problems, there are many further adjustments that can be applied to it. The next algorithm that is proposed in this chapter, takes DDPG one step further.