For a direct comparison of TD3 and DDPG, we tested TD3 in the same environment that we used for DDPG: BipedalWalker-v2.
The best hyperparameters for TD3 for this environment are listed in this table:
Hyperparameter | Actor l.r. | Critic l.r. | DNN Architecture | Buffer Size | Batch Size | Tau |
Policy Update Freq
|
Sigma
|
Value | 4e-4 | 4e-4 | [64,relu,64,relu] | 200000 | 64 | 0.005 | 2 | 0.2 |
The result is plotted in the following diagram. The curve has a smooth trend, and reaches good results after about 300K steps, with top peaks at 450K steps of training. It arrives very close to the goal of 300 points, but it does not actually gain them:
The time spent finding a good set of hyperparameters for TD3 was less compared to DDPG. And, despite the fact that we are comparing the two algorithms on only one game, we think that it is a good first insight into their differences, in terms of stability and performance. The performance of both DDPG and TD3 on BipedalWalker-v2 are shown here:
The superiority of TD3 compared to DDPG is immediately clear, both in terms of the end performance, the rate of improvement, and the stability of the algorithm.