DDPG and TD3 Applications

In the previous chapter, we concluded a comprehensive overview of all the major policy gradient algorithms. Due to their capacity to deal with continuous action spaces, they are applied to very complex and sophisticated control systems. Policy gradient methods can also use a second-order derivative, as is done in TRPO, or use other strategies, in order to limit the policy update by preventing unexpected bad behaviors. However, the main concern when dealing with this type of algorithm is their poor efficiency, in terms of the quantity of experience needed to hopefully master a task. This drawback comes from the on-policy nature of these algorithms, which makes them require new experiences each time the policy is updated. In this chapter, we will introduce a new type of off-policy actor-critic algorithm that learns a target deterministic policy, while exploring the environment with a stochastic policy. We call these methods deterministic policy gradient methods, due to their characteristic of learning a deterministic policy. We'll first show how these algorithms work, and we will also show their close relationship with Q-learning methods. Then, we'll present two deterministic policy gradient algorithms: deep deterministic policy gradient (DDPG), and a successive version of it, known as twin delayed deep deterministic policy gradient (TD3). You'll get a sense of their capabilities by implementing and applying them to a new environment. 

The following topics will be covered in this chapter:

  • Combining policy gradient optimization with Q-learning
  • Deep deterministic policy gradient
  • Twin delayed deep deterministic policy gradient (TD3)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset