Combining policy gradient optimization with Q-learning

Throughout this book, we approach two main types of model-free algorithms: the ones based on the gradient of the policy, and the ones based on the value function. From the first family, we saw REINFORCE, actor-critic, PPO, and TRPO. From the second, we saw Q-learning, SARSA, and DQN. As well as the way in which the two families learn a policy (that is, policy gradient algorithms use stochastic gradient ascent toward the steepest increment on the estimated return, and value-based algorithms learn an action value for each state-action to then build a policy), there are key differences that let us prefer one family over the other. These are the on-policy or off-policy nature of the algorithms, and their predisposition to manage large action spaces. We already discussed the differences between on-policy and off-policy in the previous chapters, but it is important to understand them well, in order to actually appreciate the algorithms that will be introduced in this chapter. 

Off-policy learning is able to use previous experiences in order to refine the current policy, despite the fact that that experience comes from a different distribution. DQN benefits from this by storing all the memories that the agent had throughout its life in a replay buffer, and by sampling mini-batches from the buffer to update the target policy. At the opposite end of the spectrum, there is on-policy learning, which requires experience to be gained from the current policy. This means that old experiences cannot be used, and every time the policy is updated, the old data has to be discarded. As a result, because off-policy learning can reuse data multiple times, it requires fewer interactions with the environment in order to learn a task. In cases where the acquisition of new samples is expensive or very difficult to do, this difference matters a lot, and choosing off-policy algorithms could be vital.

The second factor is a matter of action spaces. As we saw in Chapter 7, TRPO and PPO Implementation, policy gradient algorithms have the ability to deal with very large and continuous action spaces. Unfortunately, the same does not hold true for Q-learning algorithms. To choose an action, they have to perform maximization across all the action space, and whenever this is very large or continuous, it is intractable. Thus, Q-learning algorithms can be applied to arbitrarily complex problems (with a very large state space) but their action space has to be limited. 

In conclusion, none of the previous algorithms are always preferred over others, and the choice is mostly task dependent. Nevertheless, their advantages and disadvantages are quite complementary, and thus the question arises: Is it possible to combine the benefits of both families into a single algorithm?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset