Q-learning

Q-learning is another TD algorithm with some very useful and distinct features from SARSA. Q-learning inherits from TD learning all the characteristics of one-step learning (from TD learning, that is, the ability of learning at each step) and the characteristic to learn from experience without a proper model of the environment.

The most distinctive feature about Q-learning compared to SARSA is that it's an off-policy algorithm. As a reminder, off-policy means that the update can be made independently from whichever policy has gathered the experience. This means that off-policy algorithms can use old experiences to improve the policy. To distinguish between the policy that interacts with the environment and the one that actually improves, we call the former a behavior policy and the latter a target policy.

Here, we'll explain the more primitive version of the algorithm that copes with tabular cases, but it can easily be adapted to work with function approximators such as artificial neural networks. In fact, in the next chapter, we'll implement a more sophisticated version of this algorithm that is able to use deep neural networks and that also uses previous experiences to exploit the full capabilities of the off-policy algorithms.

But first, let's see how Q-learning works, formalize the update rule, and create a pseudocode version of it to unify all the components.

Table of Contents for Q-learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Q-learning