Theory

The idea of Q-learning is to approximate the Q-function by using the current optimal action value. The Q-learning update is very similar to the update done in SARSA, with the exception that it takes the maximum state-action value:

 is the usual learning rate and  is the discount factor.

While the SARSA update is done on the behavior policy (like a -greedy policy), the Q-update is done on the greedy target policy that results from the maximum action value. If this concept is not clear yet, take a look at figure 4.7. While in SARSA we had figure 4.3, where both actions  and  come from the same policy, in Q-learning, action  is chosen based on the next maximum state-action value. Because an update in Q-learning is not more dependent on the behavior policy (which is used only for sampling from the environment), it becomes an off-policy algorithm.

Figure 4.7. Q-learning update
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset