The algorithm

As Q-learning is a TD method, it needs a behavior policy that, as time passes, will converge to a deterministic policy. A good strategy is to use an -greedy policy with linear or exponential decay (as has been done for SARSA).

To recap, the Q-learning algorithm uses the following:

  • A target greedy policy that constantly improves
  • A behavior -greedy policy to interact with and explore the environment

After these conclusive observations, we can finally come up with the following pseudocode for the Q-learning algorithm:

Initialize  for every state-action pair


for episodes:

while is not a final state:

# env() take a step in the environment

In practice,  usually has values between 0.5 and 0.001 and  ranges from 0.9 to 0.999.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset