As Q-learning is a TD method, it needs a behavior policy that, as time passes, will converge to a deterministic policy. A good strategy is to use an -greedy policy with linear or exponential decay (as has been done for SARSA).
To recap, the Q-learning algorithm uses the following:
- A target greedy policy that constantly improves
- A behavior -greedy policy to interact with and explore the environment
After these conclusive observations, we can finally come up with the following pseudocode for the Q-learning algorithm:
Initialize for every state-action pair
for episodes:
while is not a final state:
# env() take a step in the environment
In practice, usually has values between 0.5 and 0.001 and ranges from 0.9 to 0.999.