Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The algorithm

As Q-learning is a TD method, it needs a behavior policy that, as time passes, will converge to a deterministic policy. A good strategy is to use an -greedy policy with linear or exponential decay (as has been done for SARSA).

To recap, the Q-learning algorithm uses the following:

A target greedy policy that constantly improves
A behavior -greedy policy to interact with and explore the environment

After these conclusive observations, we can finally come up with the following pseudocode for the Q-learning algorithm:

Initialize  for every state-action pair


for  episodes:
    
    while  is not a final state:
        
         # env() take a step in the environment