Q-learning algorithm

Solving a Reinforcement Learning problem during the learning process estimates an evaluation function. This function must be able to assess, through the sum of the rewards, the convenience or, otherwise, a policy. The basic idea of Q-learning is that the algorithm learns the optimal evaluation function on the whole space of states and actions (SxA).

The so-called Q-function provides a match in the form Q: S × A => V, where V is the value of future rewards of an action, a Î A, executed in the state s Î S.

Once it has learned the optimal function, Q, the agent will of course be able to recognize what action will lead to the highest future reward in a s state.

One of the most used examples for implementing the Q-learning algorithm involves the use of a table. Each cell of the table is a value, Q(s; a)= V, initialized to 0.

The agent can perform any action a Î A, where A is the total set of actions known by the agent. The basic idea of the algorithm is the training rule, which updates a table element, Q (s; a).

The algorithm follows these basic steps:

Initialize Q (s; a) arbitrarily 
Repeat (for each episode) 
  Initialize s 
  Repeat (for each step of episode): 
    Choose an action a I A  from s I S using policy  
      derived from  Q 
     Take an action a, observe r, s' 
    Q(s; a) - Q(s; a) + a .( r + g  . max Q(s'; a) - Q(s; a)  ) 
    s' :  s  - s' 
Until s is terminal

The parameters used in the Q-value update process are as follows:

a is the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value, such as 0.9, means that learning can occur quickly.
g is the discount factor, also set between 0 and 1. This models the fact that future rewards are worth less than immediate rewards. Mathematically, the discount factor needs to be set to less than 0 for the algorithm to converge.
max Q(s'; a) is the maximum reward attainable in the state following the current one, that is, the reward for taking the optimal action thereafter.

For better understanding, we have depicted the algorithm in the following figure:

Q-learning algorithm

Table of Contents for Q-learning algorithm

Create new playlist

Sign In

Sign Up

Table of Contents for
Q-learning algorithm