The algorithm

Basically, all the observations we have done for the TD update are also valid for SARSA. Once we apply them to the definition of Q-function, we obtain the SARSA update:

 is a coefficient that determines how much the action value has been updated.  is the discount factor, a coefficient between 0 and 1 used to give less importance to the values that come from distant future decisions (short-term actions are preferred to long-term ones). A visual interpretation of the SARSA update is given in figure 4.3. 

The name SARSA comes from the update that is based on the state, ; the 
action, , the reward, ; the next state, ; and finally, the next action, . Putting everything together, it forms , as can be seen in figure 4.3:

Figure 4.3 SARSA update

SARSA is an on-policy algorithm. On-policy means that the policy that is used to collect experience through interaction with the environment (called a behavior policy) is the same policy that is updated. The on-policy nature of the method is due to the use of the current policy to select the next action, , to estimate , and the assumption that in the following action it will follow the same policy (that is, it acts according to action ). 

On-policy algorithms are usually easier than off-policy algorithms, but they are less powerful and usually require more data to learn. Despite this, as for TD learning, SARSA is guaranteed to converge to the optimal policy if it visits every state-action an infinite number of times and the policy, over time, becomes a deterministic one. Practical algorithms use an -greedy policy with a decay that tends to be zero, or a value close to it. The pseudocode of SARSA is summarized in the following code block. In the pseudocode, we used an -greedy policy, but any strategy that encourages exploration can be used:

Initialize  for every state-action pair


for episodes:



while is not a final state:
# env() take a step in the environment



 is a function that implements the  strategy. Note that SARSA executes the same action that has been selected and used in the previous step to update the state-action value. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset