SARSA

So far, we have presented TD learning as a general way to estimate a value function for a given policy. In practice, TD cannot be used as it is because it lacks the primary component to actually improve the policy. SARSA and Q-learning are two one-step, tabular TD algorithms that both estimate the value functions and optimize the policy, and that can actually be used in a great variety of RL problems. In this section, we will use SARSA to learn an optimal policy for a given MDP. Then, we'll introduce Q-learning.

A concern with TD learning is that it estimates the value of a state. Think about that. In a given state, how can you choose the action with the highest next state value? Earlier, we said that you should pick the action that will move the agent to the state with the highest value. However, without a model of the environment that provides a list of the possible next states, you cannot know which action will move the agent to that state. SARSA, instead of learning the value function, learns and applies the state-action function, . tells the value of a state, , if the action, , is taken.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset