Summary

An RL problem can be formalized as an MDP, providing an abstract framework for learning goal-based problems. An MDP is defined by a set of states, actions, rewards, and transition probabilities, and solving an MDP means finding a policy that maximizes the expected reward in each state. The Markov property is intrinsic to the MDP and ensures that the future states depend only on the current one, not on its history. 

Using the definition of MDP, we formulated the concepts of policy, return function, expected return, action-value function, and value function. The latter two can be defined in terms of the values of the subsequent states, and the equations are called Bellman equations. These equations are useful because they provide a method to compute value functions in an iterative way. The optimal value functions can then be used to find the optimal policy.

RL algorithms can be categorized as model-based or model-free. While the former requires a model of the environment to plan the next actions, the latter is independent of the model and can learn by direct interaction with the environment. Model-free algorithms can be further divided into policy gradient and value function algorithms. Policy gradient algorithms learn directly from the policy through gradient ascent and are typically on-policy. Value function algorithms are usually off-policy, and learn an action-value function or value function in order to create the policy. These two methods can be brought together to give rise to methods that combine the advantages of both worlds. 

DP is the first set of model-based algorithms that we looked at in depth. It is used whenever the full model of the environment is known and when it is constituted by a limited number of states and actions. DP algorithms use bootstrapping to estimate the value of a state and they learn the optimal policy through two processes: policy evaluation and policy improvement. Policy evaluation computes the state value function for an arbitrary policy, while policy improvement improves the policy using the value function obtained from the policy evaluation process. 

By combining policy improvement and policy evaluation, the policy iteration algorithm and the value iteration algorithm can be created. The main difference between the two is that while policy iteration runs iteratively of policy evaluation and policy improvement, value iteration combines the two processes in a single update. 

Though DP suffers from the curse of dimensionality (the complexity grows exponentially with the number of states), the ideas behind policy evaluation and policy iteration are key in almost all RL algorithms because they use a generalized version of them.

Another disadvantage of DP is that it requires the exact model of the environment, limiting its applicability to many other problems.

In the next chapter, you'll see how V-functions and Q-functions can be used to learn a policy, using problems where the model is unknown by sampling directly from the environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset