Dynamic programming

DP is a general algorithmic paradigm that breaks up a problem into smaller chunks of overlapping subproblems, and then finds the solution to the original problem by combining the solutions of the subproblems.

DP can be used in reinforcement learning and is among one of the simplest approaches. It is suited to computing optimal policies by being provided with a perfect model of the environment.

DP is an important stepping stone in the history of RL algorithms and provides the foundation for the next generation of algorithms, but it is computationally very expensive. DP works with MDPs with a limited number of states and actions as it has to update the value of each state (or action-value), taking into consideration all the other possible states. Moreover, DP algorithms store value functions in an array or in a table. This way of storing information is effective and fast as there isn't any loss of information, but it does require the storage of large tables. Since DP algorithms use tables to store value functions, it is called tabular learning. This is opposed to approximated learning, which uses approximated value functions to store the values in a fixed size function, such as an artificial neural network. 

DP uses bootstrapping, meaning that it improves the estimation value of a state by using the expected value of the following states. As we have already seen, bootstrapping is used in the Bellman equation. Indeed, DP applies the Bellman equations, (6) and (7), to estimate  and/or . This can be done using the following:

Or by using the Q-function: 

Then, once the optimal value and action-value function are found, the optimal policy can be found by just taking the actions that maximize the expectation. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset