A known model

When a model is known, it can be used to simulate complete trajectories and compute the return for each of them. Then, the actions that yield the highest reward are chosen. This process is called planning, and the model of the environment is indispensable as it provides the information required to produce the next state (given a state and an action) and reward.

Planning algorithms are used everywhere, but the ones we are interested in differ from the type of action space on which they operate. Some of them work with discrete actions, others with continuous actions. 

Planning algorithms for discrete actions are usually search algorithms that build a decision tree, such as the one illustrated in the following diagram:

The current state is the root, the possible actions are represented by the arrows, and the other nodes are the states that are reached following a sequence of actions.

You can see that by trying every possible sequence of actions, you'll eventually find the optimal one. Unfortunately, in most problems, this procedure is intractable as the number of possible actions increases exponentially. Planning algorithms used for complex problems adopt strategies that allow planning by relying on a limited number of trajectories.

An algorithm of these, adopted also in AlphaGo, is called Monte Carlo Tree Search (MCTS). MCTS iteratively builds a decision tree by generating a finite series of simulated games, while sufficiently exploring parts of the tree that haven't been visited yet. Once a simulated game or trajectory reaches a leaf (that is, it ends the game), it backpropagates the results on the states visited and updates the information of win/loss or reward held by the nodes. Then, the action that yields to the next state with the higher win/loss ratio or reward is taken.

On the opposite side, planning algorithms that operate with continuous actions involve trajectory optimization techniques. These are much more difficult to solve than their counterpart with discrete actions, as they deal with an infinite-dimensional optimization problem.

Furthermore, many of them require the gradient of the model. An example is Model Predictive Control (MPC), which optimizes for a finite time horizon, but instead of executing the trajectory found, it only executes the first action. Doing so, MPC has a faster response compared to other methods with infinite time horizon planning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset