MDP

An MDP expresses the problem of sequential decision-making, where actions influence the next states and the results. MDPs are general and flexible enough to provide a formalization of the problem of learning a goal through interactions, the same problem that is addressed with RL. Thus we can express and reason with RL problems in terms of MDPs.

An MDP is four-tuple (S,A,P,R):

  • S is the state space, with a finite set of states.
  • A is the action space, with a finite set of actions.
  • P is a transition function, which defines the probability of reaching a state, s′, from s through an action, a. In P(s′, s, a) = p(s′| s, a), the transition function is equal to the conditional probability of s′ given s and a.
  • R is the reward function, which determines the value received for transitioning to state s′ after taking action a from state s.

An illustration of an MDP is given in the following diagram. The arrows represent the transitions between two states, with the transition probabilities attached to the tail of the arrows and the rewards on the body of the arrows. For their properties, the transition probabilities of a state must add up to 1. In this example, the final state is represented with a square (state S5) and for simplicity, we have represented an MDP with a single action:

Figure 3.1 Example of an MDP with five states and one action

The MDP is controlled by a sequence of discrete time steps that create a trajectory of states and actions (S0, A0, S1, A1, ...), where the states follow the dynamics of the MDP, namely the state transition function, p(s′|s, a). In this way, the transition function fully characterizes the environment's dynamics. 

By definition, the transition function and the reward function are determined only by the current state, and not from the sequence of the previous states visited. This property is called the Markov property, which means that the process is memory-less and the future state depends only on the current one, and not on its history. Thus, a state holds all the information. A system with such a property is called fully observable.

In many practical RL cases, the Markov property does not hold up, and for practicality, we can get around the problem by assuming it is an MDP and using a finite number of previous states (a finite history): St, St-1, St-2, ..., St-k. In this case, the system is partially observable and the states are called observations. We'll use this strategy in the Atari games, where we'll use row pixels as the input of the agent. This is because the single frame is static and does not carry information about the speed or direction of the objects. Instead, these values can be retrieved using the previous three or four frames (it is still an approximation).

The final objective of an MDP is to find a policy, π, that maximizes the cumulative reward, , where Rπ is the reward obtained at each step by following the policy, π. A solution of an MDP is found when a policy takes the best possible action in each state of the MDP. This policy is known as the optimal policy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset