Markov decision process and the Bellman equation

At the heart of RL is the Markov decision process (MDP). An MDP is often described as a discrete time stochastic control process. In simpler terms, this just means it is a control program that functions by time steps to determine the probability of actions, provided each action leads to a reward. This process is already used for most automation control of robotics, drones, networking, and of course RL. The classic way we picture this process is shown in the following diagram:



The Markov decision process

Where represent an MDP as a tuple or vector , using the following variables:

  • being a finite set of states,
  • - being a finite set of actions,
  •  - the probability that action  in state  at time  will lead to state  at time ,
  • is the immediate reward
  • - gamma is a discount factor we apply in order to discount the significance or provide significance to future rewards

The diagram works by picturing yourself as an agent in one of the states. You then determine actions based on the probability, always taking a random action. As you move to the next state, the action gives you a reward and you update the probability based on the reward. Again, David Silver covers this piece very well in his lectures.

Now, the preceding process works, but another variation came along that provided for better future reward evaluation, and that was done by introducing the Bellman Equation and the concept of a policy/value iteration. Whereas before we had a value, , we now have a policy () for a value called , and this yields us a new equation, shown here:

We won't cover much more about this equation other than to say to keep the concept of quality iteration in mind. In the next section, we will see how we can reduce this equation back to a quality indicator of each action and use that for Q-Learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset