Markov processes and the bellman equation – A more formal introduction to RL

Following the long history of modern deep learning being a continuation of quantitative finance with more GPUs, the theoretical foundation of reinforcement learning lies in Markov models.

Note

Note: This section requires a bit of mathematical background knowledge. If you are struggling, there is a beautiful visual introduction by Victor Powell here: http://setosa.io/ev/markov-chains/.

A more formal, but still simple, introduction is available on the website Analytics Vidhya: https://www.analyticsvidhya.com/blog/2014/07/markov-chain-simplified/.

A Markov model describes a stochastic process with different states in which the probability of ending up in a specific state is purely dependent on the state one is currently in. In the following diagram, you can see a simple Markov model describing recommendations given for a stock:

Markov processes and the bellman equation – A more formal introduction to RL

The Markov model

As you can see, there are three states in this model, BUY, HOLD, and SELL. For every two states, there is a transition probability. For example, the probability that a state gets a BUY recommendation if it had a HOLD recommendation in the previous round is described by Markov processes and the bellman equation – A more formal introduction to RL, which is equal to 0.5. There is a 50% chance that a stock that is currently in HOLD will move to BUY in the next round.

States are associated with rewards. If you own stock, and that stock has a BUY recommendation, the stock will go up, and you will earn a reward of 1. If the stock has a sell recommendation, you will gain a negative reward, or punishment, of -1.

Note

Note: In some textbooks, the rewards are associated with state transitions and not states themselves. It turns out to be mathematically equivalent, and for the ease of notation, we are associating the rewards with states here.

In a Markov model, an agent can follow a policy, usually denoted as Markov processes and the bellman equation – A more formal introduction to RL. A policy describes the probability of taking action a when in state s. Say you are a trader: you own stock and that stock gets a SELL recommendation. In that case, you might choose to sell the stock in 50% of cases, hold the stock in 30% of cases, and buy more in 20% of cases. In other words, your policy for the state SELL can be described as follows:

Markov processes and the bellman equation – A more formal introduction to RL
Markov processes and the bellman equation – A more formal introduction to RL
Markov processes and the bellman equation – A more formal introduction to RL

Some traders have a better policy and can make more money from a state than others. Therefore, the value of state s depends on the policy, Markov processes and the bellman equation – A more formal introduction to RL. The value function, V, describes the value of state s when policy Markov processes and the bellman equation – A more formal introduction to RL is followed. It is the expected return from state s when policy Markov processes and the bellman equation – A more formal introduction to RL is followed:

Markov processes and the bellman equation – A more formal introduction to RL

The expected return is the reward gained immediately plus the discounted future rewards:

Markov processes and the bellman equation – A more formal introduction to RL

The other value function frequently used in RL is the function Q(s,a), which we have already seen in the previous section. Q describes the expected return of taking action a in state s if policy Markov processes and the bellman equation – A more formal introduction to RL is followed:

Markov processes and the bellman equation – A more formal introduction to RL

Note

Note: We use the expected value since our environment and our actions are stochastic. We cannot say for certain that we will land in a specific state; we can only give a probability.

Q and V describe the same thing. If we find ourselves in a certain state, what should we do? V gives recommendations regarding which state we should seek, and Q gives advice on which action we should take. Of course, V implicitly assumes we have to take some action and Q assumes that the result of our actions is landing in some state. In fact, both Q and V are derived from the so-called Bellman equation, which brings us back to the Markov model from the beginning of this section.

If you assume that the environment you operate in can be described as a Markov model, you would really want to know two things. First, you would want to find out the state transition probabilities. If you are in state s, what is the chance Markov processes and the bellman equation – A more formal introduction to RL ends in state Markov processes and the bellman equation – A more formal introduction to RL if you take action a? Mathematically, it is as follows:

Markov processes and the bellman equation – A more formal introduction to RL

Equally, you would be interested in the expected reward Markov processes and the bellman equation – A more formal introduction to RL of being in state s, taking action a and ending up in state Markov processes and the bellman equation – A more formal introduction to RL:

Markov processes and the bellman equation – A more formal introduction to RL

With this in mind, we can now derive the two Bellman equations for Q and V. First, we rewrite the equation describing V to contain the actual formula for Markov processes and the bellman equation – A more formal introduction to RL:

Markov processes and the bellman equation – A more formal introduction to RL

We can pull the first reward out of the sum:

Markov processes and the bellman equation – A more formal introduction to RL

The first part of our expectation is the expected reward we directly receive from being in state s and the following policy, Markov processes and the bellman equation – A more formal introduction to RL:

Markov processes and the bellman equation – A more formal introduction to RL

The preceding equation shows a nested sum. First, we sum over all actions, a, weighted by their probability of occurrence under policy Markov processes and the bellman equation – A more formal introduction to RL. For each action, we then sum over the distribution of rewards Markov processes and the bellman equation – A more formal introduction to RL from the transition from state s to the next state, Markov processes and the bellman equation – A more formal introduction to RL, after action a, weighted by the probability of this transition occurring following the transition probability Markov processes and the bellman equation – A more formal introduction to RL.

The second part of our expectation can be rewritten as follows:

Markov processes and the bellman equation – A more formal introduction to RL

The expected discounted value of the future rewards after state s is the discounted expected future value of all states, Markov processes and the bellman equation – A more formal introduction to RL, weighted by their probability of occurrence, Markov processes and the bellman equation – A more formal introduction to RL, and the probability of action a being taken following policy Markov processes and the bellman equation – A more formal introduction to RL.

This formula is quite a mouthful, but it gives us a glimpse into the recursive nature of the value function. If we now replace the expectation in our value function, it becomes clearer:

Markov processes and the bellman equation – A more formal introduction to RL

The inner expectation represents the value function for the next step, Markov processes and the bellman equation – A more formal introduction to RL! That means we can replace the expectation with the value function, Markov processes and the bellman equation – A more formal introduction to RL:

Markov processes and the bellman equation – A more formal introduction to RL

Following the same logic, we can derive the Q function as follows:

Markov processes and the bellman equation – A more formal introduction to RL

Congratulations, you have just derived the Bellman equation! For now, pause and take a second to ponder and make sure you really understand the mechanics behind these equations. The core idea is that the value of a state can be expressed as the value of other states. For a long time, the go-to approach to optimizing the Bellman equation was to build a model of the underlying Markov model and its state transition and reward probabilities.

However, the recursive structure calls for a technique called dynamic programming. The idea behind dynamic programming is to solve easier sub-problems. You've already seen this in action in the Catch example. There, we used a neural network to estimate Markov processes and the bellman equation – A more formal introduction to RL, except for states that ended the game. For these games, finding the reward associated with the state is easy: it is the final reward received at the end of the game. It was these states for which the neural network first developed an accurate estimate of the function Q. From there, it could then go backward and learn the values of states that were further away from the end of the game. There are more possible applications of this dynamic programming and model-free approach to reinforcement learning.

Before we jump into the different kinds of systems that can be built using this theoretical foundation, we will pay a brief visit to the applications of the Bellman equation in economics. Readers who are familiar with the work discussed here will find reference points that they can use to develop a deeper understanding of Bellman equations. Readers unfamiliar with these works will find inspiration for further reading and applications for techniques discussed throughout this chapter.

The Bellman equation in economics

While the first application of the Bellman equation to economics occurred in 1954, Robert C. Merton's 1973 article, An Intertemporal Capital Asset Pricing Model (http://www.people.hbs.edu/rmerton/Intertemporal%20Capital%20Asset%20Pricing%20Model.pdf), is perhaps the most well-known application. Using the Bellman equation, Merton developed a capital asset pricing model that, unlike the classic CAPM model, works in continuous time and can account for changes in an investment opportunity.

The recursiveness of the Bellman equation inspired the subfield of recursive economics. Nancy Stokey, Robert Lucas, and Edward Prescott wrote an influential 1989 book titled Recursive Methods in Economic Dynamics (http://www.hup.harvard.edu/catalog.php?isbn=9780674750968) in which they apply the recursive approach to solve problems in economic theory. This book inspired others to use recursive economics to address a wide range of economic problems, from the principal-agent problem to optimal economic growth.

Avinash Dixit and Robert Pindyck developed and applied the approach successfully to capital budgeting in their 1994 book, Investment Under Uncertainty (https://press.princeton.edu/titles/5474.html). Patrick Anderson applied it to the valuation of private businesses in his 2009 article, The Value of Private Businesses in the United States (https://www.andersoneconomicgroup.com/the-value-of-private-businesses-in-the-united-states/).

While recursive economics still has many problems, including the tremendous compute power required for it, it is a promising subfield of the science.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset