Chapter 7. Reinforcement Learning for Financial Markets

Humans don't learn from millions of labeled examples. Instead, we often learn from positive or negative experiences that we associate with our actions. Children that touch a hot stove once will never touch it again. Learning from experiences and the associated rewards or punishments is the core idea behind reinforcement learning (RL). RL allows us to learn sophisticated decision-making rules while having no data at all. Through this approach, several high-profile breakthroughs occurred in AI, such as AlphaGo, which beat the world Go champion in 2016.

In finance, reinforcement learning, also known as RL, is making inroads as well. In its 2017 report, Machine learning in investment management (https://www.ahl.com/machine-learning), Man AHL outlined a reinforcement system for order routing in the FX and futures market. Order routing is a classic problem in quantitative finance. When placing an order, funds can usually choose from different brokers and place their orders at different times. The goal is to fill the order as cheaply as possible. This also means minimizing the market impact, as large orders can lift prices of stocks.

Traditional algorithms with colorful names such as Sniper or Guerilla rely on statistics from historical data and smart engineering. The RL-based routing system learned an optimal routing policy by itself. The advantage is that this system can adapt to changing markets and because of that it outperforms traditional methods in data-rich markets such as the FX market.

However, RL can do more. Researchers at OpenAI have used RL to predict when agents will collaborate or fight. Meanwhile at DeepMind, researchers there have used RL to yield new insights into the workings of the frontal cortex in the brain and the role of the dopamine hormone.

This chapter will start with an intuitive introduction to RL using a simple "catch the fruit" game. We will then dive into the underlying theory before covering more advanced RL applications. The examples in this chapter rely on visualizations that are not easily rendered in Kaggle kernels. In order to simplify them, the example algorithms are also not optimized for GPU usage. It is, therefore, best to run these examples on your local machine.

The algorithms in this chapter run relatively quickly, so you won't have to wait too long for them to run. The chapter code was written on a Mid-2012 MacBook Pro, and no example took longer than 20 minutes to run on that machine. Of course, you can also run the code on Kaggle, however the visualizations will not work there.

Catch – a quick guide to reinforcement learning

Catch is a straightforward arcade game that you might have played as a child. Fruits fall from the top of the screen, and the player has to catch them with a basket. For every fruit caught, the player scores a point. For every fruit lost, the player loses a point.

The goal here is to let the computer play Catch by itself. We will be using a simplified version in this example in order to make the task easier:

Catch – a quick guide to reinforcement learning

The "Catch" game that we will be creating

While playing Catch, the player decides between three possible actions. They can move the basket to the left, to the right, or make it stay put.

The basis for this decision is the current state of the game; in other words, the positions of the falling fruit and of the basket. Our goal is to create a model that, given the content of the game screen, chooses the action that leads to the highest score possible. This task can be seen as a simple classification problem. We could ask expert human players to play the game multiple times and record their actions. Then, we could train a model to choose the "correct" action that mirrors the expert players.

This is not how humans learn, however. Humans can learn a game such as Catch by themselves, without guidance. This is very useful, because imagine if you had to hire a bunch of experts to perform a task thousands of times every time you wanted to learn something as simple as Catch: it would be expensive and slow.

In reinforcement learning, the model trains from experience, rather than labeled data. Instead of providing the model with the correct actions, we provide it with rewards and punishments. The model receives information about the current state of the environment, for example, the computer game screen. It then outputs an action, such as a joystick movement. The environment reacts to this action and provides the next state, along with any rewards:

Catch – a quick guide to reinforcement learning

RL scheme

The model then learns to find actions that lead to maximum rewards. There are many ways this can work in practice. Right now, we are going to look at Q-learning. Q-learning made a splash when it was used to train a computer to play Atari video games. Today, it is still a relevant concept. Most modern RL algorithms are based on some adaptation of Q-learning.

An excellent way to understand Q-learning is to compare playing Catch with playing chess. In both games, you are given a state, s. With chess, this is the position of the figures on the board. In Catch, this is the location of the fruit and the basket. The player then has to take an action, a. In chess, this is moving a figure. In Catch, this is moving the basket left or right or remaining in the current position.

As a result, there will be some reward, r, and a new state, Catch – a quick guide to reinforcement learning. The problem with both Catch and chess is that the rewards do not appear immediately after the action.

In Catch, you only earn rewards when the fruits hit the basket or fall on the floor, and in chess, you only earn a reward when you win or lose the game. This means that rewards are sparsely distributed. Most of the time, r will be zero. When there is a reward, it is not always a result of the action taken immediately before. Some action taken long before might have caused the victory. Figuring out which action is responsible for the reward is often referred to as the credit assignment problem. Because rewards are delayed, good chess players do not choose their plays only by the immediate reward. Instead, they choose the expected future reward.

For example, they do not only think about whether they can eliminate an opponent's figure in the next move, they also consider how taking a specific action now will help them in the long run. In Q-learning, we choose our action based on the highest expected future reward. We use a Q-function to calculate this. This is a mathematical function that takes two arguments: the current state of the game, and a given action. We can write this as Q(state, action).

While in state s, we estimate the future reward for each possible action, a. We assume that after we have taken action a and moved to the next state, Catch – a quick guide to reinforcement learning, everything works out perfectly. The expected future reward, q(s,a), for a given state and action is calculated as the immediate reward, plus the expected future reward thereafter, Catch – a quick guide to reinforcement learning. We assume the next action, Catch – a quick guide to reinforcement learning, is optimal. Because there is uncertainty about the future, we discount Catch – a quick guide to reinforcement learning by the factor gamma, Catch – a quick guide to reinforcement learning. We, therefore, arrive at an expected reward of this:

Catch – a quick guide to reinforcement learning

Note

Note: We discount future rewards in RL for the same reason we discount future returns in finance. They are uncertain. Our choice here reflects how much we value future returns.

Good chess players are very good at estimating future rewards in their head. In other words, their Q-function, Q(s,a), is very precise.

Most chess practice revolves around developing a better Q-function. Players peruse many old games to learn how specific moves played out in the past, and how likely a given action is to lead to victory. However, this raises the question, how can a machine estimate a good Q-function? This is where neural networks come into play.

Q-learning turns RL into supervised learning

When playing a game, we generate lots of "experiences." These experiences consist of the following:

  • The initial state, s
  • The action taken, a
  • The reward earned, r
  • The state that followed, Q-learning turns RL into supervised learning

These experiences are our training data. We can frame the problem of estimating Q(s,a) as a regression problem. To solve this, we can use a neural network. Given an input vector consisting of s and a, the neural network is supposed to predict the value of Q(s,a) equal to the target: Q-learning turns RL into supervised learning. If we are good at predicting Q(s,a) for different states s and actions a, we will have a good approximation of the Q-function.

Note

Note: We estimate Q-learning turns RL into supervised learning through the same neural network as Q(s,a). This leads to some instability as our targets now change as the networks learn, just as with generative adversarial networks (GANs).

Given a batch of experiences, Q-learning turns RL into supervised learning, the training process then looks as follows:

  1. For each possible action, Q-learning turns RL into supervised learning, (left, right, stay), predict the expected future reward, Q-learning turns RL into supervised learning, using the neural network.
  2. Choose the highest value of the three predictions as the max, Q-learning turns RL into supervised learning.
  3. Calculate Q-learning turns RL into supervised learning. This is the target value for the neural network.
  4. Train the neural network using a loss function. This is a function that calculates how near or far the predicted value is from the target value. Here, we will use Q-learning turns RL into supervised learning as the loss function. Effectively, we want to minimize the squared error between prediction and target. The factor of 0.5 is just there to make the gradient nicer.

During gameplay, all the experiences are stored in a replay memory. This acts like a simple buffer in which we store Q-learning turns RL into supervised learning pairs. The ExperienceReplay class also handles preparing the data for training.

Check out the following code:

class ExperienceReplay(object):                                   #1
    def __init__(self, max_memory=100, discount=.9):
        self.max_memory = max_memory                              #2
        self.memory = []
        self.discount = discount

    def remember(self, states, game_over):                        #3
        self.memory.append([states, game_over])
        if len(self.memory) > self.max_memory:
            del self.memory[0]                                    #4

    def get_batch(self, model, batch_size=10):                    #5
        len_memory = len(self.memory)                             #6
        num_actions = model.output_shape[-1]
        env_dim = self.memory[0][0][0].shape[1]
        
        inputs = np.zeros((min(len_memory, batch_size), env_dim)) #7
        targets = np.zeros((inputs.shape[0], num_actions))
        
        for i, idx in enumerate(np.random.randint(0, len_memory, size=inputs.shape[0])):           #8
            state_t, action_t, reward_t, state_tp1 = self.memory[idx][0]                               #9
            game_over = self.memory[idx][1]

            inputs[i:i+1] = state_t                               #10
            
            targets[i] = model.predict(state_t)[0]                #11
            
            Q_sa = np.max(model.predict(state_tp1)[0])            #12
            
            if game_over:                                         #13
                targets[i, action_t] = reward_t
            else:
                targets[i, action_t] = reward_t + self.discount * Q_sa
        return inputs, targets

Let's pause for a second and break down the code that we've just created:

  1. Firstly, we implement the experience replay buffer as a Python class. A replay buffer object is responsible for storing experiences and generating training data. Therefore, it has to implement some of the most critical pieces of the Q-learning algorithm.
  2. To initialize a replay object, we need to let it know how large its buffer should be and what the discount rate, Q-learning turns RL into supervised learning, is. The replay memory itself is a list of lists following this scheme:
    [...[experience, game_over][experience, game_over]...]
  3. Within this, experience is a tuple holding the experience information and game_over is a binary Boolean value indicating whether the game was over after this step.
  4. When we want to remember a new experience, we add it to our list of experiences. Since we cannot store infinite experiences, we delete the oldest experience if our buffer exceeds its maximum length.
  5. With the get_batch function, we can obtain a single batch of training data. To calculate Q-learning turns RL into supervised learning, we need a neural network as well, so we need to pass a Keras model to use the function.
  6. Before we start generating a batch, we need to know how many experiences we have stored in our replay buffer, how many possible actions there are, and how many dimensions a game state has.
  7. Then we need to set up placeholder arrays for the inputs and targets we want to train the neural network on.
  8. We loop over the experience replay in a random order until we have either sampled all stored experiences or filled the batch.
  9. We load the experience data as well as the game_over indicator from the replay buffer.
  10. We add state s to the input matrix. Later, the model will train to map from this state to the expected reward.
  11. We then fill the expected reward for all actions with the expected reward calculated by the current model. This ensures that our model only trains on the action that was actually taken since the loss for all other actions is zero.
  12. Next, we calculate Q-learning turns RL into supervised learning. We simply assume that for the next state, Q-learning turns RL into supervised learning, or state_tp1 in code, the neural network will estimate the expected reward perfectly. As the network trains, this assumption slowly becomes true.
  13. Finally, if the game ended after state S, the expected reward from the action, a, should be the received reward, r. If it did not end, then the expected reward should be the received reward as well as the discounted expected future reward.

Defining the Q-learning model

Now it is time to define the model that will learn a Q-function for Catch. It turns out that a relatively simple model can already learn the function well. We need to define the number of possible actions as well as the grid size. There are three possible actions, which are move left, stay in position, and move right. Additionally, the game is being played on a 10x10-pixel grid:

num_actions = 3
grid_size = 10

As this is a regression problem, the final layer has no activation function, and the loss is a mean squared error loss. We optimize the network using stochastic gradient descent without momentum or any other bells and whistles:

model = Sequential()
model.add(Dense(100, input_shape=(grid_size**2,), activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(num_actions))
model.compile(optimizer='sgd', loss='mse')

Training to play Catch

The final ingredient to Q-learning is exploration. Everyday life shows that sometimes you have to do something weird and/or random to find out whether there is something better than your daily trot.

The same goes for Q-learning. By always choosing the best option, you might miss out on some unexplored paths. To avoid this, the learner will sometimes choose a random option, and not necessarily the best one.

Now we can define the training method:

def train(model,epochs):
    win_cnt = 0                                                 #1

    win_hist = []

    for e in range(epochs):                                     #2
        loss = 0.
        env.reset()
        game_over = False
        input_t = env.observe()
        
        while not game_over:                                    #3
            input_tm1 = input_t                                 #4
            
            if np.random.rand() <= epsilon:                     #5
                action = np.random.randint(0, num_actions, size=1)
            else:
                q = model.predict(input_tm1)                    #6
                action = np.argmax(q[0])

            input_t, reward, game_over = env.act(action)        #7
            if reward == 1:
                win_cnt += 1

            exp_replay.remember([input_tm1, action, reward, input_t],game_over)            #8

            inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)          #9

            batch_loss = model.train_on_batch(inputs, targets)

            loss += batch_loss
            
        win_hist.append(win_cnt)
    return win_hist

Before we go further, again let's break down the code so we can see what we're doing:

  1. We want to keep track of the progress of our Q-learner, so we count the wins of the model over time.
  2. We now play for a number of games, specified by the epoch argument. At the beginning of a game, we first reset the game, set the game_over indicator to False, and observe the initial state of the game.
  3. We will then be playing frame by frame until the game is over.
  4. At the start of a frame cycle, we save the previously observed input as input_tm1, the input at time t minus one.
  5. Now comes the exploration part. We draw a random number between zero and one. If the number is smaller than epsilon, we pick a random action. This technique is also called "epsilon greedy," as we pick a random action with a probability of epsilon and greedily choose the action promising the highest rewards otherwise.
  6. If we choose a non-random action, we let the neural network predict the expected rewards for all actions. We then pick the action with the highest expected reward.
  7. We now act with our chosen or random action and observe a new state, a reward, and information about whether the game is over. The game gives a reward of one if we win, so we eventually have to increase our win counter.
  8. We store the new experience in our experience replay buffer.
  9. We then sample a new training batch from the experience replay and train on that batch.

The following graph shows the rolling mean of successful games. After about 2,000 epochs of training, the neural network should be quite good at playing Catch:

Training to play Catch

The progress of a Q-learning neural network playing Catch

Looking at the preceding graph, it's safe to say that you have now successfully created your first reinforcement learning system, as after 5000 epochs the average of victories per games is between 90% and 100%. In the next section, we will explore the theoretical foundations of reinforcement learning and discover how the same system that learns to play catch can learn to route orders in the futures market.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset