Humans don't learn from millions of labeled examples. Instead, we often learn from positive or negative experiences that we associate with our actions. Children that touch a hot stove once will never touch it again. Learning from experiences and the associated rewards or punishments is the core idea behind reinforcement learning (RL). RL allows us to learn sophisticated decision-making rules while having no data at all. Through this approach, several high-profile breakthroughs occurred in AI, such as AlphaGo, which beat the world Go champion in 2016.
In finance, reinforcement learning, also known as RL, is making inroads as well. In its 2017 report, Machine learning in investment management (https://www.ahl.com/machine-learning), Man AHL outlined a reinforcement system for order routing in the FX and futures market. Order routing is a classic problem in quantitative finance. When placing an order, funds can usually choose from different brokers and place their orders at different times. The goal is to fill the order as cheaply as possible. This also means minimizing the market impact, as large orders can lift prices of stocks.
Traditional algorithms with colorful names such as Sniper or Guerilla rely on statistics from historical data and smart engineering. The RL-based routing system learned an optimal routing policy by itself. The advantage is that this system can adapt to changing markets and because of that it outperforms traditional methods in data-rich markets such as the FX market.
However, RL can do more. Researchers at OpenAI have used RL to predict when agents will collaborate or fight. Meanwhile at DeepMind, researchers there have used RL to yield new insights into the workings of the frontal cortex in the brain and the role of the dopamine hormone.
This chapter will start with an intuitive introduction to RL using a simple "catch the fruit" game. We will then dive into the underlying theory before covering more advanced RL applications. The examples in this chapter rely on visualizations that are not easily rendered in Kaggle kernels. In order to simplify them, the example algorithms are also not optimized for GPU usage. It is, therefore, best to run these examples on your local machine.
The algorithms in this chapter run relatively quickly, so you won't have to wait too long for them to run. The chapter code was written on a Mid-2012 MacBook Pro, and no example took longer than 20 minutes to run on that machine. Of course, you can also run the code on Kaggle, however the visualizations will not work there.
Catch is a straightforward arcade game that you might have played as a child. Fruits fall from the top of the screen, and the player has to catch them with a basket. For every fruit caught, the player scores a point. For every fruit lost, the player loses a point.
The goal here is to let the computer play Catch by itself. We will be using a simplified version in this example in order to make the task easier:
While playing Catch, the player decides between three possible actions. They can move the basket to the left, to the right, or make it stay put.
The basis for this decision is the current state of the game; in other words, the positions of the falling fruit and of the basket. Our goal is to create a model that, given the content of the game screen, chooses the action that leads to the highest score possible. This task can be seen as a simple classification problem. We could ask expert human players to play the game multiple times and record their actions. Then, we could train a model to choose the "correct" action that mirrors the expert players.
This is not how humans learn, however. Humans can learn a game such as Catch by themselves, without guidance. This is very useful, because imagine if you had to hire a bunch of experts to perform a task thousands of times every time you wanted to learn something as simple as Catch: it would be expensive and slow.
In reinforcement learning, the model trains from experience, rather than labeled data. Instead of providing the model with the correct actions, we provide it with rewards and punishments. The model receives information about the current state of the environment, for example, the computer game screen. It then outputs an action, such as a joystick movement. The environment reacts to this action and provides the next state, along with any rewards:
The model then learns to find actions that lead to maximum rewards. There are many ways this can work in practice. Right now, we are going to look at Q-learning. Q-learning made a splash when it was used to train a computer to play Atari video games. Today, it is still a relevant concept. Most modern RL algorithms are based on some adaptation of Q-learning.
An excellent way to understand Q-learning is to compare playing Catch with playing chess. In both games, you are given a state, s. With chess, this is the position of the figures on the board. In Catch, this is the location of the fruit and the basket. The player then has to take an action, a. In chess, this is moving a figure. In Catch, this is moving the basket left or right or remaining in the current position.
As a result, there will be some reward, r, and a new state, . The problem with both Catch and chess is that the rewards do not appear immediately after the action.
In Catch, you only earn rewards when the fruits hit the basket or fall on the floor, and in chess, you only earn a reward when you win or lose the game. This means that rewards are sparsely distributed. Most of the time, r will be zero. When there is a reward, it is not always a result of the action taken immediately before. Some action taken long before might have caused the victory. Figuring out which action is responsible for the reward is often referred to as the credit assignment problem. Because rewards are delayed, good chess players do not choose their plays only by the immediate reward. Instead, they choose the expected future reward.
For example, they do not only think about whether they can eliminate an opponent's figure in the next move, they also consider how taking a specific action now will help them in the long run. In Q-learning, we choose our action based on the highest expected future reward. We use a Q-function to calculate this. This is a mathematical function that takes two arguments: the current state of the game, and a given action. We can write this as Q(state, action).
While in state s, we estimate the future reward for each possible action, a. We assume that after we have taken action a and moved to the next state, , everything works out perfectly. The expected future reward, q(s,a), for a given state and action is calculated as the immediate reward, plus the expected future reward thereafter, . We assume the next action, , is optimal. Because there is uncertainty about the future, we discount by the factor gamma, . We, therefore, arrive at an expected reward of this:
Good chess players are very good at estimating future rewards in their head. In other words, their Q-function, Q(s,a), is very precise.
Most chess practice revolves around developing a better Q-function. Players peruse many old games to learn how specific moves played out in the past, and how likely a given action is to lead to victory. However, this raises the question, how can a machine estimate a good Q-function? This is where neural networks come into play.
When playing a game, we generate lots of "experiences." These experiences consist of the following:
These experiences are our training data. We can frame the problem of estimating Q(s,a) as a regression problem. To solve this, we can use a neural network. Given an input vector consisting of s and a, the neural network is supposed to predict the value of Q(s,a) equal to the target: . If we are good at predicting Q(s,a) for different states s and actions a, we will have a good approximation of the Q-function.
Given a batch of experiences, , the training process then looks as follows:
During gameplay, all the experiences are stored in a replay memory. This acts like a simple buffer in which we store pairs. The ExperienceReplay
class also handles preparing the data for training.
Check out the following code:
class ExperienceReplay(object): #1 def __init__(self, max_memory=100, discount=.9): self.max_memory = max_memory #2 self.memory = [] self.discount = discount def remember(self, states, game_over): #3 self.memory.append([states, game_over]) if len(self.memory) > self.max_memory: del self.memory[0] #4 def get_batch(self, model, batch_size=10): #5 len_memory = len(self.memory) #6 num_actions = model.output_shape[-1] env_dim = self.memory[0][0][0].shape[1] inputs = np.zeros((min(len_memory, batch_size), env_dim)) #7 targets = np.zeros((inputs.shape[0], num_actions)) for i, idx in enumerate(np.random.randint(0, len_memory, size=inputs.shape[0])): #8 state_t, action_t, reward_t, state_tp1 = self.memory[idx][0] #9 game_over = self.memory[idx][1] inputs[i:i+1] = state_t #10 targets[i] = model.predict(state_t)[0] #11 Q_sa = np.max(model.predict(state_tp1)[0]) #12 if game_over: #13 targets[i, action_t] = reward_t else: targets[i, action_t] = reward_t + self.discount * Q_sa return inputs, targets
Let's pause for a second and break down the code that we've just created:
[...[experience, game_over][experience, game_over]...]
experience
is a tuple holding the experience information and game_over
is a binary Boolean value indicating whether the game was over after this step.get_batch
function, we can obtain a single batch of training data. To calculate , we need a neural network as well, so we need to pass a Keras model to use the function.
game_over
indicator from the replay buffer.state_tp1
in code, the neural network will estimate the expected reward perfectly. As the network trains, this assumption slowly becomes true.
Now it is time to define the model that will learn a Q-function for Catch. It turns out that a relatively simple model can already learn the function well. We need to define the number of possible actions as well as the grid size. There are three possible actions, which are move left, stay in position, and move right. Additionally, the game is being played on a 10x10-pixel grid:
num_actions = 3 grid_size = 10
As this is a regression problem, the final layer has no activation function, and the loss is a mean squared error loss. We optimize the network using stochastic gradient descent without momentum or any other bells and whistles:
model = Sequential() model.add(Dense(100, input_shape=(grid_size**2,), activation='relu')) model.add(Dense(100, activation='relu')) model.add(Dense(num_actions)) model.compile(optimizer='sgd', loss='mse')
The final ingredient to Q-learning is exploration. Everyday life shows that sometimes you have to do something weird and/or random to find out whether there is something better than your daily trot.
The same goes for Q-learning. By always choosing the best option, you might miss out on some unexplored paths. To avoid this, the learner will sometimes choose a random option, and not necessarily the best one.
Now we can define the training method:
def train(model,epochs): win_cnt = 0 #1 win_hist = [] for e in range(epochs): #2 loss = 0. env.reset() game_over = False input_t = env.observe() while not game_over: #3 input_tm1 = input_t #4 if np.random.rand() <= epsilon: #5 action = np.random.randint(0, num_actions, size=1) else: q = model.predict(input_tm1) #6 action = np.argmax(q[0]) input_t, reward, game_over = env.act(action) #7 if reward == 1: win_cnt += 1 exp_replay.remember([input_tm1, action, reward, input_t],game_over) #8 inputs, targets = exp_replay.get_batch(model, batch_size=batch_size) #9 batch_loss = model.train_on_batch(inputs, targets) loss += batch_loss win_hist.append(win_cnt) return win_hist
Before we go further, again let's break down the code so we can see what we're doing:
epoch
argument. At the beginning of a game, we first reset the game, set the game_over
indicator to False
, and observe the initial state of the game.input_tm1
, the input at time t minus one.epsilon
, we pick a random action. This technique is also called "epsilon greedy," as we pick a random action with a probability of epsilon and greedily choose the action promising the highest rewards otherwise.The following graph shows the rolling mean of successful games. After about 2,000 epochs of training, the neural network should be quite good at playing Catch:
Looking at the preceding graph, it's safe to say that you have now successfully created your first reinforcement learning system, as after 5000 epochs the average of victories per games is between 90% and 100%. In the next section, we will explore the theoretical foundations of reinforcement learning and discover how the same system that learns to play catch can learn to route orders in the futures market.