Q-learning, as we saw in the previous sections, is quite useful but it does have its drawbacks. For example, as we have to estimate a Q value for each action, there has to be a discrete, limited set of actions. So, what if the action space is continuous or extremely large? Say you are using an RL algorithm to build a portfolio of stocks.
In this case, even if your universe of stocks consisted only of two stocks, say, AMZN and AAPL, there would be a huge amount of ways to balance them: 10% AMZN and 90% AAPL, 11% AMZM and 89% AAPL, and so on. If your universe gets bigger, the amount of ways you can combine stocks explodes.
A workaround to having to select from such an action space is to learn the policy, , directly. Once you have learned a policy, you can just give it a state, and it will give back a distribution of actions. This means that your actions will also be stochastic. A stochastic policy has advantages, especially in a game theoretic setting.
Imagine you are playing rock, paper, scissors and you are following a deterministic policy. If your policy is to pick rock, you will always pick rock, and as soon as your opponent figures out that you are always picking rock, you will always lose. The Nash equilibrium, the solution of a non-cooperative game, for rock, paper, scissors is to pick actions at random. Only a stochastic policy can do that.
To learn a policy, we have to be able to compute a gradient with respect to policy. Contrary to most people's expectations, policies are differentiable. In this section, we will build up a policy gradient step by step and use it to create an advantage actor-critic (A2C) model for continuous control.
The first part in the process of differentiating policies is to look at the advantage we can have by picking a particular action, a, rather than just following the policy, :
The advantage of action a in state s is the value of executing a in s minus the value of s under the policy, . We measure how good our policy, , is with , a function expressing the expected value of the starting state, :
Now, to compute the gradient of the policy, we have to do two steps, which are shown inside the expectation in the policy gradient formula:
First, we have to calculate the advantage of a given action, a, with A(s,a). Then we have to calculate the derivative of the weights of the neural network, , with respect to increasing the probability, , that a is picked under policy .
For actions with a positive advantage, A(s,a), we follow the gradient that would make a more likely. For actions with a negative advantage, we go in the exact opposite direction. The expectation says that we are doing this for all states and all actions. In practice, we manually multiply the advantage of actions with their increased likelihood gradients.
One thing left for us to look at is how we compute the advantage. The value of taking an action is the reward earned directly as a result of taking the action, as well as the value of the state we find ourselves in after taking that action:
So, we can substitute Q(s,a) in the advantage calculation:
As calculating V turns out to be useful for calculating the policy gradient, researchers have come up with the A2C architecture. A single neural network with two heads that learns both V and . As it turns out, sharing weights for learning the two functions is useful because it accelerates the training if both heads have to extract features from the environment:
If you are training an agent that operates on high-dimensional image data, for instance, the value function and the policy head then both need to learn how to interpret the image. Sharing weights would help master the common task. If you are training on lower dimensional data, it might make more sense to not share weights.
If the action space is continuous, is represented by two outputs, those being the mean, , and standard deviation, . This allows us to sample from a learned distribution just as we did for the autoencoder.
A common variant of the A2C approach is the asynchronous advantage actor-critic or A3C. A3C works exactly like A2C, except that at training time, multiple agents are simulated in parallel. This means that more independent data can be gathered. Independent data is important as too-correlated examples can make a model overfit to specific situations and forget other situations.
Since both A3C and A2C work by the same principles, and the implementation of parallel gameplay introduces some complexity that obfuscates the actual algorithm, we will just stick with A2C in the following examples.
In this section, we will train an A2C model to swing up and balance a pendulum:
The pendulum is controlled by a rotational force that can be applied in either direction. In the preceding diagram, you can see the arrow that shows the force being applied. Control is continuous; the agent can apply more or less force. At the same time, force can be applied in both directions as a positive and negative force.
This relatively simple control task is a useful example of a continuous control that can be easily extended to a stock trading task, which we will look at later. In addition, the task can be visualized so that we can get an intuitive grasp of how the algorithm learns, including any pitfalls.
The pendulum environment is part of the OpenAI Gym, a suite of games made to train reinforcement learning algorithms. You can install it via the command line as follows:
pip install
Before we start, we have to make some imports:
import gym #1 import numpy as np #2 from scipy.stats import norm #3 from keras.layers import Dense, Input, Lambda from keras.models import Model from keras.optimizers import Adam from keras import backend as K from collections import deque #4 import random
There are quite a few new imports, so let's walk through them one by one:
gym
is a toolkit for developing reinforcement learning algorithms. It provides a number of game environments, from classic control tasks, such as a pendulum, to Atari games and robotics simulations.gym
is interfaced by numpy
arrays. States, actions, and environments are all presented in a numpy
-compatible format.norm
function, which helps us take the norm of a vector.deque
Python data structure is a highly efficient data structure that conveniently manages a maximum length for us. No more manually removing experiences! We can randomly sample from deque
using Python's random
module.Now it is time to build the agent. The following methods all form the A2CAgent
class:
def __init__(self, state_size, action_size): self.state_size = state_size #1 self.action_size = action_size self.value_size = 1 self.exp_replay = deque(maxlen=2000) #2 self.actor_lr = 0.0001 #3 self.critic_lr = 0.001 self.discount_factor = .9 self.actor, self.critic = self.build_model() #4 self.optimize_actor = self.actor_optimizer() #5 self.optimize_critic = self.critic_optimizer()
Let's walk through the code step by step:
def build_model(self): state = Input(batch_shape=(None, self.state_size)) #1 actor_input = Dense(30, #2 activation='relu', kernel_initializer='he_uniform')(state) mu_0 = Dense(self.action_size, #3 activation='tanh', kernel_initializer='he_uniform')(actor_input) mu = Lambda(lambda x: x * 2)(mu_0) #4 sigma_0 = Dense(self.action_size, #5 activation='softplus', kernel_initializer='he_uniform')(actor_input) sigma = Lambda(lambda x: x + 0.0001)(sigma_0) #6 critic_input = Dense(30, #7 activation='relu', kernel_initializer='he_uniform')(state) state_value = Dense(1, kernel_initializer='he_uniform')(critic_input) #8 actor = Model(inputs=state, outputs=(mu, sigma)) #9 critic = Model(inputs=state, outputs=state_value) #10 actor._make_predict_function() #11 critic._make_predict_function() actor.summary() #12 critic.summary() return actor, critic #13
The preceding function sets up the Keras model. It is quite complicated, so let's go through it:
relu
activation function. It is initialized by an he_uniform
initializer. This initializer is only slightly different from the default glorot_uniform
initializer. The he_uniform
initializer draws from a uniform distribution with the limits , where is the input dimension. The default glorot uniform samples from a uniform distribution with the limits , with o being the output dimensionality. The difference between the two is rather small, but as it turns out, the he_uniform
initializer works better for learning the value function and policy.
tanh
activation, which ranges from -1 to 1 first and corrects the scaling later.tanh
function by two. Using the Lambda
layer, we can define such a function manually in the computational graph.softplus
activation works in principle just like relu
, but with a soft edge:Lambda
layer for this task. This also ensures that the gradients get calculated correctly, as the model is aware of the constant added.predict()
. If that happens from multiple threads, things can break. _make_predict_function
()
makes sure the model is already loaded on a GPU or CPU and is ready to predict, even from multiple threads.Now we have to create the optimizer for the actor. The actor uses a custom optimizer that optimizes it along the policy gradient. Before we define the optimizer; however, we need to look at the last piece of the policy gradient. Remember how the policy gradient was dependent on the gradient of the weights, , that would make action a more likely? Keras can calculate this derivative for us, but we need to provide Keras with the value of policy .
To this end, we need to define a probability density function. is a normal distribution with mean and standard deviation , so the probability density function, f, is as follows:
In this term, stands for the constant, 3.14…, not for the policy. Later, we only need to take the logarithm of this probability density function. Why the logarithm? Because taking the logarithm results in a smoother gradient. Maximizing the log of a probability means maximizing the probability, so we can just use the "log trick," as it is called, to improve learning.
The value of policy is the advantage of each action a, times the log probability of this action occurring as expressed by the probability density function, f.
The following function optimizes our actor model. Let's go through the optimization procedure:
def actor_optimizer(self): action = K.placeholder(shape=(None, 1)) #1 advantages = K.placeholder(shape=(None, 1)) mu, sigma_sq = self.actor.output #2 pdf = 1. / K.sqrt(2. * np.pi * sigma_sq) * K.exp(-K.square(action - mu) / (2. * sigma_sq)) #3 log_pdf = K.log(pdf + K.epsilon()) #4 exp_v = log_pdf * advantages #5 entropy = K.sum(0.5 * (K.log(2. * np.pi * sigma_sq) + 1.)) #6 exp_v = K.sum(exp_v + 0.01 * entropy) #7 actor_loss = -exp_v #8 optimizer = Adam(lr=self.actor_lr) #9 updates = optimizer.get_updates(self.actor.trainable_weights, [], actor_loss) #10 train = K.function([self.actor.input, action, advantages], [], updates=updates) #11 return train #12
epsilon
.K.sum()
, we sum the value over the batch.Adam
optimizer.get_updates()
takes three arguments, parameters
, constraints
, and loss
. We provide the parameters of the model, that is, its weights. Since we don't have any constraints, we just pass an empty list as a constraint. For a loss, we pass the actor loss.actor_optimizer()
in the init
function of our class, the optimizer function we just created becomes self.optimize_actor
.For the critic, we also need to create a custom optimizer. The loss for the critic is the mean squared error between the predicted value and the reward plus the predicted value of the next state:
def critic_optimizer(self): discounted_reward = K.placeholder(shape=(None, 1)) #1 value = self.critic.output loss = K.mean(K.square(discounted_reward - value)) #2 optimizer = Adam(lr=self.critic_lr) #3 updates = optimizer.get_updates(self.critic.trainable_weights, [], loss) train = K.function([self.critic.input, discounted_reward], [], updates=updates) #4 return train
The preceding function optimizes our critic model:
discounted_reward
contains the discounted future value of state as well as the reward immediately earned.
Adam
optimizer from which we obtain an update tensor, just as we did previously.self.optimize_critic
.For our agent to take actions, we need to define a method that produces actions from a state:
def get_action(self, state): state = np.reshape(state, [1, self.state_size]) #1 mu, sigma_sq = self.actor.predict(state) #2 epsilon = np.random.randn(self.action_size) #3 action = mu + np.sqrt(sigma_sq) * epsilon #4 action = np.clip(action, -2, 2) #5 return action
With this function, our actor can now act. Let's go through it:
At last, we need to train the model. The train_model
function will train the model after receiving one new experience:
def train_model(self, state, action, reward, next_state, done): self.exp_replay.append((state, action, reward, next_state, done)) #1 (state, action, reward, next_state, done) = random.sample(self.exp_replay,1)[0] #2 target = np.zeros((1, self.value_size)) #3 advantages = np.zeros((1, self.action_size)) value = self.critic.predict(state)[0] #4 next_value = self.critic.predict(next_state)[0] if done: #5 advantages[0] = reward - value target[0][0] = reward else: advantages[0] = reward + self.discount_factor * (next_value) - value target[0][0] = reward + self.discount_factor * next_value self.optimize_actor([state, action, advantages]) #6 self.optimize_critic([state, target])
And this is how we optimize both actor and critic:
And that is it; our A2CAgent
class is done. Now it is time to use it. We define a run_experiment
function. This function plays the game for a number of episodes. It is useful to first train a new agent without rendering, because training takes around 600 to 700 games until the agent does well. With your trained agent, you can then watch the gameplay:
def run_experiment(render=False, agent=None, epochs = 3000): env = gym.make('Pendulum-v0') #1 state_size = env.observation_space.shape[0] #2 action_size = env.action_space.shape[0] if agent = None: #3 agent = A2CAgent(state_size, action_size) scores = [] #4 for e in range(epochs): #5 done = False #6 score = 0 state = env.reset() state = np.reshape(state, [1, state_size]) while not done: #7 if render: #8 env.render() action = agent.get_action(state) #9 next_state, reward, done, info = env.step(action) #10 reward /= 10 #11 next_state = np.reshape(next_state, [1, state_size]) #12 agent.train_model(state, action, reward, next_state, done) #13 score += reward #14 state = next_state #15 if done: #16 scores.append(score) print("episode:", e, " score:", score) if np.mean(scores[-min(10, len(scores)):]) > -20: #17 print('Solved Pendulum-v0 after {} iterations'.format(len(scores))) return agent, scores
Our experiment boils down to these functions:
gym
environment. This environment contains the pendulum game. We can pass actions to it and observe states and rewards.epochs
.false
, score
to 0
, and reset the game. By resetting the game, we obtain the initial starting state.render = True
to the function, the game would be rendered on screen. Note that this won't work on a remote notebook such as in Kaggle or Jupyter.gym
also passes an info dictionary, which we can ignore.Reinforcement learning algorithms are largely developed in games and simulations where a failing algorithm won't cause any damage. However, once developed, an algorithm can be adapted to other, more serious tasks. To demonstrate this ability, we are now going to create an A2C agent that learns how to balance a portfolio of stocks within a large universe of stocks.
To train a new reinforcement learning algorithm, we first need to create a training environment. In this environment, the agent trades in real-life stock data. The environment can be interfaced just like an OpenAI Gym environment. Following the Gym conventions for interfacing reduces the complexity of development. Given a 100-day look back of the percentile returns of stocks in the universe, the agent has to return an allocation in the form of a 100-dimensional vector.
The allocation vector describes the share of assets the agent wants to allocate on one stock. A negative allocation means the agent is short trading the stock. For simplicity's sake, transaction costs and slippage are not added to the environment. It would not be too difficult to add them, however.
Tip: The full implementation of the environment and agent can be found at https://www.kaggle.com/jannesklaas/a2c-stock-trading.
The environment looks like this:
class TradeEnv(): def reset(self): self.data = self.gen_universe() #1 self.pos = 0 #2 self.game_length = self.data.shape[0] #3 self.returns = [] #4 return self.data[0,:-1,:] #5 def step(self,allocation): #6 ret = np.sum(allocation * self.data[self.pos,-1,:]) #7 self.returns.append(ret) #8 mean = 0 #9 std = 1 if len(self.returns) >= 20: #10 mean = np.mean(self.returns[-20:]) std = np.std(self.returns[-20:]) + 0.0001 sharpe = mean / std #11 if (self.pos +1) >= self.game_length: #12 return None, sharpe, True, {} else: #13 self.pos +=1 return self.data[self.pos,:-1,:], sharpe, False, {} def gen_universe(self): #14 stocks = os.listdir(DATA_PATH) stocks = np.random.permutation(stocks) frames = [] idx = 0 while len(frames) < 100: #15 try: stock = stocks[idx] frame = pd.read_csv(os.path.join(DATA_PATH,stock),index_col='Date') frame = frame.loc['2005-01-01':].Close frames.append(frame) except: e = sys.exc_info()[0] idx += 1 df = pd.concat(frames,axis=1,ignore_index=False) #16 df = df.pct_change() df = df.fillna(0) batch = df.values episodes = [] #17 for i in range(batch.shape[0] - 101): eps = batch[i:i+101] episodes.append(eps) data = np.stack(episodes) assert len(data.shape) == 3 assert data.shape[-1] == 100 return data
Our trade environment is somewhat similar to the pendulum environment. Let's see how we set it up:
We only have to make minor edits in the A2CAgent
agent class. Namely, we only have to modify the model so that it can take in the time series of returns. To this end, we add two LSTM
layers, which actor and critic share:
def build_model(self): state = Input(batch_shape=(None, #1 self.state_seq_length, self.state_size)) x = LSTM(120,return_sequences=True)(state) #2 x = LSTM(100)(x) actor_input = Dense(100, activation='relu', #3 kernel_initializer='he_uniform')(x) mu = Dense(self.action_size, activation='tanh', #4 kernel_initializer='he_uniform')(actor_input) sigma_0 = Dense(self.action_size, activation='softplus', kernel_initializer='he_uniform') (actor_input) sigma = Lambda(lambda x: x + 0.0001)(sigma_0) critic_input = Dense(30, activation='relu', kernel_initializer='he_uniform')(x) state_value = Dense(1, activation='linear', kernel_initializer='he_uniform') (critic_input) actor = Model(inputs=state, outputs=(mu, sigma)) critic = Model(inputs=state, outputs=state_value) actor._make_predict_function() critic._make_predict_function() actor.summary() critic.summary() return actor, critic
Again, we have built a Keras model in a function. It is only slightly different from the model before. Let's explore it:
LSTM
layers are shared across actor and critic.And that is it! This algorithm can now learn to balance a portfolio just as it could learn to balance before.