In this section, we will be introducing some practical tips for building RL systems. We will also highlight some current research frontiers that are highly relevant to financial practitioners.
Reinforcement learning is the field of designing algorithms that maximize a reward function. However, creating good reward functions is surprisingly hard. As anyone who has ever managed people will know, both people and machines game the system.
The literature on RL is full of examples of researchers finding bugs in Atari games that had been hidden for years but were found and exploited by an RL agent. For example, in the game "Fishing Derby," OpenAI has reported a reinforcement learning agent achieving a higher score than is ever possible according to the game makers, and this is without catching a single fish!
While it is fun for games, such behavior can be dangerous when it occurs in financial markets. An agent trained on maximizing returns from trading, for example, could resort to illegal trading activities such as spoofing trades, without its owners knowing about it. There are three methods to create better reward functions, which we will look at in the next three subsections.
By manually creating rewards, practitioners can help the system to learn. This works especially well if the natural rewards of the environment are sparse. If, say, a reward is usually only given if a trade is successful, and this is a rare event, it helps to manually add a function that gives a reward if the trade was nearly successful.
Equally, if an agent is engaging in illegal trading, a hard-coded "robot policy" can be set up that gives a huge negative reward to the agent if it breaks the law. Reward shaping works if the rewards and the environment are relatively simple. In complex environments, it can defeat the purpose of using machine learning in the first place. Creating a complex reward function in a very complex environment can be just as big a task as writing a rule-based system acting in the environment.
Yet, especially in finance, and more so in trading, hand-crafted reward shaping is useful. Risk-averse trading is an example of creating a clever objective function. Instead of maximizing the expected reward, risk-averse reinforcement learning maximizes an evaluation function,
, which is an extension of the utility-based shortfall to a multistage setting:
Here is a concave, continuous, and strictly increasing function that can be freely chosen according to how much risk the trader is willing to take. The RL algorithm now maximizes as follows:
In inverse reinforcement learning (IRL), a model is trained to predict the reward function of a human expert. A human expert is performing a task, and the model observes states and actions. It then tries to find a value function that explains the human expert's behavior. More specifically, by observing the expert, a policy trace of states and actions is created. One example is the maximum likelihood inverse reinforcement learning, or IRL, algorithm which works as follows:
Similar to IRL, which produces a reward function from human examples, there are also algorithms that learn from human preferences. A reward predictor produces a reward function under which policy is trained.
The goal of the reward predictor is to produce a reward function that results in a policy that has a large human preference. Human preference is measured by showing the human the results of two policies and letting the human indicate which one is more preferable:
Much like for GANs, RL can be fragile and can be hard to train for good results. RL algorithms are quite sensitive to hyperparameter choices. But there are a few ways to make RL more robust:
Similar to the GAN tips, there is little theoretical reason for why these tricks work, but they will make your RL work better in practice.