Practical tips for RL engineering

In this section, we will be introducing some practical tips for building RL systems. We will also highlight some current research frontiers that are highly relevant to financial practitioners.

Designing good reward functions

Reinforcement learning is the field of designing algorithms that maximize a reward function. However, creating good reward functions is surprisingly hard. As anyone who has ever managed people will know, both people and machines game the system.

The literature on RL is full of examples of researchers finding bugs in Atari games that had been hidden for years but were found and exploited by an RL agent. For example, in the game "Fishing Derby," OpenAI has reported a reinforcement learning agent achieving a higher score than is ever possible according to the game makers, and this is without catching a single fish!

While it is fun for games, such behavior can be dangerous when it occurs in financial markets. An agent trained on maximizing returns from trading, for example, could resort to illegal trading activities such as spoofing trades, without its owners knowing about it. There are three methods to create better reward functions, which we will look at in the next three subsections.

Careful, manual reward shaping

By manually creating rewards, practitioners can help the system to learn. This works especially well if the natural rewards of the environment are sparse. If, say, a reward is usually only given if a trade is successful, and this is a rare event, it helps to manually add a function that gives a reward if the trade was nearly successful.

Equally, if an agent is engaging in illegal trading, a hard-coded "robot policy" can be set up that gives a huge negative reward to the agent if it breaks the law. Reward shaping works if the rewards and the environment are relatively simple. In complex environments, it can defeat the purpose of using machine learning in the first place. Creating a complex reward function in a very complex environment can be just as big a task as writing a rule-based system acting in the environment.

Yet, especially in finance, and more so in trading, hand-crafted reward shaping is useful. Risk-averse trading is an example of creating a clever objective function. Instead of maximizing the expected reward, risk-averse reinforcement learning maximizes an evaluation function,

Careful, manual reward shaping

, which is an extension of the utility-based shortfall to a multistage setting:

Careful, manual reward shaping

Here Careful, manual reward shaping is a concave, continuous, and strictly increasing function that can be freely chosen according to how much risk the trader is willing to take. The RL algorithm now maximizes as follows:

Careful, manual reward shaping

Inverse reinforcement learning

In inverse reinforcement learning (IRL), a model is trained to predict the reward function of a human expert. A human expert is performing a task, and the model observes states and actions. It then tries to find a value function that explains the human expert's behavior. More specifically, by observing the expert, a policy trace of states and actions is created. One example is the maximum likelihood inverse reinforcement learning, or IRL, algorithm which works as follows:

  1. Guess a reward function, R
  2. Compute the policy, Inverse reinforcement learning, that follows from R, by training an RL agent
  3. Compute the probability that the actions observed, D, were a result of Inverse reinforcement learning, Inverse reinforcement learning
  4. Compute the gradient with respect to R and update it
  5. Repeat this process until Inverse reinforcement learning is very high

Learning from human preferences

Similar to IRL, which produces a reward function from human examples, there are also algorithms that learn from human preferences. A reward predictor produces a reward function under which policy is trained.

The goal of the reward predictor is to produce a reward function that results in a policy that has a large human preference. Human preference is measured by showing the human the results of two policies and letting the human indicate which one is more preferable:

Learning from human preferences

Learning from preferences

Robust RL

Much like for GANs, RL can be fragile and can be hard to train for good results. RL algorithms are quite sensitive to hyperparameter choices. But there are a few ways to make RL more robust:

  • Using a larger experience replay buffer: The goal of using experience replay buffers is to collect uncorrelated experiences. This can be achieved by just creating a larger buffer or a whole buffer database that can store millions of examples, possibly from different agents.
  • Target networks: RL is unstable in part because the neural network relies on its own output for training. By using a frozen target network for generating training data, we can mitigate problems. The frozen target network should only be updated slowly by, for example, moving the weights of the target network only a few percent every few epochs in the direction of the trained network.
  • Noisy inputs: Adding noise to the state representation helps the model generalize to other situations and avoids overfitting. It has proven especially useful if the agent is trained in a simulation but needs to generalize to the real, more complex world.
  • Adversarial examples: In a GAN-like setup, an adversarial network can be trained to fool the model by changing the state representations. The model can, in turn, learn to ignore the adversarial attacks. This makes learning more robust.
  • Separating policy learning from feature extraction: The most well-known results in reinforcement learning have learned a game from raw inputs. However, this requires the neural network to interpret, for example, an image by learning how that image leads to rewards. It is easier to separate the steps by, for example, first training an autoencoder that compresses state representations, then training a dynamics model that can predict the next compressed state, and then training a relatively small policy network from the two inputs.

Similar to the GAN tips, there is little theoretical reason for why these tricks work, but they will make your RL work better in practice.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset