Recently, a decades-old optimization algorithm for reinforcement learning algorithms has come back into fashion. Evolutionary strategies (ES) are much simpler than Q-learning or A2C.
Instead of training one model through backpropagation, in ES we create a population of models by adding random noise to the weights of the original model. We then let each model run in the environment and evaluate its performance. The new model is the performance-weighted average of all the models.
In the following diagram, you can see a visualization of how evolution strategies work:
To get a better grip on how this works, consider the following example. We want to find a vector that minimizes the mean squared error to a solution vector. The learner is not given the solution, but only the total error as a reward signal:
solution = np.array([0.5, 0.1, -0.3]) def f(w): reward = -np.sum(np.square(solution - w)) return reward
A key advantage of evolutionary strategies is that they have fewer hyperparameters. In this case, we need just three:
npop = 50 #1 sigma = 0.1 #2 alpha = 0.1 #3
The optimization algorithm will look like the following code:
w = np.random.randn(3) #1 for i in range(300): #2 N = np.random.randn(npop, 3) * sigma #3 R = np.zeros(npop) for j in range(npop): #4 w_try = w + N[j] R[j] = f(w_try) A = (R - np.mean(R)) / np.std(R) #5 w = w + alpha * np.dot(N.T, A)/npop #6
Genetic optimization is relatively short in code, so let's go through it:
sigma
.Similar to neural networks themselves, evolutionary strategies are loosely inspired by nature. In nature, species optimize themselves for survival using natural selection. Researchers have come up with many algorithms to imitate this process. The preceding neural evolution strategy algorithm works not only for single vectors but for large neural networks as well. Evolutionary strategies are still a field of active research, and at the time of writing, no best practice has been settled on.
Reinforcement learning and evolutionary strategies are the go-to techniques if no supervised learning is possible, but a reward signal is available. There are many applications in the financial industry where this is the case from simple "multi-armed bandit" problems, such as the DHL order routing system, to complex trading systems.