Stability and reproducibility

Stability and reproducibility are somehow interconnected with each other as the goal is to design an algorithm that is capable of consistency across multiple runs and that is not too invariant to small tweaks. For example, the algorithm shouldn't be too sensitive to changes in the values of the hyperparameters.

The main factor that makes deep RL algorithms difficult to replicate is intrinsic to the nature of deep neural networks. This is mainly due to random initialization of the deep neural networks and the stochasticity of optimization. Moreover, this situation is exacerbated in RL, considering that the environments are stochastic. Combined, these factors are also to the detriment of the interpretability of results.

Stability is also put to the test by the high instability of RL algorithms, as we saw in Q-learning and REINFORCE. For example, in value-based algorithms, there isn't any guarantee of convergence and the algorithms suffer from high bias and instability. DQN uses many tricks to stabilize the learning process, such as an experienced replay and a delay in the update of the target network. Though these strategies can alleviate the instability problems, they don't go away.

To overcome any constraints that are intrinsic to the algorithm in terms of stability and reproducibility, we need to intervene outside of it. To this end, many different benchmarks and some rules of thumb can be employed to ensure a good level of reproducibility and consistency of results. These are as follows:

Whenever possible, test the algorithms on multiple but similar environments. For example, test it on a suite of environments such as Roboschool or Atari Gym where the tasks are comparable to each other in terms of action and state spaces but have different goals.

Run many trials across different random seeds. The results may vary significantly by changing the seeds. As an example of this, the following diagram shows two runs of the exact same algorithm with the same hyperparameters, but with a different seed. You can see that the differences are large. So, depending on your goal, it could be helpful to use multiple random seeds, generally between three and five. For example, in academic papers, it is good practice to average all the results across five runs and take the standard deviation into account as well.
If the results are unsteady, consider using a more stable algorithm or employing some further strategies. Also, keep in mind that the effects in the changes of the hyperparameters can vary significantly across algorithms and environments:

Figure 13.4. Performance of two trials of the same algorithm with different random seeds

Table of Contents for Stability and reproducibility

Create new playlist

Sign In

Sign Up

Table of Contents for
Stability and reproducibility