From zero to one

Once you have defined the algorithm that best fits your needs, whether that's one of the well-known algorithms or a new one, you have to develop it. As you saw throughout this book, reinforcement learning algorithms don't have much in common with supervised learning algorithms. For this reason, there are different aspects that are worth pointing out in order to facilitate the debugging, experimentation, and tuning of the algorithm:

  • Start with easy problems: Initially, you would want to experiment with a workable version of the code as fast as possible. However, it would be advisable to gradually proceed with increasingly complex environments. This will greatly help to reduce the overall training and debugging time. Let me present an example. You can start with CartPole-v1 or RoboschoolInvertedPendulum-v1 if you need a discrete or continuous environment, respectively. Then, you can move to a medium-complexity environment such as RoboschoolHopper-v1, LunarLander-v2, or a related environment with RGB images. At this point, you should have a bug-free code that you can finally train and tune on your final task. Moreover, you should be as familiar as possible with the easier tasks so that you know what to look for if something is not working.
  • Training is slow: Training deep reinforcement learning algorithms takes time and the learning curve can follow any kind of shape. As we saw in the previous chapters, the learning curves (that is, the cumulative reward of the trajectories with respect to the number of steps) can resemble a logarithm function, a hyperbolic tangent function, as shown in the following diagram, or a more complex function. The possible shapes depend on the reward function, its sparsity, and the complexity of the environment. If you are working on a new environment and you don't know what to expect, the suggestion here is to be patient and leave it running until you are sure that the progress has stopped. Also, don't get too involved with the plots while training.
  • Develop some baselines: For new tasks, the suggestion is to develop at least two baselines so that you can compare your algorithm with them. One baseline could simply be a random agent, with the other being an algorithm such as REINFORCE or A2C. These baselines can then be used as a lower bound for performance and efficiency.
  • Plots and histograms: To monitor the progress of the algorithm and to help during the debugging phase, an important factor is to plot and display histograms of key parameters such as the loss function, the cumulative reward, the actions (if possible), the length of the trajectories, the KL penalty, the entropy, and the value function. In addition to plotting the means, you can add the minimum and maximum values and the standard deviation. In this book, we primarily used TensorBoard to visualize this information, but you can use any tool you want.
  • Use multiple seeds: Deep reinforcement learning embeds stochasticity both in the neural networks and in the environments, which often makes the results incoherent between different runs. So, to ensure consistency and stability, it's better to use multiple random seeds.
  • Normalization: Depending on the design of the environment, it could be helpful to normalize the rewards, the advantage, and the observations. The advantage values (for example, in TRPO and PPO) can be normalized in a batch to have a mean of 0 and a standard deviation of 1. Additionally, the observations can be normalized using a set of initial random steps. Instead, the rewards can be normalized by a running estimate of the mean and standard deviation of the discounted or undiscounted reward.
  • Hyperparameter tuning: Hyperparameters change a lot based on the class and type of algorithm. For example, value-based methods have multiple distinct hyperparameters compared to policy gradients, but also instances of these classes such as TRPO and PPO have many unique hyperparameters. That being said, for each algorithm that was introduced throughout this book, we specified the hyperparameters that were used and the most important ones to tune. Among them, there are at least two hyperparameters that are used by all the RL algorithms: learning rate and discount factor. The former is slightly less important than in supervised learning, but nevertheless, it remains one of the first hyperparameters to tune so that we have a working algorithm. The discount factor is unique to RL algorithms. The introduction of a discount factor may introduce bias as it modifies the objective function. However, in practice, it produces a better policy. Thus, to a certain degree, the shorter the horizon, the better it is, as it reduces instability:

Figure 13.3. Example of a logarithmic and hyperbolic tangent function
For all the color references mentioned in the chapter, please refer to the color images bundle at http://www.packtpub.com/sites/default/files/downloads/9781789131116_ColorImages.pdf.

Adopt these techniques and you'll be able to train, develop, and deploy your algorithms much more easily. Furthermore, you'll have algorithms that are more stable and robust.

Having a critical view and understanding of the drawbacks of deep reinforcement learning is a key factor when it comes to actually pushing the boundaries on what RL algorithms can do to design better state-of-the-art algorithms. In the following section, we'll present the main challenges of deep RL in a more concise view. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset