Choosing the appropriate algorithm

The main driving force that differentiates the various types of RL algorithms is sample efficiency and training time.

We consider sample efficiency as the number of interactions with the environment that an agent has to make in order to learn the task. The numbers that we'll provide are an indication of the efficiency of the algorithm and are measured with respect to other algorithms on typical environments.

Clearly, there are other parameters that influence this choice, but usually, they have a minor impact and are of less importance. Just to give you an idea, the other parameters to be evaluated are the availability of CPUs and GPUs, the type of reward function, the scalability, and the complexity of the algorithm, as well as that of the environment.

For this comparison, we will take into consideration gradient-free black-box algorithms such as evolution strategies, model-based RL such as DAgger, and model-free RL. Of the latter, we will differentiate between policy gradient algorithms such as DDPG and TRPO and value-based algorithms such as DQN. 

The following diagram shows the data efficiency of these four categories of algorithms (note that the leftmost methods are less sample efficient than the rightmost methods). In particular, the efficiency of the algorithm increases as you move to the right of the diagram. So, you can see that gradient-free methods are those that require more data points from the environment, followed by policy gradient methods, value-based methods, and finally model-based RL, which are the most sample efficient:

Figure 13.1. Sample efficiency comparison between model-based RL methods, policy gradient algorithms, value-based algorithms, and gradient-free algorithms (the leftmost methods are less efficient than the rightmost methods)

Conversely, the training time of these algorithms is inversed related to their sample efficiency. This relationship is summarized in the following diagram (note that the leftmost methods are slower to train than the rightmost methods). We can see that Model-based algorithms are way slower to train than Value-based algorithms, almost by a factor of 5, which in turn almost quintuples the time of policy gradient algorithms, which are about 5x slower to train than gradient-free methods. 

Be aware that these numbers are just to highlight the average case, and the training time is only related to the speed at which the algorithm is trained, and not to the time needed to acquire new transitions from the environment:

Figure 13.2. Training time efficiency comparison between model-based RL methods, policy gradient algorithms, value-based algorithms, and gradient-free algorithms (the leftmost methods are slower to train than the rightmost methods)

We can see that the sample efficiency of an algorithm is complementary to its training time, meaning that an algorithm that is data efficient is slow to train and vice versa. Thus, because the overall learning time of an agent takes into account both the training time and the speed of the environment, you have to find a trade-off between sample efficiency and training time that meet your needs. In fact, the main purpose of model-based and more efficient model-free algorithms is to reduce the number of steps with the environment so that these algorithms are easier to deploy and train in the real world, where the interactions are slower than in simulators.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset