Natural policy gradient

REINFORCE and Actor-Critic are very intuitive methods that work well on small to medium-sized RL tasks. However, they present some problems that need to be addressed so that we can adapt policy gradient algorithms so that they work on much larger and complex tasks. The main problems are as follows:

  • Difficult to choose a correct step size: This comes from the nature of RL being non-stationary, meaning that the distribution of the data changes continuously over time and as the agent learns new things, it explores a different state space. Finding an overall stable learning rate is very tricky.
  • Instability: The algorithms aren't aware of the amount by which the policy will change. This is also related to the problem we stated previously. A single, not controlled update could induce a substantial shift of the policy that will drastically change the action distribution, and that consequently will move the agent toward a bad state space. Additionally, if the new state space is very different from the previous one, it could take a long time before recovering from it.
  • Bad sample efficiency: This problem is common to almost all on-policy algorithms. The challenge here is to extract more information from the on-policy data before discarding it.

The algorithms that are proposed in this chapter, namely TRPO and PPO, try to address these three problems by taking different approaches, though they share a common background that will be explained soon. Also, both TRPO and PPO are on-policy policy gradient algorithms that belong to the model-free family, as shown in the following categorization RL map:

Figure 7.3. The collocation of TRPO and PPO inside the categorization map of the RL algorithms

Natural Policy Gradient (NPG) is one of the first algorithms that has been proposed to tackle the instability problem of the policy gradient methods. It does this by presenting a variation in the policy step that takes care of guiding the policy in a more controlled way. Unfortunately, it is designed for linear function approximations only, and it cannot be applied to deep neural networks. However, it's the base for more powerful algorithms such as TRPO and PPO.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset