Intuition behind NPG

Before looking at a potential solution to the instability of PG methods, let's understand why it appears. Imagine you are climbing a steep volcano with a crater on the top, similar to the function in the following diagram. Let's also imagine that the only sense you have is the inclination of your foot (the gradient) and that you cannot see the world around you—you are blind. Let's also set a fixed length of each step you can take (a learning rate), for example, one meter. You take the first step, perceive the inclination of your feet, and move 1 m toward the steepest ascent direction. After repeating this process many times, you arrive at a point near the top where the crater lies, but still, you are not aware of it since you are blind. At this point, you observe that the inclination is still pointing in the direction of the crater. However, if the volcano only gets higher for a length smaller than your step, with the next step, you'll fall down. At this point, the space around you is totally new. In the case outlined in the following diagram, you'll recover pretty soon as it is a simple function, but in general, it can be arbitrarily complex. As a remedy, you could use a much smaller step size but you'll climb the mountain much slower and still, there is no guarantee of reaching the maximum. This problem is not unique to RL, but here it is more serious as the data is not stationary and the damage could be way bigger than in other contexts, such as supervised learning. Let's take a look at the following diagram:

Figure 7.4. While trying to reach the maximum of this function, you may fall inside the crater

A solution that could come to mind, and one that has been proposed in NPG, is to use the curvature of the function in addition to the gradient. The information regarding the curvature is carried on by the second derivative. It is very useful because a high value indicates a drastic change in the gradient between two points and, as prevention, a smaller and more cautious step could be taken, thus avoiding possible cliffs. With this new approach, you can use the second derivative to gain more information about the action distribution space and make sure that, in the case of a drastic shift, the distribution of the action spaces don't vary too much. In the following section, we'll see how this is done in NPG.

Table of Contents for Intuition behind NPG

Create new playlist

Sign In

Sign Up

Table of Contents for
Intuition behind NPG