Learning the AC algorithm

Simple REINFORCE has the notable property of being unbiased, but it exhibits high variance. Adding a baseline reduces the variance, while keeping it unbiased (asymptotically, the algorithm will converge to a local minimum). A major drawback of REINFORCE with baseline is that it'll converge very slowly, requiring a consistent number of interactions with the environment. 

An approach to speed up training is called bootstrapping. This is a technique that we've already seen many times throughout the book. It allows the estimation of the return values from the subsequent state values. The policy gradient algorithms that use this techniques is called actor-critic (AC). In the AC algorithm, the actor is the policy, and the critic is the value function (typically, a state-value function) that "critiques" the behavior of the actor, to help him learn faster. The advantages of AC methods are multiple, but the most important is their ability to learn in non-episodic problems.

It's not possible to solve continuous tasks with REINFORCE, as to compute the reward to go, they need all the rewards until the end of the trajectory (if the trajectories are infinite, there is no end). Relying on the bootstrapping technique, AC methods are also able to learn action values from incomplete trajectories.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset