Using a critic to help an actor to learn

The action-value function that uses one-step bootstrapping is defined as follows: 

Here,  is the notorious next state.

Thus, with an actor, and a critic using bootstrapping, we obtain a one-step AC step:

This will replace the REINFORCE step with a baseline:

Note the difference between the use of the state-value function in REINFORCE and AC. In the former, it is used only as a baseline, to provide the state value of the current state. In the latter example, the state-value function is used to estimate the value of the next state, so as to only require the current reward to estimate . Thus, we can say that the one-step AC model is a fully online, incremental algorithm. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset