The action-value function that uses one-step bootstrapping is defined as follows:
Here, is the notorious next state.
Thus, with an actor, and a critic using bootstrapping, we obtain a one-step AC step:
This will replace the REINFORCE step with a baseline:
Note the difference between the use of the state-value function in REINFORCE and AC. In the former, it is used only as a baseline, to provide the state value of the current state. In the latter example, the state-value function is used to estimate the value of the next state, so as to only require the current reward to estimate . Thus, we can say that the one-step AC model is a fully online, incremental algorithm.