Now that we have an example, let's look at the mathematical expression of the TD prediction equation:
The preceding equation states that the value of the current state, , is given by the sum of the value of the current state, , and the learning rate times the TD error. Wait, what is the TD error? Let's take a look:
It is the difference between the current state value and the predicted state reward, where the predicted state reward is represented as the sum of the reward of the predicted state and discount times' value of the predicted state. The predicted state reward is also known as the TD target. The TD algorithm for evaluating the value function is given as follows:
Input: the policy π to be evaluated
Initialize V(s) arbitrarily
Repeat (for each episode):
Initialize S
Repeat (for each step of episode):
A ← action given by π for S
Take action A, observe R, S'
V (S) ← V (S) + α[R + γV (S) − V (S')]
S ← S'
until S is Terminal