Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

TD prediction

Now that we have an example, let's look at the mathematical expression of the TD prediction equation:

The preceding equation states that the value of the current state, , is given by the sum of the value of the current state, , and the learning rate times the TD error. Wait, what is the TD error? Let's take a look:

It is the difference between the current state value and the predicted state reward, where the predicted state reward is represented as the sum of the reward of the predicted state and discount times' value of the predicted state. The predicted state reward is also known as the TD target. The TD algorithm for evaluating the value function is given as follows:

Input: the policy π to be evaluated

Initialize V(s) arbitrarily
 Repeat (for each episode):
  Initialize S
  Repeat (for each step of episode):
    A ← action given by π for S
    Take action A, observe R, S'
    V (S) ← V (S) + α[R + γV (S) − V (S')]
    S ← S'
  until S is Terminal

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for TD prediction

Create new playlist

Sign In

Sign Up

Table of Contents for
TD prediction