Algorithm explanation

The algorithm begins with initializing an empty value (mostly, 0). Then, for every episode, a state (S) is initialized, followed by a loop for every time step, where an action (A) is chosen based on policy gradient techniques. Then, the updated state (S') and reward (R) are observed and we begin solving the value function using the preceding mathematical equation (TD equation). Finally, the updated state (S') is equated to the current state (S) and the loop continues until the current state (S) is terminated.

Now, let's try to implement this in our analogy. Let's initialize all the values of our states as zeros (as shown in the following table):

State	Value
(1, 1)	0
(1, 2)	0
(1, 3)	0
...	...
(4, 4)	0

Table with zero values

Let's assume that the taxi is in the position (2, 2). The rewards for each action are given as follows:

0- Move south : Reward= -1
1- Move north : Reward= -1
2- Move east : Reward= -1
3- Move west : Reward= -1
4- Pickup passenger : Reward= -10
5- Dropoff passenger : Reward= -10

The learning rate is usually within 0 and 1 and cannot be equal to 0, and we have already seen that the discount factor is within 0 and 1 as well, but can be zero. Let's consider that the learning rate is 0.5 and the discount is 0.8 for this example. Let's assume an action is chosen at random (usually, the e-greedy algorithm is used to define the policy, π), say, north. For the action north, the reward is -1 and the next state or predicted state (S') is (1, 2). Hence, the value function of (2, 2) is as follows:

V(S) ← 0+0.5*[(-1)+0.8*(0)-0]
 => V(S) ← -0.5

Now, the updated table is as follows:

State	Value
(1, 1)	0
(1, 2)	0
(1, 3)	0
...	0
(2, 2)	-0.5
...	...
(4, 4)	0

Updated table

Likewise, the algorithm tries to calculate until the episode terminates.

Hence, TD prediction is advantageous compared to Monte Carlo methods and dynamic programming, but there is the possibility that the random policy generation can sometimes make TD prediction take more time in terms of solving the problem. In the preceding example, the taxi may result in many negative rewards and may take time as well to arrive at a perfect solution. There are other ways in which the predicted values can be optimized. In the next section, we will cover TD control, which can achieve this.

Table of Contents for Algorithm explanation

Create new playlist

Sign In

Sign Up

Table of Contents for
Algorithm explanation