Algorithm explanation

The algorithm begins with initializing an empty value (mostly, 0). Then, for every episode, a state (S) is initialized, followed by a loop for every time step, where an action (A) is chosen based on policy gradient techniques. Then, the updated state (S') and reward (R) are observed and we begin solving the value function using the preceding mathematical equation (TD equation). Finally, the updated state (S') is equated to the current state (S) and the loop continues until the current state (S) is terminated.

Now, let's try to implement this in our analogy. Let's initialize all the values of our states as zeros (as shown in the following table):

State Value
(1, 1) 0
(1, 2) 0
(1, 3) 0
... ...
(4, 4) 0
Table with zero values 

Let's assume that the taxi is in the position (2, 2). The rewards for each action are given as follows:

0- Move south : Reward= -1
1- Move north : Reward= -1
2- Move east : Reward= -1
3- Move west : Reward= -1
4- Pickup passenger : Reward= -10
5- Dropoff passenger : Reward= -10

The learning rate is usually within 0 and 1 and cannot be equal to 0, and we have already seen that the discount factor is within 0 and 1 as well, but can be zero. Let's consider that the learning rate is 0.5 and the discount is 0.8 for this example. Let's assume an action is chosen at random (usually, the e-greedy algorithm is used to define the policy, π), say, north. For the action north, the reward is -1 and the next state or predicted state (S') is (1, 2). Hence, the value function of (2, 2) is as follows:

V(S) ← 0+0.5*[(-1)+0.8*(0)-0]
=> V(S) ← -0.5

Now, the updated table is as follows:

State Value
(1, 1) 0
(1, 2) 0
(1, 3) 0
... 0
(2, 2) -0.5
... ...
(4, 4) 0
Updated table

Likewise, the algorithm tries to calculate until the episode terminates.

Hence, TD prediction is advantageous compared to Monte Carlo methods and dynamic programming, but there is the possibility that the random policy generation can sometimes make TD prediction take more time in terms of solving the problem. In the preceding example, the taxi may result in many negative rewards and may take time as well to arrive at a perfect solution. There are other ways in which the predicted values can be optimized. In the next section, we will cover TD control, which can achieve this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset