The agent would like to obtain the different reward values (←rit,→rit)(1≤i≤m)
The immediate return single-agent learning is the simplest agent learning approach. This method has no need to consider whether the follow-up action exerts any influence on the current learning step. The key to this method is to select the maximal reward value during the learning process of the agent, which has a low computational complexity. It turns out that, if we always select the maximal reward value in every step, the final reward value will not be optimal.
Example 7.2 (See Fig. 7.2).
Let point 1 be the initial state S and point 12 be the target state T. Points 2–11 represent the states that are accessible, and the weights are the reward values.
Starting from state S, we obtain the following learning process under the computational method described by Definition 7.3.
(1)S={1}, a1={1→2,1→3,1→4,1→5}, (←r1,→r1)={(←9,→9),(←7,→7),(←3,→3),(←2,→2)}.
(2)S={1,2}, (←r2,→r2)={(←4,→4),(←2,→2),(←1,→1)}.
We know that (←R2,→R2)=(←4,→4), A2=2→6. As (←R,→R)={(←9,→9),(←4,→4)}, A=
(3)S={1,2,6}, a3={6→9,6→10,}, (←r3,→r3)={(←6,→6),(←5,→5)}.
We know that (←R3,→R3)=(←6,→6), A3=6→9. As (←R,→R)={(←9,→9),(←4,→4)},
(4)S={1,2,6,9}, a4={9→12}, (←r4,→r4)={(←4,→4)},
We know that (←R4,→R4)=(←4,→4), A3=9→12. As (←R,→R)={(←9,→9),(←4,→4),(←6,→6)},
Therefore, the final strategy of the agent is π=A={1→2,2→6,6→9,9→12}.
However, in Fig. 7.2, we can see that the accumulated reward value (←23,→23)
How can we find the optimal strategy, in which the enumeration method is obviously clumsy? Baermann proposed using the principle of optimality, and the resulting recurrence relation provides a better method. By this method, we can easily obtain the optimal sequence of strategies and decrease the problem size. The principle of optimality indicates that the optimal sequence of strategies of the learning process has a specific property: no matter what the initial state and the target state, the intermediate strategies must construct an optimal strategy with respect to the initial strategy.
What follows is a non-immediate return reinforcement learning algorithm, Q-learning. This algorithm can obtain the optimal control strategy from the delayed retribution, even without prior knowledge.
Definition 7.4 The evaluation function Q((←s,→s),α)
Q((←s,→s),a)≡(←R,→R)((←s,→s),a)+γV*((←s′,→s′)). (7.2)
Here, (←R,→R)((←s,→s),α)
Now, we have a problem. If the Q function corresponds to the optimal strategy, how should the agent obtain the Q function? Let us first make the following analysis. At first, we build a relationship between Q and V∗ and the following relationship exists:
V*(s)=maxa′Q(s,a′). (7.3)
Through a simple mathematical derivation, we have
Q(←s,→s),a)≡(←R,→R)((←s,→s),a)+γmaxa′Q((←s′,→s′),a′). (7.4)
Now we know the Q function is determined by the subsequent Q function and the immediate return functions.
We now present Algorithm 7.7 (the Q-learning algorithm based on DFL) and give an example.
Example 7.3 We have six lattices, each representing a state that the agent can access (See Fig. 7.3). To express this conveniently, we set the serial numbers of the lattices to run from 1 to 6 from left to right, top to bottom. The central lattice is number 3 and is the target state.
Algorithm 7.7 The Q-learning algorithm based on DFL
Input: state of the agent is (←s,→s)
Repeated: Monitoring the current state
Select an action from the current action list, and conduct it.
When the agent receives the action, its immediate rewarding value is (←r,→r).
Check the new state (←s,→s).
Update the list item Q((←s,→s),α)
Q((←S,→S),α)←(←R,→R)((←s,→s),α)+γ max Q((←s′,→s′),α)
Modify the current state (←s,→s←(←s′,→s′))a′.
Because the state of the agent is a dynamic fuzzy variable, we assume the agent will obtain an enhanced reward value with reduced ability when its state changes from number 6 to number 3. However, the agent will obtain a subdued reward value with enhanced ability when its state changes from number 5 to number 2.
Therefore, we assume that the immediate reward value of every state in the learning process of the agent will change. Then, under the Q-learning algorithm based on DFL, we obtain the corresponding value of the state Q((←s,→s),a)
Example 7.4 We use the Q-learning algorithm based on DFL to solve Example 7.3. Set the maximal reward value of the agent when starting from number n to be Q(n,∗). Then, we have Q(9,9←12)=(←4,→4)So,Q(9,*)=Q(9,9→12)=(←4,→4)
Thus, it convert to Q(10,*)=Q(10,10→12)=(←2,→2)
Then, we have
Q(11,*)=Q(11,11→12)=(←5,→5), (7.5)
Q(6,6→9)=(←6,→6)+Q(9,*)=(←6,→6)+(←4,→4)=(←10,→10), (7.6)
Q(6,6→10)=(←5,→5)+Q(10,*)=(←5,→5)+(←2,→2)=(←7,→7). (7.7)
Therefore, we know
Q(6,*)=max{Q(6,6→9),Q(6,6→10)}=(←10,→10) (7.8)
Q(7,7→9)=(←4,→4)+Q(9,*)=(←4,→4)+(←4,→4)=(←8,→8), (7.9)
Q(7.7→10)=(←3,→3)+Q(10,*)=(←3,→3)+(←2,→2)=(←5,→5). (7.10)
Hence, in the next step, we have
Q(7,*)=max{Q(7,7→9),Q(7,7→10)}=(←8,→8), (7.11)
Q(8,8→10)=(←5,→5)+Q(10,*)=(←5,→5)+(←2,→2)=(←7,→7), (7.12)
Q(8,8→11)=(←6,→6)+Q(11,*)=(←6,→6)+(←5,→5)=(←11,→11) (7.13)
Finally, Q(8,*)=max{Q(8,8→10),Q(8,8→11)}=(←11,→11).
The same can be processed as Q(2,*)=(←14,→14),Q(3,*)=(←15,→15),Q(4,*)=
Thus, we get
Q(1,1→2)=(←9,→9)+Q(2,*)=(←9,→9)+(←14,→14)=(←23,→23), (7.14)
Q(1,1→3)=(←7,→7)+Q(3,*)=(←7,→7)+(←15,→15)=(←22,→22), (7.15)
Q(1,1→4)=(←3,→3)+Q(4,*)=(←3,→3)+(←22,→22)=(←25,→25), (7.16)
Q(1,1→5)=(←2,→2)+Q(5,*)=(←2,→2)+(←19,→19)=(←21,→21). (7.17)
Here, Q(1,*)=(←25,→25),
In the previous section, we developed a single-agent learning algorithm based on DFL. All the single-agent learning algorithms build a foundation for multi-agent learning. In the multi-agent learning model, each agent is learning according to a particular single-agent learning algorithm based on its own position in the model.
Definition 7.5 In the multi-agent learning model, each circular symbol is an agent (see Fig. 7.4). Some agents form an agent group. Each agent in the same group cooperates with the other sin the group. Different groups compete with each other.
We now introduce a cooperative multi-agent learning algorithm and a competitive multi-agent learning algorithm.
In the real world, when a team is tasked with achieving a goal, there are various ways of proceeding. For example, each person in the team can provide an idea, and the team as a whole makes a comprehensive assessment of all the ideas. Alternatively, the mission could be divided into several parts, each of which is assigned to a different team member.