Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14

Adaptive $H_{\infty}$ Tracking Control of Nonlinear Systems Using Reinforcement Learning

Hamidreza Modares^⁎; Bahare Kiumarsi^†; Kyriakos G. Vamvoudakis^‡; Frank L. Lewis^†^,^§ ^⁎Missouri University of Science and Technology, Rolla, MO, United States
^†UTA Research Institute, University of Texas at Arlington, Fort Worth, TX, United States
^‡Virginia Tech, Blacksburg, VA, United States
^§State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang, China

Abstract

This chapter presents online solutions to the optimal $H_{\infty}$ tracking of nonlinear systems to attenuate the effect of disturbance on the performance of the systems. To obviate the requirement of the complete knowledge of the system dynamics, reinforcement learning (RL) is used to learn the solutions to the Hamilton–Jacobi–Isaacs equations arising from solving the $H_{\infty}$ tracking problem. Off-policy RL algorithms are designed for continuous-time systems, which allows the reuse of data for learning and consequently leads to data efficient RL algorithms. A solution is first presented for the $H_{\infty}$ optimal tracking control of affine nonlinear systems. It is then extended to a special class of nonlinear nonaffine systems. It is shown that for the nonaffine systems existence of a stabilizing solution depends on the performance function. A performance function is designed to assure the existence of the solution to a class of nonaffine system, while taking into account the input constraints.

Keywords

$H_{\infty}$ control; Optimal tracking; Reinforcement learning

Chapter Points

• The result of this approach is to design an online data-based solution to the $H_{\infty}$ tracking control problem.
• Reinforcement learning is employed to learn the solution to the $H_{\infty}$ tracking in real-time and without requiring the system dynamics.

14.1 Introduction

Reinforcement learning (RL) [1–3], inspired by learning mechanisms observed in animals, is concerned with how an agent or decision maker takes actions so as to optimize a cost of its long-term interactions with the environment. The cost function is prescribed and captures some desired system behaviors such as minimizing the transient error and minimizing the control effort for achieving a specific goal. The agent learns an optimal policy so that, by taking actions produced based on this policy, the long-term cost function is optimized. Similar to RL, optimal control involves finding an optimal policy by optimizing a long-term performance criterion. Strong connections between RL and optimal control have prompted a major effort towards introducing and developing online and model-free RL algorithms to learn the solution to optimal control problems [4–6].

RL methods have been successfully used to solve the optimal regulation problems by learning the solution to the so-called Hamilton–Jacobi equations arising from both optimal $H_{2}$ [7–18] and $H_{\infty}$ [19–30] regulation problems. For continuous-time (CT) systems, [8,9] proposed a promising RL algorithm, called integral RL (IRL), to learn the solution to the Hamilton–Jacobi–Bellman (HJB) equations using only partial knowledge about the system dynamics. They used an iterative online policy iteration [31] procedure to implement their IRL algorithm. The original IRL algorithm and many of its extensions are on-policy algorithms. That is, the policy that is applied to the system to generate data for learning (behavior policy) is the same as the policy that is being updated and learned about (target policy). The work [15] presented an off-policy RL algorithms for CT systems in which the behavior policy could be different from the target policy. This algorithm does not require any knowledge of the system dynamics and is data efficient because it reuses the data generated by the behavior policy to learn as many target policies as required. Many variants and extensions of off-policy RL algorithms are presented in the literature. Other than the IRL-based PI algorithms and off-policy RL algorithms, efficient synchronous PI algorithms with guaranteed closed-loop stability were proposed for CT systems in [7,11,12] to learn the solution to the HJB equation. Synchronous IRL algorithms were also presented for solving the HJB equation in [23,32].

Although RL algorithms have been widely used to solve the optimal regulation problems, few results considered solving the optimal tracking control problem (OTCP) for both discrete-time [33–36] and continuous-time systems [6,37]. Moreover, existing methods for continuous-time systems require the exact knowledge of the system dynamics a priori while finding the feedforward part of the control input using either the dynamic inversion concept or the solution of output regulator equations [39–41]. While the importance of the RL algorithms is well understood for solving optimal regulation problems for uncertain systems, the requirement of the exact knowledge of the system dynamics for finding the steady-state part of the control input in the existing OTCP formulation does not allow for direct extending of the IRL algorithm for solving the OTCP.

In this chapter, we develop adaptive optimal controllers based on the RL techniques to learn the optimal $H_{\infty}$ tracking control solutions for nonlinear continuous-time systems without knowing the system dynamics or the command generator dynamics. An augmented system is first constructed from the tracking error dynamics and the command generator dynamics to introduce a new discounted performance function for the OTCP. The tracking Hamilton–Jacobi–Isaac (HJI) equations are then derived to solve OTCPs. Off-policy RL algorithms, implemented on an actor-critic structure, are used to find the solution to the tracking HJI equations online using only measured data along the augmented system trajectories. These algorithms are developed for both affine and nonaffine nonlinear systems. Therefore, they can be employed in control of many real-world applications, including robot manipulators, mobile robots, unmanned aerial vehicles (UAVs), power systems and human–robot interaction systems.

14.2 $H_{\infty}$ Optimal Tracking Control for Nonlinear Affine Systems

Existing solutions to the $H_{\infty}$ tracking problem are composed of two steps [38–41]. A feedforward control input is designed to guarantee perfect tracking using either dynamic inversion or by solving the so-called output regulator equations in the first step. A feedback control input is designed in the second step by solving an HJI equation to stabilize the tracking error dynamics. In these methods, procedures for computing the feedback and feedforward terms are based on offline solution methods which require complete knowledge of the system dynamics. In this section, a new formulation for the $H_{\infty}$ tracking is presented which allows developing model-free RL solutions.

Consider the nonlinear time-invariant system given as

$\dot{x} (t) = f (x (t)) + g (x (t)) u (t) + k (x (t)) w (t),$

(14.1)

where $x (t) \in R^{n}$ , $u (t) \in R^{m}$ and $w (t) \in R^{p}$ represent the state of the system, the control input and the external disturbance of the system, respectively. The drift dynamics is represented by $f (x (t)) \in R^{n}$ , $g (x (t)) \in R^{n \times m}$ is the input dynamics and $k (x (t)) \in R^{p}$ is the disturbance dynamics. It is assumed that $f (0) = 0$ and $f (x (t))$ , $g (x (t)$ and $k (x (t))$ are unknown Lipschitz functions and the system is stabilizable.

Assumption 1

Let $r (t)$ be the bounded reference trajectory and assume that there exists a Lipschitz continuous command generator function $h_{d} (t) \in R^{n}$ with $h_{d} (0) = 0$ such that

$\dot{r} (t) = h_{d} (t) r (t) .$

(14.2)

Define the tracking error

$e_{d} (t) ≜ x (t) - r (t) .$

(14.3)

Using (14.1)–(14.3), the tracking error dynamics is given by

${\dot{e}}_{d} (t) = f (x (t)) + g (x (t)) u (t) + k (x (t)) w (t) - h_{d} (r (t)) .$

(14.4)

The performance output to be controlled is defined such that it satisfies

${‖ z (t) ‖}^{2} = {e_{d}}^{T} Q e_{d} + u^{T} R u .$

(14.5)

The goal of the $H_{\infty}$ tracking is to attenuate the effect of the disturbance input w on the performance output z. Before defining the $H_{\infty}$ tracking control problem, we define the following general $L_{2}$ -gain or disturbance attenuation condition.

Definition 1

Bounded $L_{2}$ -gain or disturbance attenuation

The nonlinear system (14.1) is said to have $L_{2}$ -gain less than or equal to γ if the following disturbance attenuation condition is satisfied for all $w \in L_{2} [0, \infty)$ :

$\frac{\int_{t}^{\infty} e^{- α (τ - t)} {‖ z (τ) ‖}^{2} d τ}{\int_{t}^{\infty} e^{- α (τ - t)} {‖ w (τ) ‖}^{2} d τ} ⩽ γ^{2},$

(14.6)

where $α > 0$ is the discount factor and γ represents the amount of attenuation from the disturbance input $w (t)$ to the defined performance output variable $z (t)$ .

The disturbance attenuation condition (14.6) implies that the effect of the disturbance input to the desired performance output is attenuated by a degree at least equal to γ. The desired performance output represents a meaningful cost in the sense that it includes a positive penalty on the tracking error and a positive penalty on the control effort. The use of the discount factor is essential. This is because the feedforward part of the control input does not converge to zero in general and thus penalizing the control input in the performance function without a discount factor makes the performance function unbounded.

Using (14.5) in (14.6) one has

$\int_{t}^{\infty} e^{- α (τ - t)} ({e_{d}}^{T} Q e_{d} + u^{T} R u) d τ ⩽ γ^{2} \int_{t}^{\infty} e^{- α (τ - t)} (w^{T} w) d τ .$

(14.7)

Definition 2

$H_{\infty}$ optimal tracking

The $H_{\infty}$ tracking control problem is to find a control policy $u = β (e_{d}, r)$ for some smooth function β depending on the tracking error e and the reference trajectory r, such that:

(i) The closed-loop system $\dot{x} = f (x) + g (x) β (e_{d}, r) + k (x) w$ satisfies the attenuation condition (14.7).

(ii) The tracking error dynamics (14.4) with $w = 0$ is locally asymptotically stable.

The main difference between Definition 2 and the standard definition of the $H_{\infty}$ tracking control problem (see [38], Definition 5.2.1) is that a more general disturbance attenuation condition is defined here. Previous work on the $H_{\infty}$ optimal tracking divides the control input into feedback and feedforward parts. The feedforward part is first obtained separately without considering any optimality criterion. Then, the problem of optimal design of the feedback part is reduced to an $H_{\infty}$ optimal regulation problem. In contrast, in the new formulation, both feedback and feedforward parts of the control input are obtained simultaneously and optimally as a result of the defined $L_{2}$ -gain with discount factor in (14.7).

14.2.1 HJI Equation for $H_{\infty}$ Optimal Tracking

In this section, it is first shown that the problem of solving the $H_{\infty}$ tracking problem can be transformed into a min–max optimization problem subject to an augmented system composed of the tracking error dynamics and the command generator dynamics. A tracking HJI equation is then developed which gives the solution to the min–max optimization problem. The stability and $L_{2}$ -gain boundedness of the tracking HJI control solution are discussed.

Define the augmented system state

$X (t) = {[e_{d} {(t)}^{T} r {(t)}^{T}]}^{T} \in R^{2 n},$

where $e_{d} (t)$ is the tracking error defined in (14.3) and $r (t)$ is the reference trajectory.

Using (14.2) and (14.4), define the augmented system

$\dot{X} (t) = F (X (t)) + G (X (t)) u (t) + K (X (t)) w (t),$

(14.8)

where $u (t) = u (X (t))$ and

$F (X) = [\begin{matrix} f (e_{d} + r) - h_{d} (r) \\ h_{d} (r) \end{matrix}], G (X) = [\begin{matrix} g (e_{d} + r) \\ 0 \end{matrix}], K (X) = [\begin{matrix} k (e_{d} + r) \\ 0 \end{matrix}] .$

The disturbance attenuation condition (14.7) using the augmented state becomes

$\int_{t}^{\infty} e^{- α (τ - t)} (X^{T} Q_{T} X + u^{T} R u) d τ ⩽ γ^{2} \int_{t}^{\infty} e^{- α (τ - t)} (w^{T} w) d τ,$

(14.9)

where

$Q_{T} = [\begin{matrix} Q & 0 \\ 0 & 0 \end{matrix}] .$

Based on (14.9), define the performance function

$J (u, w) = \int_{t}^{\infty} e^{- α (τ - t)} (X^{T} Q_{T} X + u^{T} R u - γ^{2} w^{T} w) d τ .$

(14.10)

Solvability of the $H_{\infty}$ control problem is equivalent to solvability of the following zero-sum game [42]:

$V^{⋆} (X (t)) = J (u^{⋆}, w^{⋆}) = \min_{u} \max_{d} J (u, w),$

(14.11)

where J is defined in (14.10) and $V^{⋆} (X (t))$ is defined as the optimal value function. This two-player zero-sum game control problem has a unique solution if a game theoretic saddle point exists, i.e., if the following Nash condition holds:

$V^{⋆} (X (t)) = \min_{u} \max_{d} J (u, w) = \max_{d} \min_{u} J (u, w) .$

Differentiating (14.10), note that $V (X (t)) = J (u (t), w (t))$ gives the following Bellman equation:

$H (V, u, w) \overset{Δ}{=} X^{T} Q_{T} X + u^{T} R u - γ^{2} w^{T} w - α V + {V_{X}}^{T} (F + G u + K w) = 0,$

(14.12)

where $F ≜ F (X)$ , $G ≜ G (X)$ , $K ≜ K (X)$ and $V_{X} = \partial V / \partial X$ .

Applying stationarity conditions $\partial H (V^{⋆}, u, w) / \partial u = 0, \partial H (V^{⋆}, u, w) / \partial w = 0$ [43] gives the optimal control and disturbance inputs as

$u^{⋆} = - \frac{1}{2} R^{- 1} G^{T} {V_{X}}^{⋆},$

(14.13)

$w^{⋆} = \frac{1}{2 γ^{2}} K^{T} {V_{X}}^{⋆},$

(14.14)

where $V^{⋆}$ is the optimal value function defined in (14.11). Substituting the control input (14.13) and the disturbance (14.14) into (14.12), the following tracking HJI equation is obtained:

$\begin{matrix} H (V^{⋆}, u^{⋆}, w^{⋆}) ≜ X^{T} Q_{T} X + {V_{X}}^{⋆ T} F - α V_{X} \\ - \frac{1}{4} {V_{X}}^{⋆ T} G^{T} R^{- 1} G {V_{X}}^{⋆} + \frac{1}{4 γ^{2}} {V_{X}}^{⋆ T} K K^{T} {V_{X}}^{⋆} = 0 . \end{matrix}$

(14.15)

It is shown in [44] that the control solution (14.13)–(14.15) satisfies the disturbance attenuation condition (14.9) (part (i) of Definition 2) and that it guarantees the stability of the tracking error dynamics (14.4) without the disturbance (part (ii) of Definition 2), if the discount factor is less than an upper bound.

14.2.2 Off-Policy IRL for Learning the Tracking HJI Equation

In this section, an off-policy RL algorithm is first given to learn this control solution online and without requiring any knowledge of the system dynamics.

The Bellman equation (14.12) is linear in the cost function V, while the HJI equation (14.15) is nonlinear in the value function $V^{⋆}$ . Therefore, solving the Bellman equation for V is easier than solving the HJI for $V^{⋆}$ . Instead of directly solving for $V^{⋆}$ , a policy iteration (PI) algorithm iterates on both control and disturbance players to break the HJI equation into a sequence of differential equations linear in the cost. An offline PI algorithm for solving the $H_{\infty}$ optimal tracking problem is given as follows.

Algorithm 1 extends the results of the simultaneous RL algorithm in [27] to the tracking problem. The convergence of this algorithm to the minimal nonnegative solution of the HJI equation was shown in [27]. In fact, similar to [27], the convergence of Algorithm 1 can be established by proving that iteration on (14.16) is essentially a Newton iterative sequence which converges to the unique solution of the HJI equation (14.15).

Algorithm 1 requires complete knowledge of the system dynamics. In the following, the off-policy IRL algorithm, which was presented in [14,15] for solving the $H_{2}$ optimal regulation problem, is extended here to solve the $H_{\infty}$ optimal tracking for systems with completely unknown dynamics. To this end, the system dynamics (14.8) is first written as

$\dot{X} = F + G u^{j} + K w^{j} + G (u - u^{j}) + K (w - w^{j}),$

(14.19)

where $u^{j} \in R^{m}$ and $w^{j} \in R^{q}$ are policies to be updated. In this equation, the control input u is the behavior policy which is applied to the system to generate data for learning, while $u^{j}$ is the target policy which is evaluated and updated using data generated by the behavior policy. The fixed control policy u should be a stable and exploring control policy. Moreover, the disturbance input w is the actual external disturbance that comes from an external source and is not under our control. However, the disturbance $w^{j}$ is the disturbance that is evaluated and updated. One advantage of this off-policy IRL Bellman equation is that, in contrast to on-policy RL-based methods, the disturbance input that is applied to the system does not require to be adjustable.

Differentiating $V^{j} (X)$ along with the system dynamics (14.19) and using (14.16)–(14.18) gives

$\begin{matrix} {\dot{V}}^{j} = {({V_{X}}^{j})}^{T} (F + G u^{j} + K w^{j}) + {({V_{X}}^{j})}^{T} G (u - u^{j}) + {({V_{X}}^{j})}^{T} K (w - w^{j}) \\ = α V^{j} - X^{T} Q_{T} X - {(u^{j})}^{T} R u^{j} + γ^{2} {(w^{j})}^{T} w^{j} - \\ 2 {(u^{j + 1})}^{T} R (u - u^{j}) + 2 γ^{2} {(w^{j + 1})}^{T} (w - w^{j}) . \end{matrix}$

(14.20)

Multiplying both sides of (14.20) by $e^{- α (τ - t)}$ and integrating from both sides yields the following off-policy IRL Bellman equation:

$\begin{matrix} e^{- α T} V^{j} (X (t + T)) - V^{j} (X (t)) = \\ \int_{t}^{t + T} e^{- α (τ - t)} (- X^{T} Q_{T} X - {(u^{j})}^{T} R u^{j} + γ^{2} {(w^{j})}^{T} w^{j}) d τ \\ + \int_{t}^{t + T} e^{- α (τ - t)} (- 2 {(u^{j + 1})}^{T} R (u - u^{j}) + 2 γ^{2} {(w^{j + 1})}^{T} (w - w^{j})) d τ . \end{matrix}$

(14.21)

Note that, for a fixed control policy u (the policy that is applied to the system) and a given disturbance w (the actual disturbance that is applied to the system), Eq. (14.21) can be solved for both the value function $V^{j}$ and the updated policies $u^{j + 1}$ and $w^{j + 1}$ simultaneously.

Lemma 1

The off-policy IRL equation (14.21) gives the same solution for the value function as the Bellman equation (14.16) and the same updated control and disturbance policies as (14.18) and (14.17).

Proof

See [44]. □

The following algorithm uses the off-policy tracking Bellman equation (14.21) to iteratively solve the HJI equation (14.15) without requiring any knowledge of the system dynamics. The implementation of this algorithm is discussed in the next subsection. It is shown how the data collected from a fixed control policy u are reused to evaluate many updated control policies $u_{i}$ sequentially until convergence to the optimal solution is achieved.

Inspired by the off-policy algorithm in [14], Algorithm 2 has two separate phases. First, a fixed initial exploratory control policy u is applied and the system information is recorded over the time interval T. Second, without requiring any knowledge of the system dynamics, the information collected in phase 1 is repeatedly used to find a sequence of updated policies $u^{j}$ and $w^{j}$ converging to $u^{⋆}$ and $w^{⋆}$ . Note that Eq. (14.23) is a scalar equation and can be solved in a least square sense after collecting enough data samples from the system. It is shown in the following section how to collect required information in phase 1 and reuse it in phase 2 in a least square sense to solve (14.23) for $V^{j}$ , $u^{j + 1}$ and $w^{j + 1}$ simultaneously. After the learning is done and the optimal control policy $u^{⋆}$ is found, it can be applied to the system.

Algorithm 2 Online off-policy RL algorithm for solving the tracking HJI equation.

Theorem 1

Convergence of Algorithm 2

The off-policy Algorithm 2 converges to the optimal control and disturbance solutions given by (14.13) and (14.14) where the value function satisfies the tracking HJI equation (14.15).

Proof

See [44]. □

14.2.3 Implementing Algorithm 2 Using Neural Networks

In order to implement the off-policy RL Algorithm 2, it is required to reuse the collected information found by applying a fixed control policy u to the system to solve Eq. (14.23) for $V^{j}$ , $u^{j + 1}$ and $w^{j + 1}$ iteratively. Three neural networks (NNs), i.e., the actor NN, the critic NN and the disturber NN, are used here to approximate the value function and the updated control and disturbance policies in the Bellman equation (14.23). That is, the solution $V^{j}$ , $u^{j + 1}$ and $w^{j + 1}$ of the Bellman equation (14.23) is approximated by three NNs as

${\hat{V}}^{j} (X) = {\hat{W}}_{1}^{T} σ (X),$

(14.24)

${\hat{u}}^{j + 1} (X) = {\hat{W}}_{2}^{T} ϕ (X),$

(14.25)

${\hat{w}}^{j + 1} (X) = {\hat{W}}_{3}^{T} φ (X),$

(14.26)

where $σ = [σ_{1}, . . ., σ_{l_{1}}] \in R^{l_{1}}$ , $ϕ = [ϕ_{1}, . . ., ϕ_{l_{2}}] \in R^{l_{2}}$ and $φ = [φ_{1}, . . ., φ_{l_{3}}] \in R^{l_{3}}$ provide suitable basis function vectors, $\hat{W_{1}} \in R^{l_{1}}$ , $\hat{W_{2}} \in R^{m \times l_{2}}$ and $\hat{W_{3}} \in R^{q \times l_{3}}$ are constant weight vectors and $l_{1}$ , $l_{2}$ and $l_{3}$ are the number of neurons. Define $v^{1} = {[v_{1}^{1}, . . ., v_{1}^{m}]}^{T} = u - u^{j}$ , $v^{2} = {[v_{1}^{2}, . . ., v_{q}^{2}]}^{T} = w - w^{j}$ and assume $R = d i a g (r, . . ., r_{m})$ . Then, substituting (14.24)–(14.26) in (14.23) yields

$\begin{matrix} e (t) = {\hat{W}}_{1}^{T} (e^{- α T} σ (X (t + T)) - σ (X (t))) \\ - \int_{t}^{t + T} e^{- α (τ - t)} (- X^{T} Q_{T} X - {(u^{j})}^{T} R u^{j} + γ^{2} {(w^{j})}^{T} w^{j}) d τ \\ + 2 \sum_{l = 1}^{m} r_{l} \int_{t}^{t + T} e^{- α (τ - t)} {\hat{W}}_{2, l}^{T} ϕ (X (t)) v_{l}^{1} d τ \\ - 2 γ^{2} \sum_{k = 1}^{q} \int_{t}^{t + T} e^{- α (τ - t)} {\hat{W}}_{3, k}^{T} φ (X (t)) v_{k}^{2} d τ, \end{matrix}$

(14.27)

where $e (t)$ is the Bellman approximation error, ${\hat{W}}_{2, l}$ is the lth column of ${\hat{W}}_{2}$ and ${\hat{W}}_{3, k}$ is the kth column of ${\hat{W}}_{3}$ . The Bellman approximation error is the continuous-time counterpart of the temporal difference (TD) [10]. In order to bring the TD error to its minimum value, the least squares method is used. To this end, rewrite Eq. (14.27) as

$y (t) + e (t) = {\hat{W}}^{T} h (t),$

(14.28)

where

$\begin{matrix} \hat{W} = {[{\hat{W}}_{1}^{T}, {\hat{W}}_{2, l}^{T}, . . ., {\hat{W}}_{2, m}^{T}, {\hat{W}}_{3, 1}^{T}, . . ., {\hat{W}}_{3, q}^{T}]}^{T} \in R^{l_{1} + m \times l_{2} + q \times l_{3}}, \\ h (t) = [\begin{matrix} e^{- α T} σ (X (t + T)) - σ (X (t))) \\ 2 r_{1} \int_{t}^{t + T} e^{- α (τ - t)} ϕ (X (t)) v_{1}^{1} d τ \\ ⋮ \\ 2 r_{m} \int_{t}^{t + T} e^{- α (τ - t)} ϕ (X (t)) v_{m}^{1} d τ \\ - 2 γ^{2} \int_{t}^{t + T} e^{- α (τ - t)} φ (X (t)) v_{1}^{2} d τ \\ ⋮ \\ - 2 γ^{2} \int_{t}^{t + T} e^{- α (τ - t)} φ (X (t)) v_{q}^{2} d τ \end{matrix}], \end{matrix}$

(14.29)

$y (t) = \int_{t}^{t + T} e^{- α (τ - t)} (- X^{T} Q_{T} X - {(u^{j})}^{T} R u^{j} + γ^{2} {(w^{j})}^{T} w^{j}) d τ .$

(14.30)

The parameter vector $\hat{W}$ , which gives the approximated value function, actor and disturbance (14.24)–(14.26), is found by minimizing, in the least squares sense, the Bellman error. Assume that the systems state, input and disturbance information are collected at $N ⩾ l_{1} + m \times l_{2} + q \times l_{3}$ (the number of independent elements in $\hat{W}$ ) points $t_{1}$ to $t_{N}$ in the state space, over the same time interval T in phase 1. Then, for a given $u^{j}$ and $w^{j}$ , one can use this information to evaluate (14.29) and (14.30) at N points to form

$\begin{matrix} H = [h (t_{1}), . . . ., h (t_{N})], \\ Y = {[y (t_{1}), . . . ., y (t_{N})]}^{T} . \end{matrix}$

The least squares solution to (14.28) is then equal to

$\hat{W} = {(H H^{T})}^{- 1} H Y,$

which gives $V^{j}$ , $u^{j + 1}$ and $w^{j + 1}$ . Note that although $X (t + T)$ appears in Eq. (14.27), this equation is solved in a least square sense after observing N samples $X (t)$ , $X (t + T)$ , …, $X (t + N T)$ . Therefore, the knowledge of the system is not required to predict the future state $X (t + T)$ at time t to solve (14.27).

14.3 $H_{\infty}$ Optimal Tracking Control for a Class of Nonlinear Nonaffine Systems

This section considers the design of an RL-based optimal tracking control solution for a class of nonaffine systems.

14.3.1 A Class of Nonaffine Dynamical Systems

A special class of nonaffine systems can be described as

$\dot{X} (t) = f (X (t)) + g (X (t)) L (u) + D w (t),$

(14.31)

where $X (t) \in R^{n}$ , $u (t) \in R^{m}$ and $w (t) \in R^{p}$ are the state of the system, the control input and the external disturbance input, respectively. The functions $f (X (t))$ and $g (X (t))$ are Lipschitz functions. This system is affine in a nonlinear function $L (.)$ of the control input $u (t)$ . This class of nonaffine systems allows the definition of a new performance function for the optimal $H_{\infty}$ problem such that the existence of the constrained optimal control is assured (if any exists).

The following example shows that the UAV as a real-world application can be presented in the form of (14.31).

Example 1

A general class of nonlinear nonaffine UAV systems has the following well-known form:

$\begin{matrix} {\dot{x}}_{1} = V \cos γ \cos ψ + d_{1} w_{1}, \\ {\dot{x}}_{2} = V \cos γ \sin ψ + d_{2} w_{2}, \\ {\dot{x}}_{3} = - V \sin γ + d_{3} w_{3}, \\ \dot{V} = - α_{2} V^{2} - g \sin γ + α_{1} \bar{T} - α_{3} n_{z} - α_{4} \frac{n_{z}^{2}}{V^{2}}, \\ \dot{γ} = \frac{g}{V} (n_{z} \cos ϕ - \cos γ), \\ \dot{ψ} = \frac{g}{V \cos γ} n_{z} \sin ϕ, \end{matrix}$

(14.32)

with

$\begin{matrix} n_{x} = \frac{\bar{T} {\bar{T}}_{\max} \cos α - D}{m g}, \\ n_{x} = \frac{\bar{T} {\bar{T}}_{\max} \sin α + K}{m g}, \end{matrix}$

where $x_{1}$ , $x_{2}$ , $x_{3}$ are the UAV location coordinates, γ is the pitch angle, ψ is the heading angle, ϕ is bank angle, V is the UAV velocity and m is the mass of the UAV. The terms $n_{x}$ and $n_{z}$ denote longitudinal and normal components of the load factor, depending on the current thrust $\bar{T}$ , drag force D and lift force K (g is the acceleration due to gravity) [45].

Define the state of the UAV as

$X = {x_{1}, x_{2}, x_{3}, V, γ, ψ}^{T}$

(14.33)

and the control input and disturbance inputs (wind velocity) as $u (t) = {[\bar{T}, n_{z}, ϕ]}^{T} = {[\begin{matrix} u_{1} & u_{2} & u_{3} \end{matrix}]}^{T}$ and $w (t)$ , respectively. The constraints on the control input are as follows:

$\begin{matrix} | u_{1} | ⩽ {\bar{u}}_{1}, \\ | u_{2} | ⩽ {\bar{u}}_{2} . \end{matrix}$

(14.34)

Using (14.32) and (14.33), the UAV dynamics can be written as a nonlinear nonaffine CT system as

$\dot{X} (t) = M (X (t), u (t)) + D w (t),$

(14.35)

with

$\begin{matrix} D & = {[\begin{matrix} d_{1} & d_{2} & d_{3} & 0 & 0 & 0 \end{matrix}]}^{T}, \\ M (X, u) & = [\begin{matrix} x_{4} \cos (x_{5}) \cos (x_{6}) \\ x_{4} \cos (x_{5}) \sin (x_{6}) \\ - x_{4} \sin (x_{5}) \\ - α_{2} x_{4}^{2} - g \sin (x_{5}) + α_{1} u_{1} - α_{3} u_{2} - α_{4} \frac{u_{2}^{2}}{x_{4}^{2}} \\ \frac{g}{x_{4}} (- \cos (x_{5}) + u_{2} \cos (u_{3})) \\ \frac{g}{x_{4} \cos (x_{5})} u_{2} \sin (u_{3}) \end{matrix}] . \end{matrix}$

The UAV dynamics (14.35) can be written in the form of (14.31) with

$\begin{matrix} f (X (t)) = [\begin{matrix} x_{4} \cos (x_{5}) \cos (x_{6}) \\ x_{4} \cos (x_{5}) \sin (x_{6}) \\ - x_{4} \sin (x_{5}) \\ - α_{2} x_{4}^{2} - g \sin (x_{5}) \\ \frac{g}{x_{4}} (- \cos (x_{5}) \\ 0 \end{matrix}], g (X (t)) = [\begin{matrix} 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ α_{1} & - α_{3} & - \frac{α_{4}}{x_{4}^{2}} & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & \frac{g}{x_{4} \cos (x_{5})} \end{matrix}], \\ L (u (t)) = [\begin{matrix} L_{1} \\ L_{2} \\ L_{3} \\ L_{4} \\ L_{5} \end{matrix}] = [\begin{matrix} u_{1} \\ u_{2} \\ u_{2}^{2} \\ u_{2} \cos (u_{3}) \\ u_{2} \sin (u_{3}) \end{matrix}] . \end{matrix}$

Eq. (14.31) represents a large class of nonaffine systems far larger than the systems that are affine in the control itself. In fact, most aircraft dynamics can be expressed in the form of (14.31) if the lift equation satisfies certain assumptions [45].

14.3.2 Performance Function and $H_{\infty}$ Control Tracking for Nonaffine Systems

It is shown in [46] that the existence of an admissible optimal control solution for nonaffine systems depends on how the utility function $r (X, u)$ is defined. Moreover, to deal with the input constraints, a nonquadratic performance index needs to be defined as follows.

Let the reference trajectory be generated by the command generator dynamics (14.2). The performance or control output $z (t)$ is defined such that it satisfies

${‖ z (t) ‖}^{2} = {(X - r)}^{T} Q (X - r) + W (L (u)),$

(14.36)

where $Q ⪰ 0$ and $W (L (u))$ is a positive definite nonquadratic function of $L (u)$ which penalizes the control effort and is chosen as follows to assure the constrained control effort:

$W (L (u)) = \int_{0}^{L (u)} w (s) d s = \sum_{j = 1}^{l} (\int_{0}^{L_{j} (u)} w_{j} (s_{j}) d s_{j}),$

where $w (s) = \tanh^{- 1} ({\bar{L}}^{- 1} s) = {[\begin{matrix} w_{1} (s_{1}) & \dots & w_{l} (s_{l}) \end{matrix}]}^{T}$ and $\bar{L}$ is the constant diagonal matrix given by $\bar{L} = d i a g ({\bar{L}}_{1}, . . ., {\bar{L}}_{l})$ , which determines the bounds on $L (u)$ . Note that the bounds are originally given for the control input $u (t)$ itself. However, one can transform these bounds to bounds on $L (u)$ .

The $H_{\infty}$ control is to develop a control input such that (1) the system (1) with $w = 0$ is asymptotically stable and (2) the $L_{2}$ gain condition (14.6) with $z (t)$ defined in (14.36) is satisfied in the presence of $w \in L_{2} [0, \infty)$ .

The disturbance attenuation condition is satisfied if the following cost function is nonpositive:

$J (X) = \int_{t}^{\infty} e^{- α (τ - t)} [{(X - r)}^{T} Q (X - r) + W (L (u)) - γ^{2} w^{T} w] d τ .$

(14.37)

14.3.3 Solution of the $H_{\infty}$ Control Tracking Problem of Nonaffine Systems

Define the tracking error as (14.3). Then, using (14.2) and (14.31), the tracking error dynamics becomes

${\dot{e}}_{d} (t) = \dot{X} (t) - \dot{r} (t) = f (X (t)) + g (X (t)) L (u) + D w (t) - h_{d} (r) (t) .$

(14.38)

Based on (14.2) and (14.38), an augmented system can be constructed in terms of the tracking error $e (t)$ and the reference trajectory $r (t)$ as

$\begin{matrix} \dot{Z} (t) & = [\begin{matrix} \dot{e} (t) \\ \dot{r} (t) \end{matrix}] = [\begin{matrix} f (e (t) + r (t)) - h_{d} (r (t)) \\ h_{d} (r (t)) \end{matrix}] + [\begin{matrix} g (e (t) + r (t)) \\ 0 \end{matrix}] L (u) + [\begin{matrix} D \\ 0 \end{matrix}] w (t) \\ \equiv F (Z (t)) + G (Z (t)) L (u) + Kw (t), \end{matrix}$

(14.39)

where the augmented state is

$Z (t) = [\begin{matrix} e (t) \\ r (t) \end{matrix}] .$

The performance index (14.37) can be rewritten as

$J (L (u), w) = \int_{t}^{\infty} e^{- α (τ - t)} (Z^{T} (τ) Q_{1} Z (τ) + W (L (u)) - γ^{2} w^{T} w) d τ,$

(14.40)

with $Q_{1} = [\begin{matrix} Q & 0 \\ 0 & 0 \end{matrix}]$ .

The $H_{\infty}$ control problem can be expressed as a two-player zero-sum differential game in which the control effort policy player $L (u)$ seeks to minimize the value function, while the disturbance policy player $w (t)$ desires to maximize it. The goal is to find the feedback saddle point $(L^{⋆} (u), w^{⋆})$ such that [42]

$V^{⋆} (Z (t)) = \min_{L (u)} \max_{w} J (L (u), w) .$

(14.41)

On the basis of (14.40) and noting that $V (Z (t)) = J (L (u), w)$ , the $H_{\infty}$ tracking Bellman equation is

$Z^{T} Q_{1} Z + W (L (u)) - γ^{2} w^{T} w - α V (Z) + \dot{V} (Z) = 0$

(14.42)

and the Hamiltonian is given by

$H (Z, L (u), w, V_{Z}) = Z^{T} Q_{1} Z + W (L (u)) - γ^{2} w^{T} w - α V (Z) + V_{Z}^{T} (F (Z) + G (Z) L (u) + Kw) .$

Then the optimal control effort $L (u)$ and disturbance input $w (t)$ for the given problem are obtained by employing the stationarity condition

$\begin{matrix} L^{⋆} (u) = \underset{L (u)}{\arg \min} H (Z, L (u), w, V^{⁎}) ≜ \frac{d [Z^{T} Q_{1} Z + W (L (u)) - γ^{2} w^{T} w - α V^{⋆} + {(V_{Z}^{⋆})}^{T} \dot{Z}]}{d L (u)}, \\ w^{⋆} = \underset{w}{\arg \max} H (Z, L (u), w, V^{⋆}) ≜ \frac{d [Z^{T} Q_{1} Z + W (L (u)) - γ^{2} w^{T} w - α V^{⋆} + {(V_{Z}^{⋆})}^{T} \dot{Z}]}{d w}, \end{matrix}$

which give

$L^{⋆} (u) = - \bar{L} \tanh^{T} (v^{⋆}),$

(14.43)

$w^{⋆} = \frac{1}{2} γ^{- 2} {(V_{Z}^{⁎})}^{T} K,$

(14.44)

where

$v^{⋆} = {(V_{Z}^{⋆})}^{T} G .$

(14.45)

Substituting (14.43) and (14.44) in Bellman equation (14.42) yields the HJI equation

$Z^{T} Q_{1} Z + W (L^{⋆} (u)) - γ^{2} {(w^{⋆})}^{T} w^{⋆} - α V^{⋆} (Z) + {\dot{V}}^{⋆} (Z) = 0 .$

(14.46)

To find the optimal control solution, the tracking HJI equation (14.46) could first be solved and then the control effort $L^{⋆} (u)$ given by (14.43).

Note that the minimization problem (14.41) is defined in terms of $L (u)$ . Under certain conditions, this is equivalent to minimization in terms of $u (t)$ .

Lemma 2

We have $\min_{u} H (Z, L (u), w, V_{Z}) = \min_{L (u)} H (Z, L (u), w, V_{Z})$ if the elements of $L (u)$ are independent.

Proof

The minimum of $H (Z, L (u), w, V_{Z})$ with respect to u is equal to

$\min_{u} H (Z, L (u), w, V_{Z}) = {(\frac{\partial L (u)}{\partial u})}^{T} \frac{\partial H (Z, L (u), w, V_{Z})}{\partial L (u)} = 0$

(14.47)

and the minimum of $H (Z, L (u), w, V_{Z})$ with respect to $L (u)$ is equal to

$\min_{L (u)} H (Z, L (u), w, V_{Z}) = \frac{d H (Z, L (u), w, V_{Z})}{d L (u)} = 0 .$

(14.48)

Eqs. (14.47) and (14.48) are equivalent if and only if $J = d L (u) / d u$ is a nonsingular matrix which guarantees the elements of $L (u)$ are independent [46]. □

Note that if the elements of $L^{⋆} (u)$ are independent, then the optimal control is given by

$u^{⋆} = - L^{- 1} (\bar{L} \tanh^{T} (v^{⋆})),$

(14.49)

thus $L (u^{⋆}) = L^{⋆} (u)$ . Otherwise, it is shown in the subsequent sections how to use (14.43) to find $v^{⋆}$ and $u^{⋆}$ consequently to assure $L (u^{⋆}) = L^{⋆} (u)$ . The next result holds for both independent and dependent $L (u)$ .

Theorem 2

Solution to bounded $L_{2}$ gain problem

Assume that there exists a continuous-time positive semidefinite solution $V^{⋆} (Z)$ to the tracking HJI equation (14.46). Let $L^{⋆} (u)$ be given by (14.43). Then $L^{⋆} (u)$ in (14.31) makes the $L_{2}$ gain from the disturbance to the performance output less than or equal to γ.

Proof

See [46]. □

If the elements of $L (u)$ are independent, then there exists a $u^{⋆}$ such that $L (u^{⋆}) = L^{⋆} (u)$ and this $u^{⋆}$ makes the $L_{2}$ gain less than or equal to γ. On the other hand, if the elements of $L^{⋆} (u)$ are dependent, a method of solution is suggested in subsequent sections.

14.3.4 Off-Policy Reinforcement Learning for Nonaffine Systems

In this section, the off-policy RL is presented to solve the optimal $H_{\infty}$ control of nonaffine nonlinear systems. In the proposed method, no knowledge about the system dynamics and the reference trajectory dynamics is needed. Moreover, it does not require an adjustable disturbance input and it avoids bias in finding the value function. Two algorithms are developed for two different cases: (1) for nonaffine systems with independent elements in $L (u)$ and (2) for nonaffine systems with dependent elements in $L (u)$ . Then the implementation of these two algorithms is given.

The system dynamics (14.39) can be rewritten as

$\dot{Z} (t) = F (Z (t)) + G (Z (t)) L^{j} (u) + {Kw}^{j} + G (Z (t)) (L (u) - L^{j} (u)) + K (w - w^{j}),$

(14.50)

where $L^{j} (u)$ and $w^{j} (t)$ are the policies that are updated. By contrast, $L (u)$ and $w (t)$ are the policies that are applied to the system to collect the data.

By the definition, it is easy to see that

$e^{- α (t_{k} - t_{k - 1})} V^{j + 1} (Z (t_{k})) - V^{j + 1} (Z (t_{k - 1})) = \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} {(V_{Z}^{j + 1})}^{T} \dot{Z} (t) - α V^{j + 1} d t .$

(14.51)

Substituting (14.50) into (14.51) yields

$\begin{matrix} e^{- α (t_{k} - t_{k - 1})} V^{j + 1} (Z (t_{k})) - V^{j + 1} (Z (t_{k - 1})) = \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} {(V_{Z}^{j + 1})}^{T} [F (Z (t)) + G (Z (t)) L^{j} (u) \\ + K w^{j} + G (Z (t) (L (u) - L^{j} (u)) + K (w - w^{j})] d t . \end{matrix}$

(14.52)

On the other hand, one has

${(V_{Z}^{j + 1})}^{T} [F (Z) + G (Z) L^{j} (u) + {Kw}^{j}] = α V^{j + 1} - r_{a} (Z (t), L^{j} (u), w^{j}),$

(14.53)

where

$r_{a} (Z (t), L^{j} (u), w^{j}) = Z^{T} Q_{1} Z + W (L (u^{j})) - γ^{2} {(w^{j})}^{T} w^{j} .$

Substituting (14.53) into (14.52) yields

$\begin{matrix} e^{- α (t_{k} - t_{k - 1})} V^{j + 1} (Z (t_{k})) - V^{j + 1} (Z (t_{k - 1})) = \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} ({(V_{Z}^{j + 1})}^{T} \\ \times [G (Z (t) (L (u) - L^{j} (u)) + K (w - w^{j})] - r_{a} (Z (t), L^{j} (u), w^{j})) d t . \end{matrix}$

(14.54)

Using (14.43)–(14.45) in (14.54) yields the following off-policy $H_{\infty}$ Bellman equation:

$\begin{matrix} e^{- α (t_{k} - t_{k - 1})} V^{j + 1} (Z (t_{k})) - V^{j + 1} (Z (t_{k - 1})) = \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} ((v^{j + 1} \bar{L} (\tanh^{T} (v^{j}) - \tanh^{T} (v)) \\ + 2 γ^{2} w^{j + 1} (w - w^{j}) - r_{a} (Z (t), v^{j}, w^{j})) d t . \end{matrix}$

(14.55)

Note that if $v^{j}$ and $w^{j}$ are given, the unknown functions $V^{j + 1} (Z)$ , $v^{j + 1}$ and $w^{j + 1}$ can be approximated using (14.55). Then $L^{j + 1} (u)$ is found from $v^{j + 1}$ .

The elements of $L^{j + 1} (u)$ can be either dependent or independent. If elements in $L^{j + 1} (u)$ are independent, then the Bellman equation (14.55) can be solved iteratively using stored data to find $L^{⋆} (u)$ and the optimal control policy is $u^{⋆}$ . The following algorithm shows how to iterate on (14.55) to find the optimal control policy in this case.

Algorithm 3 gives $L^{j + 1} (u)$ and, if the condition of Lemma 2 is satisfied, then the elements of the control input are $u^{j + 1} = - L^{- 1} (\bar{L} \tanh^{T} (v^{j + 1}))$ . However, if elements in $L^{j + 1} (u)$ are dependent, then the dependency of its elements must be taken into account by encoding equality constraints while solve Eq. (14.55) for $v^{j + 1}$ .

Algorithm 3 Online off-policy RL algorithm for nonaffine system with independent elements in L(u).

To find a form for solution constraints $L (u)$ if it has dependent elements, consider the UAV system in Example 1 with

$L (u (t)) = [\begin{matrix} L_{1} \\ L_{2} \\ L_{3} \\ L_{4} \\ L_{5} \end{matrix}] = [\begin{matrix} u_{1} \\ u_{2} \\ u_{2}^{2} \\ u_{2} \cos (u_{3}) \\ u_{2} \sin (u_{3}) \end{matrix}] .$

Then, the dependency of the elements of $L (u)$ becomes

$L_{3} = {L_{2}}^{2} = {L_{4}}^{2} + {L_{5}}^{2} .$

This gives the following equality constraints:

${\bar{L}}_{3} \tanh (v_{3}) = {({\bar{L}}_{2} \tanh (v_{2}))}^{2} = {({\bar{L}}_{4} \tanh (v_{4}))}^{2} + {({\bar{L}}_{5} \tanh (v_{5}))}^{2} .$

In general, it is seen that one has a vector of equality functions

$f (L) = {[f_{1} (L), . . ., f_{p} (L)]}^{T} = 0,$

(14.57)

with p being the number of dependent elements in $L (u)$ . For example for the UAV system, one has $f_{1} = {\bar{L}}_{3} \tanh (v_{3}) - {({\bar{L}}_{2} \tanh (v_{2}))}^{2}$ , $f_{2} = {({\bar{L}}_{2} \tanh (v_{2}))}^{2} - {({\bar{L}}_{4} \tanh (v_{4}))}^{2} - {({\bar{L}}_{5} \tanh (v_{5}))}^{2}$ and $f_{3} = ({\bar{L}}_{3} \tanh (v_{3})) - {({\bar{L}}_{4} \tanh (v_{4}))}^{2} - {({\bar{L}}_{5} \tanh (v_{5}))}^{2}$ . This constraint must be taken into account when solving (14.55) for v using NNs.

The following algorithm shows how to find the optimal control solution for the cases where $L (u)$ has dependent elements. The details of implementation of solving (14.55) for v while considering the constraint imposed by the independency of elements of v are presented in the next subsection.

Before proceeding, $\bar{H}$ is defined as

$\begin{matrix} \bar{H} & = \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} ((v^{j + 1} \bar{L} (\tanh^{T} (v^{j}) - \tanh^{T} (v)) + 2 γ^{2} w^{j + 1} (w - w^{j}) - r_{a} (Z (t), v^{j}, w^{j})) d t \\ - e^{- α (t_{k} - t_{k - 1})} V^{j + 1} (Z (t_{k})) + V^{j + 1} (Z (t_{k - 1})) . \end{matrix}$

The minimum value of $\bar{H}$ in Algorithm 4 not considering the constraint (14.57) is zero. If this algorithm terminates, so that $\bar{H} = 0$ , then by Theorem 2 the $L_{2}$ gain problem is solved and there exists a $u^{⋆}$ such that $L (u^{⋆}) = L^{⋆} (u)$ .

Algorithm 4 Online off-policy RL algorithm for nonaffine system with dependent elements in L(u).

The following subsection shows how to use NNs along with linear and nonlinear LS, respectively, to implement Algorithms 3 and 4.

14.3.5 Neural Networks for Implementation of Off-Policy RL Algorithms

In this subsection, the solution of the off-policy $H_{\infty}$ Bellman equations (14.56) and Eq. (14.58) in Algorithms 3 and 4 using three NNs is presented. The unknown functions $V^{j + 1} (Z)$ , $v^{j + 1}$ and $w^{j + 1}$ can be approximated by three NNs as

${\hat{V}}^{j + 1} (Z) = \sum_{i = 1}^{N_{1}} {\hat{c}}_{i}^{j + 1} ϕ_{i} (Z) = {\hat{C}}^{j + 1} ϕ (Z),$

(14.59)

${\hat{v}}_{i}^{j + 1} = \sum_{k = 1}^{N_{2}} {\hat{p}}_{i, k}^{j + 1} σ_{i, k} (Z) = {\hat{P}}_{i}^{j + 1} σ_{i} (Z),$

(14.60)

${\hat{w}}_{i}^{j + 1} = \sum_{k = 1}^{N_{3}} {\hat{q}}_{i, k}^{j + 1} ρ_{i, k} (Z) = {\hat{Q}}_{i}^{j + 1} ρ_{i} (Z),$

(14.61)

where ${\hat{v}}^{j + 1} = [{\hat{v}}_{1}^{j + 1}, . . ., {\hat{v}}_{l}^{j + 1}]$ , ${\hat{w}}^{j + 1} = [{\hat{w}}_{1}^{j + 1}, . . ., {\hat{w}}_{q}^{j + 1}]$ . The terms $ϕ_{i} (Z) = [ϕ_{i 1}, . . ., ϕ_{i N_{i 1}}]$ , $σ_{i} (Z) = [σ_{i 1}, . . ., σ_{i N_{i 2}}]$ and $ρ_{i} (Z) = [ρ_{i 1}, . . ., ρ_{i N_{3}}]$ are basis function vectors, ${\hat{C}}^{j + 1}$ , ${\hat{P}}_{i}^{j + 1}$ and ${\hat{Q}}_{i}^{j + 1}$ are constant weight vectors and $N_{1}$ , $N_{2}$ and $N_{3}$ are the number of neurons. Substituting (14.59)–(14.61) into the off-policy $H_{\infty}$ Bellman equation (14.55) yields

$\begin{matrix} e^{- α (t_{k} - t_{k - 1})} {\hat{C}}^{j + 1} [ϕ (Z (t_{k})) - ϕ (Z (t_{k - 1}))] = \\ \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} (\sum_{i = 1}^{l} {\hat{P}}_{i}^{j + 1} σ_{i} (Z) {\bar{L}}_{i} (\tanh^{T} ({\hat{v}}_{i}^{j}) - \tanh^{T} (v_{i})) \\ + 2 γ^{2} \sum_{i = 1}^{q} {\hat{Q}}_{i}^{j + 1} ρ_{i} (Z) (w_{i} - {w_{i}}^{j}) - r_{a} (Z (t), {\hat{v}}^{j}, {\hat{w}}^{j})) d t . \end{matrix}$

(14.62)

By defining $\hat{P} = [\begin{matrix} {\hat{P}}_{1} & . . . & {\hat{P}}_{l} \end{matrix}]$ and $\hat{Q} = [\begin{matrix} {\hat{Q}}_{1} & . . . & {\hat{Q}}_{q} \end{matrix}]$ , Eq. (14.62) can be rewritten as

${\hat{W}}^{T} h (t_{k}) = y (t_{k}),$

(14.63)

where

$\begin{matrix} \hat{W} & = {{[\begin{matrix} {({\hat{C}}^{j + 1})}^{T} & {({\hat{P}}^{j + 1})}^{T} & ({\hat{Q}}^{j + 1} \end{matrix})}^{T}]}^{T}, \\ h (t_{k}) & = [\begin{matrix} e^{- α (t_{k} - t_{k - 1})} ϕ (Z (t_{k})) - ϕ (Z (t_{k - 1})) \\ \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} σ_{1} (Z) {\bar{L}}_{1} (\tanh^{T} ({\hat{v}}_{1}^{j}) - \tanh^{T} (v_{1})) d τ \\ ⋮ \\ \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} σ_{l} (Z) {\bar{L}}_{l} (\tanh^{T} ({\hat{v}}_{l}^{j}) - \tanh^{T} (v_{l})) d τ \\ 2 γ^{2} \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} ρ_{1} (Z) (w_{1} - {w_{1}}^{j}) d τ \\ ⋮ \\ 2 γ^{2} \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} ρ_{q} (Z) (w_{q} - {w_{q}}^{j}) d τ \end{matrix}], \\ y (t_{k}) & = \int_{t_{k - 1}}^{t_{k}} e^{- α (τ - t_{k - 1})} r_{a} (Z (t), {\hat{v}}^{j}, {\hat{w}}^{j})) d t . \end{matrix}$

Case 1: Independency in Elements of $L (u)$ Eq. (14.63) can be solved using the least square method for parameter vector $\hat{W}$ . Then the approximated value function and disturbance input are (14.59) and (14.61), respectively. The control input ${\hat{u}}^{j + 1}$ is found by determining $L ({\hat{u}}^{j + 1})$ based on (14.45) from (14.60). The number of unknown parameters $\hat{W}$ is $N_{1} + N_{2} + N_{3}$ . Then, at least $N ⩾ N_{1} + N_{2} + N_{3}$ data sampled $t_{1}$ to $t_{N}$ should be collect before solving (14.63) in the least square sense,

$\begin{matrix} Y & = {[\begin{matrix} y (t_{1}) & . . . & y (t_{N}) \end{matrix}]}^{T}, \\ H & = [\begin{matrix} h (t_{1}) & . . . & h (t_{N}) \end{matrix}] . \end{matrix}$

The least square solution is obtained as

$\hat{W} = {(H H^{T})}^{- 1} HY .$

Case 2: Dependency in Elements of $L (u)$ If the elements of $L (u)$ are dependent, one has to solve a constrained nonlinear least square problem to take into account the equality constraints imposed by the dependency of the elements of $L (u)$ . To show this, consider the case of the UAV in Example 1. The following constraints are considered when finding the weights of NNs:

$\begin{matrix} {\bar{L}}_{3} \tanh ({\hat{P}}_{3}^{j + 1} σ_{3} (Z)) = {({\bar{L}}_{2} \tanh ({\hat{P}}_{2}^{j + 1} σ_{2} (Z)))}^{2} = \\ {({\bar{L}}_{4} \tanh ({\hat{P}}_{4}^{j + 1} σ_{4} (Z)))}^{2} + {({\bar{L}}_{5} \tanh ({\hat{P}}_{5}^{j + 1} σ_{5} (Z)))}^{2} . \end{matrix}$

This constraint is nonlinear in NN weights and thus requires using the nonlinear least square method. In general, (14.58) becomes

$a r g \min_{\hat{W}} {‖ \hat{W} H - Y ‖}^{2} s . t . f ({\hat{P}}^{j + 1}, σ_{1}, . . ., σ_{l}) = 0,$

where the function f is defined in (14.57) and depends on how the elements of $L (u)$ and consequently NN weights are related.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 14: Adaptive H∞ Tracking Control of Nonlinear Systems Using Reinforcement Learning

Create new playlist

Sign In

Sign Up

14.1 Introduction

14.2 H∞<math><msub><mrow><mi>H</mi></mrow><mrow><mo>∞</mo></mrow></msub></math> Optimal Tracking Control for Nonlinear Affine Systems

14.2.1 HJI Equation for H∞<math><msub><mrow><mi>H</mi></mrow><mrow><mo>∞</mo></mrow></msub></math> Optimal Tracking

14.2.2 Off-Policy IRL for Learning the Tracking HJI Equation

14.2.3 Implementing Algorithm 2 Using Neural Networks

14.3 H∞<math><msub><mrow><mi>H</mi></mrow><mrow><mo>∞</mo></mrow></msub></math> Optimal Tracking Control for a Class of Nonlinear Nonaffine Systems

14.3.1 A Class of Nonaffine Dynamical Systems

14.3.2 Performance Function and H∞<math><msub><mrow><mi>H</mi></mrow><mrow><mo>∞</mo></mrow></msub></math> Control Tracking for Nonaffine Systems

14.3.3 Solution of the H∞<math><msub><mrow><mi>H</mi></mrow><mrow><mo>∞</mo></mrow></msub></math> Control Tracking Problem of Nonaffine Systems

14.3.4 Off-Policy Reinforcement Learning for Nonaffine Systems

14.3.5 Neural Networks for Implementation of Off-Policy RL Algorithms

Table of Contents for
Chapter 14: Adaptive H∞ Tracking Control of Nonlinear Systems Using Reinforcement Learning

14.2 $H_{\infty}$ Optimal Tracking Control for Nonlinear Affine Systems

14.2.1 HJI Equation for $H_{\infty}$ Optimal Tracking

14.3 $H_{\infty}$ Optimal Tracking Control for a Class of Nonlinear Nonaffine Systems

14.3.2 Performance Function and $H_{\infty}$ Control Tracking for Nonaffine Systems

14.3.3 Solution of the $H_{\infty}$ Control Tracking Problem of Nonaffine Systems