5

System Identification Under Information Divergence Criteria

The fundamental contribution of information theory is to provide a unified framework for dealing with the notion of information in a precise and technical sense. Information, in a technical sense, can be quantified in a unified manner by using the Kullback–Leibler information divergence (KLID). Two information measures, Shannon’s entropy and mutual information are special cases of KL divergence [43]. The use of probability in system identification is also shown to be equivalent to measuring KL divergence between the actual and model distributions. In parameter estimation, the KL divergence for inference is consistent with common statistical approaches, such as the maximum likelihood (ML) estimation. Based on the KL divergence, Akaike derived the well-known Akaike’s information criterion (AIC), which is widely used in the area of model selection. Another important model selection criterion, the minimum description length, first proposed by Rissanen in 1978, is also closely related to the KL divergence. In identification of stationary Gaussian processes, it has been shown that the optimal solution to an approximation problem for Gaussian random variables with the divergence criterion is identical to the main step of the subspace algorithm [123].

There are many definitions of information divergence, but in this chapter our focus is mainly on the KLID. In most cases, the extension to other definitions is straightforward.

5.1 Parameter Identifiability Under KLID Criterion

The identifiability arises in the context of system identification, indicating whether or not the unknown parameter can be uniquely identified from the observation of the system. One would not select a model structure whose parameters cannot be identified, so the problem of identifiability is crucial in the procedures of system identification. There are many concepts of identifiability. Typical examples include Fisher information–based identifiability [216], least squares (LS) identifiability [217], consistency-in-probability identifiability [218], transfer function–based identifiability [219], and spectral density–based identifiability [219]. In the following, we discuss the fundamental problem of system parameter identifiability under KLID criterion.

5.1.1 Definitions and Assumptions

Let image (image) be a sequence of observations with joint probability density functions (PDFs) image, image, where image is a image-dimensional column vector, image is a image-dimensional parameter vector, and image is the parameter space. Let image be the true parameter. The KLID between image and image will be

image (5.1)

where image denotes the expectation of the bracketed quantity taken with respect to the actual parameter value image. Based on the KLID, a natural way of parameter identification is to look for a parameter image, such that the KLID of Eq. (5.1) is minimized, that is,

image (5.2)

An important question that arises in the context of such identification problem is whether or not the parameter image can be uniquely determined. This is the parameter identifiability problem. Assume image lies in image (hence image). The notion of identifiability under KLID criterion can then be defined as follows.

Definition 5.1

The parameter set image is said to be KLID-identifiable at image, if and only if image, image, image implies image.

By the definition, if parameter set image is KLID-identifiable at image (we also say image is KLID-identifiable), then for any image, image, we have image, and hence image. Therefore, any change in the parameter yields changes in the output density.

The identifiability can also be defined in terms of the information divergence rate.

Definition 5.2

The parameter set image is said to be KLIDR-identifiable at image, if and only if image, the KL information divergence rate (KLIDR) image exists, and image implies image.

Let image be the image-neighborhood of image, where image denotes the Euclidean norm. The local KLID (or local KLIDR)-identifiability is defined as follows.

Definition 5.3

The parameter set image is said to be locally KLID (or locally KLIDR)-identifiable at image, if and only if there exists image, such that image, image (or image) implies image.

Here, we give some assumptions that will be used later on.

Assumption 5.1

image, image, the KLID image always exists.

Remark:

Let image be the probability space of the output sequence image with parameter image, where image is the related measurable space, and image is the probability measure. image is said to be absolutely continuous with respect to image, denoted by image, if image for every image such that image. Clearly, the existence of image implies image. Thus by Assumption 5.1, image, we have image, image.

Assumption 5.2

The density function image is at least twice continuously differentiable with respect to image, and image, the following interchanges between integral (or limitation) and derivative are permissible:

image (5.3)

image (5.4)

Remark:

The interchange of differentiation and integration can be justified by bounded convergence theorem for appropriately well-behaved PDF image. Similar assumptions can be found in Ref. [220]. A sufficient condition for the permission of interchange between differentiation and limitation is the uniform convergence of the limitation in image.

5.1.2 Relations with Fisher Information

Fisher information is a classical criterion for parameter identifiability [216]. There are close relationships between KLID (KLIDR)-identifiability and Fisher information.

The Fisher information matrix (FIM) for the family of densities image is given by:

image (5.5)

As image, the Fisher information rate matrix (FIRM) is:

image (5.6)

Theorem 5.1

Assume that image is an open subset of image. Then, image will be locally KLID-identifiable if the FIM image is positive definite.

Proof:

As image is an open subset, an obvious sufficient condition for image to be locally KLID-identifiable is that image, and image. This can be easily proved. By Assumption 5.2, we have

image (5.7)

On the other hand, we can derive

image (5.8)

Theorem 5.2

Assume that image is an open subset of image. Then image will be locally KLIDR-identifiable if the FIRM image is positive definite.

Proof:

By Theorem 5.1 and Assumption 5.2, we have

image (5.9)

and

image (5.10)

Thus, image is locally KLIDR-identifiable.

Suppose the observation sequence image (image) is a stationary zero-mean Gaussian process, with power spectral image. According to Theorem 2.7, the spectral expressions of the KLIDR and FIRM are as follows:

image (5.11)

image (5.12)

In this case, we can easily verify that image, and image. In fact, we have

image (5.13)

and

image (5.14)

where image.

Remark:

Theorems 5.1 and 5.2 indicate that, under certain conditions, the positive definiteness of the FIM (or FIRM) provides a sufficient condition for the local KLID (or local KLIDR)-identifiability.

5.1.3 Gaussian Process Case

When the observation sequence image is jointly Gaussian distributed, the KLID-identifiability can be easily checked. Consider the following joint Gaussian PDF:

image (5.15)

where image is the mean vector, and image is the image-dimensional covariance matrix. Then we have

image (5.16)

where image is

image (5.17)

Clearly, for the Gaussian process image, we have image if and only if image and image. Denote image, image, where image is the ith row and jth column element of image. The element image is said to be a regular element if and only if image, i.e., as a function of image, image is not a constant. In a similar way, we define the regular element of the mean vector image. Let image be a column vector containing all the distinct regular elements from image and image. We call image the regular characteristic vector (RCV) of the Gaussian process image. Then we have image if and only if image. According to Definition 5.1, for the Gaussian process image, the parameter set image is KLID-identifiable at image, if and only if image, image, image implies image.

Assume that image is an open subset of image. By Lemma 1 of Ref. [219], the map image will be locally one to one at image if the Jacobian of image has full rank image at image. Therefore, a sufficient condition for image to be locally KLID-identifiable is that

image (5.18)

Example 5.1

Consider the following second-order state-space model (image) [120]:

image (5.19)

where image is a zero-mean white Gaussian process with unit power. Then the output sequence with image is

image (5.20)

It is easy to obtain the RCV:

image (5.21)

The Jacobian matrix can then be calculated as:

image (5.22)

Clearly, we have image for all image with image. So this parameterization is locally KLID-identifiable provided image. The identifiability can also be checked from the transfer function. The transfer function of the above system is:

image (5.23)

image, image, define image. Then image, we have image provided the following two conditions are met:

1. image, image

2. image, image, image

According to the Definition 1 of Ref. [219], this system is also locally identifiable from the transfer function provided image.

The KLID-identifiability also has connection with the LS-identifiability [217]. Consider the signal-plus-noise model:

image (5.24)

where image is a parameterized deterministic signal, image is a zero-mean white Gaussian noise, image (image is an image-dimensional identity matrix), and image is the noisy observation. Then we have

image (5.25)

By Eq. (5.16), we derive

image (5.26)

where image. The above KLID is equivalent to the LS criterion of the deterministic part. In this case, the KLID-identifiability reduces to the LS-identifiability of the deterministic part.

Next, we show that for a stationary Gaussian process, the KLIDR-identifiability is identical to the identifiability from the output spectral density [219].

Let image (image) be a parameterized zero-mean stationary Gaussian process with continuous spectral density image (image-dimensional matrix). By Theorem 2.7, the KLIDR between image and image exists and is given by:

image (5.27)

Theorem 5.3

image, with equality if and only if image, image.

Proof:

image, the spectral density matrices image and image are positive definite. Let image and image be two normally distributed image-dimensional vectors, image and image, respectively. Then we have

image (5.28)

Combining Eqs. (5.28) and (5.27) yields

image (5.29)

It follows easily that image, with equality if and only if image for almost every image (hence image, image).

By Theorem 5.3, we may conclude that for a stationary Gaussian process, image is KLIDR-identifiable if and only if image, image implies image. This is exactly the identifiability from the output spectral density.

5.1.4 Markov Process Case

Now we focus on situations where the observation sequence is a parameterized Markov process. First, let us define the minimum identifiable horizon (MIH).

Definition 5.4

Assume that image is KLID-identifiable. Then the MIH is [120]:

image (5.30)

where image.

The MIH is the minimum length of the observation sequence from which image can be uniquely identified. If the MIH is known, we could identify image with the least observation data. In general, it is difficult to obtain the exact value of MIH. In some special situations, however, one can derive an upper bound on the MIH. For a parameterized Markov process, this upper bound is straightforward. In the theorem below, we show that for a image-order strictly stationary Markov process, the number image provides an upper bound on the MIH.

Theorem 5.4

If the observation sequence image (image) is a image-order strictly stationary Markov process (image), and the parameter set image is KLID-identifiable at image, then we have image.

Proof:

As parameter set image is KLID-identifiable at image, by Definition 5.1, there exists a number image, such that image, image. Let us consider two cases, one for which image and the other for which image.

1. image: The zero-order strictly stationary Markov process refers to an independent and identically distributed sequence. In this case, we have image, and

image (5.31)


And hence, image, we have image. It follows that image, and image.

2. image: If image, then image. If image, then

image (5.32)


By Markovian and stationary properties, one can derive

image (5.33)


It follows that

image (5.34)

where image is the conditional KLID. And hence,

image (5.35)


Then image, we have image. Thus, image, and it follows that image.

Example 5.2

Consider the first-order AR model (image) [120]:

image (5.36)

where image is a zero-mean white Gaussian noise with unit power. Assume that the system has reached steady state when the observations begin. The observation sequence image will be a first-order stationary Gaussian Markov process, with covariance matrix:

image (5.37)

image (image, image), we have

image (5.38)

where image are the RCVs. And hence, image.

The following corollary is a direct consequence of Theorem 5.4.

Corollary 5.1

For a image-order strictly stationary Markov process image, the parameter set image is KLID-identifiable at image if and only if image, image implies image.

From the theory of stochastic process, for a image-order strictly stationary Markov process image, under certain conditions (see Ref. [221] for details), the conditional density image will determine uniquely the joint density image. In this case, the KLID-identifiability and the KLIDR-identifiability are equivalent.

Theorem 5.5

Assume that the observation sequence image is a image-order strictly stationary Markov process (image), whose conditional density image uniquely determines the joint density image. Then, image, image is KLID-identifiable if and only if it is KLIDR-identifiable.

Proof:

We only need to prove image.

1. When image, we have image, and hence image.

2. When image, we have image, and it follows that

image (5.39)


Since image uniquely determines image, we can derive

image (5.40)


This completes the proof.

5.1.5 Asymptotic KLID-Identifiability

In the previous discussions, we assume that the true density image is known. In most practical situations, however, the actual density, and hence the KLID, needs to be estimated using random data drawn from the underlying density. Let image be an independent and identically distributed (i.i.d.) sample drawn from image. The density estimator for image will be a mapping image [98]:

image (5.41)

The asymptotic KLID-identifiability is then defined as follows:

Definition 5.5

The parameter set image is said to be asymptotic KLID-identifiable at image, if there exists a sequence of density estimates image, such that image (convergence in probability), where image is the minimum KLID estimator, image.

Theorem 5.6

Assume that the parameter space image is a compact subset, and the density estimate sequence image satisfies image. Then, image will be asymptotic KLID-identifiable provided it is KLID-identifiable.

Proof:

Since image, for image and image arbitrarily small, there exists an image such that for image,

image (5.42)

where image is the probability of Borel set image. On the other hand, as image, we have

image (5.43)

Then the event image, and hence

image (5.44)

By Pinsker’s inequality, we have

image (5.45)

where image is the image-distance (or the total variation). It follows that

image (5.46)

In addition, the following inequality holds:

image (5.47)

Then we have

image (5.48)

And hence

image (5.49)

For any image, we define the set image, where image is the Euclidean norm. As image is a compact subset in image, image must be a compact set too. Meanwhile, by Assumption 5.2, the function image (image) will be a continuous mapping image. Thus, a minimum of image over the set image must exist. Denote image, it follows easily that

image (5.50)

If image is KLID-identifiable, image, image, we have image, or equivalently, image. It follows that image. Let image, we have

image (5.51)

This implies image, and hence image.

According to Theorem 5.6, if the density estimate image is consistent in KLID in probability (image), the KLID-identifiability will be a sufficient condition for the asymptotic KLID-identifiability. The next theorem shows that, under certain conditions, the KLID-identifiability will also be a necessary condition for image to be asymptotic KLID-identifiable.

Theorem 5.7

If image is asymptotic KLID-identifiable, then it is KLID-identifiable provided

1. image is a compact subset in image,

2. image, if image, then there exist image and an infinite set image such that for image,

image (5.52)

where image, image.

Proof:

If image is asymptotic KLID-identifiable, then for image and image arbitrarily small, there exists an image such that for image,

image (5.53)

Suppose image is not KLID-identifiable, then image, image, such that image. Let image, we have (as image is a compact subset, the minimum exists)

image (5.54)

where (a) follows from image. The above result contradicts the condition (2). Therefore, image must be KLID-identifiable.

In the following, we consider several specific density estimation methods and discuss the consistency problems of the related parameter estimators.

5.1.5.1 Maximum Likelihood Estimation

The maximum likelihood estimation (MLE) is a popular parameter estimation method and is also an important parametric approach for the density estimation. By MLE, the density estimator is

image (5.55)

where image is obtained by maximizing the likelihood function, that is,

image (5.56)

Lemma 5.1

The MLE density estimate sequence image satisfies image.

A simple proof of this lemma can be found in Ref. [222]. Combining Theorem 5.6 and Lemma 5.1, we have the following corollary.

Corollary 5.2

Assume that image is a compact subset in image, and image is KLID-identifiable. Then we have image.

According to Corollary 5.2, the KLID-identifiability is a sufficient condition to guarantee the ML estimator to converge to the true value in probability one. This is not surprising since the ML estimator is in essence a special case of the minimum KLID estimator.

5.1.5.2 Histogram-Based Estimation

The histogram-based estimation is a common nonparametric method for density estimation. Suppose the i.i.d. samples image take values in a measurable space image. Let image, image, be a sequence of partitions of image, with image either finite or infinite, such that the image-measure image for each image. Then the standard histogram density estimator with respect to image and image is given by:

image (5.57)

where image is the standard empirical measure of image, i.e.,

image (5.58)

where image is the indicator function.

According to Ref. [223], under certain conditions, the density estimator image will converge in reversed order information divergence to the true underlying density image, and the expected KLID

image (5.59)

Since image, by Markov’s inequality [224], for any image, we have

image (5.60)

It follows that image, image, and for any image and image arbitrarily small, there exists an image such that for image,

image (5.61)

Thus we have image. By Theorem 5.6, the following corollary holds.

Corollary 5.3

Assume that image is a compact subset in image, and image is KLID-identifiable. Let image be the standard histogram density estimator satisfying Eq. (5.59). Then we have image, where image.

5.1.5.3 Kernel-Based Estimation

The kernel-based estimation (or kernel density estimation, KDE) is another important nonparametric approach for the density estimation. Given an i.i.d. sample image, the kernel density estimator is

image (5.62)

where image is a kernel function satisfying image and image, image is the kernel width.

For the KDE, the following lemma holds (see chapter 9 in Ref. [98] for details).

Lemma 5.2

Assume that image is a fixed kernel, and the kernel width image depends on image only. If image and image as image, then image.

From image, one cannot derive image. And hence, Theorem 5.6 cannot be applied here. However, if the parameter is estimated by minimizing the total variation (not the KLID), the following theorem holds.

Theorem 5.8

Assume that image is a compact subset in image, image is KLID-identifiable, and the kernel width image satisfies the conditions in Lemma 5.2. Then we have image, where image.

Proof:

As image, by Markov’s inequality, we have image. Following a similar derivation as for Theorem 5.6, one can easily reach the conclusion.

The KLID and the total variation are both special cases of the family of image-divergence [130]. The image-divergence between the PDFs image and image is

image (5.63)

where image is a class of convex functions. The minimum image-divergence estimator is given by [130]:

image (5.64)

Below we give a more general result, which includes Theorems 5.6 and 5.8 as special cases.

Theorem 5.9

Assume that image is a compact subset in image, image is KLID-identifiable, and for a given image, image, image, where function image is strictly increasing over the interval image, and image. Then, if the density estimate sequence image satisfies image, we have image, where image is the minimum image-divergence estimator.

Proof:

Similar to the proof of Theorem 5.6 (omitted).

5.2 Minimum Information Divergence Identification with Reference PDF

Information divergences have been suggested by many authors for the solution of the related problems of system identification. The ML criterion and its extensions (e.g., AIC) can be derived from the KL divergence approach. The information divergence approach is a natural generalization of the LS view. Actually one can think of a “distance” between the actual (empirical) and model distributions of the data, without necessarily introducing the conceptually more demanding concepts of likelihood or posterior. In the following, we introduce a novel system identification approach based on the minimum information divergence criterion.

Apart from conventional methods, the new approach adopts the idea of PDF shaping and uses the divergence between the actual error PDF and a reference (or target) PDF (usually with zero mean and a narrow range) as the identification criterion. As illustrated in Figure 5.1, in this scheme, the model parameters are adjusted such that the error distribution tends to the reference distribution. With KLID, the optimal parameters (or weights) of the model can be expressed as:

image (5.65)

where image and image denote, respectively, the actual error PDF and the reference PDF. Other information divergence measures such as image-divergence can also be used but are not considered here.

image

Figure 5.1 Scheme of system identification with a reference PDF.

The above method shapes the error distribution, and can be used to achieve the desired variance or entropy of the error, provided the desired PDF of the error can be achieved. This is expected to be useful in complex signal processing and learning systems. If we choose the image function as the reference PDF, the identification error will be forced to concentrate around the zero with a sharper peak. This coincides with commonsense predictions about system identification.

It is worth noting that the PDF shaping approaches can be found in other contexts. In the control literature, Karny et al. [225,226] proposed an alternative formulation of stochastic control design problem: the joint distributions of closed-loop variables should be forced to be as close as possible to their desired distributions. This formulation is called the fully probabilistic control. Wang et al. [227229] designed new algorithms to control the shape of the output PDF of a stochastic dynamic system. In adaptive signal processing literature, Sala-Alvarez et al. [230] proposed a general criterion for the design of adaptive systems in digital communications, called the statistical reference criterion, which imposes a given PDF at the output of an adaptive system.

It is important to remark that the minimum value of the KLID in Eq. (5.65) may not be zero. In fact, all the possible PDFs of the error are, in general, restricted to a certain set of functions image. If the reference PDF is not contained in the possible PDF set, i.e., image, we have

image (5.66)

In this case, the optimal error PDF image, and the reference distribution can never be realized. This is however not a problem of great concern, since our goal is just to make the error distribution closer (not necessarily identical) to the reference distribution.

In some special situations, this new identification method is equivalent to the ML identification. Suppose that in Figure 5.1 the noise image is independent of the input image, and the unknown system can be exactly identified, i.e., the intrinsic error (image) between the unknown system and the model can be zero. In addition, we assume that the noise PDF image is known. In this case, if setting image, we have

image (5.67)

where (a) comes from the fact that the weight vector minimizing the KLID (when image) also minimizes the error entropy, and image is the likelihood function.

5.2.1 Some Properties

We present in the following some important properties of the minimum KLID criterion with reference PDF (called the KLID criterion for short).

The KLID criterion is much different from the minimum error entropy (MEE) criterion. The MEE criterion does not consider the mean of the error due to its invariance to translation. Under MEE criterion, the estimator makes the error PDF as sharp as possible, and neglects the PDF’s location. Under KLID criterion, however, the estimator makes the actual error PDF and reference PDF as close as possible (in both shape and location).

The KLID criterion is sensitive to the error mean. This can be easily verified: if image and image are both Gaussian PDFs with zero mean and unit variance, we have image; while if the error mean becomes nonzero, image, we have image. The following theorem suggests that, under certain conditions the mean value of the optimal error PDF under KLID criterion is equal to the mean value of the reference PDF.

Theorem 5.10

Assume that image and image satisfy:

1. image, image (image);

2. image, image is an even function, where image is the mean value of image;

3. image is an even and strictly log-concave function, where image is the mean value of image.

Then, the mean value of the optimal error PDF (image) is image.

Proof:

Using reduction to absurdity. Suppose image, and let image. Denote image and image. According to the assumptions, image, and image is an even function. Then we have

image (5.68)

where (a), (d), and (e) follow from the shift-invariance of the KLID, (b) is because image is strictly log-concave, and (c) is because image and image are even functions. Therefore, image, such that

image (5.69)

This contradicts with image. And hence, image holds.

On the other hand, the KLID criterion is also closely related to the MEE criterion. The next theorem provides an upper bound on the error entropy under the constraint that the KLID is bounded.

Theorem 5.11

Let the reference PDF image be a zero-mean Gaussian PDF with variance image. If the error PDF image satisfies

image (5.70)

where image is a positive constant, then the error entropy image satisfies

image (5.71)

where image is the solution of the following equation:

image (5.72)

Proof:

Denote image the collection of all the error PDFs that satisfy image. Clearly, this is a convex set. image, we have

image (5.73)

In order to solve the error distribution that achieves the maximum entropy, we create the Lagrangian:

image (5.74)

where image and image are the Lagrange multipliers. When image, image is a concave function of image. If image is a function such that image for image sufficiently small, the Gateaux derivative of image with respect to image is given by:

image (5.75)

If it is zero for all image, we have

image (5.76)

Thus, if image (such that image is a concave function of image), the error PDF that achieves the maximum entropy exists and is given by

image (5.77)

According to the assumptions, image. It follows that

image (5.78)

where image. Obviously, image is a Gaussian density, and we have

image (5.79)

So image can be determined as

image (5.80)

In order to determine the value of image, we use the Kuhn–Tucker condition:

image (5.81)

When image, we have image, that is,

image (5.82)

Therefore, image is the solution of the Eq. (5.72).

Define the function image. It is easy to verify that image is continuous and monotonically decreasing over interval image. Since image, image, and image, the equation image certainly has a solution in image.

From the previous derivations, one may easily obtain:

image (5.83)

The above theorem indicates that, under the KLID constraint image, the error entropy is upper bounded by the reference entropy plus a certain constant. In particular, when image, we have image and image. Therefore, if one chooses a reference PDF with small entropy, the error entropy will also be confined within small values. In practice, the reference PDF is, in general, chosen as a PDF with zero mean and small entropy (e.g., the image distribution at zero).

In most practical situations, the error PDF is unknown and needs to be estimated from samples. There is always a bias in the density estimation; in order to offset the influence of the bias, one can use the same method to estimate the reference density based on the samples drawn from the reference PDF. Let image and image be respectively the actual and reference error samples. The KDEs of image and image will be

image (5.84)

where image and image are corresponding kernel widths. Using the estimated PDFs, one may obtain the empirical KLID criterion image.

Theorem 5.12

The empirical KLID image, with equality if and only if image.

Proof:

image, we have image, with equality if and only if image. And hence

image (5.85)

with equality if and only if image.

Lemma 5.3

If two PDFs image and image are bounded, then [231]:

image (5.86)

where image.

Theorem 5.13

If image is a Gaussian kernel function, image, then

image (5.87)

Proof:

Since Gaussian kernel is bounded, and image, the kernel-based density estimates image and image will also be bounded, and

image (5.88)

By Lemma 5.3, we have

image (5.89)

where

image (5.90)

Then we obtain Eq. (5.87).

The above theorem suggests that convergence in KLID ensures the convergence in image distance (image).

Before giving Theorem 5.14, we introduce some notations. By rearranging the samples in image and image, one obtains the increasing sequences:

image (5.91)

where image, image. Denote image, and image.

Theorem 5.14

If image is a Gaussian kernel function, image, then image if and only if image.

Proof:

By Theorem 5.12, it suffices to prove that image if and only if image.

Sufficiency: If image, we have

image (5.92)

Necessity: If image, then we have

image (5.93)

Let image, where image. Then,

image (5.94)

where image, image, image. Since image, image, we have image. It follows that

image (5.95)

As image is a symmetric and positive definite matrix (image), we get image, that is

image (5.96)

Thus, image, and

image (5.97)

This completes the proof.

Theorem 5.14 indicates that, under certain conditions the zero value of the empirical KLID occurs only when the actual and reference sample sets are identical.

Based on the sample sets image and image, one can calculate the empirical distribution:

image (5.98)

where image, and image. According to the limit theorem in probability theory [224], we have

image (5.99)

If image, and image is large enough, we have image, and hence image. Therefore, when the empirical KLID approaches zero, the actual error PDF will be approximately identical with the reference PDF.

5.2.2 Identification Algorithm

In the following, we derive a stochastic gradient–based identification algorithm under the minimum KLID criterion with a reference PDF. Since the KLID is not symmetric, we use the symmetric version of KLID (also referred to as the J-information divergence):

image (5.100)

By dropping off the expectation operators image and image, and plugging in the estimated PDFs, one may obtain the estimated instantaneous value of J-information divergence:

image (5.101)

where image and image are

image (5.102)

Then a stochastic gradient–based algorithm can be readily derived as follows:

image (5.103)

where

image (5.104)

This algorithm is called the stochastic information divergence gradient (SIDG) algorithm [125,126].

In order to achieve an error distribution with zero mean and small entropy, one can choose the image function at zero as the reference PDF. It is, however, worth noting that the image function is not always the best choice. In many situations, the desired error distribution may be far from the image distribution. In practice, the desired error distribution can be estimated from some prior knowledge or preliminary identification results.

Remark:

Strictly speaking, if one selects the image function as the reference distribution, the information divergence will be undefined (ill-posed). In practical applications, however, we often use the estimated information divergence as an alternative cost function, where the actual and reference error distributions are both estimated by KDE approach (usually with the same kernel width). It is easy to verify that, for the image distribution, the estimated PDF is actually the kernel function. In this case, the estimated divergence will always be valid.

5.2.3 Simulation Examples

Example 5.3

Consider the FIR system identification [126]:

image (5.105)

where the true weight vector image. The input signal image is assumed to be a zero-mean white Gaussian process with unit power.

We show that the optimal solution under information divergence criterion may be not unique. Suppose the reference PDF image is Gaussian PDF with zero mean and variance image. The J-information divergence between image and image can be calculated as:

image (5.106)

Clearly, there are infinitely many weight pairs image that satisfy image. In fact, any weight pair image that lies on the circle image will be an optimal solution. In this case, the system parameters are not identifiable. However, when image, the circle will shrink to a point and all the solutions will converge to a unique solution image. For the case image, the 3D surface of the J-information divergence is depicted in Figure 5.2. Figure 5.3 draws the convergence trajectories of the weight pair image learned by the SIDG algorithm, starting from the initial point image. As expected, these trajectories converge to the circles centered at image. When image, the weight pair image will converge to (1.0047, 0.4888), which is very close to the true weight vector.

image

Figure 5.2 3D surface of J-information divergence. Source: Adapted from Ref. [126].

image

Figure 5.3 Convergence trajectories of weight pair (image,image). Source: Adapted from Ref. [126].

Example 5.4

Identification of the hybrid system (switch system) [125]:

image (5.107)

where image is the state variable, image is the input, image is the process noise, and image is the measurement noise. This system can be written in a parameterized form (image merging into image) [125]:

image (5.108)

where image is the mode index, image, image, image, and image. In this example, image and image. Based on the parameterized form (Eq. (5.108)), one can establish the noisy hybrid decoupling polynomial (NHDP) [125]. By expanding the NHDP and ignoring the higher-order components of the noise, we obtain the first-order approximation (FOA) model. In Ref. [125], the SIDG algorithm (based on the FOA model) was applied to identify the above hybrid system. Figure 5.4 shows the identification performance for different measurement noise powers image. For comparison purpose, we also draw the performance of the least mean square (LMS) and algebraic geometric approach [232]. In Figure 5.4, the identification performance image is defined as:

image (5.109)

image

Figure 5.4 Identification performance for different measurement noise powers. Source: Adapted from Ref. [125].

In the simulation, the image function is selected as the reference PDF for SIDG algorithm. Simulation results indicate that the SIDG algorithm can achieve a better performance.

To further verify the performance of the SIDG algorithm, we consider the case in which image and image is uniformly distributed in the range of image. The reference samples are set to image according to some preliminary identification results. Figure 5.5 shows the scatter graphs of the estimated parameter vector image (with 300 simulation runs), where (A) and (B) correspond, respectively, to the LMS and SIDG algorithms. In each graph, there are two clusters. Evidently, the clusters generated by SIDG are more compact than those generated by LMS, and the centers of the former are closer to the true values than those of the latter (the true values are image and image). The involved error PDFs are illustrated in Figure 5.6. As one can see, the error distribution produced by SIDG is closer to the desired error distribution.

image

Figure 5.5 Scatter graphs of the estimated parameter vector image: (A) LMS and (B) SIDG.

image

Figure 5.6 Comparison of the error PDFs. Source: Adapted from Ref. [125].

5.2.4 Adaptive Infinite Impulsive Response Filter with Euclidean Distance Criterion

In Ref. [233], the Euclidean distance criterion (EDC), which can be regarded as a special case of the information divergence criterion with a reference PDF, was successfully applied to develop the global optimization algorithms for adaptive infinite impulsive response (IIR) filters. In the following, we give a brief introduction of this approach.

The EDC for the adaptive IIR filters is defined as the Euclidean distance (or L2 distance) between the error PDF and image function [233]:

image (5.110)

The above formula can be expanded as:

image (5.111)

where image stands for the parts of this Euclidean distance measure that do not depend on the error distribution. By dropping image, the EDC can be simplified to

image (5.112)

where image is the quadratic information potential of the error.

By substituting the kernel density estimator (usually with Gaussian kernel image) for the error PDF in the integral, one may obtain the empirical EDC:

image (5.113)

A gradient-based identification algorithm can then be derived as follows:

image (5.114)

where the gradient image depends on the model structure. Below we derive this gradient for the IIR filters.

Let us consider the following IIR filter:

image (5.115)

which can be written in the form

image (5.116)

where image, image. Then we can derive

image (5.117)

In Eq. (5.117), the parameter gradient is calculated in a recursive manner.

Example 5.5

Identifying the following unknown system [233]:

image (5.118)

The adaptive model is chosen to be the reduced order IIR filter

image (5.119)

The main goal is to determine the values of the coefficients (or weights) image, such that the EDC is minimized. Assume that the error is Gaussian distributed, image. Then, the empirical EDC can be approximately calculated as [233]:

image (5.120)

where image is the kernel width. Figure 5.7 shows the contours of the EDC performance surface in different image (the input signal is assumed to be a white Gaussian noise with zero mean and unit variance). As one can see, the local minima of the performance surface have disappeared with large kernel width. Thus, by carefully controlling the kernel width, the algorithm can converge to the global minimum. The convergence trajectory of the adaptation process with the weight approaching to the global minimum is shown in Figure 5.8.

image

Figure 5.7 Contours of the EDC performance surface: (A) image; (B) image; (C) image; and (D) image. Source: Adapted from Ref. [233].

image

Figure 5.8 Weight convergence trajectory under EDC. Source: Adapted from Ref. [233].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset