6

System Identification Based on Mutual Information Criteria

As a central concept in communication theory, mutual information measures the amount of information that one random variable contains about another. The larger the mutual information between two random variables is, the more information they share, and the better the estimation algorithm can be. Typically, there are two mutual information-based identification criteria: the minimum mutual information (MinMI) and the maximum mutual information (MaxMI) criteria. The MinMI criterion tries to minimize the mutual information between the identification error and the input signal such that the error signal contains as little as possible information about the input,1 while the MaxMI criterion aims to maximize the mutual information between the system output and the model output such that the model contains as much as possible information about the system in their outputs. Although the MinMI criterion is essentially equivalent to the minimum error entropy (MEE) criterion, their physical meanings are different. The MaxMI criterion is somewhat similar to the Infomax principle, an optimization principle for neural networks and other information processing systems. They are, however, different in their concepts. The Infomax states that a function that maps a set of input values I to a set of output values O should be chosen (or learned) so as to maximize the mutual information between I and O, subject to a set of specified constraints and/or noise processes. In the following, we first discuss the MinMI criterion.

6.1 System Identification Under the MinMI Criterion

The basic idea behind the MinMI criterion is that the model parameters should be determined such that the identification error contains as little as possible information about the input signal. The scheme of this identification method is shown in Figure 6.1. The objective function is the mutual information between the error and the input, and the optimal parameter is solved as

image (6.1)

where image is the identification error at image time (the difference between the measurement image and the model output image), image is a vector consisting of all the inputs that have influence on the model output image (possibly an infinite dimensional vector), image is the set of all possible image-dimensional parameter vectors.

image

Figure 6.1 System identification under the MinMI criterion.

For a general causal system, image will be

image (6.2)

If the model output depends on finite input (e.g., the finite impulse response (FIR) filter), then

image (6.3)

Assume the initial state of the model is known, the output image will be a function of image, i.e., image. In this case, the MinMI criterion is equivalent to the MEE criterion. In fact, we can derive

image (6.4)

where (a) is because that the conditional entropy image is not related to the parameter vector image. In Chapter 3, we have proved a similar property when discussing the MEE Bayesian estimation. That is, minimizing the estimation error entropy is equivalent to minimizing the mutual information between the error and the observation.

Although both are equivalent, the MinMI criterion and the MEE criterion are much different in meaning. The former aims to decrease the statistical dependence while the latter tries to reduce the uncertainty (scatter or dispersion).

6.1.1 Properties of MinMI Criterion

Let the model be an FIR filter. We discuss in the following the optimal solution of the MinMI criterion and investigate the connection to the mean square error (MSE) criterion [234].

Theorem 6.1

For system identification scheme of Figure 6.1, if the model is an FIR filter (image), image and image are zero-mean and jointly Gaussian, and the input covariance matrix image satisfies image, then we have image, and image, where image, image denotes the optimal weight vector under MSE criterion.

Proof:

According to the mean square estimation theory [235], image, thus we only need to prove image. As image, we have

image (6.5)

where image. And then, we can derive the following gradient

image (6.6)

where (a) follows from the fact that the conditional entropy image does not depend on the weight vector image and (b) is because that image is zero-mean Gaussian. Let this gradient be zero, we obtain image. Next we prove image. By (2.28), we have

image (6.7)

where (c) follows from image.

Theorem 6.1 indicates that with Gaussian assumption, the optimal FIR filter under MinMI criterion will be equivalent to that under the MSE criterion (i.e., the Wiener solution), and the MinMI between the error and the input will be zero.

Theorem 6.2

If the unknown system and the model are both FIR filters with the same order, and the noise signal image is independent of the input sequence image (both can be of arbitrary distribution), then we have image, where image denotes the weight vector of unknown system.

Proof:

Without Gaussian assumption, Theorem 6.1 cannot be applied here. Let image be the weight error vector between the unknown system and the model. We have image and

image (6.8)

where (a) is due to the fact that the entropy of the sum of two independent random variables is not less than the entropy of each individual variable. The equality in (a) holds if and only if image, i.e., image.

Theorem 6.2 suggests that the MinMI criterion might be robust with respect to the independent additive noise despite its distribution.

Theorem 6.3

Under the conditions of Theorem 6.2, and assuming that the input image and the noise image are both unit-power white Gaussian processes, then

image (6.9)

where image, image, image, image.

Proof:

Obviously, we have image, and

image (6.10)

By the mean square estimation theory [235],

image (6.11)

where (a) follows from image, image is an image identity matrix. Therefore

image (6.12)

On the other hand, by (2.28), the mutual information image can be calculated as

image (6.13)

Combining (6.12) and (6.13) yields the result.

The term image in Theorem 6.3 is actually the minimum MSE when estimating image based on image. Figure 6.2 shows the mutual information image and the minimum MSE image versus different weight error norm image. It can be seen that as image (or image), we have image and image. This implies that when the model weight vector image approaches the system weight vector image, the error image contains less and less information about the input vector image (or the information contained in the input signal has been sufficiently utilized), and it becomes more and more difficult to estimate the input based on the error signal (i.e., the minimum MSE image attains gradually its maximum value).

image

Figure 6.2 Mutual information image and the minimum MSE image versus weight error norm. Source: Adopted from [234].

6.1.2 Relationship with Independent Component Analysis

The parameter identification under MinMI criterion is actually a special case of independent component analysis (ICA) [133]. A brief scheme of the ICA problem is shown in Figure 6.3, where image is the image-dimensional source vector, image is the image-dimensional observation vector that is related to the source vector through image, where image is the image mixing matrix [236]. Assume that each component of the source signal image is mutually independent. There is no other prior knowledge about image and the mixing matrix image. The aim of the ICA is to search a image matrix image (i.e., the demixing matrix) such that image approaches as closely as possible image up to scaling and permutation ambiguities.

image

Figure 6.3 General configuration of the ICA.

The ICA can be formulated as an optimization problem. To make each component of image as mutually independent as possible, one can solve the matrix image under a certain objective function that measures the degree of dependence (or independence). Since the mutual information measures the statistical dependence between random variables, we may use the mutual information between components of image as the optimization criterion,2 i.e.,

image (6.14)

To some extent, the system parameter identification can be regarded as an ICA problem. Consider the FIR system identification:

image (6.15)

where image, image and image are image-dimensional weight vectors of the unknown system and the model. If regarding the vectors image and image as, respectively, the source signal and the observation in ICA, we have

image (6.16)

where image is the mixing matrix and image is the image identity matrix. The goal of the parameter identification is to make the model weight vector image approximate the unknown weight vector image, and hence make the identification error image (image) approach the additive noise image, or in other words, make the vector image approach the ICA source vector image. Therefore, the vector image can be regarded as the demixing output vector, where the demixing matrix is

image (6.17)

Due to the scaling ambiguity of the demixing output, it is reasonable to introduce a more general demixing matrix:

image (6.18)

where image, image. In this case, the demixed output image will be related to the identification error via a proportional factor image.

According to (6.14), the optimal demixing matrix will be

image (6.19)

After obtaining the optimal matrix image, one may get the optimal weight vector [133]

image (6.20)

Clearly, the above ICA formulation is actually the MinMI criterion-based parameter identification.

6.1.3 ICA-Based Stochastic Gradient Identification Algorithm

The MinMI criterion is in essence equivalent to the MEE criterion. Thus, one can utilize the various information gradient algorithms in Chapter 4 to implement the MinMI criterion-based identification. In the following, we introduce an ICA-based stochastic gradient identification algorithm [133].

According to the previous discussion, the MinMI criterion-based identification can be regarded as an ICA problem, i.e.,

image (6.21)

where the demixing matrix image.

Since

image (6.22)

by (2.8), we have

image (6.23)

And hence

image (6.24)

where (a) is due to the fact that the term image is not related to the matrix image. Denote the objective function image. The instantaneous value of image is

image (6.25)

in which image is the PDF of image (image).

In order to solve the demixing matrix image, one can resort to the natural gradient (or relative gradient)-based method [133,237]:

image (6.26)

where image. As the PDF image is usually unknown, a certain nonlinear function (e.g., the tanh function) will be used to approximate the image function.3

If adopting different step-sizes for learning the parameters image and image, we have

image (6.27)

The above algorithm is referred to as the ICA-based stochastic gradient identification algorithm (or simply the ICA algorithm). The model weight vector learned by this method is

image (6.28)

If the parameter image is set to constant image, the algorithm will reduce to

image (6.29)

6.1.4 Numerical Simulation Example

Figure 6.4 illustrates a general configuration of an acoustic echo canceller (AEC) [133]. image is the far-end signal going to the loudspeaker, and image is the echo signal entering into the microphone that is produced by an undesirable acoustic coupling between the loudspeaker and the microphone. image is the near-end signal which is usually independent of the far-end signal and the echo signal. image is the signal received by the microphone (image). The aim of the echo cancelation is to remove the echo part in image by subtracting the output of an adaptive filter that is driven by the far-end signal. As shown in Figure 6.4, the filter output image is the synthetic echo signal, and the error signal image is the echo-canceled signal (or the estimate of the near-end signal). The key technique in AEC is to build an accurate model for the echo channel (or accurately identifying the parameters of the synthetic filter).

image

Figure 6.4 General configuration of an AEC.

One may use the previously discussed ICA algorithm to implement the adaptive echo cancelation [133]. Suppose the echo channel is a 100 tap FIR filter, and assume that the input (far-end) signal image is uniformly distributed over the interval [–4, 4], and the noise (near-end) signal image is Cauchy distributed, i.e., image. The performance of the algorithms is measured by the echo return loss enhancement (ERLE) in dB:

image (6.30)

Simulation results are shown in Figures 6.5 and 6.6. In Figure 6.5, the performances of the ICA algorithm, the normalized least mean square (NLMS), and the recursive least squares (RLS) are compared, while in Figure 6.6, the performances of the ICA algorithm and the algorithm (6.29) with image are compared. During the simulation, the image function in the ICA algorithm is chosen as

image (6.31)

image

Figure 6.5 Plots of the performance of three algorithms (ICA, NLMS, RLS) in Cauchy noise environment. Source: Adopted from [133].

image

Figure 6.6 Plots of the performance of the ICA algorithm and the algorithm (6.29) with image (image). Source: Adopted from [133].

It can be clearly seen that the ICA-based algorithm shows excellent performance in echo cancelation.

6.2 System Identification Under the MaxMI Criterion

Consider the system identification scheme shown in Figure 6.7, in which image is the common input to the unknown system and the model, image is the intrinsic (noiseless) output of the unknown system, image is the additive noise, image is the noisy output measurement, and image stands for the output of the model. Under the MaxMI criterion, the identification procedure is to determine a model image such that the mutual information between the noisy system output image and the model output image is maximized. Thus the optimal model image is given by

image (6.32)

where image denotes the model set (collection of all candidate models), image, image, and image denote, respectively, the PDFs of image, image, and image.

image

Figure 6.7 Scheme of the system identification under the MaxMI criterion.

The MaxMI criterion provides a fresh insight into system identification. Roughly speaking, the noisy measurement image represents the output of an information source and is transmitted over an information channel, i.e., the identifier (including the model set and search algorithm), and the model output image represents the channel output. Then the identification problem can be regarded as the information transmitting problem, and the goal of identification is to maximize the channel capacity (measured by image) over all possible identifiers.

6.2.1 Properties of the MaxMI Criterion

In the following, we present some important properties of the MaxMI criterion [135,136].

Property 6.1:

Maximizing the mutual information image is equivalent to minimizing the conditional error entropy image, where image.

Proof:

It is easy to derive

image (6.33)

And hence

image (6.34)

where (a) is due to the fact that the model image has no effect on the entropy image.

The second property states that under certain conditions, the MaxMI criterion will be equivalent to maximizing the correlation coefficient.

Property 6.2:

If image and image are jointly Gaussian, we have image, where image is the correlation coefficient between image and image.

Proof:

Since image and image are jointly Gaussian, the mutual information image can be calculated as

image (6.35)

The log function is monotonically increasing, thus we have

image (6.36)

Property 6.3:

Assume the noise image is independent of the input signal image. Then maximizing the mutual information image is equivalent to maximizing a lower bound of the intrinsic (noiseless) mutual information image.

Proof:

Denote image the intrinsic error, i.e., image, we have

image (6.37)

where (b) follows from the independence condition and the fact that the entropy of the sum of two independent random variables is not less than the entropy of each individual variable. It follows easily that

image (6.38)

which completes the proof.

In Figure 6.7, the measurement image may be further distorted by a certain function. Denote image the distorted measurement, we have

image (6.39)

where image is the distortion function. Such distortion widely exists in practical systems. Typical examples include the saturation and the dead zone.

Property 6.4:

Suppose the noisy measurement image is distorted by a function image. Then maximizing the distorted mutual information, image is equivalent to maximizing a lower bound of the undistorted mutual information image.

Proof:

This property is a direct consequence of the data processing inequality (see Theorem 2.3), which states that for any random variables image and image, and any measurable function image,

image (6.40)

In (6.40), if function image is invertible, the equality will hold. In this case, we have

image (6.41)

That is, the invertible distortion does not change the optimal solutions of MaxMI.

Property 6.5:

If the measurement image is Gaussian, then maximizing the mutual information image will be equivalent to minimizing a lower bound of the MSE.

Proof:

According to Theorem 2.4, the rate distortion function for a Gaussian source image with MSE distortion is

image (6.42)

where image. Let image, we have

image (6.43)

where image is the variance of image. It follows easily that

image (6.44)

This completes the proof.

Consider now a special case where the model is represented by an FIR filter in which the output image is given by

image (6.45)

where image is the input (regressor) vector and image is the weight vector. Then we have the following results.

Property 6.6:

For the case of the FIR model and under the assumption that image and image are jointly Gaussian, the optimal weight vector under the MaxMI criterion will be

image (6.46)

where image, image, image (image). and in particular, if image, the MSE image will attain the lower bound as in (6.44), i.e.,

image (6.47)

Proof:

Since image and image are jointly Gaussian, then image and image are also jointly Gaussian. By Property 6.2, we have

image (6.48)

where (c) is because that image is not related to image. And then,

image (6.49)

Let the above gradient be zero, and denote image, we obtain the optimal weight vector

image (6.50)

It can be easily verified that for any image, and image, the optimal weight vector (6.50) makes the gradient (6.49) zero. When image, the optimal weight becomes the Wiener solution image. In this case, the MSE is

image (6.51)

Further the mutual information image is

image (6.52)

Combining (6.51) and (6.52), we obtain image.

Property 6.6 indicates that with a FIR filter structure and under Gaussian assumption, the MaxMI criterion yields a scaled Wiener solution which is not unique. Thus it does not satisfy the identifiability condition.4 The main reason for this is that any invertible transformation does not change the mutual information. In this property, image is restricted to nonzero. If image, we have image, and the mutual information image will be undefined (ill-posed).

A priori information usually has great value in system identification. For example, if the structures of the system or the parameters are partially known, we may use this information to impose some constraints on the structures or parameters of the filter. For the case in which the desired responses are distorted, the a priori information can help to improve the accuracy of the solution. In particular, certain parameter constraints may yield a unique optimal solution under the MaxMI criterion. Consider the optimal solution (6.50) under the following parameter constraint:

image (6.53)

where image, image. Let image, we have

image (6.54)

If image, then image can be uniquely determined as image.

6.2.2 Stochastic Mutual Information Gradient Identification Algorithm

The stochastic gradient identification algorithm under the MaxMI criterion can be expressed as

image (6.55)

where image denotes the instantaneous estimate of the gradient of mutual information image evaluated at the current value of the weight vector and image is the step-size. The key problem of the update equation (6.55) is how to calculate the instantaneous gradient image.

Let us start with the calculation of the gradient (not the instantaneous gradient) of image:

image (6.56)

where image, image, and image denote the related PDFs at the instant image. Then the instantaneous value of image can be obtained by dropping the expectation operator and plugging in the estimates of the PDFs, i.e.,

image (6.57)

where image and image are, respectively, the estimates of image and image. To estimate the density functions, one usually adopts the kernel density estimation (KDE) method and uses the following Gaussian functions as the kernels [135]

image (6.58)

where image and image denote the kernel widths.

Based on the above Gaussian kernels, the estimates of the PDFs and their gradients can be calculated as follows:

image (6.59)

image (6.60)

where image is the sliding data length and image. For FIR filter, we have

image (6.61)

Combining (6.55), (6.57), (6.59), and (6.60), we obtain a stochastic gradient identification algorithm under the MaxMI criterion, which is referred to as the stochastic mutual information gradient (SMIG) algorithm [135].

The performances of the SMIG algorithm compared with the least mean square (LMS) algorithm are demonstrated in the following by Monte Carlo simulations. Consider the FIR system identification [135]:

image (6.62)

where image and image are, respectively, the transfer functions of the unknown system and the model. Suppose the input signal image and the additive noise image are both unit-power white Gaussian processes. To uniquely determine an optimal solution under the MaxMI criterion, it is assumed that the first component of the unknown weight vector is a priori known (which is assumed to be 0.8). Thus the goal is to search the optimal solution of the other five weights. The initial weights (except image) for the adaptive FIR filter are zero-mean Gaussian distributed with variance 0.01. Further, the following distortion functions are considered [135]:

1. Undistorted: image

2. Saturation: image

3. Dead zone: image

4. Data loss5 : image

Figure 6.8 plots the distortion functions of the saturation and dead zone. Figure 6.9 shows the desired response signal with data loss (the probability of data loss is 0.3). In the simulation below, the Gaussian kernels are used and the kernel sizes are kept fixed at image.

image

Figure 6.8 Distortion functions of saturation and dead zone. Source: Adopted from [135].

image

Figure 6.9 Desired response signal with data loss. Source: Adopted from [135].

Figure 6.10 illustrates the average convergence curves over 50 Monte Carlo simulation runs. One can see that, without measurement distortions, the conventional LMS algorithm has a better performance. However, in the case of measurement distortions, it is evident the deterioration of the LMS algorithm whereas the SMIG algorithm is little affected and achieves a much better performance. Simulation results confirm that the MaxMI criterion is more robust to the measurement distortion than traditional MSE criterion.

image

Figure 6.10 Average convergence curves of SMIG and LMS algorithms: (A) undistorted, (B) saturation, (C) dead zone, and (D) data loss. Source: Adopted from [135].

6.2.3 Double-Criterion Identification Method

The system identification scheme of Figure 6.7 does not, in general, satisfy the condition of parameter identifiability (i.e., the uniqueness of the optimal solution). In order to uniquely determine an optimal solution, some a priori information about the parameters is required. However, such a priori information is not available for many practical applications. To address this problem, we introduce in the following the double-criterion identification method [136].

Consider the Wiener system shown in Figure 6.11, where the system has the cascade structure and consists of a discrete-time linear filter image followed by a zero-memory nonlinearity image. Wiener systems are typical nonlinear systems and are widely used for nonlinear modeling [238]. The double-criterion method mainly aims at the Wiener system identification, but it also applies to many other systems. In fact, any system can be regarded as a cascade system consisting of itself followed by image.

image

Figure 6.11 Wiener system.

First, we define the equivalence between two Wiener systems.

Definition 6.1

Two Wiener systems image and image are said to be equivalent if and only if image, such that

image (6.63)

Although there is a scale factor image between two equivalent Wiener systems, they have exactly the same input–output behavior.

The optimal solution of the system identification scheme of Figure 6.7 is usually nonunique. For Wiener system, the nonuniqueness means the optimal solutions are not all equivalent. According to the data processing inequality, we have

image (6.64)

where image and image denote, respectively, the intermediate output (the output of the linear part) and the zero-memory nonlinearity of the Wiener model. Then under the MaxMI criterion, the optimal Wiener model will be

image (6.65)

where image denotes the parameter vector of the linear subsystem and image denotes all measurable mappings image. Evidently, the optimal solutions given by (6.65) contain infinite nonequivalent Wiener models. Actually, we always have image provided image is an invertible function.

In order to ensure that all the optimal Wiener models are equivalent, the identification scheme of Figure 6.7 has to be modified. One can adopt the double-criterion identification method [136]. As shown in Figure 6.12, the double-criterion method utilizes both MaxMI and MSE criteria to identify the Wiener system. Specifically, the linear filter part is identified by using the MaxMI criterion, and the zero-memory nonlinear part is learned by the MSE criterion. In Figure 6.12, image and image denote, respectively, the linear and nonlinear subsystems of the unknown Wiener system, where image and image are related parameter vectors. The adaptive Wiener model image usually takes the form of “FIR + polynomial”, that is, the linear subsystem image is an image-order FIR filter, and the nonlinear subsystem image is a image-order polynomial. In this case, the intermediate output image and the final output image of the model are

image (6.66)

where image and image are image-dimensional FIR weight vector and input vector, image is the image-dimensional polynomial coefficient vector, and image is the polynomial basis vector.

image

Figure 6.12 Double-criterion identification scheme for Wiener system: (1) linear filter part is identified using MaxMI criterion and (2) nonlinear part is trained using MSE criterion.

It should be noted that similar two-gradient identification algorithms for the Wiener system have been proposed in [239,240], wherein the linear and nonlinear subsystems are both identified using the MSE criterion.

The optimal solution for the above double-criterion identification is

image (6.67)

For general case, it is hard to find the closed-form expressions for image and image. The following theorem only considers the case in which the unknown Wiener system has the same structure as the assumed Wiener model.

Theorem 6.4

For the Wiener system identification scheme shown in Figure 6.12, if

1. The unknown system and the model have the same structure, that is, image and image are both image-order FIR filters, and image and image are both image-order polynomials.

2. The nonlinear function image is invertible.

3. The additive noise image is independent of the input vector image.

Then the optimal solution of (6.67) will be

image (6.68)

where image, image, and image is expressed as

image (6.69)

Proof:

Since image and image are mutually independent, we have

image (6.70)

where image follows from the fact that the entropy of the sum of two independent random variables is not less than the entropy of each individual variable.

In (6.70), the equality holds if and only if conditioned on image, image is a deterministic variable, that is, image is a function of image. This implies that the mutual information image will achieve its maximum value (image) if and only if there exists a function image such that image, i.e.,

image (6.71)

As the nonlinear function image is assumed to be invertible, we have

image (6.72)

where image denotes the inverse function of image and image. It follows that

image (6.73)

And hence

image (6.74)

which implies image. Let image, we obtain the optimum FIR weight image.

Now the optimal polynomial coefficients can be easily determined. By independent assumption, we have

image (6.75)

with equality if and only if image. This means the MSE cost will attain its minimum value (image) if and only if the intrinsic error (image) remains zero. Therefore,

image (6.76)

Then we get image, where image is given by (6.69). This completes the proof.

Theorem 6.4 indicates that for identification scheme of Figure 6.12, under certain conditions the optimal solution will match the true system exactly (i.e., with zero intrinsic error). There is a free parameter image in the solution, however, its specific value has no substantial effect on the cascaded model. The literature [240] gives a similar result about the optimal solution under the single MSE criterion. In [240], the linear FIR subsystem is estimated up to a scaling factor which equals the derivative of the nonlinear function around a bias point.

The double-criterion identification can be implemented in two manners. The first is the sequential identification scheme, in which the MaxMI criterion is first used to learn the linear FIR filter. At the end of the first adaptation phase, the tap weights are frozen, and then the MSE criterion is used to estimate the polynomial coefficients. The second adaptation scheme simultaneously trains both the linear and nonlinear parts of the Wiener model. Obviously, the second scheme is more suitable for online identification. In the following, we focus only on the simultaneous scheme.

In [136], a stochastic gradient-based double-criterion identification algorithm was developed as follows:

image (6.77)

where image denotes the stochastic (instantaneous) gradient of the mutual information image with respect to the FIR weight vector image (see (6.57) for the computation), image denotes the Euclidean norm, image and image are the step-sizes. The update equation (a1) is actually the SMIG algorithm developed in 6.2.2. The second part (a2) of the algorithm (6.77) scales the FIR weight vector to a unit vector. The purpose of scaling the weight vector is to constrain the output energy of the FIR filter, and to avoid “very large values” of the scale factor image in the optimal solution. As mutual information is scaling invariant,6 the scaling (a2) does not influence the search of the optimal solution. However, it certainly affects the value of image in the optimal solution. In fact, if the algorithm converges to the optimal solution, we have image, and

image (6.78)

That is, the scale factor image equals either image or image, which is no longer a free parameter.

The third part (a3) of the algorithm (6.77) is the NLMS algorithm, which minimizes the MSE cost with step-size scaled by the energy of polynomial regression signal image. The NLMS is more suitable for the nonlinear subsystem identification than the standard LMS algorithm, because during the adaptation, the polynomial regression signals are usually nonstationary. The algorithm (6.77) is referred to as the SMIG-NLMS algorithm [136].

Next, Monte Carlo simulation results are presented to demonstrate the performance of the SMIG-NLMS algorithm. For comparison purpose, simulation results of the following two algorithms are also included.

image (6.79)

image (6.80)

where image, image, image, image are step-sizes, image, image, and image denote the stochastic information gradient (SIG) under Shannon entropy criterion, calculated as

image (6.81)

where image denotes the kernel function with bandwidth image. The algorithms (6.79) and (6.80) are referred to as the LMS-NLMS and SIG-NLMS algorithms [136], respectively. Note that the LMS-NLMS algorithm is actually the normalized version of the algorithm developed in [240].

Due to the “scaling” property of the linear and nonlinear portions, the expression of Wiener system is not unique. In order to evaluate how close the estimated Wiener model and the true system are, we introduce the following measures [136]:

1. Angle between image and image:

image (6.82)

where image.

2. Angle between image and image:

image (6.83)

where image is

image (6.84)

3. Intrinsic error power (IEP)7:

image (6.85)

Among the three performance measures, the angles image and image quantify the identification performance of the subsystems (linear FIR and nonlinear polynomial), while the IEP quantifies the overall performance.

Let us consider the case in which the FIR weights and the polynomial coefficients of the unknown Wiener system are [136]

image (6.86)

The common input image is a white Gaussian process with unit variance and the disturbance noise image is another white Gaussian process with variance image. The initial FIR weight vector image of the adaptive model is obtained by normalizing a zero-mean Gaussian-distributed random vector (image), and the initial polynomial coefficients are zero-mean Gaussian distributed with variance 0.01. For the SMIG-NLMS and SIG-NLMS algorithms, the sliding data length is set as image and the kernel widths are chosen according to Silverman’s rule. The step-sizes involved in the algorithms are experimentally selected so that the initial convergence rates are visually identical.

The average convergence curves of the angles image and image, over 1000 independent Monte Carlo simulation runs, are shown in Figures 6.13 and 6.14. It is evident that the SMIG-NLMS algorithm achieves the smallest angles (mismatches) in both linear and nonlinear subsystems during the steady-state phase. More detailed statistical results of the subsystems training are presented in Figures 6.15 and 6.16, in which the histograms of the angles image and image at the final iteration are plotted. The inset plots in Figures 6.15 and 6.16 give the summary of the mean and spread of the histograms. One can observe again that the SMIG-NLMS algorithm outperforms both the LMS-NLMS and SIG-NLMS algorithms in terms of the angles between the estimated and true parameter vectors.

image

Figure 6.13 Average convergence curves of the angle image over 1000 Monte Carlo runs. Source: Adopted from [136].

image

Figure 6.14 Average convergence curves of the angle image over 1000 Monte Carlo runs. Source: Adopted from [136].

image

Figure 6.15 Histogram plots of the angle image at the final iteration over 1000 Monte Carlo runs. Source: Adopted from [136].

image

Figure 6.16 Histogram plots of the angle image at the final iteration over 1000 Monte Carlo runs. Source: Adopted from [136].

The overall identification performance can be measured by the IEP. Figure 6.17 illustrates the convergence curves of the IEP over 1000 Monte Carlo runs. It is clear that the SMIG-NLMS algorithm achieves the smallest IEP during the steady-state phase. Figure 6.18 shows the probability density functions of the steady-state intrinsic errors. As expected, the SMIG-NLMS algorithm yields the largest and most concentrated peak centered at the zero intrinsic error, and hence achieves the best accuracy in identification.

image

Figure 6.17 Convergence curves of the IEP over 1000 Monte Carlo runs. Source: Adopted from [136].

image

Figure 6.18 Probability density functions of the steady-state intrinsic errors. Source: Adopted from [136].

In the previous simulations, the unknown Wiener system has the same structure as the assumed model. In order to show how the algorithm performs when the real system is different from the assumed model (i.e., the unmatched case), another simulation with the same setup is conducted. This time the linear and nonlinear parts of the unknown system are assumed to be

image (6.87)

Table 6.1 lists the mean±deviation results of the IEP at final iteration over 1000 Monte Carlo runs. Clearly, the SMIG-NLMS algorithm produces the IEP with both lower mean and smaller deviation. Figure 6.19 shows the desired output (intrinsic output of the true system) and the model outputs (trained by different algorithms) during the last 100 samples for the test input. The results indicate that the identified model by the SMIG-NLMS algorithm describes the test data with the best accuracy.

Table 6.1

Mean±Deviation Results of the IEP at the Final Iteration Over 1000 Monte Carlo Runs

IEP
SMIG-NLMS0.0011±0.0065
LMS-NLMS0.0028±0.0074
SIG-NLMS0.0033±0.0089

Source: Adopted from [136].

image

Figure 6.19 Desired output (the intrinsic output of the true system) and model outputs for the test input data. Source: Adopted from [136].

Appendix I MinMI Rate Criterion

The authors in [124] propose the MinMI rate criterion. Consider the linear Gaussian system

image (I.1)

where image, image, image. One can adopt the following linear recursive algorithm to estimate the parameters

image (I.2)

The MinMI rate criterion is to search an optimal gain matrix image such that the mutual information rate image between the error signal image and a unity-power white Gaussian noise image is minimized, where image, image is a certain Gaussian process independent of image. Clearly, the MinMI rate criterion requires that the asymptotic power spectral image of the error process image satisfies image (otherwise image does not exist). It can be calculated that

image (I.3)

where image is a spectral factor of image. Hence, under the MinMI rate criterion the optimal gain matrix image will be

image (I.4)


1The minimum mutual information rate criterion was also proposed in [124], which minimizes the mutual information rate between the error signal and a certain white Gaussian process (see Appendix I).

2The mutual information minimization is a basic optimality criterion in ICA. Other ICA criteria, such as the negentropy maximization, Infomax, likelihood maximization, and the higher order statistics, in general conform with the mutual information minimization.

3One can also apply the kernel density estimation method to estimate image, and then use the estimated PDF to compute the image function.

4It is worth noting that the identifiability problem under the MaxMI criterion has been studied in [134], wherein the “identifiability” does not means the uniqueness of the solution, but just means that the mutual information between the system output and the model output is nonzero.

5Data loss means that there exists accidental loss of the measurement data due to certain failures in the sensors or communication channels.

6For any random variables image and image, the mutual information image satisfies image, image, image.

7In practice, the IEP is evaluated using the sample mean instead of the expectation value.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset