4

System Identification Under Minimum Error Entropy Criteria

In previous chapter, we give an overview of information theoretic parameter estimation. These estimation methods are, however, devoted to cases where a large amount of statistical information on the unknown parameter is assumed to be available. For example, the minimum divergence estimation needs to know the likelihood function of the parameter. Also, in Bayes estimation with minimum error entropy (MEE) criterion, the joint distribution of unknown parameter and observation is assumed to be known. In this and later chapters, we will further investigate information theoretic system identification. Our focus is mainly on system parameter estimation (identification) where no statistical information on parameters exists (i.e., only data samples are available). To develop the identification algorithms under information theoretic criteria, one should evaluate the related information measures. This requires the knowledge of the data distributions, which are, in general, unknown to us. To address this issue, we can use the estimated (empirical) information measures as the identification criteria.

4.1 Brief Sketch of System Parameter Identification

System identification involves fitting the experimental input–output data (training data) into empirical model. In general, system identification includes the following key steps:

• Experiment design: To obtain good experimental data. Usually, the input signals should be designed such that it provides enough process excitation.

• Selection of model structure: To choose a suitable model structure based on the training data or prior knowledge.

• Selection of the criterion: To choose a suitable criterion (cost) function that reflects how well the model fits the experimental data.

• Parameter identification: To obtain the model parameters1 by optimizing (minimizing or maximizing) the above criterion function.

• Model validation: To test the model so as to reveal any inadequacies.

In this book, we focus mainly on the parameter identification part. Figure 4.1 shows a general scheme of discrete-time system identification, where image and image denote the system input and output (clean output) at time image, image is an additive noise that accounts for the system uncertainty or measurement error, and image is the measured output. Further, image is the model output and image denotes the identification error, which is defined as the difference between the measured output and the model output, i.e., image. The goal of parameter identification is then to search the model parameter vector (or weight vector) so as to minimize (or maximize) a certain criterion function (usually the model structure is predefined).

image

Figure 4.1 A general scheme of system identification.

The implementation of system parameter identification involves model structure, criterion function, and parameter search (identification) algorithm. In the following, we will briefly discuss these three aspects.

4.1.1 Model Structure

Generally speaking, the model structure is a parameterized mapping from inputs2 to outputs. There are various mathematical descriptions of system model (linear or nonlinear, static or dynamic, deterministic or stochastic, etc.). Many of them can be expressed as the following linear-in-parameter model:

image (4.1)

where image denotes the regression input vector and image denotes the weight vector (i.e., parameter vector). The simplest linear-in-parameter model is the adaptive linear neuron (ADALINE). Let the input be an image-dimensional vector image. The output of ADALINE model will be

image (4.2)

where image is a bias (some constant). In this case, we have

image (4.3)

If the bias image is zero, and the input vector is image, which is formed by feeding the input signal to a tapped delay line, then the ADALINE becomes a finite impulse response (FIR) filter.

The ARX (autoregressive with external input) dynamic model is another important linear-in-parameter model:

image (4.4)

One can write Eq. (4.4) as image if let

image (4.5)

The linear-in-parameter model also includes many nonlinear models as special cases. For example, the image-order polynomial model can be expressed as

image (4.6)

where

image (4.7)

Other examples include: the discrete-time Volterra series with finite memory and order, Hammerstein model, radial basis function (RBF) neural networks with fixed centers, and so on.

In most cases, the system model is a nonlinear-in-parameter model, whose output is not linearly related to the parameters. A typical example of nonlinear-in-parameter model is the multilayer perceptron (MLP) [53]. The MLP, with one hidden layer, can be generally expressed as follows:

image (4.8)

where image is the image input vector, image is the image weight matrix connecting the input layer with the hidden layer, image is the activation function (usually a sigmoid function), image is the image bias vector for the hidden neurons, image is the image weight vector connecting the hidden layer to the output neuron, and image is the bias for the output neuron.

The model can also be created in kernel space. In kernel machine learning (e.g. support vector machine, SVM), one often uses a reproducing kernel Hilbert space (RKHS) image associated with a Mercer kernel image as the hypothesis space [161, 162]. According to Moore-Aronszajn theorem [174, 175], every Mercer kernel image induces a unique function space image, namely the RKHS, whose reproducing kernel is image, satisfying: 1) image, the function image, and 2) image, and for every image, image, where image denotes the inner product in image. If Mercer kernel image is strictly positive-definite, the induced RKHS image will be universal (dense in the space of continuous functions over image). Assuming the input signal image, the model in RKHS image can be expressed as

image (4.9)

where image is the unknown input–output mapping that needs to be estimated. This model is a nonparametric function over input space image. However, one can regard it as a “parameterized” model, where the parameter space is the RKHS image.

The model (4.9) can alternatively be expressed in a feature space (a vector space in which the training data are embedded). According to Mercer’s theorem, any Mercer kernel image induces a mapping image from the input space image to a feature space image.3 In the feature space, the inner products can be calculated using the kernel evaluation:

image (4.10)

The feature space image is isometric-isomorphic to the RKHS image. This can be easily understood by identifying image and image, where image denotes a vector in feature space image, satisfying image, image. Therefore, in feature space the model (4.9) becomes

image (4.11)

This is a linear model in feature space, with image as the input, and image as the weight vector. It is worth noting that the model (4.11) is actually a nonlinear model in input space, since the mapping image is in general a nonlinear mapping. The key principle behind kernel method is that, as long as a linear model (or algorithm) in high-dimensional feature space can be formulated in terms of inner products, a nonlinear model (or algorithm) can be obtained by simply replacing the inner product with a Mercer kernel. The model (4.11) can also be regarded as a “parameterized” model in feature space, where the parameter is the weight vector image.

4.1.2 Criterion Function

The criterion (risk or cost) function in system identification reflects how well the model fits the experimental data. In most cases, the criterion is a functional of the identification error image, with the form

image (4.12)

where image is a loss function, which usually satisfies

Nonnegativity: image;

Symmetry: image;

Monotonicity: image, image;

Integrability: i.e., image is an integrable function.

Typical examples of criterion (4.12) include the mean square error (MSE), mean absolute deviation (MAD), mean image-power error (MPE), and so on. In practice, the error distribution is in general unknown, and hence, we have to estimate the expectation value in Eq. (4.12) using sample data. The estimated criterion function is called the empirical criterion function (empirical risk). Given a loss function image, the empirical criterion function image can be computed as follows:

a. Instantaneous criterion function: image;

b. Average criterion function: image;

c. Weighted average criterion function: image.

Note that for MSE criterion (image), the average criterion function is the well-known least-squares criterion function (sum of the squared errors).

Besides the criterion functions of form (4.12), there are many other criterion functions for system identification. In this chapter, we will discuss system identification under MEE criterion.

4.1.3 Identification Algorithm

Given a parameterized model, the identification error image can be expressed as a function of the parameters. For example, for the linear-in-parameter model (4.1), we have

image (4.13)

which is a linear function of image (assuming image and image are known). Similarly, the criterion function image (or the empirical criterion function image) can also be expressed as a function of the parameters, denoted by image (or image). Therefore, the identification criterion represents a hyper-surface in the parameter space, which is called the performance surface.

The parameter image can be identified through searching the optima (minima or maxima) of the performance surface. There are two major ways to do this. One is the batch mode and the other is the online (sequential) mode.

4.1.3.1 Batch Identification

In batch mode, the identification of parameters is done only after collecting a number of samples or even possibly the whole training data. When these data are available, one can calculate the empirical criterion function image based on the model structure. And then, the parameter image can be estimated by solving the following optimization problem:

image (4.14)

where image denotes the set of all possible values of image. Sometimes, one can achieve an analytical solution by setting the gradient4 of image to zero, i.e.,

image (4.15)

For example, with the linear-in-parameter model (4.1) and under the least-squares criterion (empirical MSE criterion), we have

image (4.16)

where image and image. And hence,

image (4.17)

If image is a nonsingular matrix, we have

image (4.18)

In many situations, however, there is no analytical solution for image, and we have to rely on nonlinear optimization techniques, such as gradient descent methods, simulated annealing methods, and genetic algorithms (GAs).

The batch mode approach has some shortcomings: (i) it is not suitable for online applications, since the identification is performed only after a number of data are available and (ii) the memory and computational requirements will increase dramatically with the increasing amount of data.

4.1.3.2 Online Identification

The online mode identification is also referred to as the sequential or incremental identification, or adaptive filtering. Compared with the batch mode identification, the sequential identification has some desirable features: (i) the training data (examples or observations) are sequentially (one by one) presented to the identification procedure; (ii) at any time, only few (usually one) training data are used; (iii) a training observation can be discarded as long as the identification procedure for that particular observation is completed; and (iv) it is not necessary to know how many total training observations will be presented. In this book, our focus is primarily on the sequential identification.

The sequential identification is usually performed by means of iterative schemes of the type

image (4.19)

where image denotes the estimated parameter at image instant (iteration) and image denotes the adjustment (correction) term. In the following, we present several simple online identification (adaptive filtering) algorithms.

4.1.3.3 Recursive Least Squares Algorithm

Given a linear-in-parameter model, the Recursive Least Squares (RLS) algorithm recursively finds the least-squares solution of Eq. (4.18). With a sequence of observations image up to and including time image, the least-squares solution is

image (4.20)

When a new observation image becomes available, the parameter estimate image is

image (4.21)

One can derive the following relation between image and image:

image (4.22)

where image is the prediction error,

image (4.23)

and image is the gain vector, computed as

image (4.24)

where the matrix image can be calculated recursively as follows:

image (4.25)

Equations (4.22)(4.25) constitute the RLS algorithm.

Compared to most of its competitors, the RLS exhibits very fast convergence. However, this benefit is achieved at the cost of high computational complexity. If the dimension of image is image, then the time and memory complexities of RLS are both image.

4.1.3.4 Least Mean Square Algorithm

The Least Mean Square (LMS) algorithm is much simpler than RLS, which is a stochastic gradient descent algorithm under the instantaneous MSE cost image. The weight update equation for LMS can be simply derived as follows:

image (4.26)

where image is the step-size (adaptation gain, learning rate, etc.),5 and the term image is the instantaneous gradient of the model output with respect to the weight vector, whose form depends on the model structure. For a FIR filter (or ADALINE), the instantaneous gradient will simply be the input vector image. In this case, the LMS algorithm becomes6

image (4.27)

The computational complexity of the LMS (4.27) is just image, where image is the input dimension.

If the model is an MLP network, the term image can be computed by back propagation (BP), which is a common method of training artificial neural networks so as to minimize the objective function [53].

There are many other stochastic gradient descent algorithms that are similar to the LMS. Typical examples include the least absolute deviation (LAD) algorithm [31] and the least mean fourth (LMF) algorithm [26]. The LMS, LAD, and LMF algorithms are all special cases of the least mean image-power (LMP) algorithm [30]. The LMP algorithm aims to minimize the image-power of the error, which can be derived as

image (4.28)

For the cases image, the above algorithm corresponds to the LAD, LMS, and LMF algorithms, respectively.

4.1.3.5 Kernel Adaptive Filtering Algorithms

The kernel adaptive filtering (KAF) algorithms are a family of nonlinear adaptive filtering algorithms developed in kernel (or feature) space [12], by using the linear structure and inner product of this space to implement the well-established linear adaptive filtering algorithms (e.g., LMS, RLS, etc.) and to obtain nonlinear filters in the original input space. They have several desirable features: (i) if choosing a universal kernel (e.g., Gaussian kernel), they are universal approximators; (ii) under MSE criterion, the performance surface is quadratic in feature space so gradient descent learning does not suffer from local minima; and (iii) if pruning the redundant features, they have moderate complexity in terms of computation and memory. Typical KAF algorithms include the kernel recursive least squares (KRLS) [176], kernel least mean square (KLMS) [177], kernel affine projection algorithms (KAPA) [178], and so on. When the kernel is radial (such as the Gaussian kernel), they naturally build a growing RBF network, where the weights are directly related to the errors at each sample. In the following, we only discuss the KLMS algorithm. Interesting readers can refer to Ref. [12] for further information about KAF algorithms.

Let image be an image-dimensional input vector. We can transform image into a high-dimensional feature space image (induced by kernel image) through a nonlinear mapping image, i.e., image. Suppose the model in feature space is given by Eq. (4.11). Then using the LMS algorithm on the transformed observation sequence image yields [177]

image (4.29)

where image denotes the estimated weight vector (at iteration image) in feature space. The KLMS (4.29) is very similar to the LMS algorithm, except for the dimensionality (or richness) of the projection space. The learned mapping (model) at iteration image is the composition of image and image, i.e., image. If identifying image, we obtain the sequential learning rule in the original input space:

image (4.30)

At iteration image, given an input image, the output of the filter is

image (4.31)

From Eq. (4.31) we see that, if choosing a radial kernel, the KLMS produces a growing RBF network by allocating a new kernel unit for every new example with input image as the center and image as the coefficient. The algorithm of KLMS is summarized in Table 4.1, and the corresponding network topology is illustrated in Figure 4.2.

Table 4.1

The KLMS Algorithm

Image

image

Figure 4.2 Network topology of KLMS at iteration image.

Selecting a proper Mercer kernel is crucial for all kernel methods. In KLMS, the kernel is usually chosen to be a normalized Gaussian kernel:

image (4.32)

where image is the kernel size (kernel width) and image is called the kernel parameter. The kernel size in Gaussian kernel is an important parameter that controls the degree of smoothing and consequently has significant influence on the learning performance. Usually, the kernel size can be set manually or estimated in advance by Silverman’s rule [97]. The role of the step-size in KLMS remains in principle the same as the step-size in traditional LMS algorithm. Specifically, it controls the compromise between convergence speed and misadjustment. It has also been shown in Ref. [177] that the step-size in KLMS plays a similar role as the regularization parameter.

The main bottleneck of KLMS (as well as other KAF algorithms) is the linear growing network with each new sample, which poses both computational and memory issues especially for continuous adaptation scenarios. In order to curb the network growth and to obtain a compact representation, a variety of sparsification techniques can be applied, where only the important input data are accepted as the centers. Typical sparsification criteria include the novelty criterion [179], approximate linear dependency (ALD) criterion [176], coherence criterion [180], surprise criterion [181], and so on. The idea of quantization can also be used to yield a compact network with desirable accuracy [182].

4.2 MEE Identification Criterion

Most of the existing approaches to parameter identification utilized the MSE (or equivalently, Gaussian likelihood) as the identification criterion function. The MSE is mathematically tractable and under Gaussian assumption is an optimal criterion for linear system. However, it is well known that MSE may be a poor descriptor of optimality for nonlinear and non-Gaussian (e.g., multimodal, heavy-tail, or finite range distributions) situations, since it constrains only the second-order statistics. To address this issue, one can select some criterion beyond second-order statistics that does not suffer from the limitation of Gaussian assumption and can improve performance in many realistic scenarios. Information theoretic quantities (entropy, divergence, mutual information, etc.) as identification criteria attract ever-increasing attention to this end, since they can capture higher order statistics and information content of signals rather than simply their energy [64]. In the following, we discuss the MEE criterion for system identification.

Under MEE criterion, the parameter vector (weight vector) image can be identified by solving the following optimization problem:

image (4.33)

where image denotes the probability density function (PDF) of error image. If using the order-image Renyi entropy (image, image) of the error as the criterion function, the estimated parameter will be

image (4.34)

where image follows from the monotonicity of logarithm function, image is the order-image information potential (IP) of the error image. If image, minimizing the order-image Renyi entropy is equivalent to minimizing the order-image IP; while if image, minimizing the order-image Renyi entropy is equivalent to maximizing the order-image IP. In practical application, we often use the order-image IP instead of the order-image Renyi entropy as the criterion function for identification.

Further, if using the image-entropy of the error as the criterion function, we have

image (4.35)

where image. Note that the image-entropy criterion includes the Shannon entropy (image) and order-image information potential (image) as special cases.

The error entropy is a functional of error distribution. In practice, the error distribution is usually unknown to us, and so is the error entropy. And hence, we have to estimate the error entropy from error samples, and use the estimated error entropy (called the empirical error entropy) as a criterion to identify the system parameter. In the following, we present several common approaches to estimating the entropy from sample data.

4.2.1 Common Approaches to Entropy Estimation

A straight way to estimate the entropy is to estimate the underlying distribution based on available samples, and plug the estimated distributions into the entropy expression to obtain the entropy estimate (the so-called “plug-in approach”) [183]. Several plug-in estimates of the Shannon entropy (extension to other entropy definitions is straightforward) are presented as follows.

4.2.1.1 Integral Estimate

Denote image the estimated PDF based on sample image. Then the integral estimate of entropy is of the form

image (4.36)

where image is a set typically used to exclude the small or tail values of image. The evaluation of Eq. (4.36) requires numerical integration and is not an easy task in general.

4.2.1.2 Resubstitution Estimate

The resubstitution estimate substitutes the estimated PDF into the sample mean approximation of the entropy measure (approximating the expectation value by its sample mean), which is of the form

image (4.37)

This estimation method is considerably simpler than the integral estimate, since it involves no numerical integration.

4.2.1.3 Splitting Data Estimate

Here, we decompose the sample image into two sub samples: image, image, image. Based on subsample image, we obtain a density estimate image, and then, using this density estimate and the second subsample image, we estimate the entropy by

image (4.38)

where image is the indicator function and the set image (image). The splitting data estimate is different from the resubstitution estimate in that it uses different samples to estimate the density and to calculate the sample mean.

4.2.1.4 Cross-validation Estimate

If image denotes a density estimate based on sample image (i.e., leaving image out), then the cross-validation estimate of entropy is

image (4.39)

A key step in plug-in estimation is to estimate the PDF from sample data. In the literature, there are mainly two approaches for estimating the PDF of a random variable based on its sample data: parametric and nonparametric. Accordingly, there are also parametric and nonparametric entropy estimations. The parametric density estimation assumes a parametric model of the density and estimates the involved parameters using classical estimation methods like the maximum likelihood (ML) estimation. This approach needs to select a suitable parametric model of the density, which depends upon some prior knowledge. The nonparametric density estimation, however, does not need to select a parametric model, and can estimate the PDF of any distribution.

The histogram density estimation (HDE) and kernel density estimation (KDE) are two popular nonparametric density estimation methods. Here we only discuss the KDE method (also referred to as Parzen window method), which has been widely used in nonparametric regression and pattern recognition. Given a set of independent and identically distributed (i.i.d.) samples7 image drawn from image, the KDE for image is given by [97]

image (4.40)

where image denotes a kernel function8 with width image, satisfying the following conditions:

image (4.41)

where image is the kernel function with width 1. To make the estimated PDF smooth, the kernel function is usually selected to be a continuous and differentiable (and preferably symmetric and unimodal) function. The most widely used kernel function in KDE is the Gaussian function:

image (4.42)

The kernel width of the Gaussian kernel can be optimized by the ML principle, or selected according to rules-of-thumb, such as Silverman’s rule [97].

With a fixed kernel width image, we have

image (4.43)

where image denotes the convolution operator. Using a suitable annealing rate for the kernel width, the KDE can be asymptotically unbiased and consistent. Specifically, if image and image, then image in probability [98].

In addition to the plug-in methods described previously, there are many other methods for entropy estimation, such as the sample-spacing method and the nearest neighbor method. In the following, we derive the sample-spacing estimate. First, let us express the Shannon entropy as [184]

image (4.44)

where image. Using the slope of the curve image to approximate the derivative, we have

image (4.45)

where image is the order statistics of the sample image, and image is the image-order sample spacing (image). Hence, the sample-spacing estimate of the entropy is

image (4.46)

If adding a correction term to compensate the asymptotic bias, we get

image (4.47)

where image is the Digamma function (image is the Gamma function).

4.2.2 Empirical Error Entropies Based on KDE

To calculate the empirical error entropy, we usually adopt the resubstitution estimation method with error PDF estimated by KDE. This approach has some attractive features: (i) it is a nonparametric method, and hence requires no prior knowledge on the error distribution; (ii) it is computationally simple, since no numerical integration is needed; and (iii) if choosing a smooth and differentiable kernel function, the empirical error entropy (as a function of the error sample) is also smooth and differentiable (this is very important for the calculation of the gradient).

Suppose now a set of error samples image are available. By KDE, the error density can be estimated as

image (4.48)

Then by resubstitution estimation method, we obtain the following empirical error entropies:

1. Empirical Shannon entropy

image (4.49)

2. Empirical order-image Renyi entropy

image (4.50)

where image is the empirical order-image IP, i.e.,

image (4.51)

3. Empirical image-entropy

image (4.52)

It is worth noting that for quadratic information potential (QIP) (image), if choosing Gaussian kernel function, the resubstitution estimate will be identical to the integral estimate but with kernel width image instead of image. Specifically, if image is given by Eq. (4.42), we can derive

image (4.53)

This result comes from the fact that the integral of the product of two Gaussian functions can be exactly evaluated as the value of the Gaussian function computed at the difference of the arguments and whose variance is the sum of the variances of the two original Gaussian functions. From Eq. (4.53), we also see that the QIP can be simply calculated by the double summation over error samples. Due to this fact, when using order-image IP, we usually set image.

In the following, we present some important properties of the empirical error entropy, and our focus is mainly on the order-image Renyi entropy (or order-image IP) [64,67].

Property 1:

image.

Proof:

Let image, where image is an arbitrary constant, image, then we have

image (4.54)

Remark:

Property 1 shows that the empirical error entropy is also shift-invariant. In system identification with MEE criterion, we usually set a bias at the model output to achieve a zero-mean error.

Property 2:

image, where image is the empirical Shannon entropy.

Proof:

image (4.55)

Property 3:

Let’s denote image (image is the kernel width). Then image, image, we have image.

Proof:

image (4.56)

where (a) is because that image, the kernel function image satisfies image.

Property 4:

image, where image is a random variable with PDF image (image denotes the convolution operator).

Proof:

According to the theory of KDE [97,98], we have

image (4.57)

Hence, image. Since the PDF of the sum of two independent random variables is equal to the convolution of their individual PDFs, image can be considered as the sum of the error image and another random variable that is independent of the error and has PDF image. And since the entropy of the sum of two independent random variables is no less than the entropy of each individual variable, we have image.

Remark:

Property 4 implies that minimizing the empirical error entropy will minimize an upper bound of the actual error entropy. In general, an identification algorithm seeks extrema (either minimum or maximum) of the cost function, independently to its actual value, so the dependence on estimation error is decreased.

Property 5:

If image, then image, with equality if image.

Proof:

Consider the case where image (image is similar), we have

image (4.58)

If image, image, i.e., image, the equality will hold.

Remark:

Property 5 suggests that if the kernel function reaches its maximum value at zero, then when all the error samples have the same value, the empirical error entropy reaches minimum, and in this case the uncertainty of the error is minimum.

Property 6:

If the kernel function image is continuously differentiable, symmetric, and unimodal, then the empirical error entropy is smooth at the global minimum of Property 5, that is, image has zero-gradient and positive semidefinite Hessian matrix at image.

Proof:

image (4.59)

If image, we can calculate

image (4.60)

Hence, the gradient vector image, and the Hessian matrix is

image (4.61)

whose eigenvalue–eigenvector pairs are

image (4.62)

where image. According to the assumptions we have image, therefore this Hessian matrix is positive semidefinite.

Property 7:

With Gaussian kernel, the empirical QIP image can be expressed as the squared norm of the mean vector of the data in kernel space.

Proof:

The Gaussian kernel is a Mercer kernel, and can be written as an inner product in the kernel space (RKHS):

image (4.63)

where image defines the nonlinear mapping between input space and kernel space image. Hence the empirical QIP can also be expressed in terms of an inner product in kernel space:

image (4.64)

where image is the mean vector of the data in kernel space.

In addition to the previous properties, the literature [102] points out that the empirical error entropy has the dilation feature, that is, as the kernel width image increases, the performance surface (surface of the empirical error entropy in parameter space) will become more and more flat, thus leading to the local extrema reducing gradually and even disappearing. Figure 4.3 illustrates the contour plots of a two-dimensional performance surface of ADALINE parameter identification based on the order-image IP criterion. It is clear to see that the IPs corresponding to different image values all have the feature of dilation. The dilation feature implies that one can obtain desired performance surface by means of proper selection of the kernel width.

image

Figure 4.3 Contour plots of a two-dimensional performance surface of ADALINE parameter identification based on the order-image IP criterion (adopted from Ref. [102]).

4.3 Identification Algorithms Under MEE Criterion

4.3.1 Nonparametric Information Gradient Algorithms

In general, information gradient (IG) algorithms refer to the gradient-based identification algorithms under MEE criterion (i.e., minimizing the empirical error entropy), including the batch information gradient (BIG) algorithm, sliding information gradient algorithm, forgetting recursive information gradient (FRIG) algorithm, and stochastic information gradient (SIG) algorithm. If the empirical error entropy is estimated by nonparametric approaches (like KDE), then they are called the nonparametric IG algorithms. In the following, we present several nonparametric IG algorithms that are based on image-entropy and KDE.

4.3.1.1 BIG Algorithm

With the empirical image-entropy as the criterion function, the BIG identification algorithm is derived as follows:

image (4.65)

where image is the step-size, image is the number of training data, image and image denote, respectively, the first-order derivatives of functions image and image. The gradient image of the model output with respect to the parameter image depends on the specific model structure. For example, if the model is an ADALINE or FIR filter, we have image. The reason for the algorithm named as the BIG algorithm is because that the empirical error entropy is calculated based on all the training data.

Given a specific image function, we obtain a specific algorithm:

1. image (Shannon entropy)

image (4.66)

2. image (Corresponding to order-image information potential)

image (4.67)

In above algorithms, if the kernel function is Gaussian function, the derivative image will be

image (4.68)

The step-size image is a crucial parameter that controls the compromise between convergence speed and misadjustment, and has significant influence on the learning (identification) performance. In practical use, the selection of step-size should guarantees the stability and convergence rate of the algorithm. To further improve the performance, the step-size image can be designed as a variable step-size. In Ref. [105], a self-adjusting step-size (SAS) was proposed to improve the performance of QIP criterion, i.e.,

image (4.69)

where image and image is a symmetric and unimodal kernel function (hence image).

The kernel width image is another important parameter that controls the smoothness of the performance surface. In general, the kernel width can be set manually or determined in advance by Silverman’s rule. To make the algorithm converge to the global solution, one can start the algorithm with a large kernel width and decrease this parameter slowly during the course of adaptation, just like in stochastic annealing. In Ref. [185], an adaptive kernel width was proposed to improve the performance.

The BIG algorithm needs to acquire in advance all the training data, and hence is not suitable for online identification. To address this issue, one can use the sliding information gradient algorithm.

4.3.1.2 Sliding Information Gradient Algorithm

The sliding information gradient algorithm utilizes a set of recent error samples to estimate the error entropy. Specifically, the error samples used to calculate the error entropy at time image is as follows9 :

image (4.70)

where image denotes the sliding data length (image). Then the error entropy at time image can be estimated as

image (4.71)

And hence, the sliding information gradient algorithm can be derived as

image (4.72)

For the case image (corresponding to QIP), the above algorithm becomes

image (4.73)

4.3.1.3 FRIG Algorithm

In the sliding information gradient algorithm, the error entropy can be estimated by a forgetting recursive method [64]. Assume at time image the estimated error PDF is image, then the error PDF at time image can be estimated as

image (4.74)

where image is the forgetting factor. Therefore, we can calculate the empirical error entropy as follows:

image (4.75)

If image (image), there exists a recursive form [186]:

image (4.76)

where image is the QIP at time image. Thus, we have the following algorithm:

image (4.77)

namely the FRIG algorithm. Compared with the sliding information gradient algorithm, the FRIG algorithm is computationally simpler and is more suitable for nonstationary system identification.

4.3.1.4 SIG Algorithm

In the empirical error entropy of (4.71), if dropping the outer averaging operator (image), one may obtain the instantaneous error entropy at time image:

image (4.78)

The instantaneous error entropy is similar to the instantaneous error cost image, as both are obtained by removing the expectation operator (or averaging operator) from the original criterion function. The computational cost of the instantaneous error entropy (4.78) is image of that of the empirical error entropy of Eq. (4.71). The gradient identification algorithm based on the instantaneous error entropy is called the SIG algorithm, which can be derived as

image (4.79)

If image, we obtain the SIG algorithm under Shannon entropy criterion:

image (4.80)

If image, we get the SIG algorithm under QIP criterion:

image (4.81)

The SIG algorithm (4.81) is actually the FRIG algorithm with image.

4.3.2 Parametric IG Algorithms

In IG algorithms described above, the error distribution is estimated by nonparametric KDE approach. With this approach, one is often confronted with the problem of how to choose a suitable value of the kernel width. An inappropriate choice of width will significantly deteriorate the performance of the algorithm. Though the effects of the kernel width on the shape of the performance surface and the eigenvalues of the Hessian at and around the optimal solution have been carefully investigated [102], at present the choice of the kernel width is still a difficult task. Thus, a certain parameterized density estimation, which does not involve the choice of kernel width, sometimes might be more practical. Especially, if some prior knowledge about the data distribution is available, the parameterized density estimation may achieve a better accuracy than nonparametric alternatives.

Next, we discuss the parametric IG algorithms that adopt parametric approaches to estimate the error distribution. To simplify the discussion, we only present the parametric SIG algorithm.

In general, the SIG algorithm can be expressed as

image (4.82)

where image is the value of the error PDF at image estimated based on the error samples image. By KDE approach, we have

image (4.83)

Now we use a parametric approach to estimate image. Let’s consider the exponential (maximum entropy) PDF form:

image (4.84)

where the parameters image (image) can be estimated by some classical estimation methods like the ML estimation. After obtaining the estimated parameter values image, one can calculate image as

image (4.85)

Substituting Eq. (4.85) into Eq. (4.82), we obtain the following parametric SIG algorithm:

image (4.86)

If adopting Shannon entropy (image), the algorithm becomes

image (4.87)

The selection of the PDF form is very important. Some typical PDF forms are as follows [187190]:

1. image

2. image

3. Generalized Gaussian density (GGD) model [191]:

image (4.88)

where image, image is the standard deviation, and image is the Gamma function:

image (4.89)

In the following, we present the SIG algorithm based on GGD model.

The GGD model has three parameters: location (mean) parameter image, shape parameter image, and dispersion parameter image. It has simple yet flexible functional forms and could approximate a large number of statistical distributions, and is widely used in image coding, speech recognition, and BSS, etc. The GGD densities include Gaussian (image) and Laplace (image) distributions as special cases. Figure 4.4 shows the GGD distributions for several shape parameters with zero mean and deviation image. It is evident that smaller values of the shape parameter correspond to heavier tails and therefore to sharper distributions. In the limiting cases, as image, the GGD becomes close to the uniform distribution, whereas as image, it approaches an impulse function (image-distribution).

image

Figure 4.4 GGD distributions for different shape parameters image (image, image).

Utilizing the GGD model to estimate the error distribution is actually to estimate the parameters image, image, and image based on the error samples. Up to now, there are many methods on how to estimate the GGD parameters [192]. Here, we only discuss the moment matching method (method of moments).

The order-image absolute central moment of GGD distribution can be calculated as

image (4.90)

Hence we have

image (4.91)

The right-hand side of Eq. (4.91) is a function of image, denoted by image. Thus the parameter image can be expressed as

image (4.92)

where image is the inverse of function image. Figure 4.5 shows the curves of the inverse function image when image.

image

Figure 4.5 Inverse functions image (image).

According to Eq. (4.92), based on the moment matching method one can estimate the parameters image, image, and image as follows:

image (4.93)

where the subscript image represents that the values are estimated based on the error samples image. Thus image will be

image (4.94)

Substituting Eq. (4.94) into Eq. (4.82) and letting image (Shannon entropy), we obtain the SIG algorithm based on GGD model:

image (4.95)

where image, image, and image are calculated by Eq. (4.93).

To make a distinction, we denote “SIG-kernel” and “SIG-GGD” the SIG algorithms based on kernel approach and GGD densities, respectively. Compared with the SIG-kernel algorithm, the SIG-GGD just needs to estimate the three parameters of GGD density, without resorting to the choice of kernel width and the calculation of kernel function.

Comparing Eqs. (4.95) and (4.28), we find that when image, the SIG-GGD algorithm can be considered as an LMP algorithm with adaptive order image and variable step-size image. In fact, under certain conditions, the SIG-GGD algorithm will converge to a certain LMP algorithm with fixed order and step-size. Consider the FIR system identification, in which the plant and the adaptive model are both FIR filters with the same order, and the additive noise image is zero mean, ergodic, and stationary. When the model weight vector image converges to the neighborhood of the plant weight vector image, we have

image (4.96)

where image is the weight error vector. In this case, the estimated values of the parameters in SIG-GGD algorithm will be

image (4.97)

Since noise image is an ergodic and stationary process, if image is large enough, the estimated values of the three parameters will tend to some constants, and consequently, the SIG-GGD algorithm will converge to a certain LMP algorithm with fixed order and step-size. Clearly, if noise image is Gaussian distributed, the SIG-GGD algorithm will converge to the LMS algorithm (image), and if image is Laplacian distributed, the algorithm will converge to the LAD algorithm (image). In Ref. [28], it has been shown that under slow adaptation, the LMS and LAD algorithms are, respectively, the optimum algorithms for the Gaussian and Laplace interfering noises. We may therefore conclude that the SIG-GGD algorithm has the ability to adjust its parameters so as to automatically switch to a certain optimum algorithm.

There are two points that deserve special mention concerning the implementation of the SIG-GGD algorithm: (i) since there is no analytical expression, the calculation of the inverse function image needs to use look-up table or some interpolation method and (ii) in order to avoid too large gradient and ensure the stability of the algorithm, it is necessary to set an upper bound on the parameter image.

4.3.3 Fixed-Point Minimum Error Entropy Algorithm

Given a mapping image, the fixed points are solutions of iterative equation image, image. The fixed-point (FP) iteration is a numerical method of computing fixed points of iterated functions. Given an initial point image, the FP iteration algorithm is

image (4.98)

where image is the iterative index. If image is a function defined on the real line with real values, and is Lipschitz continuous with Lipschitz constant smaller than 1.0, then image has precisely one fixed point, and the FP iteration converges toward that fixed point for any initial guess image. This result can be generalized to any metric space.

The FP algorithm can be applied in parameter identification under MEE criterion [64]. Let’s consider the QIP criterion:

image (4.99)

under which the optimal parameter (weight vector) image satisfies

image (4.100)

If the model is an FIR filter, and the kernel function is the Gaussian function, we have

image (4.101)

One can write Eq. (4.101) in an FP iterative form (utilizing image):

image (4.102)

Then we have the following FP algorithm:

image (4.103)

where

image (4.104)

The above algorithm is called the fixed-point minimum error entropy (FP-MEE) algorithm. The FP-MEE algorithm can also be implemented by using the forgetting recursive form [194], i.e.,

image (4.105)

where

image (4.106)

This is the recursive fixed-point minimum error entropy (RFP-MEE) algorithm.

In addition to the parameter search algorithms described above, there are many other parameter search algorithms to minimize the error entropy. Several advanced parameter search algorithms are presented in Ref. [104], including the conjugate gradient (CG) algorithm, Levenberg–Marquardt (LM) algorithm, quasi-Newton method, and others.

4.3.4 Kernel Minimum Error Entropy Algorithm

System identification algorithms under MEE criterion can also be derived in kernel space. Existing KAF algorithms are mainly based on the MSE (or least squares) criterion. MSE is not always a suitable criterion especially in nonlinear and non-Gaussian situations. Hence, it is attractive to develop a new KAF algorithm based on a non-MSE (nonquadratic) criterion. In Ref. [139], a KAF algorithm under the maximum correntropy criterion (MCC), namely the kernel maximum correntropy (KMC) algorithm, has been developed. Similar to the KLMS, the KMC is also a stochastic gradient algorithm in RKHS. If the kernel function used in correntropy, denoted by image, is the Gaussian kernel, the KMC algorithm can be derived as [139]

image (4.107)

where image denotes the estimated weight vector at iteration image in a high-dimensional feature space image induced by Mercer kernel image and image is a feature vector obtained by transforming the input vector image into the feature space through a nonlinear mapping image. The KMC algorithm can be regarded as a KLMS algorithm with variable step-size image.

In the following, we will derive a KAF algorithm under the MEE criterion. Since the image-entropy is a very general and flexible entropy definition, we use the image-entropy of the error as the adaptation criterion. In addition, for simplicity we adopt the instantaneous error entropy (4.78) as the cost function. Then, one can easily derive the following kernel minimum error entropy (KMEE) algorithm:

image (4.108)

The KMEE algorithm (4.108) is actually the SIG algorithm in kernel space. By selecting a certain image function, we can obtain a specific KMEE algorithm. For example, if setting image (i.e., image), we get the KMEE under Shannon entropy criterion:

image (4.109)

The weight update equation of Eq. (4.108) can be written in a compact form:

image (4.110)

where image, image, and image is a vector-valued function of image, expressed as

image (4.111)

The KMEE algorithm is similar to the KAPA [178], except that the error vector image in KMEE is nonlinearly transformed by the function image. The learning rule of the KMEE in the original input space can be written as (image)

image (4.112)

where image.

The learned model by KMEE has the same structure as that learned by KLMS, and can be represented as a linear combination of kernels centered in each data points:

image (4.113)

where the coefficients (at iteration image) are updated as follows:

image (4.114)

The pseudocode for KMEE is summarized in Table 4.2.

Table 4.2

The KMEE Algorithm

Image

4.3.5 Simulation Examples

In the following, we present several simulation examples to demonstrate the performance (accuracy, robustness, convergence rate, etc.) of the identification algorithms under MEE criterion.

Example 4.1 [102]

Assume that both the unknown system and the model are two-dimensional ADALINEs, i.e.,

image (4.115)

where unknown weight vector is image, and image is the independent and zero-mean Gaussian noise. The goal is to identify the model parameters imageunder noises of different signal-to-noise ratios (SNRs). For each noise energy, 100 independent Monte-Carlo simulations are performed with image (image) training data that are chosen randomly. In the simulation, the BIG algorithm under QIP criterion is used, and the kernel function image is the Gaussian function with bandwidth image. Figure 4.6 shows the average distance between actual and estimated optimal weight vector. For comparison purpose, the figure also includes the identification results (by solving the Wiener-Hopf equations) under MSE criterion. Simulation results indicate that, when SNR is higher (image), the MEE criterion achieves much better accuracy than the MSE criterion (or requires less training data when achieving the same accuracy).

image

Figure 4.6 Comparison of the performance of MEE against MSE (adopted from Ref. [102]).

Example 4.2 [100]

Identify the following nonlinear dynamic system:

image (4.116)

where image and image are the state variables and image is the input signal. The identification model is the time delay neural network (TDNN), where the network structure is an MLP with multi-input, single hidden layer, and single output. The input vector of the neural network contains the current input and output and their past values of the nonlinear system, that is, the training data can be expressed as

image (4.117)

In this example, image and image are set as image. The number of hidden units is set at 7, and the symmetric sigmoid function is selected as the activation function. In addition, the number of training data is image. We continue to compare the performance of MEE (using the BIG algorithm under QIP criterion) to MSE. For each criterion, the TDNN is trained starting from 50 different initial weight vectors, and the best solution (the one with the highest QIP or lowest MSE) among the 50 candidates is selected to test the performance.10 Figure 4.7 illustrates the probability densities of the error between system actual output and TDNN output with 10,000 testing data. One can see that the MEE criterion achieves a higher peak around the zero error. Figure 4.8 shows the probability densities of system actual output (desired output) and model output. Evidently, the output of the model trained under MEE criterion matches the desired output better.

image

Figure 4.7 Probability densities of the error between system actual output and model output (adopted from Ref. [100]).

image

Figure 4.8 Probability densities of system actual output and model output (adopted from Ref. [100]).

Example 4.3 [190]

Compare the performances of SIG-kernel, SIG-GGD, and LMP family algorithms (LAD, LMS, LMF, etc.). Assume that both the unknown system and the model are FIR filters:

image (4.118)

where image and image denote the transfer functions of the system and the model, respectively. The initial weight vector of the model is set to be image, and the input signal is white Gaussian noise with zero mean and unit power (variance). The kernel function in SIG-kernel algorithm is the Gaussian kernel with bandwidth determined by Silverman rule. In SIG-GGD algorithm, we set image, and to avoid large gradient, we set the upper bound of image at 4.0.

In the simulation, we consider four noise distributions (Laplace, Gaussian, Uniform, MixNorm), as shown in Figure 4.9. For each noise distribution, the average convergence curves, over 100 independent Monte Carlo simulations, are illustrated in Figure 4.10, where WEP denotes the weight error power, defined as

image (4.119)

where image is the weight error vector (the difference between desired and estimated weight vectors) and image is the weight error norm. Table 4.3 lists the average identification results (mean±deviation) of image(image). Further, the average evolution curves of image in SIG-GGD are shown in Figure 4.11.

image

Figure 4.9 Four PDFs of the additive noise (adopted from Ref. [190]).

image

Figure 4.10 Average convergence curves of several algorithms for different noise distributions: (A) Laplace, (B) Gaussian, (C) Uniform, and (D) MixNorm (adopted from Ref. [190]).

Table 4.3

Average Identification Results of image Over 100 Monte Carlo Simulations

Image

(adopted from Ref. [190])

image

Figure 4.11 Evolution curves of image over 100 Monte Carlo runs: (A) Laplace, (B) Gaussian, and (C) Uniform (adopted from Ref. [190]).

From the simulation results, we have the following observations:

i. The performances of LAD, LMS, and LMF depend crucially on the distribution of the disturbance noise. These algorithms may achieve the smallest misadjustment for a certain noise distribution (e.g., the LMF performs best in uniform noise); however, for other noise distributions, their performances may deteriorate dramatically (e.g., the LMF performs worst in Laplace noise).

ii. Both SIG-GGD and SIG-Kernel are robust to noise distribution. In the case of symmetric and unimodal noises (e.g., Laplace, Gaussian, Uniform), SIG-GGD may achieve a smaller misadjustment than SIG-Kernel. Though in the case of nonsymmetric and nonunimodal noises (e.g., MixNorm), SIG-GGD may be not as good as SIG-Kernel, it is still better than most of the LMP algorithms.

iii. Near the convergence, the parameter image in SIG-GGD converges approximately to 1, 2, and 4 (note that image is restricted to image artificially) when disturbed by, respectively, Laplace, Gaussian, and Uniform noises. This confirms the fact that SIG-GGD has the ability to adjust its parameters so as to switch to a certain optimum algorithm.

Example 4.4 [194]

Apply the RFP-MEE algorithm to identify the following FIR filter:

image (4.120)

The adaptive model is also an FIR filter with equal order. The input to both the plant and adaptive model is white Gaussian noise with unit power. The observation noise is white Gaussian distributed with zero mean and variance image. The main objective is to investigate the effect of the forgetting factor on the convergence speed and convergence accuracy (WEP after convergence) of the RFP-MEE algorithm. Figure 4.12 shows the convergence curves of the RFP-MEE with different forgetting factors. One can see that smaller forgetting factors result in faster convergence speed and larger steady-state WEP. This result conforms to the well-known general behavior of the forgetting factor in recursive estimates. Thus, selecting a proper forgetting factor for RFP-MEE must consider the intrinsic trade-off between convergence speed and identification accuracy.

image

Figure 4.12 Convergence curves of RFP-MEE algorithm with different forgetting factors (adopted from Ref. [194]).

Example 4.5

This example aims to demonstrate the performance of KMEE (with Shannon entropy or QIP criterion). For comparison purpose, we also show the performances of several other KAF algorithms: KLMS, KMC, and KAPA. Let’s consider the nonlinear system identification, where the nonlinear system is as follows [195]:

image (4.121)

The noise image is of symmetric image-stable (image) distribution with characteristic function

image (4.122)

where image, image. When image, the distribution is a zero-mean Gaussian distribution with variance 0.01; while when image, the distribution corresponds to an impulsive noise with infinite variance.

Figure 4.13 illustrates the average learning curves (over 200 Monte Carlo runs) for different image values and Table 4.4 lists the testing MSE at final iteration. In the simulation, 1000 samples are used for training and another 100 clean samples are used for testing (the filter is fixed during the testing phase). Further, all the kernels (kernel of RKHS, kernel of correntropy, and kernel for density estimation) are selected to be the Gaussian kernel. The kernel parameter for RKHS is set at image, kernel size for correntropy is image, and kernel bandwidth for density estimation is image. The sliding data lengths for KMEE and KAPA are both set at image. The step-sizes for KLMS, KMC, KAPA, KMEE (Shannon), and KMEE (QIP), are, respectively, set at 0.8, 1.0, 0.05, 1.0, and 2.0. These parameters are experimentally selected to achieve the desirable performance. From Figure 4.13 and Table 4.4, one can see that the KMEE algorithms outperform all other algorithms except for the case of small image, the KMC algorithm may achieve a relatively smaller deviation in testing MSE. Simulation results also show that the performances of KMEE (Shannon) and KMEE (QIP) are very close.

image

Figure 4.13 Average learning curves for different image values: (A) image, (B) image, (C) image, (D) image.

Table 4.4

Testing MSE at Final Iteration for Different image Values

Image

4.4 Convergence Analysis

Next, we analyze the convergence properties of the parameter identification under MEE criterion. For simplicity, we consider only the ADALINE model (which includes the FIR filter as a special case). The convergence analysis of error entropy minimization algorithms is, in general, rather complicated. This is mostly because

1. the objective function (the empirical error entropy) is not only related to the current error but also concerned with the past error values;

2. the entropy function and the kernel function are nonlinear functions;

3. the shape of performance surface is very complex (nonquadratic and nonconvex).

There are two ways for the convergence analysis of such algorithms: (i) using the Taylor series expansion near the optimal solution to obtain an approximate linearization algorithm, and then performing the convergence analysis for the linearization algorithm and (ii) applying the energy conservation relation to analyze the convergence behavior [106]. The first approach is relatively simple, but only the approximate analysis results near the optimal solution can be achieved. With the second approach it is possible to acquire rigorous analysis results of the convergence, but usually more assumptions are needed. In the following, we first briefly introduce the first analysis approach, and then focus on the second approach, performing the mean square convergence analysis based on the energy conservation relation. The following analysis is mainly aimed at the nonparametric IG algorithms with KDE.

4.4.1 Convergence Analysis Based on Approximate Linearization

In Ref. [102], an approximate linearization approach has been used to analyze the convergence of the gradient-based algorithm under order-image IP criterion. Consider the ADALINE model:

image (4.123)

where image is the image-dimensional input vector, image is the weight vector. The gradient based identification algorithm under order-image (image) information potential criterion can be expressed as

image (4.124)

where image denotes the gradient of the empirical order-image IP with respect to the weight vector, which can be calculated as

image (4.125)

As the kernel function image satisfies image, we have

image (4.126)

So, one can rewrite Eq. (4.125) as

image (4.127)

where image.

If the weight vector image lies in the neighborhood of the optimal solution image, one can obtain a linear approximation of the gradient image using the first-order Taylor expansion:

image (4.128)

where image is the Hessian matrix, i.e.,

image (4.129)

Substituting Eq. (4.128) into Eq. (4.124), and subtracting image from both sides, one obtains

image (4.130)

where image is the weight error vector. The convergence analysis of the above linear recursive algorithm is very simple. Actually, one can just borrow the well-known convergence analysis results from the LMS convergence theory [1820]. Assume that the Hessian matrix image is a normal matrix and can be decomposed into the following normal form:

image (4.131)

where image is an image orthogonal matrix, image, image is the eigenvalue of image. Then, the linear recursion (4.130) can be expressed as

image (4.132)

Clearly, if the following conditions are satisfied, the weight error vector image will converge to the zero vector (or equivalently, the weight vector image will converge to the optimal solution):

image (4.133)

Thus, a sufficient condition that ensures the convergence of the algorithm is

image (4.134)

Further, the approximate time constant corresponding to image will be

image (4.135)

In Ref. [102], it has also been proved that if the kernel width image increases, the absolute values of the eigenvalues will decrease, and the time constants will increase, that is, the convergence speed of the algorithm will become slower.

4.4.2 Energy Conservation Relation

Here, the energy conservation relation is not the well-known physical law of conservation of energy, but refers to a certain identical relation that exists among the WEP, a priori error, and a posteriori error in adaptive filtering algorithms (such as the LMS algorithm). This fundamental relation can be used to analyze the mean square convergence behavior of an adaptive filtering algorithm [196].

Given an adaptive filtering algorithm with error nonlinearity [29]:

image (4.136)

where image is the error nonlinearity function (image corresponds to the LMS algorithm), the following energy conservation relation holds:

image (4.137)

where image is the WEP at instant image (or iteration image), image and image are, respectively, the a priori error and a posteriori error:

image (4.138)

One can show that the IG algorithms (assuming ADALINE model) also satisfy the energy conservation relation similar to Eq. (4.137) [106]. Let us consider the sliding information gradient algorithm (with image entropy criterion):

image (4.139)

where image is the empirical error entropy at iteration image:

image (4.140)

where image, image is the image th error sample used to calculate the empirical error entropy at iteration image. Given the image-function and the kernel function image, the empirical error entropy image will be a function of the error vector image, which can be written as

image (4.141)

This function will be continuously differentiable with respect to image, provided that both functions image and image are continuously differentiable.

In ADALINE system identification, the error image will be

image (4.142)

where image is the image th measurement at iteration image, image is the image th model output at iteration image, image is the image th input vector11 at iteration image, and image is the ADALINE weight vector. One can write Eq. (4.142) in the form of block data, i.e.,

image (4.143)

where image, image, and image (image input matrix).

The block data image can be constructed in various manners. Two typical examples are

1. One time shift:

image (4.144)

2. image-time shift:

image (4.145)

Combining Eqs. (4.139), (4.141), and (4.143), we can derive (assuming ADALINE model)

image (4.146)

where image, in which

image (4.147)

With QIP criterion (image), the function image can be calculated as (assuming image is a Gaussian kernel):

image (4.148)

The algorithm in Eq. (4.146) is in form very similar to the adaptive filtering algorithm with error nonlinearity, as expressed in Eq. (4.136). In fact, Eq. (4.146) can be regarded, to some extent, as a “block” version of Eq. (4.136). Thus, one can study the mean square convergence behavior of the algorithm (4.146) by similar approach as in mean square analysis of the algorithm (4.136). It should also be noted that the objective function behind algorithm (4.146) is not limited to the error entropy. Actually, the cost function image can be extended to any function of image, including the simple block mean square error (BMSE) criterion that is given by [197]

image (4.149)

We now derive the energy conservation relation for the algorithm (4.146). Assume that the unknown system and the adaptive model are both ADALINE structures with the same dimension of weights.12 Let the measured output (in the form of block data) be

image (4.150)

where image is the weight vector of unknown system and image is the noise vector. In this case, the error vector image can be expressed as

image (4.151)

where image is the weight error vector. In addition, we define the a priori and a posteriori error vectors image and image:

image (4.152)

Clearly, image and image have the following relationship:

image (4.153)

Combining Eqs. (4.146) and (4.153) yields

image (4.154)

where image is an image symmetric matrix with elements image.

Assume that the matrix image is invertible (i.e., image). Then we have

image (4.155)

And hence

image (4.156)

Both sides of Eq. (4.156) should have the same energy, i.e.,

image (4.157)

From Eq. (4.157), after some simple manipulations, we obtain the following energy conservation relation:

image (4.158)

where image, image, image. Further, in order to analyze the mean square convergence performance of the algorithm, we take the expectations of both sides of (4.158) and write

image (4.159)

The energy conservation relation (4.159) shows how the energies (powers) of the error quantities evolve in time, which is exact for any adaptive algorithm described in Eq. (4.146), and is derived without any approximation and assumption (except for the condition that image is invertible). One can also observe that Eq. (4.159) is a generalization of Eq. (4.137) in the sense that the a priori and a posteriori error quantities are extended to vector case.

4.4.3 Mean Square Convergence Analysis Based on Energy Conservation Relation

The energy conservation relation (4.159) characterizes the evolution behavior of the weight error power (WEP). Substituting image into (4.159), we obtain

image (4.160)

To evaluate the expectations image and image, some assumptions are given below [106]:

• A1: The noise image is independent, identically distributed, and independent of the input image;

• A2: The a priori error vector image is jointly Gaussian distributed;

• A3: The input vectors image are zero-mean independent, identically distributed;

• A4: image, image and image are independent.

Remark:

Assumptions A1 and A2 are popular and have been widely used in convergence analysis for many adaptive filtering algorithms [196]. As pointed out in [29], the assumption A2 is reasonable for longer weight vector by central limit theorem arguments. The assumption A3 restricts the input sequence image to white regression data, which is also a common practice in the literature (e.g., as in Refs. [28,198,199]). The assumption A4 is somewhat similar to the uncorrelation assumption in Ref. [29], but it is a little stronger. This assumption is reasonable under assumptions A1 and A3, and will become more realistic as the weight vector gets longer (justified by the law of large numbers).

In the following, for tractability, we only consider the case in which the block data are constructed by “image-time shift” approach (see Eq. (4.145)). In this case, the assumption A3 implies: (i) the input matrices image are zero-mean, independent, identically distributed and (ii) image and image are mutually independent. Combining assumption A3 and the independence between image and image, one may easily conclude that the components of the a priori error vector image are also zero-mean, independent, identically distributed. Thus, by Gaussian assumption A2, the PDF of image can be expressed as

image (4.161)

where image is the a priori error power. Further, by assumption A1, the a priori error vector image and the noise vector image are independent, and hence

image (4.162)

where image denotes the PDF of image. The inner integral depends on image through the second moment image only, and so does image. Thus, given the noise distribution image, the expectation image can be expressed as a function of image, which enables us to define the following function13 :

image (4.163)

It follows that

image (4.164)

Next, we evaluate the expectation image. As image, image, by assumptions A1, A3, and A4, image and image will be independent. Thus

image (4.165)

where (a) and (b) follow from the assumption A3. Since image is Gaussian and independent of the noise vector image, the term image will also depend on image through image only, and this prompts us to define the function image:

image (4.166)

Combining Eqs. (4.165) and (4.166) yields

image (4.167)

Substituting Eqs. (4.164) and (4.167) into Eq. (4.160), we obtain

image (4.168)

Now, we use the recursion formula (4.168) to analyze the mean square convergence behavior of the algorithm (4.146), following the similar derivations in Ref. [29].

4.4.3.1 Sufficient Condition for Mean Square Convergence

From Eq. (4.168), it is easy to observe

image (4.169)

Therefore, if we choose the step-size image such that for all image

image (4.170)

then the sequence of WEP image will be monotonically decreasing (and hence convergent). Thus, a sufficient condition for the mean square convergence of the algorithm (4.146) would be

image (4.171)

where (a) comes from the assumption that the input vector is stationary, image, image denotes the set of all possible values of image (image).

In general, the above upper bound of the step-size is conservative. However, if image is a monotonically increasing function over image, i.e., image, image, one can explicitly derive the maximum value (tight upper bound) of the step-size that ensures the mean square convergence. Let’s come back to the condition in Eq. (4.170) and write

image (4.172)

Under this condition, the WEP will be monotonically decreasing. As the a priori error power image is proportional to the WEP image (see Eq. (4.177)), in this case, the a priori error power will also be monotonically decreasing, i.e.,

image (4.173)

where image is the initial a priori error power. So, the maximum step-size is

image (4.174)

where (a) follows from Eq. (4.173) and the monotonic property of image. This maximum step-size depends upon the initial a priori error power. When image, we have

image (4.175)

In this case, the learning is at the edge of convergence (WEP remains constant).

Remark

If the step-size image is below the upper bound or smaller than the maximum value image, the WEP will be decreasing. However, this does not imply that the WEP will converge to zero. There are two reasons for this. First, for a stochastic gradient-based algorithm, there always exist some excess errors (misadjustments). Second, the algorithm may converge to a local minimum (if any).

4.4.3.2 Mean Square Convergence Curve

One can establish a more homogeneous form for the recursion formula (4.168). First, it is easy to derive

image (4.176)

where (a) follows from the independence between image and image, image, image. As the input data are assumed to be zero-mean, independent, identically distributed, we have image (imageis an image-dimensional unit matrix), and hence

image (4.177)

Substituting Eq. (4.177) and image into Eq. (4.168) yields the equation that governs the mean square convergence curve:

image (4.178)

4.4.3.3 Mean Square Steady-State Performance

We can use Eq. (4.178) to evaluate the mean square steady-state performance. Suppose the WEP reaches a steady-state value, i.e.,

image (4.179)

Then the mean square convergence equation (4.178) becomes, in the limit

image (4.180)

It follows that

image (4.181)

Denote image the steady-state WEP, i.e., image, we have

image (4.182)

Therefore, if the adaptive algorithm (4.146) converges, the steady-state WEP image will be a positive solution of Eq. (4.182), or equivalently, image will be a positive FP of the function image.

Further, denote image the steady-state excess mean square error (EMSE), i.e., image. By Eq. (4.177), we can easily evaluate image as

image (4.183)

The steady-state EMSE is in linear proportion to the steady-state WEP.

So far we have derived the mean square convergence performance for adaptive algorithm (4.146), under the assumptions A1–A4. The derived results depend mainly on two functions: image, image. In the following, we will derive the exact expressions for the two functions for QIP criterion (image).

By Eq. (4.148), under QIP criterion we have

image (4.184)

Hence, the function image can be expressed as

image (4.185)

where (a) comes from the fact that the error pairs image are independent, identically distributed. In addition, substituting (4.184) into (4.166) yields

image (4.186)

where (b) and (c) follow from the fact that the error samples image are independent, identically distributed.

In (4.185) and (4.186), the functions image and image do not yet have the explicit expressions in terms of the argument image. In order to obtain the explicit expressions, one has to calculate the involved expectations using the PDFs of the a priori error and the noise. The calculation is rather complex and tedious. In the following, we only present the results for the Gaussian noise case.

Let the noise image be a zero-mean white Gaussian process with variance image. Then the error image will also be zero-mean Gaussian distributed with variance image. In this case, one can derive

image (4.187)

Substituting the explicit expressions in Eq. (4.187) into Eq. (4.178), one may obtain the exact convergence curve of the WEP, which can be described by a nonlinear dynamic system:

image (4.188)

where the function image.

In the following, a Monte Carlo simulation example is presented to verify the previous theoretical analysis results [106]. Consider the case in which the input signal and additive noise are both white Gaussian processes with unit power. Assume the unknown and adaptive systems are both ADALINE structures with weight vector of length 25. The initial weight vector of the adaptive system was obtained by perturbing each coefficient of the ideal weight vector image by a random variable that is zero-mean Gaussian distributed with variance 0.04 (hence the initial WEP is 1.0 or 0 dB).

First, we examine the mean square convergence curves of the adaptive algorithm. For different values of the step-size image, kernel width image, and sliding data length image, the average convergence curves (solid) over 100 Monte Carlo runs and the corresponding theoretical learning curves (dotted) are plotted in Figures 4.144.16. Clearly, the experimental and theoretical results agree very well. Second, we verify the steady-state performance. As shown in Figures 4.174.19, the steady-state EMSEs generated by simulations match well with those calculated by theory. These simulated and theoretical results also demonstrate how the step-size image, kernel width image, and sliding data length image affect the performance of the adaptation: (i) a larger step-size produces a faster initial convergence, but results in a larger misadjustment; (ii) a larger kernel width causes a slower initial convergence, but yields a smaller misadjustment; (iii) a larger sliding data length achieves a faster initial convergence and a smaller misadjustment.14

image

Figure 4.14 Simulated and theoretical learning curves for different step-sizes (image, image).

image

Figure 4.15 Simulated and theoretical learning curves for different kernel widths (image, image).

image

Figure 4.16 Simulated and theoretical learning curves for different sliding data lengths (image, image).

image

Figure 4.17 Simulated and theoretical EMSE versus step-size image.

image

Figure 4.18 Simulated and theoretical EMSE versus kernel width image.

image

Figure 4.19 Simulated and theoretical EMSE versus sliding data length image.

In addition, we verify the upper bound on step-sizes that guarantee the convergence of the learning. For the case in which the kernel width image, and the sliding data length image, we plot in Figure 4.20 the curve of the function image. Clearly, this function is monotonically increasing. Thus by (4.174), we can calculate the maximum step-size image. For step-sizes around image the simulated and theoretical learning curves are shown in Figure 4.21. As expected, when image, the learning is at the edge of convergence. If image is above (or below) the maximum step-size image, the weight error power will be increasing (or decreasing).

image

Figure 4.20 The function image.

image

Figure 4.21 Simulated (solid) and theoretical (dotted) learning curves for step-sizes around image.

4.5 Optimization of image-Entropy Criterion

The image-entropy criterion is very flexible. In fact, many entropy definitions are special cases of the image-entropy. This flexibility, however, also brings the problem of how to select a good image function to maximize the performance of the adaptation algorithm [200].

The selection of the image function is actually an optimization problem. Denote image a quantitative performance index (convergence speed, steady-state accuracy, etc.) of the algorithm that we would like to optimize. Then the optimal image function will be

image (4.189)

where image represents the set of all possible image functions. Due to the fact that different identification scenarios usually adopt different performance indexes, the above optimization problem must be further formulated in response to a specific identification system. In the following, we will focus on the FIR system identification, and assume for simplicity that the unknown system and the adaptive model are both FIR filters of the same order.

Before proceeding, we briefly review the optimization of the general error criterion image. In the adaptive filtering literature, there are mainly two approaches for the optimization of the general (usually non-MSE) error criterion. One approach regards the choice of the error criterion (or the image function) as a parameter search, in which a suitable structure of the criterion is assumed [26,201]. Such a design method usually leads to a suboptimal algorithm since the criterion is limited to a restricted class of functions. Another approach is proposed by Douglas and Meng [28], where the calculus of variations is used, and no prior information about the structure of error criterion is assumed. According to Douglas’ method, in FIR system identification one can use the following performance index to optimize the error criterion:

image (4.190)

where image is the weight error vector at image iteration. With the above performance index, the optimization of the image function can be formulated as the following optimization problem [28]:

image (4.191)

where image, image is the error PDF at image iteration, image is the step-size, and image is the input signal power. By calculus of variations, one can obtain the optimal image function [28]:

image (4.192)

The optimal image function can thus be expressed in the form of indefinite integral:

image (4.193)

which depends crucially on the error PDF image.

Next, we will utilize Eq. (4.193) to derive an optimal image-entropy criterion [200]. Assume that the error PDF image is symmetric, continuously differentiable (up to the second order), and unimodal with a peak at the origin. Then image satisfies: (i) invertible over interval image and (ii) image is symmetric. Therefore, we have

image (4.194)

where image denotes the inverse function of image over image and image. So, we can rewrite Eq. (4.193) as

image (4.195)

Let the optimal image-entropy criterion equal to the optimal error criterion image, we have

image (4.196)

Hence

image (4.197)

Let image, we obtain

image (4.198)

To achieve an explicit form of the function image, we consider a special case in which the error is zero-mean Gaussian distributed:

image (4.199)

Then we have

image (4.200)

It follows that

image (4.201)

Substituting Eq. (4.201) into Eq. (4.198) yields

image (4.202)

where image, image, and image is a constant. Thus, we obtain the following optimal image-entropy:

image (4.203)

When image, and image, we have

image (4.204)

That is, as image the derived optimal entropy will approach the Shannon entropy. One may therefore conclude that, under slow adaptation condition (image is small enough), Shannon’s entropy is actually a suboptimal entropy criterion. Figure 4.22 shows the optimal image functions for different step-sizes image (assume image).

image

Figure 4.22 Optimal image functions for different step-sizes image (adopted from Ref. [200]).

One can easily derive the IG algorithms under the optimal image-entropy criterion. For example, substituting Eq. (4.202) into Eq. (4.79), we have the following SIG algorithm:

image (4.205)

It is worth noting that the above optimal image entropy was derived as an example for a restricted situation, in which the constraints include FIR identification, white Gaussian input and noise, etc. For more general situations, such as nonlinear and non-Gaussian cases, the derived image function would no longer be an optimal one.

As pointed out in Ref. [28], the implementation of the optimal nonlinear error adaptation algorithm requires the exact knowledge of the noise’s or error’s PDFs. This is usually not the case in practice, since the characteristics of the noise or error may only be partially known or time-varying. In the implementation of the algorithm under the optimal image-entropy criterion, however, the required PDFs are estimated by a nonparametric approach (say the KDE), and hence we don’t need such a priori information. It must be noted that in Eq. (4.205), the parameters image and image are both related to the error variance image, which is always time-varying during the adaptation. In practice, we should estimate this variance and update the two parameters online.

In the following, we present a simple numerical example to verify the theoretical conclusions and illustrate the improvements that may be achieved by optimizing the image function. Consider the FIR system identification, where the transfer functions of the plant and the adaptive filter are [200]

image (4.206)

The input signal and the noise are white Gaussian processes with powers 1.0 and 0.64, respectively. The initial weight vector of adaptive filter was obtained by perturbing each component of the optimal weight vector by a random variable that is uniformly distributed in the interval image. In the simulation, the SIG algorithms under three different entropy criteria (optimal entropy, Shannon entropy, QIP) are compared. The Gaussian kernel is used and the kernel size is kept fixed at image during the adaptation.

First, the step-size of the SIG algorithm under the optimal entropy criterion was chosen to be image. The step-sizes of the other two SIG algorithms are adjusted such that the three algorithms converge at the same initial rate. Figure 4.23 shows the average convergence curves over 300 simulation runs. Clearly, the optimal entropy criterion achieves the smallest final misadjustment (steady-state WEP). The step-sizes of the other two SIG algorithms can also be adjusted such that the three algorithms yield the same final misadjustment. The corresponding results are presented in Figure 4.24, which indicates that, beginning at the same initial WEP, the algorithm under optimal entropy criterion converges faster to the optimal solution than the other two algorithms. Therefore, a noticeable performance improvement can be achieved by optimizing the image function.

image

Figure 4.23 Convergence curves of three SIG algorithms with the same initial convergence rate (adopted from Ref. [200]).

image

Figure 4.24 Convergence curves of three SIG algorithms with the same final misadjustment (adopted from Ref. [200]).

Further, we consider the slow adaptation case in which the step-size for the optimal entropy criterion was chosen to be image (smaller than 0.015). It has been proved that, if the step-size becomes smaller (tends to zero), the optimal entropy will approach the Shannon entropy. Thus, in this case, the adaptation behavior of the SIG algorithm under Shannon entropy criterion would be nearly equivalent to that of the optimal entropy criterion. This theoretical prediction is confirmed by Figure 4.25, which illustrates that the Shannon entropy criterion and the optimal entropy criterion may produce almost the same convergence performance. In this figure, the initial convergence rates of the three algorithms are set equal.

image

Figure 4.25 Convergence curves of three SIG algorithms in slow adaptation (adopted from Ref. [200]).

4.6 Survival Information Potential Criterion

Traditional entropy measures, such as Shannon and Renyi’s entropies of a continuous random variable, are defined based on the PDF. As argued by Rao et al. [157], this kind of entropy definition has several drawbacks: (i) the definition will be ill-suited for the case in which the PDF does not exist; (ii) the value can be negative; and (iii) the approximation using empirical distribution is impossible in general. In order to overcome these problems, Rao et al. proposed a new definition of entropy based on the cumulative distribution function (or equivalently, the survival function) of a random variable, which they called the cumulative residual entropy (CRE) [157]. Motivated by the definition of CRE, Zografos and Nadarajah proposed two new broad classes of entropy measures based on the survival function, that is, the survival exponential and generalized survival exponential entropies, which include the CRE as a special case [158].

In the following, a new IP, namely the survival information potential (SIP) [159] is defined in terms of the survival function instead of the density function. The basic idea of this definition is to replace the density function with the survival function in the expression of the traditional IP. The SIP is, in fact, the argument of the power function in the survival exponential entropy. In a sense, this parallels the relationship between the IP and the Renyi entropy. When used as an adaptation criterion, the SIP has some advantages over the IP: (i) it has consistent definition in the continuous and discrete domains; (ii) it is not shift-invariant (i.e., its value will vary with the location of distribution); (iii) it can be easily computed from sample data (without kernel computation and the choice of kernel width), and the estimation asymptotically converges to the true value; and (iv) it is a more robust measure since the distribution function is more regular than the density function (note that the density function is computed as the derivative of the distribution function).

4.6.1 Definition of SIP

Before proceeding, we review the definitions of the CRE and survival exponential entropy.

Let image be a random vector in image. Denote image the absolute value transformed random vector of image, which is an image-dimensional random vector with components image. Then the CRE of image is defined by [157]

image (4.207)

where image is the multivariate survival function of the random vector image, and image. Here the notation image means that image, image, and image denotes the indicator function.

Based on the same notations, the survival exponential entropy of order image is defined as [158]

image (4.208)

From Eq. (4.208), we have

image (4.209)

It can be shown that the following limit holds [158]:

image (4.210)

The definition of the IP, along with the similarity between the survival exponential entropy and the Renyi entropy, motivates us to define the SIP.

Definition

For a random vector image in image, the SIP of order image is defined by [159]

image (4.211)

The SIP (4.211) is just defined by replacing the density function with the survival function (of an absolute value transformation of image) in the original IP. This new definition seems more natural and reasonable, because the survival function (or equivalently, the distribution function) is more regular and general than the PDF. For the case image, we call the SIP the quadratic survival information potential (QSIP).

The SIP image can be interpreted as the image-power of the image-norm in the survival functional space. When image, the survival exponential entropy image is a monotonically increasing function of image, and minimizing the SIP is equivalent to minimizing the survival exponential entropy; while when image, the survival exponential entropy image is a monotonically decreasing function of image, and in this case, minimizing the SIP is equivalent to maximizing the survival exponential entropy. We stress that when used as an approximation criterion in system identification, no matter what value of image, the SIP should be minimized to achieve smaller errors. This is quite different from the IP criterion which should be maximized when image [64]. The reason for this is that image, the smaller SIP corresponds to more concentrated errors around the zero value. To demonstrate this fact, we give a simple example below.

Assume that image is zero-mean Gaussian distributed, image. For different variance image and image values, we can calculate the SIP and IP, which are shown in Figure 4.26. It is clear that the SIP is a monotonically increasing function of image for all the image values, while the IP is a monotonically increasing function only when image.

image

Figure 4.26 The SIP and IP for different image and image (adopted from Ref. [159]).

4.6.2 Properties of the SIP

To further understand the SIP, we present in the following some important properties.

Property 1:

image, image.

Proof:

This property is a direct consequence of the definition of SIP.

Property 2:

image, with equality if and only if image.

Proof:

It is obvious that image. Now image if and only if image. Therefore, image implies image for almost all image, or in other word, for almost all image, image, which implies image.

Remark:

The global minimum value of the SIP is zero, and it corresponds to the image distribution located at zero. This is a desirable property that fails to hold for conventional MEE criteria whose global minimum corresponds to the image distribution located at any position (shift-invariant). Hence when using SIP as an approximation criterion in system identification, we do not need to add a bias term at system output.

The next five properties (Properties 37) are direct consequences of Ref. [158], and will, therefore, not be proved here (for detailed proofs, please refer to Theorems 1–5 in Ref. [158]).

Property 3:

If image and image (image) for some image, then image.

Property 4:

Let image be an image-dimensional random vector, and let image with image, image, image. Then image.

Property 5 (Weak convergence):

Let image be a sequence of image-dimensional random vectors converging in law to a random vector image. If image are all bounded in image for someimage, then image.

Property 6:

If the components of an image-dimensional random vector image are independent with each other, then image.

Property 7:

Let image and image be nonnegative and independent random variables (image). Then image.

Property 8:

Given two image-dimensional continuous random vectors image and image with PDFs image and image, if image is symmetric (rotation invariant for image) and unimodal around zero, and image is independent of image, then image.

Proof:

Since image and image are independent,

image (4.212)

It follows that image

image (4.213)

where (a) follows from the condition that image is symmetric (rotation invariance for image) and unimodal around zero. Thus we get image.

Property 9:

For the case image, the SIP of image equals the expectation of image.

Proof:

image (4.214)

The above property can be generalized to the case where image is a natural number, as stated in Property 10.

Property 10:

If image, then image, where image, image are independent and identically distributed (i.i.d.) random vectors which are independent of but have the same distribution with image.

Proof:

As random vectors image are i.i.d., independent of but have the same distribution with image, we can derive

image (4.215)

And hence

image (4.216)

The next property establishes a relationship between SIP and IP (which exists when image has PDF). A similar relationship has been proved by Rao et al. for their CRE (see Proposition 4 in [157]).

Property 11:

Let image be a nonnegative random variable with continuous distribution. Then there exists a function image such that the image-order IP of image is related to image via

image (4.217)

Proof:

Let image be the distribution function with density image. If we choose image, where image is defined as in the remarks preceding the Proposition 4 in [157], then image has the distribution image. Therefore, we have

image (4.218)

Property 12:

Let image and image be two continuous random vectors. Assuming for every value image, the conditional density image is symmetric (rotation invariant for image) and unimodal in image around image, then image, where image is any mapping image for which image exists.

Proof:

Denote image and image, respectively, the densities of image and image, i.e.,

image (4.219)

Then for any image, we have

image (4.220)

where (b) comes from the condition that for every image, image is symmetric (rotation invariant for image) and unimodal in image around image. Therefore

image (4.221)

Remark:

The above property suggests that under certain conditions, the conditional mean image, which minimizes the MSE, also minimizes the error’s SIP.

4.6.3 Empirical SIP

Next, we discuss the empirical SIP of image. Since image (see Property 1), we assume without loss of generality that image. Let image be image i.i.d. samples of image with survival function image. The empirical survival function of image can be estimated by putting 1/N at each of the sample points, i.e.,

image (4.222)

Consequently, the empirical SIP can be calculated as

image (4.223)

According to Glivento–Cantelli theorem [202],

image (4.224)

Combining Eq. (4.224) and Property 5 yields the following proposition.

Proposition:

For any random vector image in image, if image is bounded in image for some image, then the empirical SIP (4.223) will converge to the true SIP of image, i.e., image.

In the sequel, we will derive more explicit expressions for the empirical SIP (4.223). Let image be a realization of image.

4.6.3.1 Scalar Data Case

First, we consider the scalar data case, i.e., image. We assume, without loss of generality, that image. Then we derive

image (4.225)

where we assume image. One can rewrite Eq. (4.225) into a more simple form:

image (4.226)

where

image (4.227)

From Eq. (4.226), the empirical SIP for scalar data can be expressed as a weighted sum of the ordered sample data image, where the weights image depend on the sample size image and the image value, satisfying image, image. For the case image, the weights for different image values are shown in Figure 4.27. One can observe: (i) when image, all the weights are equal (image), and in this case the empirical SIP is identical to the sample mean of image and (ii) when image, the weights are not equal. Specifically, when image (image), the weight image is a monotonically increasing (decreasing) function of the order index image, that is, the larger weights are assigned to the larger (smaller) sample data.

image

Figure 4.27 The weights for different image values (adopted from Ref. [159]).

4.6.3.2 Multidimensional Data Case

Computing the empirical SIP for multidimensional data (image) is, in general, not an easy task. If image is a natural number, however, we can still obtain a simple explicit expression. In this case, one can derive

image (4.228)

The empirical SIP in Eq. (4.228) can also be derived using Property 10. By Property 10, we have image, where image are i.i.d. random vectors which are independent of but have the same distribution with image. It is now evident that the empirical SIP of Eq. (4.228) is actually the sample mean estimate of image.

Remark:

Compared with the empirical IP (say the estimated IP by Parzen window method), the empirical SIP is much simpler in computation (just an ordering of the samples), since there is no kernel evaluation, the most time-consuming part in the calculation of the empirical IP. In addition, there is no problem like kernel width choice.

4.6.4 Application to System Identification

Similar to the IP, the SIP can also be used as an optimality criterion in adaptive system training. Under the SIP criterion, the unknown system parameter vector (or weight vector) image can be estimated as

image (4.229)

where image is the survival function of the absolute value transformed error image. In practical application, the error distribution is usually unknown; we have to use, instead of the theoretical SIP, the empirical SIP as the cost function. Given a sequence of error samples image, assuming, without loss of generality, that image, the empirical SIP will be (assume scalar error)

image (4.230)

where image is calculated by Eq. (4.227). The empirical cost (4.230) is a weighted sum of the ordered absolute errors. One drawback of Eq. (4.230) is that it is not smooth at image. To address this problem, one can use the empirical SIP of the square errors image as an alternative adaptation cost, given by

image (4.231)

The above cost is the weighted sum of the ordered square errors, which includes the popular MSE cost as a special case (when image). A more general cost can be defined as the empirical SIP of any mapped errors image, i.e.,

image (4.232)

where function image usually satisfies

image (4.233)

Based on the general cost (4.232), the weight update equation for system identification is

image (4.234)

The weight update can be performed online (i.e., over a short sliding window), as described in Table 4.5.

Table 4.5

Online System Identification with SIP Criterion

Image

In the following, we present two simulation examples to demonstrate the performance of the SIP minimization criterion. In the simulations below, the empirical cost (4.231) is adopted (image).

4.6.4.1 FIR System Identification

First, we consider the simple FIR system identification. Let the unknown system be a FIR filter given by [159]

image (4.235)

The adaptive system is another FIR filter with the same order. The input image is a white Gaussian process with unit variance. Assume that the output of unknown system is disturbed by an additive noise. Three different distributions are utilized to generate the noise data:

image (4.236)

where image denotes the characteristic function of the image distribution. The above three distributions have, respectively, heavy, medium, and light tails. The noise signals are shown in Figure 4.28.

image

Figure 4.28 Three different noises: (A) image, (B) Gaussian, and (C) Binary.

In the simulation, the sliding data length is set at 10. The step-sizes of each algorithm are chosen such that the initial convergence rates are visually identical. Figure 4.29 shows the average convergence curves over 100 Monte Carlo runs for different image. Notice that when image, the algorithm is actually the LMS algorithm (strictly speaking, the block LMS algorithm). The WEPs at final iteration are summarized in Table 4.6. From simulation results, one can observe:

i. For the case of image noise, the algorithms with larger image values (say image) converge to smaller WEP, and can even outperform the LAD algorithm (for comparison purpose, we also plot in Figure 4.29(A) the convergence curve of LAD). It is well known that the LAD algorithm performs well in image-stable noises [31].

ii. For the case of Gaussian noise, the algorithm with image (the LMS algorithm) performs better, which is to be expected, since MSE criterion is optimal for linear Gaussian systems.

iii. For the case of binary noise, the algorithms with smaller image values (say image) obtain better performance.

image

Figure 4.29 Convergence curves averaged over 100 Monte Carlo runs: (A) image, (B) Gaussian, and (C) Binary.

Table 4.6

WEPs at Final Iteration Over 100 Monte Carlo Runs

Image

The basic reasons for these findings are as follows. As shown in Figure 4.27, the larger the image value, the smaller the weights assigned to the larger errors. For the case of heavy-tail noises (e.g., image noise), the larger errors are usually caused by the impulsive noise. In this case, the larger image value will reduce the influence of the outliers and improve the performance. On the other hand, for the case of light-tail noises (e.g., binary noise), the larger errors are mainly caused by the system mismatch, thus the smaller image value will decrease the larger mismatch more rapidly (as the larger weights are assigned to the larger errors).

4.6.4.2 TDNN Training

The second simulation example is on the TDNNs training (in batch mode) with SIP minimization criterion for one-step prediction of the Mackey–Glass (MG) chaotic time series [203] with delay parameter image and sampling period 6 s. The TDNN is built from MLPs that consist of six processing elements (PEs) in a hidden layer with biases and tanh nonlinearities and a single linear output PE with an output bias. The goal is to predict the value of the current sample image using the previous seven points image (the size of the input delay line is consistent with Taken’s embedding theorem [204]). In essence, the problem is to identify the underlying mapping between the input vector image and the desired output image. For comparison purpose, we also present simulation results of TDNN training with image-order (image) IP maximization criterion. Since IP is shift-invariant, after training the bias value of the output PE was adjusted so as to yield zero-mean error over the training set. The Gaussian kernel was used to evaluate the empirical IP and the kernel size was experimentally set at 0.8. A segment of 200 samples is used as the training data. To avoid local-optimal solutions, each TDNN is trained starting from 500 predetermined initial weights generated by zero-mean Gaussian distribution with variance 0.01. The best solution (the one with the lowest SIP or the highest IP after training) among the 500 candidates is selected to test the accuracy performance. In each simulation, the training algorithms utilized BP with variable step-sizes [205], and 1000 iterations were run to ensure the convergence. The trained networks are tested on an independently generated test sequence of 4000 samples, and the testing errors are listed in Table 4.7. One can see the TDNN trained using SIP with image achieves the smallest testing error. Thus, if properly choosing the order image, the SIP criterion is capable of outperforming the IP criterion. Figure 4.30 shows the computation time per iteration versus the number of training data. Clearly, the SIP-based training is computationally much more efficient than the IP-based training, especially for large data sets. The training time for both methods is measured on a personal computer equipped with a 2.2 GHz Processor and 3 GB memory.

Table 4.7

Testing Errors Over 4000 Test Samples in MG Time Series Prediction

Image

image

Figure 4.30 Execution time per iteration versus the number of training data.

4.7 Δ-Entropy Criterion

System identification usually handles continuous-valued random processes rather than discrete-valued processes. In many practical situations, however, the input and/or output of the unknown system may be discrete-valued for a variety of reasons:

a. For many systems, especially in the field of digital communication, the input signals take values only in finite alphabetical sets.

b. Coarsely quantized signals are commonly used when the data are obtained from an A/D converter or from a communication channel. Typical contexts involving quantized data include digital control systems (DCSs), networked control systems (NCSs), wireless sensor networks (WSNs), etc.

c. Binary-valued sensors occur frequently in practical systems. Some typical examples of binary-valued sensors can be found in Ref. [206].

d. Discrete-valued time series are common in practice. In recent years, the count or integer-valued data time series have gained increasing attentions [207210].

e. Sometimes, due to computational consideration, even if the observed signals are continuous-valued, one may classify the data into groups and obtain the discrete-valued data [130, Chap. 5].

In these situations, one may apply the differential entropy (or IP) to implement the MEE criterion, in spite of the fact that the random variables are indeed discrete. When the discretization is coarse (i.e., few levels) the use of differential entropy may carry a penalty in performance that is normally not quantified. Alternatively, the MEE implemented with discrete entropy will become ill-suited since the minimization fails to constrain the dispersion of the error value which should be pursued because the error dynamic range decreases over iterations.

In the following, we augment the MEE criterion choices by providing a new entropy definition for discrete random variables, called the image-entropy, which comprises two terms: one is the discrete entropy and the other is the logarithm of the average interval between two successive discrete values. This new entropy retains important properties of the differential entropy and reduces to the traditional discrete entropy for a special case. More importantly, the proposed entropy definition can still be used to measure the value dispersion of a discrete random variable, and hence can be used as an MEE optimality criterion in system identification with discrete-valued data.

4.7.1 Definition of Δ-Entropy

Before giving the definition of image-entropy, let’s review a fundamental relationship between the differential entropy and discrete entropy (for details, see also Ref. [43]).

Consider a continuous scalar random variable image with PDF image. One can produce a quantized random variable image (see Figure 4.31), given by

image (4.237)

where image is one of countable values, satisfying

image (4.238)

image

Figure 4.31 Quantization of a continuous random variable.

The probability that image is

image (4.239)

And hence, the discrete entropy image is calculated as

image (4.240)

If the density function image is Riemann integrable, the following limit holds

image (4.241)

Here, to make a distinction between the discrete entropy and differential entropy, we use image instead of image to denote the differential entropy of image. Thus, if the quantization interval image is small enough, we have

image (4.242)

So, the differential entropy of a continuous random variable image is approximately equal to the discrete entropy of the quantized variable image plus the logarithm of the quantization interval image. The above relationship explains why differential entropy is sensitive to value dispersion. That is, compared with the discrete entropy, the differential entropy “contains” the term image, which measures the average interval between two successive quantized values since

image (4.243)

This important relationship also inspired us to seek a new entropy definition for discrete random variables that will measure uncertainty as well as value dispersion and is defined as follows.

Definition: Given a discrete random variable image with values image, and the corresponding distribution image, the image-entropy, denoted by image or image, is defined as [211]

image (4.244)

where image (or image) stands for the average interval (distance) between two successive values.

The image-entropy contains two terms: the first term is identical to the traditional discrete entropy and the second term equals the logarithm of the average interval between two successive values. This new entropy can be used as an optimality criterion in estimation or identification problems, because minimizing error’s image-entropy decreases the average interval and automatically force the error values to concentrate.

Next, we discuss how to calculate the average interval image. Assume without loss of generality that the discrete values satisfy image. Naturally, one immediately thinks of the arithmetic and geometric means, i.e.,

image (4.245)

Both arithmetic and geometric means take no account of the distribution. A more reasonable approach is to calculate the average interval using a probability-weighted method. For example, one can use the following formula:

image (4.246)

However, if image, the sum of weights will be <1, because

image (4.247)

To address this issue, we give another formula:

image (4.248)

The second term of (4.248) equals the arithmetic mean multiplied by image, which normalizes the weight sum to one. Substituting (4.248) into (4.244), we obtain

image (4.249)

The image-entropy can be immediately extended to the infinite value-set case, i.e.,

image (4.250)

In the following, we use Eq. (4.249) or (4.250) as the image-entropy expression.

4.7.2 Some Properties of the Δ-Entropy

The image-entropy maintains a close connection to the differential entropy. It is clear that the image-entropy and the differential entropy have the following relationship in the limit:

Theorem 1

For any continuous random variable image with Riemann integrable PDF image, we have image, where quantized variable image is given by Eq. (4.237).

Proof:

Combining Eqs. (4.239) and (4.250), we have

image

As image is Riemann integrable, it follows that

image

This completes the proof.

Remark:

The differential entropy of image is the limit of the image-entropy of image as image. Thus, to some extent one can regard the image-entropy as a “quantized version” of the differential entropy.

Theorem 2

image.

Proof:

Omitted due to simplicity.

Remark:

By Theorem 2, if the minimum interval between two successive values is larger than 1, we have image, whereas if the maximum interval between two successive values is smaller than 1, we have image.

Theorem 3

If image is a discrete random variable with equally spaced values, and the interval image, then image.

Proof:

For equally spaced intervals, the difference between theimage-entropy and the discrete entropy equals image. Hence, the statement follows directly.

Remark:

The classification problem is, in general, a typical example of the error variable distributed on equally spaced values image. Thus in classification, the error’s discrete entropy is equivalent to the image-entropy. This fact also gives an interpretation for why the discrete entropy can be used in the test and classification problems [212,213].

In information theory, it has been proved that the discrete entropy satisfies (see Ref. [43], p. 489)

image (4.251)

Combining Eq. (4.251) and Theorem 2, we obtain a bound on image-entropy:

image (4.252)

A lower bound of the image-entropy can also be expressed in terms of the variance image, as given in the following theorem:

Theorem 4

If image, then image.

Proof:

It is easy to derive

image

It follows that image, and hence

image

The lower bound of Theorem 4 suggests that, under certain condition minimizing the image-entropy constrains the variance. This is a key difference between the image-entropy and conventional discrete entropy.

Theorem 5

For any discrete random variable image, image, image.

Proof:

Since image and image, we haveimage.

Theorem 6

image, image.

Proof:

Since image and image, we have image.

Theorems 5 and 6 indicate that the image-entropy has the same shifting and scaling properties as the differential entropy.

Theorem 7

The image-entropy is a concave function of image.

Proof:

image, and image, we have

image (4.253)

By the concavity of the logarithm function,

image (4.254)

It is well known that the discrete entropy image is a concave function of the distribution image, i.e.,

image (4.255)

Combining Eqs. (4.254) and (4.255) yields

image (4.256)

which implies image-entropy is a concave function of image.

The concavity of the image-entropy is a desirable property for the entropy optimization problem. This property ensures that when a stationary value of the image-entropy subject to linear constraints is found, it gives the global maximum value [149].

Next, we solve the maximum image-entropy distribution. Consider the following constrained optimization problem:

image (4.257)

where image is the expected value of the function image. The Lagrangian is given by

image (4.258)

where image are the image Lagrange multipliers corresponding to the image constraints. Here image is used as the first Lagrange multiplier instead of image as a matter of convenience. Let image, we have

image (4.259)

where

image (4.260)

Solving Eq. (4.259), we obtain the following theorem:

Theorem 8

The distribution image that maximizes the image-entropy subject to the constraints of Eq. (4.257) is given by

image (4.261)

where image are determined by substituting for image from Eq. (4.261) into the constraints of Eq. (4.257).

For the case in which the discrete values are equally spaced, we have image, and Eq. (4.261) becomes

image (4.262)

In this case, the maximum image-entropy distribution is identical to the maximum discrete entropy distribution [149].

4.7.3 Estimation of Δ-Entropy

In practical situations, the discrete values image and probabilities image are usually unknown, and we must estimate them from sample data image. An immediate approach is to group the sample data into different values image and calculate the corresponding relative frequencies:

image (4.263)

where image denotes the number of these outcomes belonging to the value image with image.

Based on the estimated values image and probabilities image, a simple plug-in estimate of image-entropy can be obtained as

image (4.264)

where image and image.

As the sample size increases, the estimated value set image will approach the true value set image with probability one, i.e., image, as image. In fact, assuming image is an i.i.d. sample from the distribution image, and image, image, we have

image (4.265)

We investigate in the following the asymptotic behavior of the image-entropy in random sampling. We assume for tractability that the value set image is known (or has been exactly estimated). Following a similar derivation of the asymptotic distribution for the image-entropy (see Ref. [130], Chap. 2), we denote the parameter vector image, and rewrite Eq. (4.264) as

image (4.266)

The first-order Taylor expansion of image around image gives

image (4.267)

where image, and image is

image (4.268)

where

image (4.269)

According to Ref. [130, Chap. 2], we have

image (4.270)

where the inverse of the Fisher information matrix of image is given by image. Then image is bounded in probability, and

image (4.271)

And hence, random variables image and image have the same asymptotic distribution, and we have the following theorem.

Theorem 9

The estimate image, obtained by replacing the image by their relative frequencies image, in a random sample of size image, satisfies

image (4.272)

provided image, where image, image, and

image (4.273)

where image is calculated as (4.268).

The afore-discussed plug-in estimator of the image-entropy has close relationships with certain estimators of the differential entropy.

4.7.3.1 Relation to KDE-based Differential Entropy Estimator

Suppose image are samples from a discrete random variable image. We rewrite the plug-in estimate (4.264) as

image (4.274)

where image. Denote image, and let image, we construct another set of samples:

image (4.275)

which are samples from the discrete random variable image, and satisfy

image (4.276)

Now we consider image as samples from a “continuous” random variable image. The PDF of image can be estimated by the KDE approach:

image (4.277)

The kernel function satisfies image and image. If the kernel function is selected as the following uniform kernel:

image (4.278)

then Eq. (4.277) becomes

image (4.279)

where image follows from image, and image, image. The differential entropy of image can then be estimated as

image (4.280)

As a result, the plug-in estimate of the image-entropy is identical to a uniform kernel-based estimate of the differential entropy from the scaled samples (4.275).

4.7.3.2 Relation to Sample-Spacing Based Differential Entropy Estimator

The plug-in estimate of the image-entropy also has a close connection with the sample-spacing based estimate of the differential entropy. Suppose the sample data are different from each other, and have been rearranged in an increasing order: image, the image-spacing estimate is given by [214]

image (4.281)

where image and image. If image, we obtain the one-spacing estimate:

image (4.282)

On the other hand, based on the samples one can estimate the value set and corresponding probabilities:

image (4.283)

Then the plug-in estimate of the image-entropy will be

image (4.284)

It follows that

image (4.285)

where (b) comes from the concavity of the logarithm function. If image is bounded, we have

image (4.286)

In this case, the plug-in estimate of image-entropy provides an asymptotic upper bound on the one-spacing entropy estimate.

4.7.4 Application to System Identification

The image-entropy can be used as an optimality criterion in system identification, especially when error signal is distributed on a countable value set (which is usually unknown and varying with time) 15. A typical example is the system identification with quantized input/output (I/O) data, as shown in Figure 4.32, where image and image represent the quantized I/O observations, obtained via I/O quantizers image and image. With uniform quantization, image and image can be expressed as

image (4.287)

where image and image denote the quantization box-sizes, image gives the largest integer that is less than or equal to image.

image

Figure 4.32 System identification with quantized I/O data.

In practice, image-entropy cannot be, in general, analytically computed, since the error’s values and corresponding probabilities are unknown. In this case, we need to estimate the image-entropy by the plug-in method as discussed previously. Traditional gradient-based methods, however, cannot be used to solve the image-entropy minimization problem, since the objective function is usually not differentiable. Thus, we have to resort to other methods, such as the estimation of distribution algorithms (EDAs) [215], a new class of evolutionary algorithms (EAs), although they are usually more computationally complex. The EDAs use the probability model built from the objective function to generate the promising search points instead of crossover and mutation as done in traditional GAs. Some theoretical results related to the convergence and time complexity of EDAs can be found in Ref. [215]. Table 4.8 presents the EDA-based identification algorithm with image-entropy criterion.

Table 4.8

EDA Based Identification Algorithm with image-entropy Criterion

Image

Usually, we use a Gaussian model with diagonal covariance matrix (GM/DCM) [215] to estimate the density function image of the imageth generation. With GM/DCM model, we have

image (4.288)

where the means image and the deviations image can be estimated as

image (4.289)

In the following, two simple examples are presented to demonstrate the performance of the above algorithm. In all of the simulations below, we set image and image.

First, we consider the system identification based on quantized I/O data, where the unknown system and the parametric model are both two-tap FIR filters, i.e.,

image (4.290)

The true weight vector of the unknown system is image, and the initial weight vector of the model is image. The input signal and the additive noise are both white Gaussian processes with variances image and 0.04, respectively. The number of the training data is image. In addition, the quantization box-size image and image are equal. We compare the performance among three entropy criteria: image-entropy, differential entropy17 and discrete entropy. For different quantization sizes, the average evolution curves of the weight error norm over 100 Monte Carlo runs are shown in Figure 4.33. We can see the image-entropy criterion achieves the best performance, and the discrete entropy criterion fails to converge (discrete entropy cannot constrain the dispersion of the error value). When the quantization size becomes smaller, the performance of the differential entropy approaches that of the image-entropy. This agrees with the limiting relationship between the image-entropy and the differential entropy.

image

Figure 4.33 Evolution curves of the weight error norm for different entropy criteria (adopted from Ref. [211]).

The second example illustrates that the image-entropy criterion may yield approximately an unbiased solution even if the input and output data are both corrupted by noises. Consider again the identification of a two-tap FIR filter in which image. We assume the input signal image, input noise image, and the output noise image are all zero-mean white Bernoulli processes with distributions below

image (4.291)

where image, image, and image denote, respectively, the standard deviations of image, image, and image. In the simulation we set image, and the number of training data is image. Simulation results over 100 Monte Carlo runs are listed in Tables 4.9 and 4.10. For comparison purpose, we also present the results obtained using MSE criterion. As one can see, the image-entropy criterion produces nearly unbiased estimates under various SNR conditions, whereas the MSE criterion yields biased solution especially when the input noise power increasing.

Table 4.9

Simulation Results for Different image (image)

Image

(adopted from Ref. [211])

Table 4.10

Simulation Results for Different image (image)

Image

(adopted from Ref. [211])

4.8 System Identification with MCC

Correntropy is closely related to Renyi’s quadratic entropy. With Gaussian kernel, correntropy is a localized similarity measure between two random variables: when two points are close, the correntropy induced metric (CIM) behaves like an L2 norm; outside of the L2 zone CIM behaves like an L1 norm; as two points are further apart, the metric approaches L0 norm [137]. This property makes the MCC a robust adaptation criterion in presence of non-Gaussian impulsive noise. At the end of this chapter, we briefly discuss the application of the MCC criterion to system identification.

Consider a general scheme of system identification as shown in Figure 4.1. The objective of the identification is to optimize a criterion function (or cost function) in such a way that the model output image resembles as closely as possible the measured output image. Under the MCC criterion, the cost function that we want to maximize is the correntropy between the measured output and the model output, i.e.,

image (4.292)

In practical applications, one often uses the following empirical correntropy as the cost function:

image (4.293)

Then a gradient-based identification algorithm can be easily derived as follows:

image (4.294)

When image, the above algorithm becomes a stochastic gradient-based (LMS-like) algorithm:

image (4.295)

In the following, we present two simulation examples of FIR identification to demonstrate the performance of MCC criterion, and compare it with the performance of MSE and MEE.

In the first example, the weight vector of the plant is [138]

image (4.296)

The input signal is a white Gaussian process with zero mean and unit variance. The noise distribution is a mixture of Gaussian:

image (4.297)

In this distribution, the Gaussian density with variance 10 creates strong outliers. The kernel sizes for the MCC and the MEE are set at 2.0. The step-sizes for the three identification criteria are chosen such that when the observation noise is Gaussian, their performance is similar in terms of the weight SNR (WSNR),

image (4.298)

Figure 4.34 shows the performance in the presence of impulsive noise. One can see the MCC criterion achieves a more robust performance.

image

Figure 4.34 WSNR of three identification criteria in impulsive measurement noise (adopted from Ref. [138]).

In the second example, the plant has a time-varying transfer function, where the weight vector is changing as follows [138]:

image (4.299)

where image is the unit step function. The simulation results are shown in Figure 4.35. Once again the MCC criterion performs better in impulsive noise environment.

image

Figure 4.35 WSNR of three identification criteria while tracking a time-varying system in impulsive measurement noise (adopted from Ref. [138]).

Appendix H Vector Gradient and Matrix Gradient

In many of system identification applications, we often encounter a cost function that we want to maximize (or minimize) with respect to a vector or matrix. To accomplish this optimization, we usually need to find a vector or matrix derivative.

Derivative of Scalar with Respect to Vector

If image is a scalar, image is an image vector, then

image (H.1)

image (H.2)

Derivative of Scalar with Respect to Matrix

If image is a scalar, image is an image matrix, image, then

image (H.3)

Derivative of Vector with Respect to Vector

If image and image are, respectively, image and image vectors, then

image (H.4)

image (H.5)

Second Derivative (Hessian matrix) of Scalar with Respect to Vector

image (H.6)

With the above definitions, there are some basic results:

1. image, where image and image are scalars, image are constants.

2. image.

3. image.

4. image, where image and image are both image vectors, and image and image are independent.

5. image, image, where image is a matrix independent of image.

6. Let image and image be respectively image and image vectors, image and image be respectively image and image constant matrices. Then

image

7. If image is a image matrix with independent elements, then

image


1In the case of black-box identification, the model parameters are basically viewed as vehicles for adjusting the fit to data and do not reflect any physical consideration in the system.

2For a dynamic system, the input vector may contain past inputs and outputs.

3The Mercer theorem states that any reproducing kernel image can be expanded as follows [160]:

image

where image and image are the eigenvalues and the eigenfunctions, respectively. The eigenvalues are nonnegative. Therefore, a mapping image can be constructed as

image


By construction, the dimensionality of image is determined by the number of strictly positive eigenvalues, which can be infinite (e.g., for the Gaussian kernel case).

4See Appendix H for the calculation of the gradient in vector or matrix form.

5The step-size is critical to the performance of the LMS. In general, the choice of step-size is a trade-off between the convergence rate and the asymptotic EMSE [19,20].

6The LMS algorithm usually assumes an FIR model [19,20].

7In practical applications, if the samples are not i.i.d., the KDE method can still be applied.

8The kernel function for density estimation is not necessarily a Mercer kernel. In this book, to make a distinction between these two types of kernels, we denote them by image and image, respectively.

9It should be noted that the error samples at time k are not necessarily to be a tap delayed sequence. A more general expression of the error samples can be image.

10Since error entropy is shift-invariant, after training under MEE criterion the bias value of the output PE was adjusted so as to yield zero-mean error over the training set.

11Here the input vector is a row vector.

12This configuration has been widely adopted due to its convenience for convergence analysis (e.g., see Ref. [196]).

13Similar to [29], the subscript image in image indicates that the Gaussian assumption A2 is the main assumption leading to the defined expression (4.163). The subscript image for image, which is defined later in (4.166), however, suggests that the independence assumption A4 is the major assumption in evaluating the expectation.

14Increasing the sliding data length can improve both convergence speed and steady-state performance, however, this will increase dramatically the computational burden (image).

15For the case in which the error distribution is continuous, one can still use the image-entropy as the optimization criterion if classifying the errors into groups and obtaining the quantized data.

16The truncation selection is a widely used selection method in EDAs. In the truncation selection, individuals are sorted according to their objective function (or fitness function) values and only the best individuals are selected.

17Strictly speaking, the differential entropy criterion is invalid in this example, because the error is discrete-valued. However, in the simulation one can still adopt the empirical differential entropy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset