3

Information Theoretic Parameter Estimation

Information theory is closely associated with the estimation theory. For example, the maximum entropy (MaxEnt) principle has been widely used to deal with estimation problems given incomplete knowledge or data. Another example is the Fisher information, which is a central concept in statistical estimation theory. Its inverse yields a fundamental lower bound on the variance of any unbiased estimator, i.e., the well-known Cramer–Rao lower bound (CRLB). An interesting link between information theory and estimation theory was also shown for the Gaussian channel, which relates the derivative of the mutual information with the minimum mean square error (MMSE) [81].

3.1 Traditional Methods for Parameter Estimation

Estimation theory is a branch of statistics and signal processing that deals with estimating the unknown values of parameters based on measured (observed) empirical data. Many estimation methods can be found in the literature. In general, the statistical estimation can be divided into two main categories: point estimation and interval estimation. The point estimation involves the use of empirical data to calculate a single value of an unknown parameter, while the interval estimation is the use of empirical data to calculate an interval of possible values of an unknown parameter. In this book, we only discuss the point estimation. The most common approaches to point estimation include the maximum likelihood (ML), method of moments (MM), MMSE (also known as Bayes least squared error), maximum a posteriori (MAP), and so on. These estimation methods also fall into two categories, namely, classical estimation (ML, MM, etc.) and Bayes estimation (MMSE, MAP, etc.).

3.1.1 Classical Estimation

The general description of the classical estimation is as follows: let the distribution function of population image be image, where image is an unknown (but deterministic) parameter that needs to be estimated. Suppose image are samples (usually independent and identically distributed, i.i.d.) coming from image (image are corresponding sample values). Then the goal of estimation is to construct an appropriate statistics image that serves as an approximation of unknown parameter image. The statistics image is called an estimator of image, and its sample value image is called the estimated value of image. Both the samples image and the parameter image can be vectors.

The ML estimation and the MM are two prevalent types of classical estimation.

3.1.1.1 ML Estimation

The ML method, proposed by the famous statistician R.A. Fisher, leads to many well-known estimation methods in statistics. The basic idea of ML method is quite simple: the event with greatest probability is most likely to occur. Thus, one should choose the parameter that maximizes the probability of the observed sample data. Assume that image is a continuous random variable with probability density function (PDF) image, image, where image is an unknown parameter, image is the set of all possible parameters. The ML estimate of parameter image is expressed as

image (3.1)

where image is the joint PDF of samples image. By considering the sample values image to be fixed “parameters,” this joint PDF is a function of the parameter image, called the likelihood function, denoted by image. If samples image are i.i.d., we have image. Then the ML estimate of image becomes

image (3.2)

In practice, it is often more convenient to work with the logarithm of the likelihood function (called the log-likelihood function). In this case, we have

image (3.3)

An ML estimate is the same regardless of whether we maximize the likelihood or log-likelihood function, since log is a monotone transformation.

In most cases, the ML estimate can be solved by setting the derivative of the log-likelihood function to zero:

image (3.4)

For many models, however, there is no closed form solution of ML estimate, and it has to be found numerically using optimization methods.

If the likelihood function involves latent variables in addition to unknown parameter image and known data observations image, one can use the expectation–maximization (EM) algorithm to find the ML solution [164,165] (see Appendix E). Typically, the latent variables are included in a likelihood function because either there are missing values among the data or the model can be formulated more simply by assuming the existence of additional unobserved data points.

ML estimators possess a number of attractive properties especially when sample size tends to infinity. In general, they have the following properties:

• Consistency: As the sample size increases, the estimator converges in probability to the true value being estimated.

• Asymptotic normality: As the sample size increases, the distribution of the estimator tends to the Gaussian distribution.

• Efficiency: The estimator achieves the CRLB when the sample size tends to infinity.

3.1.1.2 Method of Moments

The MM uses the sample algebraic moments to approximate the population algebraic moments, and then solves the parameters. Consider a continuous random variable image, with PDF image, where image are image unknown parameters. By the law of large numbers, the image-order sample moment image of image will converge in probability to the image-order population moment image, which is a function of image, i.e.,

image (3.5)

The sample moment image is a good approximation of the population moment image, thus one can achieve an estimator of parameters image (image) by solving the following equations:

image (3.6)

The solution of (3.6) is the MM estimator, denoted by image, image.

3.1.2 Bayes Estimation

The basic viewpoint of Bayes statistics is that in any statistic reasoning problem, a prior distribution must be prescribed as a basic factor in the reasoning process, besides the availability of empirical data. Unlike classical estimation, the Bayes estimation regards the unknown parameter as a random variable (or random vector) with some prior distribution. In many situations, this prior distribution does not need to be precise, which can be even improper (e.g., uniform distribution on the whole space). Since the unknown parameter is a random variable, in the following we use image to denote the parameter to be estimated, and image to denote the observation data image.

Assume that both the parameter image and observation image are continuous random variables with joint PDF

image (3.7)

where image is the marginal PDF of image (the prior PDF) and image is the conditional PDF of image given image (also known as the likelihood function if considering image as the function’s variable). By using the Bayes formula, one can obtain the posterior PDF of image given image:

image (3.8)

Let image be an estimator of image (based on the observation image), and let image be a loss function that measures the difference between random variables image and image. The Bayes risk of image is defined as the expected loss (the expectation is taken over the joint distribution of image and image):

image (3.9)

where image denotes the posterior expected loss (posterior Bayes risk). An estimator is said to be a Bayes estimator if it minimizes the Bayes risk among all estimators. Thus, the Bayes estimator can be obtained by solving the following optimization problem:

image (3.10)

where image denotes all Borel measurable functions image. Obviously, the Bayes estimator also minimizes the posterior Bayes risk for each image.

The loss function in Bayes risk is usually a function of the estimation error image. The common loss functions used for Bayes estimation include:

1. squared error function: image;

2. absolute error function: image;

3. 0–1 loss function: image, where image denotes the delta function.1

The squared error loss corresponds to the MSE criterion, which is perhaps the most prevalent risk function in use due to its simplicity and efficiency. With the above loss functions, the Bayes estimates of the unknown parameter are, respectively, the mean, median, and mode2 of the posterior PDF image, i.e.,

image (3.11)

The estimators (a) and (c) in (3.11) are known as the MMSE and MAP estimators. A simple proof of the MMSE estimator is given in Appendix F. It should be noted that if the posterior PDF is symmetric and unimodal (SUM, such as Gaussian distribution), the three Bayes estimators are identical.

The MAP estimate is a mode of the posterior distribution. It is a limit of Bayes estimation under 0–1 loss function. When the prior distribution is uniform (i.e., a constant function), the MAP estimation coincides with the ML estimation. Actually, in this case we have

image (3.12)

where (a) comes from the fact that image is a constant.

Besides the previous common risks, other Bayes risks can be conceived. Important examples include the mean image-power error [30], Huber’s M-estimation cost [33], and the risk-sensitive cost [38], etc. It has been shown in [24] that if the posterior PDF is symmetric, the posterior mean is an optimal estimate for a large family of Bayes risks, where the loss function is even and convex.

In general, a Bayes estimator is a nonlinear function of the observation. However, if image and image are jointly Gaussian, then the MMSE estimator is linear. Suppose image, image, with jointly Gaussian PDF

image (3.13)

where image is the covariance matrix:

image (3.14)

Then the posterior PDF image is also Gaussian and has mean (the MMSE estimate)

image (3.15)

which is, obviously, a linear function of image.

There are close relationships between estimation theory and information theory. The concepts and principles in information theory can throw new light on estimation problems and suggest new methods for parameter estimation. In the sequel, we will discuss information theoretic approaches to parameter estimation.

3.2 Information Theoretic Approaches to Classical Estimation

In the literature, there have been many reports on the use of information theory to deal with classical estimation problems (e.g., see [149]). Here, we only give several typical examples.

3.2.1 Entropy Matching Method

Similar to the MM, the entropy matching method obtains the parameter estimator by using the sample entropy (entropy estimator) to approximate the population entropy. Suppose the PDF of population image is image (image is an unknown parameter). Then its differential entropy is

image (3.16)

At the same time, one can use the sample image to calculate the sample entropy image.3 Thus, we can obtain an estimator of parameter image through solving the following equation:

image (3.17)

If there are several parameters, the above equation may have infinite number of solutions, while a unique solution can be achieved by combining the MM. In [166], the entropy matching method was used to estimate parameters of generalized Gaussian distribution (GGD).

3.2.2 Maximum Entropy Method

The maximum entropy method applies the famous MaxEnt principle to parameter estimation. The basic idea is that, subject to the information available, one should choose the parameter image such that the entropy is as large as possible, or the distribution as nearly uniform as possible. Here, the maximum entropy method refers to a general approach rather than a specific parameter estimation method. In the following, we give three examples of maximum entropy method.

3.2.2.1 Parameter Estimation of Exponential Type Distribution

Assume that the PDF of population image is of the following form:

image (3.18)

where image, image, are image (generalized) characteristic moment functions, image is an unknown parameter vector to be estimated. Many known probability distributions are special cases of this exponential type distribution. By Theorem 2.2, image is the maximum entropy distribution satisfying the following constraints:

image (3.19)

where image denotes the population characteristic moment:

image (3.20)

As image is unknown, the population characteristic moments cannot be calculated. We can approximate them using the sample characteristic moments. And then, an estimator of parameter image can be obtained by solving the following optimization problem:

image (3.21)

where image, image, are image sample characteristic moments, i.e.,

image (3.22)

According to Theorem 2.2, the estimator of image satisfies the equations:

image (3.23)

If image, the above estimation method will be equivalent to the MM.

3.2.2.2 Maximum Spacing Estimation

Suppose the distribution function of population image is image, and the true value of the unknown parameter image is image, then the random variable image will be distributed over the interval image, which is a uniform distribution if image. According to the MaxEnt principle, if the distribution over a finite interval is uniform, the entropy will achieve its maximum. Therefore, the entropy of random variable image will attain the maximum value if image. So one can obtain an estimator of the parameter image by maximizing the sample entropy of image. Let a sample of population image be image, the sample of image will be image. Let image denote the sample entropy of image, the estimator of parameter image can be expressed as

image (3.24)

If the sample entropy is calculated by using the one-spacing estimation method (see Chapter 4), then we have

image (3.25)

where image is the order statistics of image. Formula (3.25) is called the maximum spacing estimation of parameter image.

3.2.2.3 Maximum Equality Estimation

Suppose image is an i.i.d. sample of population image with PDF image. Let image be the order statistics of image. Then the random sample divides the real axis into image subintervals image, image, where image and image. Each subinterval has the probability:

image (3.26)

Since the sample is random and i.i.d., the most reasonable situation is that the probabilities of image subinterval are equal. Hence, the parameter image should be chosen in such a way as to maximize the entropy of distribution image (or to make image as nearly uniform as possible), i.e.,

image (3.27)

The above estimation is called the maximum equality estimation of parameter image.

It is worth noting that besides parameter estimation, the MaxEnt principle can also be applied to spectral density estimation [48]. The general idea is that the maximum entropy rate stochastic process that satisfies the given constant autocorrelation and variance constraints, is a linear Gauss–Markov process with i.i.d. zero-mean, Gaussian input.

3.2.3 Minimum Divergence Estimation

Let image be an i.i.d. random sample from a population image with PDF image, image. Let image be the estimated PDF based on the sample. Let image be an estimator of image. Then image is also an estimator for image. Then the estimator image should be chosen so that image is as close as possible to image. This can be achieved by minimizing any measure of information divergence, say the KL-divergence

image (3.28)

or, alternatively,

image (3.29)

The estimate image in (3.28) or (3.29) is called the minimum divergence (MD) estimate of image. In practice, we usually use (3.28) for parameter estimation, because it can be simplified as

image (3.30)

Depending on the estimated PDF image, the MD estimator may take many different forms. Next, we present three specific examples of MD estimator.

• MD Estimator 1
Without loss of generality, we assume that the sample satisfies image. Then the distribution function can be estimated as

image (3.31)


Thus, we have

image (3.32)

where image is the likelihood function. In this case, the MD estimation is exactly the ML estimation.

• MD Estimator 2
Suppose the population image is distributed over the interval image, and the sample satisfies image. It is reasonable to assume that in each subinterval image, the probability is image. And hence, the PDF of image can be estimated as

image (3.33)


Substituting (3.33) into (3.30) yields

image (3.34)


If image is a continuous function of image, then according to the mean value theorem of integral calculus, we have

image (3.35)

where image. Hence, (3.34) can be written as

image (3.36)


The above parameter estimation is in form very similar to the ML estimation. However, different from Fisher’s likelihood function, the values of the cost function in (3.36) are taken at image, which are determined by mean value theorem of integral calculus.

• MD Estimator 3
Assume population image is distributed over the interval image, and this interval is divided into image subintervals image (image) by image data image. The probability of each subinterval is determined by

image (3.37)


If image are given proportions of the population that lie in the image subintervals (image), then parameter image should be chosen so as to make image and image as close as possible, i.e.,

image (3.38)


This is a useful estimation approach, especially when the information available is on proportions in the population, such as proportions of persons in different income intervals or proportions of students in different score intervals.
In the previous MD estimations, the KL-divergence can be substituted by other definitions of divergence. For instance, if using image-divergence, we have

image (3.39)

or

image (3.40)


For details on the minimum image-divergence estimation, the readers can refer to [130].

3.3 Information Theoretic Approaches to Bayes Estimation

The Bayes estimation can also be embedded within the framework of information theory. In particular, some information theoretic measures, such as the entropy and correntropy, can be used instead of the traditional Bayes risks.

3.3.1 Minimum Error Entropy Estimation

In the scenario of Bayes estimation, the minimum error entropy (MEE) estimation aims to minimize the entropy of the estimation error, and hence decrease the uncertainty in estimation. Given two random variables: image, an unknown parameter to be estimated, and image, the observation (or measurement), the MEE (with Shannon entropy) estimation of image based on image can be formulated as

image (3.41)

where image is the estimation error, image is an estimator of image based on image, image is a measurable function, image stands for the collection of all measurable functions image, and image denotes the PDF of the estimation error. When (3.41) is compared with (3.9) one concludes that the “loss function” in MEE is image, which is different from traditional Bayesian risks, like MSE. Indeed one does not need to impose a risk functional in MEE, the risk is directly related to the error PDF. Obviously, other entropy definitions (such as order-image Renyi entropy) can also be used in MEE estimation. This feature is potentially beneficial because the risk is matched to the error distribution.

The early work in MEE estimation can be traced back to the late 1960s when Weidemann and Stear [86] studied the use of error entropy as a criterion function (risk function) for analyzing the performance of sampled data estimation systems. They proved that minimizing the error entropy is equivalent to minimizing the mutual information between the error and the observation, and also proved that the reduced error entropy is upper-bounded by the amount of information obtained by the observation. Minamide [89] extended Weidemann and Stear’s results to a continuous-time estimation system. Tomita et al. applied the MEE criterion to linear Gaussian systems and studied state estimation (Kalman filtering), smoothing, and predicting problems from the information theory viewpoint. In recent years, the MEE became an important criterion in supervised machine learning [64].

In the following, we present some important properties of the MEE criterion, and discuss its relationship to conventional Bayes risks. For simplicity, we assume that the error image is a scalar (image). The extension to arbitrary dimensions will be straightforward.

3.3.1.1 Some Properties of MEE Criterion

Property 1:

image, image

Proof:

This is the shift invariance of the differential entropy. According to the definition of differential entropy, we have

image (3.42)

Remark:

The MEE criterion is invariant with respect to error’s mean. In practice, in order to meet the desire for small error values, the MEE estimate is usually restricted to zero-mean (unbiased) error, which requires special user attention (i.e., mean removal). We should note that the unbiased MEE estimate can still be nonunique (see Property 6).

Property 2:

If image is a random variable independent of the error image, then image.

Proof:

According to the properties of differential entropy and the independence condition, we have

image (3.43)

Remark:

Property 2 implies that MEE criterion is robust to independent additive noise. Specifically, if error image contains an independent additive noise image, i.e., image, where image is the true error, then minimizing the contaminated error entropy image will constrain the true error entropy image.

Property 3:

Minimizing the error entropy image is equivalent to minimizing the mutual information between error image and the observation image, i.e., image.

Proof:

As image, we have

image (3.44)

It follows easily that

image (3.45)

where (a) comes from the fact that the conditional entropy image is not related to image.

Remark:

The minimization of error entropy will minimize the mutual information between the error and the observation. Hence, under MEE criterion, the observation will be “fully utilized,” so that the error contains least information about the observation.

Property 4:

The error entropy image is lower bounded by the conditional entropy image, and this lower bound is achieved if and only if error image is independent of the observation image.

Proof:

By (3.44), we have

image (3.46)

where the inequality follows from the fact that image, with equality if and only if error image and observation image are independent.

Remark:

The error entropy can never be smaller than the conditional entropy of the parameter image given observation image. This lower bound is achieved if and only if the error contains no information about the observation.

For MEE estimation, there is no explicit expression for the optimal estimate unless some constraints on the posterior PDF are imposed. The next property shows that, if the posterior PDF is SUM, the MEE estimate will be equal to the conditional mean (i.e., the MMSE estimate).

Property 5:

If for any image, the posterior PDF image is SUM, then the MEE estimate equals the posterior mean, i.e., image.

Proof:

This property is a direct consequence of Theorem 1 of [91] (Omitted).

Remark:

Since the posterior PDF is SUM, in the above property the MEE estimate also equals the posterior median or posterior mode. We point out that for order-image Renyi entropy, this property still holds. One may be led to conclude that MEE has no advantage over the MMSE, since they correspond to the same solution for SUM case. However, the risk functional in MEE is still well defined for unimodal but asymmetric distributions, although no general results are known for this class of distributions. But we suspect that the practical advantages of MEE versus MSE reported in the literature [64] are taking advantage of this case. In the context of adaptive filtering, the two criteria may yield much different performance surfaces even if they have the same optimal solution.

Table 3.1 gives a summary of the optimal estimates for several Bayes estimation methods.

Table 3.1

Optimal Estimates for Several Bayes Estimation Methods

Image

Property 6:

The MEE estimate may be nonunique even if the error distribution is restricted to zero mean (unbiased).

Proof:

We prove this property by means of a simple example as follows [94]: suppose image is a discrete random variable with Bernoulli distribution:

image (3.47)

The posterior PDF of image given image is (image):

image

image

Given an estimator image, the error PDF will be

image (3.48)

Let image be an unbiased estimator, then image, and hence image. Assuming image (due to symmetry, one can obtain similar results for image), the error entropy can be calculated as

image (3.49)

One can easily verify that the error entropy achieves its minimum value when image. Clearly, in this example there are infinitely many unbiased MEE estimators.

Property 7 (Score Orthogonality [92]):

Given an MEE estimator image, the error’s score image is orthogonal to any measurable function image of image, where image is defined as

image (3.50)

where image denotes the error PDF for estimator image.

Proof:

Given an MEE estimator image, image andimage, we have

image (3.51)

For image small enough, image will be a “small” random variable. According to [167], we have

image (3.52)

where image denotes the higher order terms. Then the derivative of entropy image with respect to image at image is

image (3.53)

Combining (3.51) and (3.53) yields

image (3.54)

Remark:

If the error is zero-mean Gaussian distributed with variance image, the score function will be image. In this case, the score orthogonality condition reduces to image. This is the well-known orthogonality condition for MMSE estimation.

In MMSE estimation, the orthogonality condition is a necessary and sufficient condition for optimality, and can be used to find the MMSE estimator. In MEE estimation, however, the score orthogonality condition is just a necessary condition for optimality but not a sufficient one. Next, we will present an example to demonstrate that if an estimator satisfies the score orthogonality condition, it can be a local minimum or even a local maximum of image in a certain direction. Before proceeding, we give a definition.

Definition 3.1

Given an estimator image, image is said to be a local minimum (or maximum) of image in the direction of image, if and only if image, such that image, image, we have

image (3.55)

Example 3.1 [92]

Suppose the joint PDF of image and image is the mix-Gaussian density (image):

image (3.56)

The MMSE estimation of image based on image can be computed as image. It is easy to check that the estimator image satisfies the score orthogonality condition.

Now we examine whether the MMSE estimator image is a local minimum or maximum of the error entropy image in a certain direction image. We focus here on the case where image (linear function). In this case, we have

image (3.57)

For the case image, the error’s entropies image with respect to different image and image values are shown in Figure 3.1, from which we see that the MMSE estimator (image) is a local minimum of the error entropy in the direction of image for image and a local maximum for image. In addition, for the case image, the error’s entropies image with respect to different image and image values are depicted in Figure 3.2. It can be seen from Figure 3.2 that when image, image is a global minimum of image; while when image, it becomes a local maximum.

image

Figure 3.1 Error’s entropy image with respect to different image and image values. Source: Adopted from [92].

image

Figure 3.2 Error’s entropy image with respect to different image and image values. Source: Adopted from [92].

The local minima or local maxima can also be judged using the second-order derivative of error entropy with respect to image. For instance, if image, we can calculate the following second-order derivative using results of [167]:

image (3.58)

And hence

image (3.59)

which implies that if image, the MMSE estimator (image) will be a local minimum of the error entropy in the direction of image, whereas if image, it becomes a local maximum.

As can be seen from Figure 3.2, if image, the error entropy image achieves its global minima at image. Figure 3.3 depicts the error PDF for image (MMSE estimator) and image (linear MEE estimator), where image. We can see that the MEE solution is in this case not unique but it is much more concentrated (with higher peak) than the MMSE solution, which potentially gives an estimator with much smaller variance. We can also observe that the peaks of the MMSE and MEE error distributions occur at two different error locations, which means that the best parameter sets are different for each case.

image

Figure 3.3 Error’s PDF image for image and image. Source: Adopted from [92].

3.3.1.2 Relationship to Conventional Bayes Risks

The loss function of MEE criterion is directly related to the error’s PDF, which is much different from the loss functions in conventional Bayes risks, because the user does not need to select the risk functional. To some extent, the error entropy can be viewed as an “adaptive” Bayes risk, in which the loss function is varying with the error distribution, that is, different error distributions correspond to different loss functions. Figure 3.4 shows the loss functions (the lower subplots) of MEE corresponding to three different error PDFs. Notice that the third case provides a risk function that is nonconvex in the space of the errors. This is an unconventional risk function because the role of the weight function is to privilege one solution versus all others in the space of the errors.

image

Figure 3.4 The loss functions of MEE corresponding to three different error PDFs.

There is an important relationship between the MEE criterion and the traditional MSE criterion. The following theorem shows that the MSE is equivalent to the error entropy plus the KL-divergence between the error PDF and any zero-mean Gaussian density.

Theorem 3.1

Let image denote a Gaussian density, image, where image. Then we have

image (3.60)

Proof:

Since image, we have

image

And hence

image (3.61)

It follows easily that image.

Remark:

The above theorem suggests that the minimization of MSE minimizes both the error entropy image and the KL-divergence image. Then the MMSE estimation will decrease the error entropy, and at the same time, make the error distribution close to zero-mean Gaussian distribution. In nonlinear and non-Gaussian estimating systems, the desirable error distribution can be far from Gaussian, while the MSE criterion still makes the error distribution close to zero-mean Gaussian distribution, which may lead to poor estimation performance. Thus, Theorem 3.1 also gives an explanation on why the performance of MMSE estimation may be not good for non-Gaussian samples.

Suppose that the error is zero-mean Gaussian distributed. Then we have image, where image. In this case, the MSE criterion is equivalent to the MEE criterion.

From the derivation of Theorem 3.1, let image, we have

image (3.62)

Since image, then

image (3.63)

Inequality (3.63) suggests that, minimizing the MSE is equivalent to minimizing an upper bound of error entropy.

There exists a similar relationship between MEE criterion and a large family of Bayes risks [168]. To prove this fact, we need a lemma.

Lemma 3.1

Any Bayes risk image corresponds to a PDF4 as follows:

image (3.64)

where image and image satisfy

image (3.65)

Theorem 3.2

For any Bayes risk image, if the loss function image satisfies image, then

image (3.66)

where image is the PDF given in (3.64).

Proof:

First, we show that in the PDF image, image is a positive number. Since image satisfies image and image, we have image. Then

image (3.67)

where (a) follows from image. Therefore,

image (3.68)

which completes the proof.

Remark:

The condition image in the theorem is not very restrictive, because for most Bayes risks, the loss function increases rapidly when image goes to infinity. But for instance, it does not apply to the maximum correntropy (MC) criterion studied next.

3.3.2 MC Estimation

Correntropy is a novel measure of similarity between two random variables [64]. Let image and image be two random variables with the same dimensions, the correntropy is

image (3.69)

where image is a translation invariant Mercer kernel. The most popular kernel used in correntropy is the Gaussian kernel:

image (3.70)

where image denotes the kernel size (kernel width). Gaussian kernel image is a translation invariant kernel that is a function of image, so it can be rewritten as image.

Compared with other similarity measures, such as the mean square error, correntropy (with Gaussian kernel) has some nice properties: (i) it is always bounded (image); (ii) it contains all even-order moments of the difference variable for the Gaussian kernel (using a series expansion); (iii) the weights of higher order moments are controlled by kernel size; and (iv) it is a local similarity measure, and is very robust to outliers.

The correntropy function can also be applied to Bayes estimation [169]. Let image be an unknown parameter to be estimated and image be the observation. We assume, for simplification, that image is a scalar random variable (extension to the vector case is straightforward), image, and image is a random vector taking values in image. The MC estimation of image based on image is to find a measurable function image such that the correntropy between image and image is maximized, i.e.,

image (3.71)

With any translation invariant kernel such as the Gaussian kernel image, the MC estimator will be

image (3.72)

where image is the estimation error. If image, image has posterior PDF image, then the estimation error has PDF

image (3.73)

where image denotes the distribution function of image. In this case, we have

image (3.74)

The following theorem shows that the MC estimation is a smoothed MAP estimation.

Theorem 3.3

The MC estimator (3.74) can be expressed as

image (3.75)

where image (“image” denotes the convolution operator with respect to image).

Proof:

One can derive

image (3.76)

where image and (a) comes from the symmetry of image. It follows easily that

image (3.77)

This completes the proof.

Remark:

The function image can be viewed as a smoothed version (through convolution) of the posterior PDF image. Thus according to Theorem 3.3, the MC estimation is in essence a smoothed MAP estimation, which is the mode of the smoothed posterior distribution. The kernel size image plays an important role in the smoothing process by controlling the degree of smoothness. When image, the Gaussian kernel will approach the Dirac delta function, and the function image will reduce to the original posterior PDF. In this case, the MC estimation is identical to the MAP estimation. On the other hand, when image, the second-order moment will dominate the correntropy, and the MC estimation will be equivalent to the MMSE estimation [137].

In the literature of global optimization, the convolution smoothing method [170172] has been proven very effective in searching the global minima (or maxima). Usually, one can use the convolution of a nonconvex cost function with a suitable smooth function to eliminate local optima, and gradually decrease the degree of smoothness to achieve global optimization. Therefore, we believe that by properly annealing the kernel size, the MC estimation can be used to obtain the global maxima of MAP estimation.

The optimal solution of MC estimation is the mode of the smoothed posterior distribution, which is, obviously, not necessarily unique. The next theorem, however, shows that when kernel size is larger than a certain value, the MC estimation will have a unique optimal solution that lies in a strictly concave region of the smoothed PDF (see [169] for the proof).

Theorem 3.4 [169]

Assume that function image converges uniformly to 1 as image. Then there exists an interval image, image, such that when kernel size image is larger than a certain value, the smoothed PDF image will be strictly concave in image, and has a unique global maximum lying in this interval for any image.

Remark

Theorem 3.4 suggests that in MC estimation, one can use a larger kernel size to eliminate local optima by constructing a concave (in a certain interval) cost function with a unique global optimal solution. This result also shows that the initial condition for convolution smoothing should be chosen in this range. The only other parameter in the method that the user has to select is the annealing rate.

In the following, a simple example is presented to illustrate how the kernel size affects the solution of MC estimation. Suppose the joint PDF of image and image (image) is the mixture density (image) [169]:

image (3.78)

where image denotes the “clean” PDF and image denotes the contamination part corresponding to “bad” data or outliers. Let image, and assume that image and image are both jointly Gaussian:

image (3.79)

image (3.80)

For the case image, the smoothed posterior PDFs with different kernel sizes are shown in Figure 3.5, from which we observe: (i) when kernel size is small, the smoothed PDFs are nonconcave within the dominant region (say the interval image), and there may exist local optima or even nonunique optimal solutions; (ii) when kernel size is larger, the smoothed PDFs become concave within the dominant region, and there is a unique optimal solution. Several estimates of image given image are listed in Table 3.2. It is evident that when the kernel size is very small, the MC estimate is the same as the MAP estimate; while when the kernel size is very large, the MC estimate is close to the MMSE estimate. In particular, for some kernel sizes (say image or image), the MC estimate of image equals −0.05, which is exactly the MMSE (or MAP) estimate of image based on the “clean” distribution image. This result confirms the fact that the MC estimation is much more robust (with respect to outliers) than both MMSE and MAP estimations.

image

Figure 3.5 Smoothed conditional PDFs given image. Source: Adopted from [169]).

Table 3.2

Estimates of image Given image

Image

(adopted from [169])

3.4 Information Criteria for Model Selection

Information theoretic approaches have also been used to solve the model selection problem. Consider the problem of estimating the parameter image of a family of models

image (3.81)

where the parameter space dimension image is also unknown. This problem is actually a model structure selection problem, and the value of image is the structure parameter. There are many approaches to select the space dimension image. Here, we only discuss several information criteria for selecting the most parsimonious correct model, where the name indicates that they are closely related to or can be derived from information theory.

1. Akaike’s information criterion
The Akaike’s information criterion (AIC) was first developed by Akaike [6,7]. AIC is a measure of the relative goodness of fit of a statistical model. It describes the tradeoff between bias and variance (or between accuracy and complexity) in model construction. In the general case, AIC is defined as

image (3.82)

where image is the maximized value of the likelihood function for the estimated model. To apply AIC in practice, we start with a set of candidate models, and find the models’ corresponding AIC values, and then select the model with the minimum AIC value. Since AIC includes a penalty term image, it can effectively avoid overfitting. However, the penalty is constant regardless of the number of samples used in the fitting process.
The AIC criterion can be derived from the KL-divergence minimization principle or the equivalent relative entropy maximization principle (see Appendix G for the derivation).

2. Bayesian Information Criterion
The Bayesian information criterion (BIC), also known as the Schwarz criterion, was independently developed by Akaike and by Schwarz in 1978, using Bayesian formalism. Akaike’s version of BIC was often referred to as the ABIC (for “a BIC”) or more casually, as Akaike’s Bayesian Information Criterion. BIC is based, in part, on the likelihood function, and is closely related to AIC criterion. The formula for the BIC is

image (3.83)

where image denotes the number of the observed data (i.e., sample size). The BIC criterion has a form very similar to AIC, and as one can see, the penalty term in BIC is in general larger than in AIC, which means that generally it will provide smaller model sizes.

3. Minimum Description Length Criterion
The minimum description length (MDL) principle was introduced by Rissanen [9]. It is an important principle in information and learning theories. The fundamental idea behind the MDL principle is that any regularity in a given set of data can be used to compress the data, that is, to describe it using fewer symbols than needed to describe the data literally. According to MDL, the best model for a given set of data is the one that leads to the best compression of the data. Because data compression is formally equivalent to a form of probabilistic prediction, MDL methods can be interpreted as searching for a model with good predictive performance on unseen data. The ideal MDL approach requires the estimation of the Kolmogorov complexity, which is noncomputable in general. However, there are nonideal, practical versions of MDL.
From a coding perspective, assume that both sender and receiver know which member image of the parametric family image generated a data string image. Then from a straightforward generalization of Shannon’s Source Coding Theorem to continuous random variables, it follows that the best description length of image (in an average sense) is simply image, because on average the code length achieves the entropy lower bound image. Clearly, minimizing image is equivalent to maximizing image. Thus the MDL coincides with the ML in parametric estimation problems. In addition, we have to transmit image, because the receiver did not know its value in advance. Adding in this cost, we arrive at a code length for the data string image:

image (3.84)

where image denotes the number of bits for transmitting image. If we assume that the machine precision is image for each component of image and image is transmitted with a uniform encoder, then the term image is expressed as

image (3.85)


In this case, the MDL takes the form of BIC. An alternative expression of image is

image (3.86)

where image is a constant related to the number of bits in the exponent of the floating point representation of image and image is the optimal precision of image.

Appendix E: EM Algorithm

An EM algorithm is an iterative method for finding the ML estimate of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which calculates the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter estimates are then used to determine the distribution of the latent variables in the next E step.

Let image be the observed data, image be the unobserved data, and image be a vector of unknown parameters. Further, let image be the likelihood function, image be the marginal likelihood function of the observed data, and image be the conditional density of image given image under the current estimate of the parameters image. The EM algorithm seeks to find the ML estimate of the marginal likelihood by iteratively applying the following two steps

1. Expectation step (E step): Calculate the expected value of the log-likelihood function, with respect to the conditional distribution of image given image under the current estimate image:

image (E.1)

2. Maximization step (M step): Find the next estimate image of the parameters by maximizing this quantity:

image (E.2)


The iteration image continues until image is sufficiently small.

Appendix F: Minimum MSE Estimation

The MMSE estimate image of image is obtained by minimizing the following cost:

image (F.1)

Since image, image, one only needs to minimize image. Let

image (F.2)

Then we get image.

Appendix G: Derivation of AIC Criterion

The information theoretic KL-divergence plays a crucial role in the derivation of AIC. Suppose that the data are generated from some distribution image. We consider two candidate models (distributions) to represent image: image and image. If we know image, then we could evaluate the information lost from using image to represent image by calculating the KL-divergence, image; similarly, the information lost from using image to represent image would be found by calculating image. We would then choose the candidate model minimizing the information loss. If image is unknown, we can estimate, via AIC, how much more (or less) information is lost by image than by image. The estimate is, certainly, only valid asymptotically.

Given a family of density models image, where image denotes a image-dimensional parameter space, image, and a sequence of independent and identically distributed observations image, the AIC can be expressed as

image (G.1)

where image is the ML estimate of image:

image (G.2)

Let image be the unknown true parameter vector. Then

image (G.3)

where

image (G.4)

Taking the first term in a Taylor expansion of image, we obtain

image (G.5)

where image is the image Fisher information matrix. Then we have

image (G.6)

Suppose image can be decomposed into

image (G.7)

where image is some nonsingular matrix. We can derive

image (G.8)

According to the statistical properties of the ML estimator, when sample number image is large enough, we have

image (G.9)

Combining (G.9) and (G.7) yields

image (G.10)

where image is the image identity matrix. From (G.8) and (G.10), we obtain

image (G.11)

That is, image is Chi-Squared distributed with image degree of freedom. This implies

image (G.12)

It follows that

image (G.13)

And hence

image (G.14)

It has been proved in [173] that

image (G.15)

Therefore

image (G.16)

Combining (G.16) and (G.14), we have

image (G.17)

To minimize the KL-divergence image, one need to maximize image, or equivalently, to minimize (in an asymptotical sense) the following objective function

image (G.18)

This is exactly the AIC criterion.


1For a discrete variable image, image is defined by image, while for a continuous variable, it is defined as image, satisfying image.

2The mode of a continuous probability distribution is the value at which its PDF attains its maximum value.

3Several entropy estimation methods will be presented in Chapter 4.

4Here, image is actually the maximum entropy density that satisfies the constraint condition image.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset