2

Information Measures

The concept of information is so rich that there exist various definitions of information measures. Kolmogorov had proposed three methods for defining an information measure: probabilistic method, combinatorial method, and computational method [146]. Accordingly, information measures can be categorized into three categories: probabilistic information (or statistical information), combinatory information, and algorithmic information. This book focuses mainly on statistical information, which was first conceptualized by Shannon [44]. As a branch of mathematical statistics, the establishment of Shannon information theory lays down a mathematical framework for designing optimal communication systems. The core issues in Shannon information theory are how to measure the amount of information and how to describe the information transmission. According to the feature of data transmission in communication, Shannon proposed the use of entropy, which measures the uncertainty contained in a probability distribution, as the definition of information in the data source.

2.1 Entropy

Definition 2.1

Given a discrete random variable image with probability mass function image, image, Shannon’s (discrete) entropy is defined by [43]

image (2.1)

where image is Hartley’s amount of information associated with the discrete value image with probability image.1 This information measure was originally devised by Claude Shannon in 1948 to study the amount of information in a transmitted message. Shannon entropy measures the average information (or uncertainty) contained in a probability distribution and can also be used to measure many other concepts, such as diversity, similarity, disorder, and randomness. However, as the discrete entropy depends only on the distribution image, and takes no account of the values, it is independent of the dynamic range of the random variable. The discrete entropy is unable to differentiate between two random variables that have the same distribution but different dynamic ranges. Actually the discrete random variables with the same entropy may have arbitrarily small or large variance, a typical measure for value dispersion of a random variable.

Since system parameter identification deals, in general, with continuous random variables, we are more interested in the entropy of a continuous random variable.

Definition 2.2

If image is a continuous random variable with PDF image, image, Shannon’s differential entropy is defined as

image (2.2)

The differential entropy is a functional of the PDF image. For this reason, we also denote it by image. The entropy definition in (2.2) can be extended to multiple random variables. The joint entropy of two continuous random variables image and image is

image (2.3)

where image denotes the joint PDF of image. Furthermore, one can define the conditional entropy of image given image as

image (2.4)

where image is the conditional PDF of image given image.2

If image and image are discrete random variables, the entropy definitions in (2.3) and (2.4) only need to replace the PDFs with the probability mass functions and the integral operation with the summation.

Theorem 2.1

Properties of the differential entropy3 :

1. Differential entropy can be either positive or negative.

2. Differential entropy is not related to the mean value (shift invariant), i.e., image, where image is an arbitrary constant.

3. image

4. image, image

5. Entropy has the concavity property: image is a concave function of image, that is, image, we have

image (2.5)

6. If random variables image and image are mutually independent, then

image (2.6)

that is, the entropy of the sum of two independent random variables is no smaller than the entropy of each individual variable.

7. Entropy power inequality (EPI): If image and image are mutually independent image-dimensional random variables, we have

image (2.7)

with equality if and only if image and image are Gaussian distributed and their covariance matrices are in proportion to each other.

8. Assume image and image are two image-dimensional random variables, image, image denotes a smooth bijective mapping defined over image, image is the Jacobi matrix of image, then

image (2.8)

where image denotes the determinant.

9. Suppose image is a image-dimensional Gaussian random variable, image, i.e.,

image (2.9)

Then the differential entropy of image is

image (2.10)


Differential entropy measures the uncertainty and dispersion in a probability distribution. Intuitively, the larger the value of entropy, the more scattered the probability density of a random variable or in other word, the smaller the value of entropy, the more concentrated the probability density. For a one-dimensional random variable, the differential entropy is similar to the variance. For instance, the differential entropy of a one-dimensional Gaussian random variable image is image, where image denotes the variance of image. It is clear to see that in this case the differential entropy increases monotonically with increasing variance. However, the entropy is in essence quite different from the variance; it is a more comprehensive measure. The variance of some random variable is infinite, while the entropy is still finite. For example, consider the following Cauchy distribution4 :

image (2.11)


Its variance is infinite, while the differential entropy is image [147].
There is an important entropy optimization principle, that is, the maximum entropy (MaxEnt) principle enunciated by Jaynes [148] and Kapur and Kesavan [149]. According to MaxEnt, among all the distributions that satisfy certain constraints, one should choose the distribution that maximizes the entropy, which is considered to be the most objective and most impartial choice. MaxEnt is a powerful and widely accepted principle for statistical inference with incomplete knowledge of the probability distribution.
The maximum entropy distribution under characteristic moment constraints can be obtained by solving the following optimization problem:

image (2.12)

where image is the natural constraint (the normalization constraint) and image (image) denote image (generalized) characteristic moment constraints.

Theorem 2.2

(Maximum Entropy PDF) Satisfying the constraints in (2.12), the maximum entropy PDF is given by

image (2.13)

where the coefficients image are the solution of the following equations5:

image (2.14)

In statistical information theory, in addition to Shannon entropy, there are many other definitions of entropy, such as Renyi entropy (named after Alfred Renyi) [152], Havrda–Charvat entropy [153], Varma entropy [154], Arimoto entropy [155], and image-entropy [156]. Among them, image-entropy is the most generalized definition of entropy. image-entropy of a continuous random variable image is defined by [156]

image (2.15)

where either image is a concave function and image is a monotonously increasing function or image is a convex function and image is a monotonously decreasing function. When image, image-entropy becomes the image-entropy:

image (2.16)

where image is a concave function. Similar to Shannon entropy, image-entropy is also shift-invariant and satisfies (see Appendix C for the proof)

image (2.17)

where image and image are two mutually independent random variables. Some typical examples of image-entropy are given in Table 2.1. As one can see, many entropy definitions can be regarded as the special cases of image-entropy.

Table 2.1

image-Entropies with Different image and image Functions [130]

Image

From Table 2.1, Renyi’s entropy of order-image is defined as

image (2.18)

where image, image, image is called the order-image information potential (when image, called the quadratic information potential, QIP) [64]. The Renyi entropy is a generalization of Shannon entropy. In the limit image, it will converge to Shannon entropy, i.e., image.

The previous entropies are all defined based on the PDFs (for continuous random variable case). Recently, some researchers also propose to define the entropy measure using the distribution or survival functions [157,158]. For example, the cumulative residual entropy (CRE) of a scalar random variable image is defined by [157]

image (2.19)

where image is the survival function of image. The CRE is just defined by replacing the PDF with the survival function (of an absolute value transformation of image) in the original differential entropy (2.2). Further, the order-image (image) survival information potential (SIP) is defined as [159]

image (2.20)

This new definition of information potential is valid for both discrete and continuous random variables.

In recent years, the concept of correntropy has also been applied successfully in signal processing and machine learning [137]. The correntropy is not a true entropy measure, but in this book it is still regarded as an information theoretic measure since it is closely related to Renyi’s quadratic entropy (image), that is, the negative logarithm of the sample mean of correntropy (with Gaussian kernel) yields the Parzen estimate of Renyi’s quadratic entropy [64]. Let image and image be two random variables with the same dimensions, the correntropy is defined by

image (2.21)

where image denotes the expectation operator, image is a translation invariant Mercer kernel6, and image denotes the joint distribution function of image. According to Mercer’s theorem, any Mercer kernel image induces a nonlinear mapping image from the input space (original domain) to a high (possibly infinite) dimensional feature space image (a vector space in which the input data are embedded), and the inner product of two points image and image in image can be implicitly computed by using the Mercer kernel (the so-called “kernel trick”) [160162]. Then the correntropy (2.21) can alternatively be expressed as

image (2.22)

where image denotes the inner product in image. From (2.22), one can see that the correntropy is in essence a new measure of the similarity between two random variables, which generalizes the conventional correlation function to feature spaces.

2.2 Mutual Information

Definition 2.3

The mutual information between continuous random variables image and image is defined as

image (2.23)

The conditional mutual information between image and image, conditioned on random variable image, is given by

image (2.24)

For a random vector7image (image), the mutual information between components is

image (2.25)

Theorem 2.3

Properties of the mutual information:

1. Symmetry, i.e., image.

2. Non-negative, i.e., image, with equality if and only if image and image are mutually independent.

3. Data processing inequality (DPI): If random variables image, image, image form a Markov chain image, then image. Especially, if image is a function of image, image, where image is a measurable mapping from image to image, then image, with equality if image is invertible and image is also a measurable mapping.

4. The relationship between mutual information and entropy:

image (2.26)

5. Chain rule: Let image be image random variables. Then

image (2.27)

6. If image, image, image are image, image, and image-dimension Gaussian random variables with, respectively, covariance matrices image, image, and image, then the mutual information between image and image is

image (2.28)

In particular, if image, we have

image (2.29)

where image denotes the correlation coefficient between image and image.

7. Relationship between mutual information and MSE: Assume image and image are two Gaussian random variables, satisfying image, where image, image, image and image are mutually independent. Then we have [81]

image (2.30)

where image denotes the minimum MSE when estimating image based on image.
Mutual information is a measure of the amount of information that one random variable contains about another random variable. The stronger the dependence between two random variables, the greater the mutual information is. If two random variables are mutually independent, the mutual information between them achieves the minimum zero. The mutual information has close relationship with the correlation coefficient. According to (2.29), for two Gaussian random variables, the mutual information is a monotonically increasing function of the correlation coefficient. However, the mutual information and the correlation coefficient are different in nature. The mutual information being zero implies that the random variables are mutually independent, thereby the correlation coefficient is also zero, while the correlation coefficient being zero does not mean the mutual information is zero (i.e., the mutual independence). In fact, the condition of independence is much stronger than mere uncorrelation. Consider the following Pareto distributions [149]:

image (2.31)

where image, image, image. One can calculate image, image, and image, and hence image (image and image are uncorrelated). In this case, however, image, that is, image and image are not mutually independent (the mutual information not being zero).
With mutual information, one can define the rate distortion function and the distortion rate function. The rate distortion function image of a random variable image with MSE distortion is defined by

image (2.32)


At the same time, the distortion rate function is defined as

image (2.33)

Theorem 2.4

If image is a Gaussian random variable, image, then

image (2.34)

2.3 Information Divergence

In statistics and information geometry, an information divergence measures the “distance” of one probability distribution to the other. However, the divergence is a much weaker notion than that of the distance in mathematics, in particular it need not be symmetric and need not satisfy the triangle inequality.

Definition 2.4

Assume that image and image are two random variables with PDFs image and image with common support. The Kullback–Leibler information divergence (KLID) between image and image is defined by

image (2.35)

In the literature, the KL-divergence is also referred to as the discrimination information, the cross entropy, the relative entropy, or the directed divergence.

Theorem 2.5

Properties of KL-divergence:

1. image, with equality if and only if image.

2. Nonsymmetry: In general, we have image.

3. image, that is, the mutual information between two random variables is actually the KL-divergence between the joint probability density and the product of the marginal probability densities.

4. Convexity property: image is a convex function of image, i.e., image, we have

image (2.36)

where image and image.

5. Pinsker’s inequality: Pinsker inequality is an inequality that relates KL-divergence and the total variation distance. It states that

image (2.37)

6. Invariance under invertible transformation: Given random variables image and image, and the invertible transformation image, the KL-divergence remains unchanged after the transformation, i.e., image. In particular, if image, where image is a constant, then the KL-divergence is shift-invariant:

image (2.38)

7. If image and image are two image-dimensional Gaussian random variables, image, image, then

image (2.39)

where image denotes the trace operator.
There are many other definitions of information divergence. Some quadratic divergences are frequently used in machine learning, since they involve only a simple quadratic form of PDFs. Among them, the Euclidean distance (ED) in probability spaces and the Cauchy–Schwarz (CS)-divergence are popular, and are defined respectively as [64]

image (2.40)

image (2.41)

Clearly, the ED in (2.40) can be expressed in terms of QIP:

image (2.42)

where image is named the cross information potential (CIP). Further, the CS-divergence of (2.41) can also be rewritten in terms of Renyi’s quadratic entropy:

image (2.43)

where image is called Renyi’s quadratic cross entropy.
Also, there is a much generalized definition of divergence, i.e., the image-divergence, which is defined as [130]

image (2.44)

where image is a collection of convex functions, image, image, image, and image. When image (or image), the image-divergence becomes the KL-divergence. It is easy to verify that the image-divergence satisfies the properties (1), (4), and (6) in Theorem 2.5. Table 2.2 gives some typical examples of image-divergence.

Table 2.2

image-Divergences with Different image-Functions [130]

Image

2.4 Fisher Information

The most celebrated information measure in statistics is perhaps the one developed by R.A. Fisher (1921) for the purpose of quantifying information in a distribution about the parameter.

Definition 2.5

Given a parameterized PDF image, where image, image is a image-dimensional parameter vector, and assuming image is continuously differentiable with respect to image, then the Fisher information matrix (FIM) with respect to image is

image (2.45)

Clearly, the FIM image, also referred to as the Fisher information, is a image matrix. If image is a location parameter, i.e., image, Fisher information will be

image (2.46)

The Fisher information of (2.45) can alternatively be written as

image (2.47)

where image stands for the expectation with respect to image. From (2.47), one can see that the Fisher information measures the “average sensitivity” of the logarithm of PDF to the parameter image or the “average influence” of the parameter image on the logarithm of PDF. The Fisher information is also a measure of the minimum error in estimating the parameter of a distribution. This is illustrated in the following theorem.

Theorem 2.6

(Cramer–Rao Inequality) Let image be a parameterized PDF, where image, image is a image-dimensional parameter vector, and assume that image is continuously differentiable with respect to image. Denote image an unbiased estimator of image based on image, satisfying image, where image denotes the true value of image. Then

image (2.48)

where image is the covariance matrix of image.

Cramer–Rao inequality shows that the inverse of the FIM provides a lower bound on the error covariance matrix of the parameter estimator, which plays a significant role in parameter estimation. A proof of the Theorem 2.6 is given in Appendix D.

2.5 Information Rate

The previous information measures, such as entropy, mutual information, and KL-divergence, are all defined for random variables. These definitions can be further extended to various information rates, which are defined for random processes.

Definition 2.6

Let image and image be two discrete-time stochastic processes, and denote image, image. The entropy rate of the stochastic process image is defined as

image (2.49)

The mutual information rate between image and image is defined by

image (2.50)

If image, the KL-divergence rate between image and image is

image (2.51)

If the PDF of the stochastic process image is dependent on and continuously differentiable with respect to the parameter vector image, then the Fisher information rate matrix (FIRM) is

image (2.52)

The information rates measure the average amount of information of stochastic processes in unit time. The limitations in Definition 2.6 may not exist, however, if the stochastic processes are stationary, these limitations in general exist. The following theorem gives the information rates for stationary Gaussian processes.

Theorem 2.7

Given two jointly Gaussian stationary processes image and image, with power spectral densities image and image, and image with spectral density image, the entropy rate of the Gaussian process image is

image (2.53)

The mutual information rate between image and image is

image (2.54)

If image, the KL-divergence rate between image and image is

image (2.55)

If the PDF of image is dependent on and continuously differentiable with respect to the parameter vector image, then the FIRM (assuming image) is [163]

image (2.56)

Appendix B image-Stable Distribution

image-stable distributions are a class of probability distributions satisfying the generalized central limit theorem, which are extensions of the Gaussian distribution. The Gaussian, inverse Gaussian, and Cauchy distributions are its special cases. Excepting the three kinds of distributions, other image-stable distributions do not have PDF with analytical expression. However, their characteristic functions can be written in the following form:

image (B.1)

where image is the location parameter, image is the dispersion parameter, image is the characteristic factor, image is the skewness factor. The parameter image determines the trailing of distribution. The smaller the value of image, the heavier the trail of the distribution is. The distribution is symmetric if image, called the symmetric image-stable (image) distribution. The Gaussian and Cauchy distributions are image-stable distributions with image and image, respectively.

When image, the tail attenuation of image-stable distribution is slower than that of Gaussian distribution, which can be used to describe the outlier data or impulsive noises. In this case the distribution has infinite second-order moment, while the entropy is still finite.

Appendix C Proof of (2.17)

Proof

Assume image is a concave function, and image is a monotonically increasing function. Denote image the inverse of function image, we have

image (C.1)

Since image and image are independent, then

image (C.2)

According to Jensen’s inequality, we can derive

image (C.3)

As image is monotonically increasing, image must also be monotonically increasing, thus we have image. Similarly, image. Therefore,

image (C.4)

For the case in which image is a convex function and image is monotonically decreasing, the proof is similar (omitted).

Appendix D Proof of Cramer–Rao Inequality

Proof

First, one can derive the following two equalities:

image (D.1)

image (D.2)

where image is a image identity matrix. Denote image and image.

Then

image (D.3)

So we obtain

image (D.4)

According to the matrix theory, if the symmetric matrix image is positive-definite, then image. It follows that

image (D.5)

i.e., image.


1In this book, “log” always denotes the natural logarithm. The entropy will then be measured in nats.

2Strictly speaking, we should use some subscripts to distinguish the PDFs image, image, and image. For example, we can write them as image, image, image. In this book, for simplicity we often omit these subscripts if no confusion arises.

3The detailed proofs of these properties can be found in related information theory textbooks, such as “Elements of Information Theory” written by Cover and Thomas [43].

4Cauchy distribution is a non-Gaussian image-stable distribution (see Appendix B).

5On how to solve these equations, interested readers are referred to [150,151].

6Let image be a measurable space and assume a real-valued function image is defined on image, i.e., image. Then function image is called a Mercer kernel if and only if it is a continuous, symmetric, and positive-definite function. Here, image is said to be positive-definite if and only if

image

where image denotes any finite signed Borel measure, image. If the equality holds only for zero measure, then image is said to be strictly positive-definite (SPD).

7Unless mentioned otherwise, in this book a vector refers to a column vector.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset