1
Introduction to Information Theory

1.1. Introduction

Information theory was developed by Claude Shannon in the 1940s [SHA 48, SHA 59b]. This is a mathematical theory which introduced source coding and channel coding, two fundamental concepts of communication systems. This theory allows us to determine, depending on the sources’ properties, the theoretical limits of lossless source coding (exact message reconstruction of the source) or lossly source coding (message reconstruction under a fidelity criterion). It also gives us the reachable rates for a given transmission channel due to channel coding.

In this chapter, after reviewing the basics of discrete and continuous probabilities in section 1.2, we will start by introducing the fundamental notions of information theory such as the entropy and average mutual information in section 1.3. We will then focus on the fundamental theorems of information theory for communication systems. We will state the lossly and lossless source coding theorem in sections 1.4 and 1.5, respectively. Then we will determine the theoretical limits of communication without error in a noisy channel. In section 1.5, we will introduce different channel models and finally in section 1.7, we will compute the capacity of these channels and state the channel coding theorem.

1.2. Review of probabilities

Probability theory is a mathematical domain that describes and models random processes. In this section, we present a summary of this theory. We recommend for further reading the books of Papoulis and Pillai [PAP 02] and R. Durrett [DUR 10].

Let X be an experiment or an observation that can be repeated under similar circumstances several times. At each repetition, the result of this observation is an event denoted as x, which can take several possible outcomes. The set of these values is denoted as image.

The result X = x of this observation is not known before it takes place. X is consequently called a random variable. It is modeled by the frequency of appearance of all the outcomes.

Two classes of random variables can be distinguished as follows:

  • – discrete random variables, when the set of outcomes is discrete;
  • – continuous random variables, when their distribution functions are continuous.

1.2.1. Discrete random variables

A discrete random variable X takes its values in a discrete set, called its alphabet image. This alphabet may be infinite (for instance, if image = image) or finite with a size n, if image = {x1, x2, … , xn}. Each outcome is associated with an probability of occurrence PX = {p1, p2, … , pn}:

[1.1] images

For discrete random variables, the probability density fX(x) is defined by:

[1.2] images

where δ(u) is the Dirac function.

1.2.1.1. Joint probability

Let X and Y be two discrete random variables of which respective set of possible outcomes is image = {x1, x2, …, xn} and image = {y1, y2, … , ym}.

Pr(X = xi, Y = yj) is called the joint probability of the events X = xi and Y = yj. Of course, the following property is verified:

[1.3] images

1.2.1.2. Marginal probability

The probability Pr(X = xi) can be computed from the set of joint probabilities Pr(X = xi, Y = yj):

[1.4] images

1.2.1.3. Conditional probability

[1.5] images

Similarly, we can write as:

[1.6] images

As a consequence, the following relation stands:

which can be further developed to

images
images

Equation [1.30] is called the Bayes law. From this equation, we can see that Pr(X = xi|Y = yj) is the a posteriori probability, whereas Pr(Y = yi) is the a priori probability.

1.2.1.4. Independence

If two discrete random variables X and Y are independent, then

[1.8] images

and

[1.9] images

1.2.2. Continuous random variables

The random variable X is continuous if its cumulative distribution function FX(x) is continuous. FX(x) is related to the probability density in the following way:

The random variable mean is defined as:

[1.11] images

Its Nth moment is equal to:

[1.12] images

1.2.3. Jensen’s inequality

Let us first recall that a function f(x) is convex if, for any x, y and 0 < λ < 1, the following inequality stands:

[1.13] images

Let f be a convex function, (x1, … , xn) a real n-tuple belonging to the definition set of f and (p1, … , pn) a real positive n-tuple such that image. Then:

[1.14] images

Jensen’s inequality is obtained by interpreting the pi terms as probabilities: if f(x) is convex for any real discrete random variable X, then:

[1.15] images

1.2.4. Random signals

The signals used in digital communications depend on time t.

Signal x(t) is deterministic if the function t images x(t) is perfectly known. If, however, the values taken by x(t) are unknown, the signal follows a random process. At time t, the random variable is denoted by X(t), and an outcome of this random variable is denoted as x(t). The set of all signal values x(t), for any t in the definition domain, is a given outcome of the random process X.

A random process is defined by its probability density and its statistical moments. The probability density is equal to:

[1.16] images

The random process is stationary if its probability density is independent of time: fX(x, t) = fX(x) ∀t. As a consequence, all of its statistical properties are independent of t. Its probability density can thus be obtained from equation [1.10] in the following way:

[1.17] images

mx(t), the mean of the random variable x(t) from the random process X, is defined as:

[1.18] images

The autocorrelation function Rxx(τ) of a random variable is:

[1.19] images

The random process X is second-order stationary or wide-sense stationary if, for any random signal x(t):

  1. 1) its mean mx(t) is independent of t;
  2. 2) its autocorrelation function verifies Rxx(t1, t2) = Rxx(t1 + t, t2 + t) ∀t.

Then it can simply be denoted as:

[1.20] images

In that case, the power spectrum density γxx(f) is obtained by applying the Fourier transform on the autocorrelation function:

[1.21] images

Reciprocally, the autocorrelation function Rxx(τ) is determined from the power spectrum density as follows:

[1.22] images

Generally, the mean and autocorrelation function of a stationary random process are estimated from a set of outcomes of signal X(t). When the mean over time tends to the random process’ mean, the random process is ergodic. Only one outcome of the random process X is required to evaluate its mean and autocorrelation function. Most random processes that are considered in digital communications are second-order stationary and ergodic.

For discrete signals (for instance, signals that have been sampled from a continuous random signal x(t) at frequency image) xn = x(nTe), the autocorrelation function Rxx(τ) is only defined at discrete times τ = nTe, and the power spectrum density becomes:

[1.23] images

1.3. Entropy and mutual information

1.3.1. Comments on the concept of information

The quantitative notion of information associated with a message exchanged between a source and a receiver in the usual language is linked, for instance, to the veracity of the message, the a priori knowledge of the message by the receiver or the receiver’s understanding (where a problem of language can occur, etc.).

All these considerations are a matter of semantic and are not taken into account by information theory. In information theory, we will only retain a part of the general concept of information: the quantitative measure of information is a measure of the uncertainty associated with an event. This notion of information is fundamental for the study of communication systems.

1.3.2. A logarithmic measure of information

A measure of the information associated with the event X = xi denoted as h(xi) should satisfy the following properties [SHA 48]:

  • h(xi) should be continuous for p(X = xi) between 0 and 1;
  • h(xi) = ∞ if Pr(X = xi) = 0;
  • h(xi) = 0 if Pr(X = xi) = 1: a certain event brings no information;
  • h(xi) > h(yj) if Pr(Y = yj) > Pr(X = xi): more an event is uncertain, more information it will bring;
  • h(xi) + h(yj) = h(xi, yj), if the events Y = yj and X = xi are independent: the realization of two independent events bring a quantity of information equal to the sum of the quantity of information of these two events h(xi) and h(yj).

In order to satisfy the above properties, the quantity of information h(xi) associated with the realization of the event X = xi should be equal to the logarithm of the inverse of the probability Pr(X = xi). Using this definition, a high probability event will carry less quantity of information than a low probability event. When the binary logarithm is used1, the unit h(xi) is the Shannon (Sh). When the natural logarithm is used, the unit is the natural unit (Nat). In this book, we will consider binary logarithms. Consequently, we have the following relation:

[1.24] images

EXAMPLE 1.1.– Let a discrete source generate bits (0 or 1) with Pr(X = 0) = Pr(X = 1) = images. The quantity of information associated with the realization of the event X = 0 or X = 1 is equal to:

[1.25] images

If the source is generating a sequence composed of n independent bits, we have 2n different sequences. Each of these sequences can appear with the probability image. The quantity of information associated with the realization of a specific sequence is equal to:

[1.26] images

Let us consider the realization of two events X = xi and Y = xj. The associated quantity of information is:

[1.27] images

where Pr(X = xi, Y = yj) is the joint probability of the two events.

The quantity of information associated with the realization of the event X = xi conditionally to the event Y = yj is as follows:

[1.28] images

From relation [1.7], we deduce:

[1.29] images

or also:

EXAMPLE 1.2.– A card is randomly drawn from a standard deck of 32 cards (4 colors: heart, spade, diamond, club – 8 values: 7, 8, 9, 10, Jack, Queen, King and Ace). Let x be the event “the drawn card is the club ace” and y be the event “the drawn card is a club". We can compute h(x), h(y) and h(x|y). Since we have:

[1.31] images

we obtain:

[1.32] images
[1.33] images
[1.34] images

1.3.3. Mutual information

We define the mutual information as the quantity of information that the realization of the event Y = yj gives about the event X = xi. In other words, the mutual information is the difference between the quantity of information associated with the realization of the event X = xi and the quantity of information associated with the the realization of the event X = xi conditionally to the event Y = yj. This quantity of information is given by:

[1.35] images

If the two events are independent, we have Pr(X = xi|Y = yj) = Pr(X = xi) and consequently i(xi ; yj) = 0. On the opposite, it the event X = xi is equivalent to the event Y = yj, then we have Pr(X = xi|Y = yj) = 1 and i(xi; yj) = h(xi).

Since we have the following relation:

images

The quantity of information that the realization of the event Y = yj gives about the event X = xi is the same as the information that the realization of the event X = xi gives about the event Y = yj.

[1.36] images

We also have the following relations:

images

Compared to h(xi), the mutual information i(xi; yj) can be negative.

EXAMPLE 1.3.– (cont.) Compute i(x; y):

[1.37] images

The quantity of information that the realization of the event “the drawn card is a club” gives about the event “the drawn card is a club ace” is equal to 2 Sh.

We will see in section 1.7 that the mutual information is important for communications when we associate X with the input of the channel transmission and Y to the output of the channel transmission.

1.3.4. Entropy and average mutual information

After having considered individual events, we will now compute the entropy of a source described with the random variable X with sample space image = {x1, x2, … , xn} and associated probabilities image = {p1, p2, … , pn}. n is the size of the sample space. The average quantity of information entropy of the source is the mean of the quantity of information associated with each possible realization of the event X = xi:

[1.38] images

H(X) is a measure of the uncertainty on X.

The entropy H(X) has the following properties:

[1.39] images
[1.40] images

The entropy is maximum when all the probabilities pi are the same.

EXAMPLE 1.4.– Let a source with 2 states x0 and x1 with p0 = p and p1 = 1 − p. The entropy of this source is the following:

[1.42] images

The function H2(p) defined previously will be often used in this chapter to simplify the notations. It is given in Figure 1.1.

image

Figure 1.1. Entropy of a binary source

The entropy is maximum (1 Sh/bit) when p = 0.5.

We define two random variables X and Y with state spaces image = {x1, x2, … , xn} and image = {y1, y2, … , ym}, respectively. The joint entropy H(X, Y) is defined as follows:

[1.43] images

If the two random variables X and Y are independent, the joint entropy is equal to the sum of the entropies H(X) and H(Y).

We can also compute the conditional entropy H(X/Y) corresponding to the average quantity of information of X given the observation Y from h(xi|yj):

[1.44] images

The relations [1.7] and [1.30] enable us to write the joint entropy as a function of the entropy and the conditional entropy:

[1.45] images

The uncertainty on X and Y is equal to the sum of the uncertainty on X and the uncertainty on Y given X.

The mutual information associated with the realization of an event can be extended to the random variable X and Y. The average mutual information between the random variables X and Y is given by:

[1.46] images

Consequently, we have also the following relations:

[1.47] images

The average mutual information I(X; Y) measures the average quantity of information on X (or reduction of average uncertainty) due to the observation of Y. Figure 1.2 shows graphically the relations between the different entropies and the average mutual information.

image

Figure 1.2. Relations between entropies and average mutual information

While i(xi; yj) can be negative, we always have I(X; Y) ≥ 0.

1.3.5. Kullback–Leibler divergence

Let X and Y be two random variables with state spaces image = {x1, x2, … , xn} and image = {y1, y2, … , yn} and with associated probabilities image = {p1, p2, … , pn} and image = {q1, q2, , qn}, respectively.

We define the relative entropy or Kullback–Leibler divergence as follows:

[1.48] images

While Kullback–Leibler is often considered as a measure of distance between two probability densities, it is not strictly speaking a distance. For example, we generally have image.

We can prove that image and that image if and only if pi = qii = 1 … n.

For continuous random variables P and Q with density probability p(x) and q(x), respectively, the Kullback–Leibler divergence is defined by the integral

[1.49] images

1.3.6. Differential entropy

The differential entropy is a generalization of the entropy to the case of continuous random variables. Let a continuous random variable X be defined by the density probability p(x). The differential entropy HD(X) of X is equal to:

[1.50] images

The differential entropy characterizes the quantity of information of the continuous random variable X.

Let us compute the differential entropy HD(X) of a random variable X with a centered Gaussian probability distribution of variance image. We have:

[1.51] images
[1.52] images

Since log2 image does not depend on x and

images

We can extract the first term in the integral. Then we obtain:

[1.53] images

By definition,

[1.54] images

Finally, we have:

1.3.7. Typical and jointly typical sequences

1.3.7.1. Typical sequences

The set of all sequences of random variable can be divided into two disjoint sets: the set of typical sequences and the set of non-typical sequences. Typical sequences are an important tool for the demonstration of the fundamental theorems of information theory.

Let us define a sequence of random variables X = (X1, X2, … , XN) independent and identically distributed (i.i.d) with state space image and x be a realization.

Since the variables Xi are independent, the terms log2(Pr(Xi)) are also independent and we have the following relation:

This relation converges to H(X) when N is high enough. This property is called the asymptotic equipartition principle (AEP). Among the set of all the possible sequences image, we define the typical sequences for which the probability of occurrence is close to 2N(H(X)). The set of typical sequences image is described as follows:

From the relations [1.56] and [1.57], we can easily prove the following properties:

  • 1) The probability that a sequence x is typical converge to 1 when N → ∞.
  • 2) The number of typical sequences |image| is upper bounded by:
[1.58] images

To summarize, for a sequence of random variables X with N large enough, the realizations x belong almost surely to the set image composed of around 2N(H(X)) sequences, each having an occurrence probability close to 2N(H(X)).

In order to illustrate the typical sequences, we will consider a simple example. Let us consider a binary memoryless random variable X with occurrence probabilities p1 = image and p2 = image. The entropy of X is equal to 0.9183 Sh/symb. For a vector of length N, we have image sequences2 composed of n symbols x1 and N − n symbols x2 and of occurrence probability image. In Figure 1.3, we have plotted the probability distribution of the sequences for N = 100 (continuous line) and the distribution of typical sequences set (dash line). We can see that for N = 100, the probability distribution is close enough to one of the set composed of 4.41027 typical sequences.

image

Figure 1.3. Density probability and typical sequences set

1.3.7.2. Jointly typical sequences

The concept of typical sequences can be extended to the case of sequences of random variables couples. The set of jointly typical sequences image is described as follows:

Let (x, y) be a sequence of random variables couples i.i.d. according to the following joint probability:

[1.60] images

We can easily prove the following properties:

  • – the probability that a sequence of couples (x, y) is jointly typical reaches 1 when N → ∞;
  • – the number of sequences of couples that are jointly typical |image| is upper bounded by:
images
  • – let x and y be two independent realizations of X and Y respectively, the probability that (x′, y′) belongs to the set of jointly typical sequences of couples is:
[1.61] images

The first two properties are obtained from the relations [1.56] and [1.59]. The proof of the third property is the following:

[1.62] images

Since we have about 2N(H(X)) typical sequences x, 2N(H(Y)) typical sequences y and only 2N(H(X,Y)) jointly typical sequences of couples, the probability to choose a jointly typical sequence of couples among the set of couples of typical sequences independently selected is approximately equal to:

[1.63] images

1.4. Lossless source coding theorems

1.4.1. Introduction

Source coding is divided into two classes: lossless source coding and lossy source coding. We will first study lossless source coding also called entropy coding. The aim of lossless source coding is to describe the digital sequence delivered by the source with the shortest sequence of symbols with the ability to reconstruct it by the source decoder.

As in most of the classical books on digital communications, in this chapter, we will study the techniques of source coding by ignoring the effect of the transmission channel and using Shannon’s paradigm. Consequently, we assume that the output of the source coder is directly connected to the input of the source decoder as shown in Figure 1.4.

1.4.2. Entropy and source redundancy

We consider a discrete and stationary source the output symbols of which are Q-ary symbols (the size of the alphabet is equal to Q). The output of this source is described by the random variable X. Consequently, the entropy H(X) is the average quantity of information per symbol at the output of the source.

image

Figure 1.4. Block diagram of the studied chain for source coding

A discrete source is memoryless if the output symbols are de-correlated. Otherwise, we will say that the source is with memory.

If the source is memoryless, H(X) can be computed as previously:

[1.64] images

The entropy H(X) is maximum if the symbols are equiprobable. For Q-ary symbols, the maximum entropy is HMAX = log2 Q.

If the source is with memory, then the entropy per symbol H(X) can be computed by:

[1.65] images

where HJ(X) is the entropy per group of J symbols.

The redundancy of the source Rred characterizes the difference between the quantity of information of the source and the quantity of a source with equiprobable symbols. We have:

[1.66] images

The range of Rred is between 0 (the source symbols are independent and equiprobable) and 1 (the entropy of this source is zero).

1.4.3. Fundamental theorem of source coding

The fundamental theorem of source coding stated by Shannon [SHA 48] is the following:

THEOREM 1.1.– Let ∈ > 0, for all stationary source with entropy per symbol H(X), there is a binary source coding method that associates with each message x of length N a binary word of average length NRmoy such that:

[1.67] images

Consequently, we can associate, on average NH(X) bits, with each message x.

PROOF.– We have seen previously that the typical sequences allow us to divide the set of sequences of random variables into two disjoint sets. The total set of the |image|N sequences x can be divided into two sets: the set image of typical sequences with image and the set image of the non-typical sequences.

In order to distinguish these two sets, we can add a prefix “0” for the typical sequences and “1” for the non-typical sequences. So, the typical sequences can be encoded using N(H(X) + ∈) + 2 bits (the additional bit takes into account the fact that N(H(X) + ∈) is not necessarily an integer). The non-typical sequences can be encoded using at most N log2 |image| + 2 bits.

We can bound the rate or average bits per realization Rmoy as follows:

[1.68] images
images

image can be as small as we want by choosing a high value for N.

1.4.4. Lossless source coding

1.4.4.1. Introduction

From a general point of view, the source coder associates with each message delivered by the source a word composed of q-ary symbols while trying to minimize the average number of these symbols. A message, depending on the context, will be a Q-ary symbol from the source or a set of J Q-ary symbols.

image

Figure 1.5. Source coding

We will restrict ourselves to the case where the symbols at the output of the source coder are bits (q = 2). The generalization to other alphabet sizes is possible without any particular difficulties.

The source coding should satisfy the two following criteria:

  • – unique coding: each message should be coded with a different word;
  • – unique decoding: each word should be distinguished without ambiguity. This criterion can be obtained using:
    • - coding by fixed length word,
    • - coding using a distinguishable separable symbol (Morse system for example),
    • - coding with words of variable length.

Since the source coding with a separable symbol is a suboptimal and obsolete solution, later in this chapter, we will focus only on variable and fixed length source coding.

A source code is instantaneous if no word is the beginning of another. This condition called prefix condition is important to facilitate the decoding.

1.4.4.2. Variable length coding

EXAMPLE 1.5.– Let a discrete source generating four different messages a1, a2, a3 and a4 with their respective probabilities Pr(a1) = image, Pr(a2) = image and Pr(a3) = Pr(a4) = image. One word is associated with each message as described in Table 1.1.

Table 1.1. Variable length code of example 1

Message Word
a1 1
a2 00
a3 01
a4 10

We can check that this source code satisfies the criterion of unique coding but it does not allow a unique decoding. Indeed, for example, it is not possible to decode the message a1, a2, a1, … etc. coded by the sequence 1001. At the receiver, we have no way to decide if the transmitted message was a1, a2, a1 or a4, a3. Consequently, this code is unusable.

EXAMPLE 1.6.– One word is associated with each message as described in Table 1.2.

We can check that this source code satisfies both unique coding and decoding. However, this code is not instantaneous. Indeed, the message a3 is the beginning of the message a4. After the sequence 11, it is necessary to determine the parity of the number of zeros in order to be able to decode the rest of the transmitted message. Consequently, the decoding is more complex.

Table 1.2. Variable length code of example 2

Message Word
a1 00
a2 10
a3 11
a4 110

EXAMPLE 1.7.– One word is associated with each message as described in Table 1.3.

Table 1.3. Variable length code of example 3

Message Word
a1 0
a2 10
a3 110
a4 111

This source code satisfies both the unique coding and decoding. It is instantaneous as we can check on Figure 1.6 that shows the tree associated with this source code.

image

Figure 1.6. Tree associated with the source code of example 3

1.4.4.3. Kraft inequality

THEOREM 1.2.– An instantaneous code composed of Q binary words of length {n1, n2, … , nQ}, respectively, with n1n2 ≤ … ≤ nQ should satisfy the following inequality:

[1.69] images

PROOF.– An instantaneous code can be graphically represented using a complete binary tree of depth nQ. Each leaf of the final tree is associated with one of the source message. The word is the label sequence of the path going from the tree root to the leaf.

image

Figure 1.7. Kraft inequality

A complete tree is composed of image leaves. Let us choose a node of degree n1 as the leaf associated with the first word C1, this choice eliminates image leaves. Among the remaining nodes, we choose a node of degree n2 as the leaf associated with the second word C2. This choice eliminates image leaves. We continue this procedure until the last word. The necessary condition to guarantee an instantaneous decoding is the following:

images

By dividing the two terms by image, we obtain the Kraft inequality.

1.4.4.4. Fundamental theorem of source coding

We consider first a memoryless source with entropy per symbol H(X). We will prove that, for this source, it is possible to build an instantaneous code for which the average length of the words Rmoy satisfy the following inequality:

[1.70] images

with

[1.71] images

PROOF.– We choose ni the length of the word associated with the ith message as follows:

[1.72] images

imagesximages is the smallest integer not less than x.

Let us verify that such a source code is instantaneous or, equivalently, that it satisfies the Kraft inequality:

[1.73] images

since images− log2 piimages ≥ − log2 pi.

Consequently, we have:

[1.74] images

The fundamental theorem of source coding can be expressed as follows:

THEOREM 1.3.– For all stationary sources with entropy per symbol H(X), there is a source coder that can encode the message into binary words with average length Rmoy as close as possible to H(X).

[1.75] images

We consider again a memoryless source with entropy per symbol H(X). By grouping the symbols of this source into a message composed of J symbols, we obtain a new source that can also be encoded using an instantaneous code. The length of the words of this new code RJmoy satisfies the following inequality:

[1.76] images

Dividing the terms by J, we obtain:

[1.77] images

Rmoy is the average number of bits associated with a symbol of the source image. By increasing J, Rmoy can reach asymptotically H(X):

[1.78] images

This result can be directly generalized to the case of sources with memory.

1.4.4.5. Entropy rate

We consider a stationary and discrete source X with entropy per symbol H(X) Sh/symbol generating DS symbols per second. We define the entropy rate DI as follows:

[1.79] images

The binary data rate at the output of the coder D′B is the product of the symbol rate DS by the average number of bits per symbol Rmoy:

[1.80] images

At the output of the binary source encoder, we define H′(X), the entropy per bit, as follows:

[1.81] images

The entropy rate Dj at the output of the source coder is then given by:

[1.82] images

As expected, the entropy rate is not changed by the source coding. From the theorem of source coding, we have:

[1.83] images

Multiplying the two terms by Ds, we obtain:

[1.84] images

Consequently, DI, the entropy rate of the source, is the lower bound on the binary data rate obtained after source coding. In case of equality, one bit will carry the quantity of information of one Shannon. If the redundancy of the output sequence is not zero, then one bit will carry less than one Shannon.

image

Figure 1.8. Entropy rate

1.5. Theorem for lossy source coding

1.5.1. Introduction

We will now study lossy channel coding. When the source generates a continuous and real signal, after sampling we have real samples. Compared to the previous section, the size of the state space is infinite and it is not possible to describe the source precisely using a finite number of bits. The aim of lossy source coding is to minimize a fidelity criterion such as the mean square error or a subjective quality under a binary rate constraint, or to minimize the binary rate under a given fidelity criterion. While mean square error is not always the best criterion, it is still the most used and we will consider it in this section. In the next chapter, we will study different practical solutions to implement lossy source coding such as scalar quantization with or without prediction, vector quantization, transform coding, etc.

In this section, we will introduce the concept of distortion and the distortion rate function. Then we will give the fundamental theorem for the lossy source coding.

1.5.2. Definitions

As previously mentioned, we will assume that the output of the source coder is directly connected to the input of the source decoder. The lossy source coder associates a binary word of length R bits with each sequence x ∈ χ and the source decoder associates a sequence image with each of the 2R possible binary words. The sequence image is named as quantized or estimated sequence image.

image

Figure 1.9. Block diagram of the coder-decoder

DEFINITION 1.1.– The distortion per dimension between the sequences x and image of dimension N is defined by:

[1.85] images

DEFINITION 1.2.– The average distortion per dimension of the coder-decoder is defined by:

[1.86] images

where f(x) is the density probability of x.

DEFINITION 1.3.– A pair (R, D) is said to be achievable if there is a coder-decoder such that:

[1.87] images

In his famous 1959 paper [SHA 59b], Shannon introduced the rate-distortion function which allows us to express the maximum theoretical rate under a given distortion constraint.

DEFINITION 1.4.– For a given memoryless source, the rate distortion function R(D) is defined as follows:

[1.88] images

1.5.3. Lossly source coding theorem

THEOREM 1.4.– The minimum number of bits per dimension R allowing to describe a sequence of real samples with a given average distortion D should be higher or equal to R(D).

[1.89] images

The proof of this theorem [COV 91] is based on the jointly typical sequences.

If the source is Gaussian, it can be proved that the following relation is always true:

where image is the source variance. By introducing the distortion rate function D(R), the relation [1.90] can also be written as:

Like for the lossless source theorem, this theorem gives only a theoretical limit. For example, a simple uniform quantization does not generally allow us to reach this limit. Indeed, in the above theorem, it is assumed that the dimension N of the vector to be coded x tends to infinity. The rate-distortion can be generalized to the case of sources with memory.

In Figure 1.10, we have plotted the distortion-rate given by the relation [1.91] for image = 1.

image

Figure 1.10. Shannon distortion-rate for a Gaussian source of unitary variance

In Figure 1.11, we have shown the optimal values yi obtained using an uniform quantization (illustrated by “*”) and a non-uniform quantization (illustrated by “o”) for L = 8 and a Gaussian source.

image

Figure 1.11. Uniform and non-uniform quantization L = 8 for a Gaussian source with variance σx = 1

In this simple case (R = 3 bits/sample and σx = 1), the average distortions of the uniform and non-uniform quantizations are -14.27 dB and -14.62 dB, respectively. The Shannon distortion-rate, in that case, is 10 log102−6 = −18.06 dB. Since the probabilities associated with each interval are not the same, we can apply a lossy source coding after the quantization in order to reduce the binary data rate. This lossy source coding brings an additional gain of 2 dB. However, in order to reach the Shannon theoretical limit, we will have to perform vector quantization which means that we will have to combine several samples together. These different techniques will be developed in Chapter 2.

1.6. Transmission channel models

1.6.1. Binary symmetric channel

The binary symmetric channel is the simplest transmission channel since the input and the output of the channel is binary and it is defined using a single parameter.

This channel can be described using a graph as shown in Figure 1.12 on which we list the possible values of the input and output. The labels on the branches are the conditional probabilities Pr(Y = y|X = x).

image

Figure 1.12. Binary symmetric channel

This channel is characterized by the four following conditional probabilities:

[1.92] images

p is often called the inversion probability. We define also the probabilities associated with the input bits also called a priori probabilities Pr(X = 0) = q and Pr(X = 1) = 1 − q.

The error probability for this channel can be easily calculated as follows:

[1.93] images

The binary symmetric channel is memoryless: let x and y be respectively the input and output sequences composed of n bits: x = [x0, x1, … , xn−1], and y = [y0, y1, … , yn−1]. Since the channel is memoryless, we have the following relation:

images

The joint conditional probability is the product of the n conditional probabilities Pr(Y = yi|X = xi).

Using the Bayes rule, we can compute the probabilities Pr(X, Y), Pr(Y) and Pr(X|Y) from Pr(Y|X) and Pr(X) as given in Table 1.4.

Table 1.4. Probabilities Pr(X, Y), Pr(Y) and Pr(X|Y) for the binary symmetric channel

Pr(X, Y) Y = 0 Y = 1
X = 0 q(1 – p) qp
X = 1 (1 – q)p (1 – q)(1 – p)
Pr(Y)  
Y = 0 q(1 – p)+(1 – q)p
Y = 1 qp + (1 – q)(1 – p)
Pr(X|Y) Y = 0 Y = 1
X = 0 images images
X = 1 images images

From Pr(Y|X), we can compute the conditional entropy H(Y|X). We have:

[1.94] images

If q = 0.5, we also have H(X|Y) = H2(p).

In Figure 1.13, we have plotted the curves H(X|Y) = f(q) for a binary symmetric channel with p =0.1, 0.2 and 0.5.

image

Figure 1.13. Conditional entropy H(X|Y) versus q and p

1.6.2. Discrete channels without memory

The binary symmetric channel is a specific case of the class of the discrete channels without memory. The input symbols of the discrete channels are M-ary symbols and the output symbols are L-ary symbols. They are described using a set of LM conditional probabilities Pr(Y = yj|X = xi) = pij and Pr(X = xi) = qi. As in the binary case, we can easily show that conditional probabilities satisfy the following relation:

[1.95] images

These channels are described using a graph as shown in Figure 1.14.

image

Figure 1.14. Discrete channel without memory

The symbol error probability Pe in this channel is given by:

[1.96] images

1.6.3. Binary erasure channel

The binary erasure channel, introduced by Elias in 1955 [ELI 55], is a channel in which some bits can be lost or erased. Compared to the binary symmetric channel, we add an event Y = corresponding to the case where a transmitted bit has been erased. This channel is characterized by the following conditional probabilities:

[1.97] images

p is the erasure probability. This transmission channel model is often used to model the packet loss in high level communication protocols. This channel is described by the diagram in Figure 1.15.

image

Figure 1.15. Erasure channel

Using the Bayes rule, we can compute Pr(X, Y), Pr(Y) and Pr(X|Y) from Pr(Y|X) and Pr(X) as shown in Table 1.5.

The conditional entropy H(X|Y) is equal to pH2(q).

1.6.4. Additive white Gaussian noise channel

The additive white Gaussian noise (AWGN) channel is the most important channel. It allows us to model the transmission channel for which the predominant noise is the thermal noise. We assume that the bandwidth of transmitted baseband signal is B and that the noise added to the transmitted signal is stationary, white, Gaussian and with a unilateral power density spectrum N0. The noise power N is equal to:

[1.98] images

Table 1.5. Probabilities Pr(X,Y), Pr(Y) and Pr(X|Y) for the erasure channel

Pr(X,Y) Y = 0 Y = ∈ Y = 1
X = 0 q(1 − p) qp 0
X = 1 0 (1 − q)p (1 − q)(1 − p)
Pr(Y)
Y = 0 q(1 − p)
Y = ∈ p
Y = 1 qp + (1 − q)(1 − p)
Pr(X|Y) Y = 0 Y = ∈ Y = 1
X = 0 1 q 0
X = 1 0 1 − q 1

The sampling theorem says that 2BT samples are enough to represent a bandlimited signal (fmax = B) for a duration T. We will show in Volume 2 of this book [PIS 15] that the optimal demodulator is composed of a matched filter that will limit the noise bandwidth. After matched filtering and sampling, the relation between the input symbol si drawn from a discrete alphabet and the output symbol yi at time i can be written for the AWGN channel as follows:

[1.99] images

ni is the white noise sample with a centered Gaussian probability density:

[1.100] images

The variance image of the noise sample ni is equal to:

images

Consequently, the probability density of yi conditionally to xi is:

[1.101] images

1.7. Capacity of a transmission channel

1.7.1. Introduction

We would like to solve the following problem: let us assume that we transmit equiprobable bits Pr(X = 0) = Pr(X = 1) = ½ with a binary rate of 1000 bits per second through a binary symmetric channel with parameter p = 0.1. What is the maximum information rate that can be transmitted? We can imagine that this rate is 900 Sh/s by substracting the number of errors per second. However, this is a not a good idea since we do not know the error positions. For example, when p = 0.5, we have on average 500 errors per second and no information is transmitted. In the following, we will answer this question.

1.7.2. Capacity of a transmission channel

We denote by X and Y the random variables associated with the input and the output of the channel, respectively.

DEFINITION 1.5.– We define the capacity of a transmission channel as follows:

The capacity of a transmission channel is the maximum of average mutual information. The maximization is performed over all the possible sources. If the channel is memoryless, the maximization is done over the set of all possible distributions p(x) of the input symbols x.

We first consider that the transmission channel is noiseless. In that case, the capacity is the average quantity of information that the input symbols can carry.

The capacity C is defined in Shannon/symbol. It is also possible to express the capacity in Shannon/second (we will refer to capacity per time unit instead of capacity per symbol). To distinguish it from the capacity per symbol, we will denote it by C′. We have:

[1.103] images

When the channel is noiseless, the capacity C is equal to log2 Q. Indeed, the average quantity of information is maximized when the source entropy is maximized, that is to say, when all the Q-ary input symbols are equiprobable. Then we have:

When the channel is noisy, we have C < HMAX(X). In order to compute the capacity of a transmission channel, we have to calculate the average quantity of information that is lost in the channel. We have seen previously that H(X|Y) is the measure of the residual uncertainty on X knowing Y. In order to perform a good transmission, it is desirable that this quantity is equal to zero or negligible.

H(X|Y) corresponds to the average quantity of information lost in the channel. When the channel is noiseless, we have H(X|Y) = H(X|X) = 0 and consequently C = HMAX(X) corresponding to the relation [1.104].

When the channel is so noisy that X and Y are independent, we have H(X|Y) = H(X). In that case, the capacity of information is zero, C = 0. These two cases are illustrated in Figures 1.16 and 1.17.

image

Figure 1.16. Case C = HMAX(X)

image

Figure 1.17. Case C = 0

When the source entropy is equal to HMAX(X), H(X|Y) is only related to the transmission channel. If H(X|Y) is non-negligible (case of a noisy channel), it would not be possible to perform a transmission without errors by just connecting the source to the input of the transmission channel. We have to add an element called channel coder between the source and the transmission channel. Figure 1.18 illustrates the new communication chain including a channel coder, the noisy channel and the channel decoder.

image

Figure 1.18. Communication system with channel coding

For this new system, we can define the average mutual information I(U;V) = H(U) − H(U|V). The role of channel coding is to make the average quantity of information H(U|V) as low as desired. It is then possible to transmit through the noisy channel an average quantity of information H(U) with the desired quality criterion. Of course, we have H(U) < H(X) due to the redundancy added by the channel coding.

We will now state the fundamental theorem of channel coding.

1.7.3. Fundamental theorem of channel coding

There is a channel coding guaranteeing a communication with an error rate as low as desired under the condition that the average quantity of information entering the block channel coder-channel-channel decoder is less than the capacity C of the channel [SHA 48]:

[1.105] images

Thus, the capacity of a transmission channel as defined in equation [1.102] is equal to the highest number of information bits that can be transmitted through the channel with an error rate as low as desired.

Multiplying the two terms of this inequality by Ds (the data rate of the source), we obtain an inequality between the maximum binary information rate Db and the capacity per time unit C′:

[1.106] images

The proof of the fundamental theorem of channel coding is based on the principle of random coding and the properties of jointly typical sequences of couples. Shannon proposed to use the following channel coder and channel decoder to prove the theorem: the codewords associated with the information words are randomly generated among a set of 2NR codewords. The decoding stage should verify if there exists a unique codeword jointly typical with the received word. If it exists, then the decoding is successful. In the opposite case (no codeword or many codewords satisfy the test), the decoding is a failure. Since the probability that another codeword is jointly typical with the received word is equal to 2−N(I(X;Y)), if we limit the number of codewords to 2N(I(X;Y)), we can guarantee with a high probability that there will be no confusion between the transmitted codeword and all other codewords. We will see later that the optimal decoding is to choose the closest codeword according to the Euclidian distance. However, the decoding scheme proposed by Shannon facilitates proof of the theorem.

The work of Shannon does not give us a practical solution (i.e. with a reasonable complexity) for the realization of the channel coder and the channel decoder. Since 1948, researchers have proposed practical error correcting codes and decoding algorithms to approach this theoretical limit. It is only in 1993, with the discovery of the so-called turbo codes [BER 93] and the rediscovery of the LDPC codes in 1995 [MAC 99], that it becomes possible to reach this limit within 1 dB.

1.7.4. Capacity of the binary symmetric channel

We have seen that the binary symmetric channel is described by the probability of error p. Since H(Y|X) = H2(p), the average mutual information I(X; Y) of the binary symmetric channel is given by:

[1.107] images

Since H2(p) is independent of q, in order to maximize I(X;Y), it is necessary to maximize H(Y). We have seen in equation [1.41] that to maximize H(Y), we should have Pr(Y = 0) = Pr(Y = 1) = 1/2 i.e. Pr(X = 0) = Pr(X = 1) = q = 1/2. Then the capacity of the binary symmetric channel is given by:

[1.108] images

In Figure 1.19, we plot the curves I(X; Y) = f(q) for a binary symmetric channel with p = 0.0, 0.1, 0.2 and 0.5. We can see on this figure that indeed the average mutual information is maximized when Pr(X = 0) = Pr(X = 1) = q = 1/2.

image

Figure 1.19. Mutual information I(X;Y) versus q and p

In Figure 1.20, we plot the curve C = f(p) for a binary symmetric channel with q = 0.5. As expected, the capacity is maximum when p = 0 and zero when p = 0.5.

image

Figure 1.20. Capacity of the binary symmetric channel versus p

1.7.5. Capacity of erasure channel

For the erasure channel, we have seen that the conditional entropy H(X|Y) is equal to pH2(q) and the average mutual information I(X; Y) is the following:

[1.109] images

The average mutual information is maximum for q = 0.5 i.e. H2(q) = 1. Consequently, the capacity of this channel is:

[1.110] images

1.7.6. Capacity of additive white Gaussian noise channel

To determine the capacity of AWGN channel, we will first compute the average mutual information I(X; Y).

We have introduced in section 1.6.4 the relation yi = xi + ni between the samples yi at the output of the AWGN channel and the input samples xi and noise samples ni. The samples xi, yi and ni can be seen as realizations of three random variables X, Y and Z. Consequently, the average mutual information can be written as follows:

[1.112] images

since Z is the random variable associated with the noise and is independent of X.

In section 1.3.6, we have computed the differential entropy HD(X) of X, a Gaussian random variable with variance image. From equation [1.55], we have:

[1.113] images

It is possible to prove that the maximum of I(X; Y) is reached when the density probability of X is Gaussian, centered, with variance image. The noise variance of Z is equal to image.

Let us compute the variance of Y. We have:

[1.114] images

From [1.111] we can derive the capacity of the AWGN channel as:

It should be noted that the capacity of the AWGN channel has been computed here under an average power constraint available at the transmitter. Other constraints, such as a peak power constraint, will give a different expression of the capacity.

When the noise is not white, the capacity will be obtained using the waterfilling technique that generalized the relation [1.115].

Let us introduce P as the average power of the signal x(t). When considering a sampling frequency of 2B, we have 2BT samples for a duration T. Then the power is derived as follows:

[1.116] images

We finally obtain from [1.115], the classical relation of the capacity of the AWGN channel:

The capacity C is the capacity per real symbol i.e. per dimension. Some authors expressed it in Shannon/dimension.

Figure 1.21 gives the curve of the capacity C of the AWGN channel versus the signal to noise ratio SNR = image. We can observe that for SNR > 5dB, the capacity C is well approximated by the linear function C ≈ ½ log2(SNR).

image

Figure 1.21. Capacity of the additive white Gaussian noise channel

When multiplying [1.117] by the sampling frequency 2B, we finally obtain the expression of the capacity per time unit:

[1.118] images

where N = BN0 the noise power.

When the signal to noise ratio is high, we can approximate the capacity of the AWGN channel as follows:

images

1.7.7. Graphical representation

It is also possible to geometrically demonstrate that in order to ensure a transmission without error, the average quantity of information H(U) should

not be higher than image.

Let us recall the relation between the transmitted vector x and the received vector y of dimension D.

[1.119] images

n = (n1, n2,…, nd) is the noise vector composed of D independent Gaussian samples of variance image. The probability density of the vector n can be mathematically written as:

[1.120] images

For D tending to infinity, we have shown in Appendix A that the noise vector3 is concentrated at the surface of a D dimensions sphere with radius image

The transmitted vector x is randomly generated with a variance image. per dimension and a Gaussian probability distribution in order to maximize the capacity:

[1.121] images

For the same reason as above, the vector x is concentrated at the surface of a sphere of radius image. Since the power of the received signal is the sum image, the vector associated with the received signal is on the surface of the D dimensions sphere with radius image.

We wish to perform a transmission without error of an average quantity of information H(U) = image, where M = 2DH(U) is the number of possible transmitted signals. To meet this goal, all the spheres of noise should be disjoint. Consequently, the volume of the M spheres of noise must be less than the volume of the sphere of radius image. Let us recall that the volume of a D dimensions sphere with radius r is given by:

[1.122] images

where Γ(.) is the factorial function4.

As a consequence, we should have:

[1.123] images

This idea is illustrated in Figure 1.22.

image

Figure 1.22. Spheres of noise illustration

The expression can be simplified as follows:

[1.124] images

Then we obtain the following inequality:

[1.125] images

This inequality can be rewritten as:

[1.126] images

Finally, since the capacity C is the highest possible value of the average information quantity H(U), we find again the formula of the capacity of the AWGN channel:

[1.127] images

When the bandwidth is limited, the dimension D is equal to D = 2BT with B bandwidth of the system and T duration of the transmission. The noise power is equal to N = N0B = 2Bimage and the average power of the signal X is equal to P = 2Bimage. Then the capacity C′ per time unit is given by:

[1.128] images

Let Eb be the average energy per information bit and Es be the average energy per symbol. We have:

[1.129] images

where Ts and Tb are the symbol and information bit duration respectively (assuming a M-ary modulation M = 2g and a code rate R we have Ts = gRTb).

We have the following relation between the signal to noise ratio P/N and the ratio Eb/Nq:

[1.130] images

η is the spectral efficiency in bits/sec/Hz:

[1.131] images

The spectral efficiency η is maximum when the bandwidth is minimum i.e. Bmin = 1/Ts. we have.

[1.132] images

When considering that the binary rate is equal to the channel capacity (Db = C′), the spectral efficiency ηmax can be also written as follows:

[1.133] images

This equation can be also rewritten as:

[1.134] images

The minimum value of Eb/N0 for a communication without error is obtained when the maximum spectral efficiency tends to zero (the bandwidth tends to infinity). We then obtain:

[1.135] images

Figure 1.23 shows the curve of the maximum spectral efficiency versus Eb/N0. We have also plotted the required ratio Eb/N0 considering a bit error rate of 10−5 of systems such as digital modulation BPSK and QPSK without coding. These modulations will be studied in Volume 2 of this book [PIS 15]. The performance of these communication systems are at 9.5 dB and 7.75 dB, respectively, from the Shannon limit. Adding a convolutional code (133,171) of rate R = 1/2 to a system using BPSK modulation gives a 5.1 dB gain compared to a system without coding. The concatenation of this convolutional code and a Reed–Solomon code (255,223) proposed by Forney [FOR 66] allow us to be 2.5 dB from the Shannon limit. The last point is relative to the performance obtained using turbo codes et al. [BER 93]. We will study these error correcting codes in Chapters 3, 4 and 5.

image

Figure 1.23. Maximum spectral efficiency of an additive white Gaussian noise channel

Figure 1.24 shows the maximum spectral efficiency for different digital modulations. Depending on the number of possible states M of these modulations, the spectral efficiency is always limited by log2(M). These curves can be obtained numerically from the mathematical expression of the average mutual information given by equation [1.111] by taking into account the distribution of y that is no more a Gaussian distribution but will depend on the considered modulation.

image

Figure 1.24. Spectral efficiency versus Eb/N0

1.8. Exercises

1.8.1. Exercise 1: calculation of the entropy for a discrete channel

A source X can generate three different symbols a1, a2 and a3 with the associated probabilities Pr(X = a1) = 0.25, Pr(X = a2) = 0.5 and Pr(X = a3) = 0.25

This source is connected to a discrete channel defined by the following conditional probabilities:

images

pij is the probability to receive aj when we transmit ai

1) Draw this transmission channel graphically.

2) Compute the probabilities Pr(Y = aj) for j ∈ {1,2,3} and the conditional probabilities Pr(X = ai|Y = aj).

3) Compute the entropies H(X) and H(Y), the joint entropy H(X, Y) and the conditional entropy H(Y|X).

4) Check that H(X, Y) = H(Y|X) + H(X) = H(X|Y) + H(Y).

1.8.2. Exercise 2: computation of the mutual information [BAT 97]

We draw four cards randomly in a standard deck of 32 cards (4 colors: heart, spade, diamond, club – 8 values: 7, 8, 9, 10, Jack, Queen, King and Ace).

Let us define the following events:

  • – E1: the event “the hand contains no 7, 8, 9 and 10”;
  • – E2: the event “the hand contains no Jack, Queen and King”;
  • – E3: the event “the hand contains four cards of the same values”.

1) Compute the information h(E1), h(E2) and h(E3).

2) Compute the mutual information i(E1; E2), i(E1; E3)

1.8.3. Exercise 3: capacity of the additive white Gaussian noise channel

Determine the capacity of the AWGN channel assuming that the signal power is 10 W, the bandwidth is 1 MHz and the noise spectrum density ½N0 is equal to 10−9 W/Hz

1.8.4. Exercise 4: binary symmetric channel

We consider a binary symmetric channel. The source X generates equiprobable bits p(X = 0) = p(X = 1) = 0.5.

Determine H(Y), H(X|Y) and I(X;Y) in function of p. Achieve the numerical application for p = 0.11.

1.8.5. Exercise 5: Z channel

Let us define the Z channel as follows:

image

The source X generates equiprobable bits p(X = 0) = p(X = 1) = 0.5.

1) Determine p(Y = yi), p(X = xi,Y = yj), p(X = xi|Y = yj) in function of p.

2) Determine H(X|Y) and I(X;Y) in function of p. Achieve the numerical application for p = 0.5.

1.8.6. Exercise 6: discrete channel

Let us consider the discrete channel with input X and output Y with state space A = {0,1, 2, 3, 4} and conditional probabilities as follows:

images

1) Draw the channel graphically.

2) Compute H(X|Y) and I(X;Y) assuming that the input symbols are equiprobable.

3) Show that it is possible to transmit at the rate 1 bit/symb with a zero error probability.

4) Is it possible to increase the rate by grouping the symbols two by two?

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset