Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

1
Introduction to Information Theory

1.1. Introduction

Information theory was developed by Claude Shannon in the 1940s [SHA 48, SHA 59b]. This is a mathematical theory which introduced source coding and channel coding, two fundamental concepts of communication systems. This theory allows us to determine, depending on the sources’ properties, the theoretical limits of lossless source coding (exact message reconstruction of the source) or lossly source coding (message reconstruction under a fidelity criterion). It also gives us the reachable rates for a given transmission channel due to channel coding.

In this chapter, after reviewing the basics of discrete and continuous probabilities in section 1.2, we will start by introducing the fundamental notions of information theory such as the entropy and average mutual information in section 1.3. We will then focus on the fundamental theorems of information theory for communication systems. We will state the lossly and lossless source coding theorem in sections 1.4 and 1.5, respectively. Then we will determine the theoretical limits of communication without error in a noisy channel. In section 1.5, we will introduce different channel models and finally in section 1.7, we will compute the capacity of these channels and state the channel coding theorem.

1.2. Review of probabilities

Probability theory is a mathematical domain that describes and models random processes. In this section, we present a summary of this theory. We recommend for further reading the books of Papoulis and Pillai [PAP 02] and R. Durrett [DUR 10].

Let X be an experiment or an observation that can be repeated under similar circumstances several times. At each repetition, the result of this observation is an event denoted as x, which can take several possible outcomes. The set of these values is denoted as .

The result X = x of this observation is not known before it takes place. X is consequently called a random variable. It is modeled by the frequency of appearance of all the outcomes.

Two classes of random variables can be distinguished as follows:

– discrete random variables, when the set of outcomes is discrete;
– continuous random variables, when their distribution functions are continuous.

1.2.1. Discrete random variables

A discrete random variable X takes its values in a discrete set, called its alphabet . This alphabet may be infinite (for instance, if = ) or finite with a size n, if = {x₁, x₂, … , x_n}. Each outcome is associated with an probability of occurrence P_X = {p₁, p₂, … , p_n}:

[1.1]

For discrete random variables, the probability density f_X(x) is defined by:

[1.2]

where δ(u) is the Dirac function.

1.2.1.1. Joint probability

Let X and Y be two discrete random variables of which respective set of possible outcomes is = {x₁, x₂, …, x_n} and = {y₁, y₂, … , y_m}.

P_r(X = x_i, Y = y_j) is called the joint probability of the events X = x_i and Y = y_j. Of course, the following property is verified:

[1.3]

1.2.1.2. Marginal probability

The probability P_r(X = x_i) can be computed from the set of joint probabilities P_r(X = x_i, Y = y_j):

[1.4]

1.2.1.3. Conditional probability

[1.5]

Similarly, we can write as:

[1.6]

As a consequence, the following relation stands:

[1.7]

which can be further developed to

Equation [1.30] is called the Bayes law. From this equation, we can see that P_r(X = x_i|Y = y_j) is the a posteriori probability, whereas P_r(Y = y_i) is the a priori probability.

1.2.1.4. Independence

If two discrete random variables X and Y are independent, then

[1.8]

and

[1.9]

1.2.2. Continuous random variables

The random variable X is continuous if its cumulative distribution function F_X(x) is continuous. F_X(x) is related to the probability density in the following way:

[1.10]

The random variable mean is defined as:

[1.11]

Its N^th moment is equal to:

[1.12]

1.2.3. Jensen’s inequality

Let us first recall that a function f(x) is convex if, for any x, y and 0 < λ < 1, the following inequality stands:

[1.13]

Let f be a convex function, (x₁, … , x_n) a real n-tuple belonging to the definition set of f and (p₁, … , p_n) a real positive n-tuple such that . Then:

[1.14]

Jensen’s inequality is obtained by interpreting the p_i terms as probabilities: if f(x) is convex for any real discrete random variable X, then:

[1.15]

1.2.4. Random signals

The signals used in digital communications depend on time t.

Signal x(t) is deterministic if the function t x(t) is perfectly known. If, however, the values taken by x(t) are unknown, the signal follows a random process. At time t, the random variable is denoted by X(t), and an outcome of this random variable is denoted as x(t). The set of all signal values x(t), for any t in the definition domain, is a given outcome of the random process X.

A random process is defined by its probability density and its statistical moments. The probability density is equal to:

[1.16]

The random process is stationary if its probability density is independent of time: f_X(x, t) = f_X(x) ∀t. As a consequence, all of its statistical properties are independent of t. Its probability density can thus be obtained from equation [1.10] in the following way:

[1.17]

m_x(t), the mean of the random variable x(t) from the random process X, is defined as:

[1.18]

The autocorrelation function R_xx(τ) of a random variable is:

[1.19]

The random process X is second-order stationary or wide-sense stationary if, for any random signal x(t):

1) its mean m_x(t) is independent of t;
2) its autocorrelation function verifies R_xx(t₁, t₂) = R_xx(t₁ + t, t₂ + t) ∀t.

Then it can simply be denoted as:

[1.20]

In that case, the power spectrum density γ_xx(f) is obtained by applying the Fourier transform on the autocorrelation function:

[1.21]

Reciprocally, the autocorrelation function Rx_x(τ) is determined from the power spectrum density as follows:

[1.22]

Generally, the mean and autocorrelation function of a stationary random process are estimated from a set of outcomes of signal X(t). When the mean over time tends to the random process’ mean, the random process is ergodic. Only one outcome of the random process X is required to evaluate its mean and autocorrelation function. Most random processes that are considered in digital communications are second-order stationary and ergodic.

For discrete signals (for instance, signals that have been sampled from a continuous random signal x(t) at frequency ) x_n = x(nT_e), the autocorrelation function Rx_x(τ) is only defined at discrete times τ = nT_e, and the power spectrum density becomes:

[1.23]

1.3. Entropy and mutual information

1.3.1. Comments on the concept of information

The quantitative notion of information associated with a message exchanged between a source and a receiver in the usual language is linked, for instance, to the veracity of the message, the a priori knowledge of the message by the receiver or the receiver’s understanding (where a problem of language can occur, etc.).

All these considerations are a matter of semantic and are not taken into account by information theory. In information theory, we will only retain a part of the general concept of information: the quantitative measure of information is a measure of the uncertainty associated with an event. This notion of information is fundamental for the study of communication systems.

1.3.2. A logarithmic measure of information

A measure of the information associated with the event X = x_i denoted as h(x_i) should satisfy the following properties [SHA 48]:

– h(x_i) should be continuous for p(X = x_i) between 0 and 1;
– h(x_i) = ∞ if P_r(X = x_i) = 0;
– h(x_i) = 0 if P_r(X = x_i) = 1: a certain event brings no information;
– h(x_i) > h(y_j) if P_r(Y = y_j) > P_r(X = x_i): more an event is uncertain, more information it will bring;
– h(x_i) + h(y_j) = h(x_i, y_j), if the events Y = y_j and X = x_i are independent: the realization of two independent events bring a quantity of information equal to the sum of the quantity of information of these two events h(x_i) and h(y_j).

In order to satisfy the above properties, the quantity of information h(x_i) associated with the realization of the event X = x_i should be equal to the logarithm of the inverse of the probability P_r(X = x_i). Using this definition, a high probability event will carry less quantity of information than a low probability event. When the binary logarithm is used¹, the unit h(x_i) is the Shannon (Sh). When the natural logarithm is used, the unit is the natural unit (Nat). In this book, we will consider binary logarithms. Consequently, we have the following relation:

[1.24]

EXAMPLE 1.1.– Let a discrete source generate bits (0 or 1) with P_r(X = 0) = P_r(X = 1) = . The quantity of information associated with the realization of the event X = 0 or X = 1 is equal to:

[1.25]

If the source is generating a sequence composed of n independent bits, we have 2ⁿ different sequences. Each of these sequences can appear with the probability . The quantity of information associated with the realization of a specific sequence is equal to:

[1.26]

Let us consider the realization of two events X = x_i and Y = x_j. The associated quantity of information is:

[1.27]

where Pr(X = x_i, Y = y_j) is the joint probability of the two events.

The quantity of information associated with the realization of the event X = x_i conditionally to the event Y = y_j is as follows:

[1.28]

From relation [1.7], we deduce:

[1.29]

or also:

[1.30]

EXAMPLE 1.2.– A card is randomly drawn from a standard deck of 32 cards (4 colors: heart, spade, diamond, club – 8 values: 7, 8, 9, 10, Jack, Queen, King and Ace). Let x be the event “the drawn card is the club ace” and y be the event “the drawn card is a club". We can compute h(x), h(y) and h(x|y). Since we have:

[1.31]

we obtain:

[1.32]

[1.33]

[1.34]

1.3.3. Mutual information

We define the mutual information as the quantity of information that the realization of the event Y = y_j gives about the event X = x_i. In other words, the mutual information is the difference between the quantity of information associated with the realization of the event X = x_i and the quantity of information associated with the the realization of the event X = x_i conditionally to the event Y = y_j. This quantity of information is given by:

[1.35]

If the two events are independent, we have P_r(X = x_i|Y = y_j) = P_r(X = x_i) and consequently i(x_i ; y_j) = 0. On the opposite, it the event X = x_i is equivalent to the event Y = y_j, then we have P_r(X = x_i|Y = y_j) = 1 and i(x_i; y_j) = h(x_i).

Since we have the following relation:

The quantity of information that the realization of the event Y = y_j gives about the event X = x_i is the same as the information that the realization of the event X = x_i gives about the event Y = y_j.

[1.36]

We also have the following relations:

Compared to h(x_i), the mutual information i(x_i; y_j) can be negative.

EXAMPLE 1.3.– (cont.) Compute i(x; y):

[1.37]

The quantity of information that the realization of the event “the drawn card is a club” gives about the event “the drawn card is a club ace” is equal to 2 Sh.

We will see in section 1.7 that the mutual information is important for communications when we associate X with the input of the channel transmission and Y to the output of the channel transmission.

1.3.4. Entropy and average mutual information

After having considered individual events, we will now compute the entropy of a source described with the random variable X with sample space = {x₁, x₂, … , x_n} and associated probabilities = {p₁, p₂, … , p_n}. n is the size of the sample space. The average quantity of information entropy of the source is the mean of the quantity of information associated with each possible realization of the event X = x_i:

[1.38]

H(X) is a measure of the uncertainty on X.

The entropy H(X) has the following properties:

[1.39]

[1.40]

[1.41]

The entropy is maximum when all the probabilities pi are the same.

EXAMPLE 1.4.– Let a source with 2 states x₀ and x₁ with p₀ = p and p₁ = 1 − p. The entropy of this source is the following:

[1.42]

The function H₂(p) defined previously will be often used in this chapter to simplify the notations. It is given in Figure 1.1.

**Figure 1.1**. *Entropy of a binary source*

The entropy is maximum (1 Sh/bit) when p = 0.5.

We define two random variables X and Y with state spaces = {x₁, x₂, … , x_n} and = {y₁, y₂, … , y_m}, respectively. The joint entropy H(X, Y) is defined as follows:

[1.43]

If the two random variables X and Y are independent, the joint entropy is equal to the sum of the entropies H(X) and H(Y).

We can also compute the conditional entropy H(X/Y) corresponding to the average quantity of information of X given the observation Y from h(x_i|y_j):

[1.44]

The relations [1.7] and [1.30] enable us to write the joint entropy as a function of the entropy and the conditional entropy:

[1.45]

The uncertainty on X and Y is equal to the sum of the uncertainty on X and the uncertainty on Y given X.

The mutual information associated with the realization of an event can be extended to the random variable X and Y. The average mutual information between the random variables X and Y is given by:

[1.46]

Consequently, we have also the following relations:

[1.47]

The average mutual information I(X; Y) measures the average quantity of information on X (or reduction of average uncertainty) due to the observation of Y. Figure 1.2 shows graphically the relations between the different entropies and the average mutual information.

**Figure 1.2**. *Relations between entropies and average mutual information*

While i(x_i; y_j) can be negative, we always have I(X; Y) ≥ 0.

1.3.5. Kullback–Leibler divergence

Let X and Y be two random variables with state spaces = {x₁, x₂, … , x_n} and = {y₁, y₂, … , y_n} and with associated probabilities = {p₁, p₂, … , p_n} and = {q₁, q₂, …, q_n}, respectively.

We define the relative entropy or Kullback–Leibler divergence as follows:

[1.48]

While Kullback–Leibler is often considered as a measure of distance between two probability densities, it is not strictly speaking a distance. For example, we generally have .

We can prove that and that if and only if p_i = q_i ∀_i = 1 … n.

For continuous random variables P and Q with density probability p(x) and q(x), respectively, the Kullback–Leibler divergence is defined by the integral

[1.49]

1.3.6. Differential entropy

The differential entropy is a generalization of the entropy to the case of continuous random variables. Let a continuous random variable X be defined by the density probability p(x). The differential entropy H_D(X) of X is equal to:

[1.50]

The differential entropy characterizes the quantity of information of the continuous random variable X.

Let us compute the differential entropy H_D(X) of a random variable X with a centered Gaussian probability distribution of variance . We have:

[1.51]

[1.52]

Since log₂ does not depend on x and

We can extract the first term in the integral. Then we obtain:

[1.53]

By definition,

[1.54]

Finally, we have:

[1.55]

1.3.7. Typical and jointly typical sequences

1.3.7.1. Typical sequences

The set of all sequences of random variable can be divided into two disjoint sets: the set of typical sequences and the set of non-typical sequences. Typical sequences are an important tool for the demonstration of the fundamental theorems of information theory.

Let us define a sequence of random variables X = (X₁, X₂, … , X_N) independent and identically distributed (i.i.d) with state space and x be a realization.

Since the variables X_i are independent, the terms log₂(P_r(X_i)) are also independent and we have the following relation:

[1.56]

This relation converges to H(X) when N is high enough. This property is called the asymptotic equipartition principle (AEP). Among the set of all the possible sequences , we define the typical sequences for which the probability of occurrence is close to 2^−N(H(X)). The set of typical sequences is described as follows:

[1.57]

From the relations [1.56] and [1.57], we can easily prove the following properties:

1) The probability that a sequence x is typical converge to 1 when N → ∞.
2) The number of typical sequences || is upper bounded by:

[1.58]

To summarize, for a sequence of random variables X with N large enough, the realizations x belong almost surely to the set composed of around 2^N^(H(X)) sequences, each having an occurrence probability close to 2^{− N(H(X))}.

In order to illustrate the typical sequences, we will consider a simple example. Let us consider a binary memoryless random variable X with occurrence probabilities p₁ = and p₂ = . The entropy of X is equal to 0.9183 Sh/symb. For a vector of length N, we have sequences² composed of n symbols x₁ and N − n symbols x₂ and of occurrence probability . In Figure 1.3, we have plotted the probability distribution of the sequences for N = 100 (continuous line) and the distribution of typical sequences set (dash line). We can see that for N = 100, the probability distribution is close enough to one of the set composed of 4.410²⁷ typical sequences.

**Figure 1.3**. *Density probability and typical sequences set*

1.3.7.2. Jointly typical sequences

The concept of typical sequences can be extended to the case of sequences of random variables couples. The set of jointly typical sequences is described as follows:

[1.59]

Let (x, y) be a sequence of random variables couples i.i.d. according to the following joint probability:

[1.60]

We can easily prove the following properties:

– the probability that a sequence of couples (x, y) is jointly typical reaches 1 when N → ∞;
– the number of sequences of couples that are jointly typical || is upper bounded by:

– let x and y be two independent realizations of X and Y respectively, the probability that (x′, y′) belongs to the set of jointly typical sequences of couples is:

[1.61]

The first two properties are obtained from the relations [1.56] and [1.59]. The proof of the third property is the following:

[1.62]

Since we have about 2^N^(H(X)) typical sequences x, 2^N^(H(Y)) typical sequences y and only 2^N^(H(X,Y)) jointly typical sequences of couples, the probability to choose a jointly typical sequence of couples among the set of couples of typical sequences independently selected is approximately equal to:

[1.63]

1.4. Lossless source coding theorems

1.4.1. Introduction

Source coding is divided into two classes: lossless source coding and lossy source coding. We will first study lossless source coding also called entropy coding. The aim of lossless source coding is to describe the digital sequence delivered by the source with the shortest sequence of symbols with the ability to reconstruct it by the source decoder.

As in most of the classical books on digital communications, in this chapter, we will study the techniques of source coding by ignoring the effect of the transmission channel and using Shannon’s paradigm. Consequently, we assume that the output of the source coder is directly connected to the input of the source decoder as shown in Figure 1.4.

1.4.2. Entropy and source redundancy

We consider a discrete and stationary source the output symbols of which are Q-ary symbols (the size of the alphabet is equal to Q). The output of this source is described by the random variable X. Consequently, the entropy H(X) is the average quantity of information per symbol at the output of the source.

**Figure 1.4**. *Block diagram of the studied chain for source coding*

A discrete source is memoryless if the output symbols are de-correlated. Otherwise, we will say that the source is with memory.

If the source is memoryless, H(X) can be computed as previously:

[1.64]

The entropy H(X) is maximum if the symbols are equiprobable. For Q-ary symbols, the maximum entropy is H_MAX = log₂ Q.

If the source is with memory, then the entropy per symbol H(X) can be computed by:

[1.65]

where H_J(X) is the entropy per group of J symbols.

The redundancy of the source R_red characterizes the difference between the quantity of information of the source and the quantity of a source with equiprobable symbols. We have:

[1.66]

The range of R_red is between 0 (the source symbols are independent and equiprobable) and 1 (the entropy of this source is zero).

1.4.3. Fundamental theorem of source coding

The fundamental theorem of source coding stated by Shannon [SHA 48] is the following:

THEOREM 1.1.– Let ∈ > 0, for all stationary source with entropy per symbol H(X), there is a binary source coding method that associates with each message x of length N a binary word of average length NR_moy such that:

[1.67]

Consequently, we can associate, on average NH(X) bits, with each message x.

PROOF.– We have seen previously that the typical sequences allow us to divide the set of sequences of random variables into two disjoint sets. The total set of the ||^N sequences x can be divided into two sets: the set of typical sequences with and the set of the non-typical sequences.

In order to distinguish these two sets, we can add a prefix “0” for the typical sequences and “1” for the non-typical sequences. So, the typical sequences can be encoded using N(H(X) + ∈) + 2 bits (the additional bit takes into account the fact that N(H(X) + ∈) is not necessarily an integer). The non-typical sequences can be encoded using at most N log₂ || + 2 bits.

We can bound the rate or average bits per realization R_moy as follows:

[1.68]

can be as small as we want by choosing a high value for N.

1.4.4. Lossless source coding

1.4.4.1. Introduction

From a general point of view, the source coder associates with each message delivered by the source a word composed of q-ary symbols while trying to minimize the average number of these symbols. A message, depending on the context, will be a Q-ary symbol from the source or a set of J Q-ary symbols.

We will restrict ourselves to the case where the symbols at the output of the source coder are bits (q = 2). The generalization to other alphabet sizes is possible without any particular difficulties.

The source coding should satisfy the two following criteria:

– unique coding: each message should be coded with a different word;
– unique decoding: each word should be distinguished without ambiguity. This criterion can be obtained using:
- - coding by fixed length word,
- - coding using a distinguishable separable symbol (Morse system for example),
- - coding with words of variable length.

Since the source coding with a separable symbol is a suboptimal and obsolete solution, later in this chapter, we will focus only on variable and fixed length source coding.

A source code is instantaneous if no word is the beginning of another. This condition called prefix condition is important to facilitate the decoding.

1.4.4.2. Variable length coding

EXAMPLE 1.5.– Let a discrete source generating four different messages a₁, a₂, a₃ and a₄ with their respective probabilities Pr(a₁) = , P_r(a₂) = and P_r(a₃) = P_r(a₄) = . One word is associated with each message as described in Table 1.1.

Table 1.1. Variable length code of example 1

Message	Word
a₁	1
a₂	00
a₃	01
a₄	10

We can check that this source code satisfies the criterion of unique coding but it does not allow a unique decoding. Indeed, for example, it is not possible to decode the message a₁, a₂, a₁, … etc. coded by the sequence 1001. At the receiver, we have no way to decide if the transmitted message was a₁, a₂, a₁ or a₄, a₃. Consequently, this code is unusable.

EXAMPLE 1.6.– One word is associated with each message as described in Table 1.2.

We can check that this source code satisfies both unique coding and decoding. However, this code is not instantaneous. Indeed, the message a₃ is the beginning of the message a₄. After the sequence 11, it is necessary to determine the parity of the number of zeros in order to be able to decode the rest of the transmitted message. Consequently, the decoding is more complex.

Table 1.2. Variable length code of example 2

Message	Word
a₁	00
a₂	10
a₃	11
a₄	110

EXAMPLE 1.7.– One word is associated with each message as described in Table 1.3.

Table 1.3. Variable length code of example 3

Message	Word
a₁	0
a₂	10
a₃	110
a₄	111

This source code satisfies both the unique coding and decoding. It is instantaneous as we can check on Figure 1.6 that shows the tree associated with this source code.

**Figure 1.6**. *Tree associated with the source code of example 3*

1.4.4.3. Kraft inequality

THEOREM 1.2.– An instantaneous code composed of Q binary words of length {n₁, n₂, … , n_Q}, respectively, with n₁ ≤ n₂ ≤ … ≤ n_Q should satisfy the following inequality:

[1.69]

PROOF.– An instantaneous code can be graphically represented using a complete binary tree of depth n_Q. Each leaf of the final tree is associated with one of the source message. The word is the label sequence of the path going from the tree root to the leaf.

A complete tree is composed of leaves. Let us choose a node of degree n₁ as the leaf associated with the first word C₁, this choice eliminates leaves. Among the remaining nodes, we choose a node of degree n₂ as the leaf associated with the second word C₂. This choice eliminates leaves. We continue this procedure until the last word. The necessary condition to guarantee an instantaneous decoding is the following:

By dividing the two terms by , we obtain the Kraft inequality.

1.4.4.4. Fundamental theorem of source coding

We consider first a memoryless source with entropy per symbol H(X). We will prove that, for this source, it is possible to build an instantaneous code for which the average length of the words R_moy satisfy the following inequality:

[1.70]

with

[1.71]

PROOF.– We choose n_i the length of the word associated with the i^th message as follows:

[1.72]

x is the smallest integer not less than x.

Let us verify that such a source code is instantaneous or, equivalently, that it satisfies the Kraft inequality:

[1.73]

since − log₂ p_i ≥ − log₂ p_i.

Consequently, we have:

[1.74]

The fundamental theorem of source coding can be expressed as follows:

THEOREM 1.3.– For all stationary sources with entropy per symbol H(X), there is a source coder that can encode the message into binary words with average length R_moy as close as possible to H(X).

[1.75]

We consider again a memoryless source with entropy per symbol H(X). By grouping the symbols of this source into a message composed of J symbols, we obtain a new source that can also be encoded using an instantaneous code. The length of the words of this new code R_Jmoy satisfies the following inequality:

[1.76]

Dividing the terms by J, we obtain:

[1.77]

R_moy is the average number of bits associated with a symbol of the source . By increasing J, R_moy can reach asymptotically H(X):

[1.78]

This result can be directly generalized to the case of sources with memory.

1.4.4.5. Entropy rate

We consider a stationary and discrete source X with entropy per symbol H(X) Sh/symbol generating D_S symbols per second. We define the entropy rate D_I as follows:

[1.79]

The binary data rate at the output of the coder D′_B is the product of the symbol rate D_S by the average number of bits per symbol R_moy:

[1.80]

At the output of the binary source encoder, we define H′(X), the entropy per bit, as follows:

[1.81]

The entropy rate D′_j at the output of the source coder is then given by:

[1.82]

As expected, the entropy rate is not changed by the source coding. From the theorem of source coding, we have:

[1.83]

Multiplying the two terms by D_s, we obtain:

[1.84]

Consequently, D_I, the entropy rate of the source, is the lower bound on the binary data rate obtained after source coding. In case of equality, one bit will carry the quantity of information of one Shannon. If the redundancy of the output sequence is not zero, then one bit will carry less than one Shannon.

1.5. Theorem for lossy source coding

1.5.1. Introduction

We will now study lossy channel coding. When the source generates a continuous and real signal, after sampling we have real samples. Compared to the previous section, the size of the state space is infinite and it is not possible to describe the source precisely using a finite number of bits. The aim of lossy source coding is to minimize a fidelity criterion such as the mean square error or a subjective quality under a binary rate constraint, or to minimize the binary rate under a given fidelity criterion. While mean square error is not always the best criterion, it is still the most used and we will consider it in this section. In the next chapter, we will study different practical solutions to implement lossy source coding such as scalar quantization with or without prediction, vector quantization, transform coding, etc.

In this section, we will introduce the concept of distortion and the distortion rate function. Then we will give the fundamental theorem for the lossy source coding.

1.5.2. Definitions

As previously mentioned, we will assume that the output of the source coder is directly connected to the input of the source decoder. The lossy source coder associates a binary word of length R bits with each sequence x ∈ χ and the source decoder associates a sequence with each of the 2^R possible binary words. The sequence is named as quantized or estimated sequence .

**Figure 1.9**. *Block diagram of the coder-decoder*

DEFINITION 1.1.– The distortion per dimension between the sequences x and of dimension N is defined by:

[1.85]

DEFINITION 1.2.– The average distortion per dimension of the coder-decoder is defined by:

[1.86]

where f(x) is the density probability of x.

DEFINITION 1.3.– A pair (R, D) is said to be achievable if there is a coder-decoder such that:

[1.87]

In his famous 1959 paper [SHA 59b], Shannon introduced the rate-distortion function which allows us to express the maximum theoretical rate under a given distortion constraint.

DEFINITION 1.4.– For a given memoryless source, the rate distortion function R(D) is defined as follows:

[1.88]

1.5.3. Lossly source coding theorem

THEOREM 1.4.– The minimum number of bits per dimension R allowing to describe a sequence of real samples with a given average distortion D should be higher or equal to R(D).

[1.89]

The proof of this theorem [COV 91] is based on the jointly typical sequences.

If the source is Gaussian, it can be proved that the following relation is always true:

[1.90]

where is the source variance. By introducing the distortion rate function D(R), the relation [1.90] can also be written as:

[1.91]

Like for the lossless source theorem, this theorem gives only a theoretical limit. For example, a simple uniform quantization does not generally allow us to reach this limit. Indeed, in the above theorem, it is assumed that the dimension N of the vector to be coded x tends to infinity. The rate-distortion can be generalized to the case of sources with memory.

In Figure 1.10, we have plotted the distortion-rate given by the relation [1.91] for = 1.

**Figure 1.10**. *Shannon distortion-rate for a Gaussian source of unitary variance*

In Figure 1.11, we have shown the optimal values y_i obtained using an uniform quantization (illustrated by “*”) and a non-uniform quantization (illustrated by “o”) for L = 8 and a Gaussian source.

**Figure 1.11**. *Uniform and non-uniform quantization L* = 8 *for a Gaussian source with variance σ_x* = 1

In this simple case (R = 3 bits/sample and σ_x = 1), the average distortions of the uniform and non-uniform quantizations are -14.27 dB and -14.62 dB, respectively. The Shannon distortion-rate, in that case, is 10 log₁₀2⁻⁶ = −18.06 dB. Since the probabilities associated with each interval are not the same, we can apply a lossy source coding after the quantization in order to reduce the binary data rate. This lossy source coding brings an additional gain of 2 dB. However, in order to reach the Shannon theoretical limit, we will have to perform vector quantization which means that we will have to combine several samples together. These different techniques will be developed in Chapter 2.

1.6. Transmission channel models

1.6.1. Binary symmetric channel

The binary symmetric channel is the simplest transmission channel since the input and the output of the channel is binary and it is defined using a single parameter.

This channel can be described using a graph as shown in Figure 1.12 on which we list the possible values of the input and output. The labels on the branches are the conditional probabilities Pr(Y = y|X = x).

**Figure 1.12**. *Binary symmetric channel*

This channel is characterized by the four following conditional probabilities:

[1.92]

p is often called the inversion probability. We define also the probabilities associated with the input bits also called a priori probabilities P_r(X = 0) = q and P_r(X = 1) = 1 − q.

The error probability for this channel can be easily calculated as follows:

[1.93]

The binary symmetric channel is memoryless: let x and y be respectively the input and output sequences composed of n bits: x = [x₀, x₁, … , x_n₋₁], and y = [y₀, y₁, … , y_n₋₁]. Since the channel is memoryless, we have the following relation:

The joint conditional probability is the product of the n conditional probabilities P_r(Y = y_i|X = x_i).

Using the Bayes rule, we can compute the probabilities P_r(X, Y), P_r(Y) and Pr(X|Y) from P_r(Y|X) and P_r(X) as given in Table 1.4.

From P_r(Y|X), we can compute the conditional entropy H(Y|X). We have:

[1.94]

If q = 0.5, we also have H(X|Y) = H₂(p).

In Figure 1.13, we have plotted the curves H(X|Y) = f(q) for a binary symmetric channel with p =0.1, 0.2 and 0.5.

**Figure 1.13**. *Conditional entropy H*(X|Y) *versus q and p*

1.6.2. Discrete channels without memory

The binary symmetric channel is a specific case of the class of the discrete channels without memory. The input symbols of the discrete channels are M-ary symbols and the output symbols are L-ary symbols. They are described using a set of LM conditional probabilities P_r(Y = y_j|X = x_i) = p_ij and P_r(X = x_i) = q_i. As in the binary case, we can easily show that conditional probabilities satisfy the following relation:

[1.95]

These channels are described using a graph as shown in Figure 1.14.

**Figure 1.14**. *Discrete channel without memory*

The symbol error probability P_e in this channel is given by:

[1.96]

1.6.3. Binary erasure channel

The binary erasure channel, introduced by Elias in 1955 [ELI 55], is a channel in which some bits can be lost or erased. Compared to the binary symmetric channel, we add an event Y = ∈ corresponding to the case where a transmitted bit has been erased. This channel is characterized by the following conditional probabilities:

[1.97]

p is the erasure probability. This transmission channel model is often used to model the packet loss in high level communication protocols. This channel is described by the diagram in Figure 1.15.

Using the Bayes rule, we can compute P_r(X, Y), P_r(Y) and P_r(X|Y) from P_r(Y|X) and P_r(X) as shown in Table 1.5.

The conditional entropy H(X|Y) is equal to pH₂(q).

1.6.4. Additive white Gaussian noise channel

The additive white Gaussian noise (AWGN) channel is the most important channel. It allows us to model the transmission channel for which the predominant noise is the thermal noise. We assume that the bandwidth of transmitted baseband signal is B and that the noise added to the transmitted signal is stationary, white, Gaussian and with a unilateral power density spectrum N₀. The noise power N is equal to:

[1.98]

Table 1.5. Probabilities P_r(X,Y), P_r(Y) and P_r(X|Y) for the erasure channel

P_r(X,Y)	Y = 0	Y = ∈	Y = 1
X = 0	q(1 − p)	qp	0
X = 1	0	(1 − q)p	(1 − q)(1 − p)

P_r(Y)
Y = 0	q(1 − p)
Y = ∈	p
Y = 1	qp + (1 − q)(1 − p)

P_r(X\|Y)	Y = 0	Y = ∈	Y = 1
X = 0	1	q	0
X = 1	0	1 − q	1

The sampling theorem says that 2BT samples are enough to represent a bandlimited signal (f_max = B) for a duration T. We will show in Volume 2 of this book [PIS 15] that the optimal demodulator is composed of a matched filter that will limit the noise bandwidth. After matched filtering and sampling, the relation between the input symbol s_i drawn from a discrete alphabet and the output symbol y_i at time i can be written for the AWGN channel as follows:

[1.99]

n_i is the white noise sample with a centered Gaussian probability density:

[1.100]

The variance of the noise sample n_i is equal to:

Consequently, the probability density of y_i conditionally to x_i is:

[1.101]

1.7. Capacity of a transmission channel

1.7.1. Introduction

We would like to solve the following problem: let us assume that we transmit equiprobable bits P_r(X = 0) = P_r(X = 1) = ½ with a binary rate of 1000 bits per second through a binary symmetric channel with parameter p = 0.1. What is the maximum information rate that can be transmitted? We can imagine that this rate is 900 Sh/s by substracting the number of errors per second. However, this is a not a good idea since we do not know the error positions. For example, when p = 0.5, we have on average 500 errors per second and no information is transmitted. In the following, we will answer this question.

1.7.2. Capacity of a transmission channel

We denote by X and Y the random variables associated with the input and the output of the channel, respectively.

DEFINITION 1.5.– We define the capacity of a transmission channel as follows:

[1.102]

The capacity of a transmission channel is the maximum of average mutual information. The maximization is performed over all the possible sources. If the channel is memoryless, the maximization is done over the set of all possible distributions p(x) of the input symbols x.

We first consider that the transmission channel is noiseless. In that case, the capacity is the average quantity of information that the input symbols can carry.

The capacity C is defined in Shannon/symbol. It is also possible to express the capacity in Shannon/second (we will refer to capacity per time unit instead of capacity per symbol). To distinguish it from the capacity per symbol, we will denote it by C′. We have:

[1.103]

When the channel is noiseless, the capacity C is equal to log₂ Q. Indeed, the average quantity of information is maximized when the source entropy is maximized, that is to say, when all the Q-ary input symbols are equiprobable. Then we have:

[1.104]

When the channel is noisy, we have C < H_MAX(X). In order to compute the capacity of a transmission channel, we have to calculate the average quantity of information that is lost in the channel. We have seen previously that H(X|Y) is the measure of the residual uncertainty on X knowing Y. In order to perform a good transmission, it is desirable that this quantity is equal to zero or negligible.

H(X|Y) corresponds to the average quantity of information lost in the channel. When the channel is noiseless, we have H(X|Y) = H(X|X) = 0 and consequently C = H_MAX(X) corresponding to the relation [1.104].

When the channel is so noisy that X and Y are independent, we have H(X|Y) = H(X). In that case, the capacity of information is zero, C = 0. These two cases are illustrated in Figures 1.16 and 1.17.

When the source entropy is equal to H_MAX(X), H(X|Y) is only related to the transmission channel. If H(X|Y) is non-negligible (case of a noisy channel), it would not be possible to perform a transmission without errors by just connecting the source to the input of the transmission channel. We have to add an element called channel coder between the source and the transmission channel. Figure 1.18 illustrates the new communication chain including a channel coder, the noisy channel and the channel decoder.

**Figure 1.18**. *Communication system with channel coding*

For this new system, we can define the average mutual information I(U;V) = H(U) − H(U|V). The role of channel coding is to make the average quantity of information H(U|V) as low as desired. It is then possible to transmit through the noisy channel an average quantity of information H(U) with the desired quality criterion. Of course, we have H(U) < H(X) due to the redundancy added by the channel coding.

We will now state the fundamental theorem of channel coding.

1.7.3. Fundamental theorem of channel coding

There is a channel coding guaranteeing a communication with an error rate as low as desired under the condition that the average quantity of information entering the block channel coder-channel-channel decoder is less than the capacity C of the channel [SHA 48]:

[1.105]

Thus, the capacity of a transmission channel as defined in equation [1.102] is equal to the highest number of information bits that can be transmitted through the channel with an error rate as low as desired.

Multiplying the two terms of this inequality by D_s (the data rate of the source), we obtain an inequality between the maximum binary information rate D_b and the capacity per time unit C′:

[1.106]

The proof of the fundamental theorem of channel coding is based on the principle of random coding and the properties of jointly typical sequences of couples. Shannon proposed to use the following channel coder and channel decoder to prove the theorem: the codewords associated with the information words are randomly generated among a set of 2^NR codewords. The decoding stage should verify if there exists a unique codeword jointly typical with the received word. If it exists, then the decoding is successful. In the opposite case (no codeword or many codewords satisfy the test), the decoding is a failure. Since the probability that another codeword is jointly typical with the received word is equal to 2^−N^(I(X;Y)), if we limit the number of codewords to 2^N^(I(X;Y)), we can guarantee with a high probability that there will be no confusion between the transmitted codeword and all other codewords. We will see later that the optimal decoding is to choose the closest codeword according to the Euclidian distance. However, the decoding scheme proposed by Shannon facilitates proof of the theorem.

The work of Shannon does not give us a practical solution (i.e. with a reasonable complexity) for the realization of the channel coder and the channel decoder. Since 1948, researchers have proposed practical error correcting codes and decoding algorithms to approach this theoretical limit. It is only in 1993, with the discovery of the so-called turbo codes [BER 93] and the rediscovery of the LDPC codes in 1995 [MAC 99], that it becomes possible to reach this limit within 1 dB.

1.7.4. Capacity of the binary symmetric channel

We have seen that the binary symmetric channel is described by the probability of error p. Since H(Y|X) = H₂(p), the average mutual information I(X; Y) of the binary symmetric channel is given by:

[1.107]

Since H₂(p) is independent of q, in order to maximize I(X;Y), it is necessary to maximize H(Y). We have seen in equation [1.41] that to maximize H(Y), we should have P_r(Y = 0) = P_r(Y = 1) = 1/2 i.e. P_r(X = 0) = P_r(X = 1) = q = 1/2. Then the capacity of the binary symmetric channel is given by:

[1.108]

In Figure 1.19, we plot the curves I(X; Y) = f(q) for a binary symmetric channel with p = 0.0, 0.1, 0.2 and 0.5. We can see on this figure that indeed the average mutual information is maximized when P_r(X = 0) = P_r(X = 1) = q = 1/2.

**Figure 1.19**. *Mutual information I*(X;Y) *versus q and p*

In Figure 1.20, we plot the curve C = f(p) for a binary symmetric channel with q = 0.5. As expected, the capacity is maximum when p = 0 and zero when p = 0.5.

**Figure 1.20**. *Capacity of the binary symmetric channel versus p*

1.7.5. Capacity of erasure channel

For the erasure channel, we have seen that the conditional entropy H(X|Y) is equal to pH₂(q) and the average mutual information I(X; Y) is the following:

[1.109]

The average mutual information is maximum for q = 0.5 i.e. H₂(q) = 1. Consequently, the capacity of this channel is:

[1.110]

1.7.6. Capacity of additive white Gaussian noise channel

To determine the capacity of AWGN channel, we will first compute the average mutual information I(X; Y).

We have introduced in section 1.6.4 the relation y_i = x_i + n_i between the samples y_i at the output of the AWGN channel and the input samples x_i and noise samples n_i. The samples x_i, y_i and n_i can be seen as realizations of three random variables X, Y and Z. Consequently, the average mutual information can be written as follows:

[1.111]

[1.112]

since Z is the random variable associated with the noise and is independent of X.

In section 1.3.6, we have computed the differential entropy H_D(X) of X, a Gaussian random variable with variance . From equation [1.55], we have:

[1.113]

It is possible to prove that the maximum of I(X; Y) is reached when the density probability of X is Gaussian, centered, with variance . The noise variance of Z is equal to .

Let us compute the variance of Y. We have:

[1.114]

From [1.111] we can derive the capacity of the AWGN channel as:

[1.115]

It should be noted that the capacity of the AWGN channel has been computed here under an average power constraint available at the transmitter. Other constraints, such as a peak power constraint, will give a different expression of the capacity.

When the noise is not white, the capacity will be obtained using the waterfilling technique that generalized the relation [1.115].

Let us introduce P as the average power of the signal x(t). When considering a sampling frequency of 2B, we have 2BT samples for a duration T. Then the power is derived as follows:

[1.116]

We finally obtain from [1.115], the classical relation of the capacity of the AWGN channel:

[1.117]

The capacity C is the capacity per real symbol i.e. per dimension. Some authors expressed it in Shannon/dimension.

Figure 1.21 gives the curve of the capacity C of the AWGN channel versus the signal to noise ratio SNR = . We can observe that for SNR > 5dB, the capacity C is well approximated by the linear function C ≈ ½ log₂(SNR).

**Figure 1.21**. *Capacity of the additive white Gaussian noise channel*

When multiplying [1.117] by the sampling frequency 2B, we finally obtain the expression of the capacity per time unit:

[1.118]

where N = BN₀ the noise power.

When the signal to noise ratio is high, we can approximate the capacity of the AWGN channel as follows:

1.7.7. Graphical representation

It is also possible to geometrically demonstrate that in order to ensure a transmission without error, the average quantity of information H(U) should

not be higher than .

Let us recall the relation between the transmitted vector x and the received vector y of dimension D.

[1.119]

n = (n₁, n₂,…, n_d) is the noise vector composed of D independent Gaussian samples of variance . The probability density of the vector n can be mathematically written as:

[1.120]

For D tending to infinity, we have shown in Appendix A that the noise vector³ is concentrated at the surface of a D dimensions sphere with radius

The transmitted vector x is randomly generated with a variance . per dimension and a Gaussian probability distribution in order to maximize the capacity:

[1.121]

For the same reason as above, the vector x is concentrated at the surface of a sphere of radius . Since the power of the received signal is the sum , the vector associated with the received signal is on the surface of the D dimensions sphere with radius .

We wish to perform a transmission without error of an average quantity of information H(U) = , where M = 2^DH^(U) is the number of possible transmitted signals. To meet this goal, all the spheres of noise should be disjoint. Consequently, the volume of the M spheres of noise must be less than the volume of the sphere of radius . Let us recall that the volume of a D dimensions sphere with radius r is given by:

[1.122]

where Γ(.) is the factorial function⁴.

As a consequence, we should have:

[1.123]

This idea is illustrated in Figure 1.22.

**Figure 1.22**. *Spheres of noise illustration*

The expression can be simplified as follows:

[1.124]

Then we obtain the following inequality:

[1.125]

This inequality can be rewritten as:

[1.126]

Finally, since the capacity C is the highest possible value of the average information quantity H(U), we find again the formula of the capacity of the AWGN channel:

[1.127]

When the bandwidth is limited, the dimension D is equal to D = 2BT with B bandwidth of the system and T duration of the transmission. The noise power is equal to N = N₀B = 2B and the average power of the signal X is equal to P = 2B. Then the capacity C′ per time unit is given by:

[1.128]

Let E_b be the average energy per information bit and E_s be the average energy per symbol. We have:

[1.129]

where T_s and T_b are the symbol and information bit duration respectively (assuming a M-ary modulation M = 2^g and a code rate R we have T_s = gRT_b).

We have the following relation between the signal to noise ratio P/N and the ratio E_b/N_q:

[1.130]

η is the spectral efficiency in bits/sec/Hz:

[1.131]

The spectral efficiency η is maximum when the bandwidth is minimum i.e. B_min = 1/T_s. we have.

[1.132]

When considering that the binary rate is equal to the channel capacity (D_b = C′), the spectral efficiency η_max can be also written as follows:

[1.133]

This equation can be also rewritten as:

[1.134]

The minimum value of E_b/N₀ for a communication without error is obtained when the maximum spectral efficiency tends to zero (the bandwidth tends to infinity). We then obtain:

[1.135]

Figure 1.23 shows the curve of the maximum spectral efficiency versus E_b/N₀. We have also plotted the required ratio E_b/N₀ considering a bit error rate of 10⁻⁵ of systems such as digital modulation BPSK and QPSK without coding. These modulations will be studied in Volume 2 of this book [PIS 15]. The performance of these communication systems are at 9.5 dB and 7.75 dB, respectively, from the Shannon limit. Adding a convolutional code (133,171) of rate R = 1/2 to a system using BPSK modulation gives a 5.1 dB gain compared to a system without coding. The concatenation of this convolutional code and a Reed–Solomon code (255,223) proposed by Forney [FOR 66] allow us to be 2.5 dB from the Shannon limit. The last point is relative to the performance obtained using turbo codes et al. [BER 93]. We will study these error correcting codes in Chapters 3, 4 and 5.

**Figure 1.23**. *Maximum spectral efficiency of an additive white Gaussian noise channel*

Figure 1.24 shows the maximum spectral efficiency for different digital modulations. Depending on the number of possible states M of these modulations, the spectral efficiency is always limited by log₂(M). These curves can be obtained numerically from the mathematical expression of the average mutual information given by equation [1.111] by taking into account the distribution of y that is no more a Gaussian distribution but will depend on the considered modulation.

**Figure 1.24**. *Spectral efficiency versus E_b/N₀*

1.8. Exercises

1.8.1. Exercise 1: calculation of the entropy for a discrete channel

A source X can generate three different symbols a₁, a₂ and a₃ with the associated probabilities P_r(X = a₁) = 0.25, P_r(X = a₂) = 0.5 and Pr(X = a₃) = 0.25

This source is connected to a discrete channel defined by the following conditional probabilities:

p_ij is the probability to receive a_j when we transmit a_i

1) Draw this transmission channel graphically.

2) Compute the probabilities P_r(Y = a_j) for j ∈ {1,2,3} and the conditional probabilities P_r(X = a_i|Y = a_j).

3) Compute the entropies H(X) and H(Y), the joint entropy H(X, Y) and the conditional entropy H(Y|X).

4) Check that H(X, Y) = H(Y|X) + H(X) = H(X|Y) + H(Y).

1.8.2. Exercise 2: computation of the mutual information [BAT 97]

We draw four cards randomly in a standard deck of 32 cards (4 colors: heart, spade, diamond, club – 8 values: 7, 8, 9, 10, Jack, Queen, King and Ace).

Let us define the following events:

– E1: the event “the hand contains no 7, 8, 9 and 10”;
– E2: the event “the hand contains no Jack, Queen and King”;
– E3: the event “the hand contains four cards of the same values”.

1) Compute the information h(E₁), h(E₂) and h(E₃).

2) Compute the mutual information i(E1; E2), i(E1; E3)

1.8.3. Exercise 3: capacity of the additive white Gaussian noise channel

Determine the capacity of the AWGN channel assuming that the signal power is 10 W, the bandwidth is 1 MHz and the noise spectrum density ½N₀ is equal to 10⁻⁹ W/Hz

1.8.4. Exercise 4: binary symmetric channel

We consider a binary symmetric channel. The source X generates equiprobable bits p(X = 0) = p(X = 1) = 0.5.

Determine H(Y), H(X|Y) and I(X;Y) in function of p. Achieve the numerical application for p = 0.11.

1.8.5. Exercise 5: Z channel

Let us define the Z channel as follows:

The source X generates equiprobable bits p(X = 0) = p(X = 1) = 0.5.

1) Determine p(Y = y_i), p(X = x_i,Y = y_j), p(X = x_i|Y = y_j) in function of p.

2) Determine H(X|Y) and I(X;Y) in function of p. Achieve the numerical application for p = 0.5.

1.8.6. Exercise 6: discrete channel

Let us consider the discrete channel with input X and output Y with state space A = {0,1, 2, 3, 4} and conditional probabilities as follows:

1) Draw the channel graphically.

2) Compute H(X|Y) and I(X;Y) assuming that the input symbols are equiprobable.

3) Show that it is possible to transmit at the rate 1 bit/symb with a zero error probability.

4) Is it possible to increase the rate by grouping the symbols two by two?

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Pr(Y)
Y = 0	q(1 – p)+(1 – q)p
Y = 1	qp + (1 – q)(1 – p)