2
Source Coding

2.1. Introduction

We have seen in sections 1.4 and 1.5 that there exist two classes of source coding: lossless source coding and lossy source coding.

The aim of lossless source coding or entropy coding is to describe the digital sequence delivered by the source with the shortest sequence of symbols, usually bits. This sequence should be designed in order to guarantee the perfect reconstruction of the initial sequence by the source decoder. In lossy source coding, the aim is to minimize a fidelity criterion such as the mean square error or a subjective quality under a constraint on the binary rate.

In this chapter, we will first review in section 2.2 the different solutions to implement lossless source coding such as Huffman’s algorithm, arithmetic coding and Lempel–Ziv coding. The remainder of this chapter will focus on lossy source coding. In section 2.3, we will study the scalar quantization and the vector quantization and their respective performances. The coding techniques for memory sources such as linear prediction, scalar quantization with prediction and transform coding and subband coding will be detailed in section 2.4. Finally, we will give some application examples such as still image compression, audio compression and speech compression in section 2.5.

2.2. Algorithms for lossless source coding

2.2.1. Run length coding

The run length coding (RLC) is a simple algorithm exploiting the repetition between consecutive symbols. It is efficient when the source sequence is composed of many identical successive symbols. Instead of coding each symbol independently, we evaluate couples (number of identical consecutive symbols, symbol). Let us consider the example of the following sequence of data coded with 8 bits:

[2.1] images

A more efficient solution is to add a prefix only when the number of identical consecutive symbols is higher than 1:

[2.2] images

In this case, it is necessary to add an additional symbol to indicate the position of the repeated symbols. For example, we can add an octet every eight symbols to inform of the presence of a prefix. In the previous example, since the repeated symbols are the first, second and seventh, we have:

[2.3] images

This algorithm is particularly effective in the case of an image composed of black and white pixels such as the fax machine.

2.2.2. Huffman’s algorithm

In 1952, Huffman [HUF 52] proposed a variable length source coding algorithm suitable for source generating L different messages. This algorithm allows us to obtain the minimum average number of bit per word while guaranteeing that the algorithm is uniquely decoding and instantaneous.

We will now describe the algorithm to build the encoding table. We start the procedure by ordering the list of the L messages from the top to the bottom following a decreasing probability order (each message is associated with a node).

  1. 1) Choose the two nodes with the lowest probabilities.
  2. 2) Connect these two nodes together: the upper branch is labeled with 0, while the lower branch is labeled with 1.
  3. 3) Sum the two probabilities associated with the new node.
  4. 4) Suppress the two nodes/messages previously selected and return to phase 1.

We repeat this procedure until all the messages have been selected. The constructed tree graphically describes the source coding. The words are obtained by following the different paths from the right of the tree to the left.

EXAMPLE 2.1.– Let us consider a discrete source composed of eight messages: a1, a2, a3, a4, a5, a6, a7, a8, with the associated probabilities {0.16; 0.15; 0.01; 0.05; 0.26; 0.11; 0.14; 0.12}. The entropy of this source is H(X) = 2.7358.

The Huffman’s coding of this source is given in Figure 2.1.

image

Figure 2.1. Example of Huffman’s encoding

The encoding table is given in Table 2.1. The average number of bits per word Rmoy is equal to 2.8 and is slightly higher than the entropy of the source.

Table 2.1. Huffman’s encoding table

message word ni
a5 00 2
a1 010 3
a2 011 3
a7 100 3
a8 101 3
a6 110 3
a4 1110 4
a3 1111 4

For memoryless sources, the Huffman’s algorithm allows us to build an optimal source coding under the restriction that the probabilities of the message are a negative power of 2 (1/2, 1/4, …). However, when the successive symbols are correlated, it is necessary to group many symbols together to constitute the messages. In this case, the complexity of the source coding becomes very high (the number of messages is QJ assuming that the messages are composed of J Q-ary symbols).

The Huffman’s algorithm has often been used for image compression or audio compression in combination with lossy algorithms (Joint Photographic Experts Group (JPEG), MP3, etc.). From a practical point of view, the construction of the tree can be semi-adaptive or adaptive. The semi-adaptive method consists of performing two process phases on the input sequence: during the first phase, the message probabilities are calculated and the tree is built, while during the second phase the encoding is performed. In the adaptive method, the Huffman’s tree is dynamically constructed simultaneously at the coder and decoder. These methods are fully described in [SAL 07].

2.2.3. Evaluation of the entropy of a written text

To evaluate the entropy of a written text, we limit the size of the alphabet to the 26 letters and the symbols coma, full stop, quotation mark and space. Consequently, we will assume that the alphabet is composed of 30 different symbols. The probabilities of occurrence of these characters in an English literary text are given in Table 2.2. If the probability of occurrence of each character was equal, the entropy per character will be equal to log2(30) = 4.9 Sh/character.

Table 2.2. Probabilities of occurrence of characters

ai pi ai pi
a 0.0659 p 0.0143
b 0.0117 q 0.0006
c 0.0177 r 0.0500
d 0.0378 s 0.0494
e 0.1049 t 0.0702
f 0.0181 u 0.0235
g 0.0164 v 0.0072
h 0.0485 w 0.0192
i 0.0469 x 0.0008
j 0.0005 y 0.0145
k 0.0064 z 0.0008
1 0.0333 0.1744
m 0.0210 0.0033
n 0.0580 , 0.0172
o 0.0587 . 0.0089

Using equation [1.64], we obtain an entropy of 4.2 Sh/character. However, this calculation does not take into account the correlation between the consecutive characters. To illustrate this correlation, we have grouped the characters by pair and shown in Figure 2.2 the probability of occurrence of each of these pairs. The characters associated with the columns and lines correspond, respectively, to the first and second character of the pairs.

By grouping the characters 2 by 2, we obtain an entropy per character of 3.6 Shannon/character which is slightly lower than the entropy previously calculated. Different studies have shown that for a literary text, the entropy is much lower: about 1 Shannon/character.

To illustrate the redundancy of the English language, we do the following experiment [MAC 03, COV91]: the aim is for the candidate to determine letter after letter an unknown sentence. Once he/she finds a correct letter, we count the number of attempts required to guess this letter. The candidate then searches the next letter. We present the two results obtained with the following sentence:

image

image

Figure 2.2. Probabilities of occurrence of the characters

It should be observed that in many cases, the candidate determines the next letter at the first attempt. Except at the beginning of the words and the syllables, the other letters can be easily obtained. We can imagine a very efficient source coding exploiting these properties; instead of coding the symbols, we code the number of attempts, we see that it will be possible to significantly reduce the number of bits required to transmit this sentence. It implies that we will have to perform the reverse procedure by using complex decoding tables. This system illustrates the principles used in arithmetic coding [MAC 03, COV 91, BAT 97, RIS 76] and in the Lempel–Ziv algorithm [ZIV 78]. We will now describe these two classes of algorithms.

2.2.4. Arithmetic coding

Arithmetic coding was introduced by Rissanen [RIS 76] and Pasco [PAS 76]. It allows us to perform source coding without any a priori knowledge of the statistics of the source (memoryless or with memory). Like for the adaptive Huffman’s coding, the probability distribution is estimated as the source symbols occur. One of the main principles of arithmetic coding is to associate with each binary sequence an interval on the segment [0; 1[ as shown in Figure 2.3. For example, the sequence 0111 corresponding to the interval [0.0111; 0.1000[ in binary or [0.4375; 0.5[ in decimal. The longer the sequence, the smaller the associated interval. Compared to the Huffman’s algorithm, it will be possible to associate a non-integer number of bits with a symbol depending on its probability of occurrence.

image

Figure 2.3. Example of partitioning

Assuming that the probabilities of the Q source symbols are known, we can then perform partitioning of the segment [0; 1[ into Q intervals or partitions Si of length Pr(xi). We can continue the partitioning by dividing the partitions Si into subpartitions Sij of length Pr(xi,xj) = Pr(xi)Pr(xj|xi) and so on. The principle of arithmetic coding consists of associating each sequence of symbols from the source with an interval that will be related to a binary word.

Let us consider an example to better understand the principle. We suppose that a discrete source is generating two symbols a and b with the probabilities of occurrence image and image. We wish to code the sequence aaba. We first divide the segment [0; 1[ into two intervals image and image, respectively, of length image and image. Since the first symbol of the sequence is a, the interval Sa is then divided into two intervals image and Sab=image, respectively, of length image and image since Pr(aa) = Pr(a)Pr(a | a) = image and image. Saa is then divided into two intervals image and image, then finally the interval Saab is divided into two intervals image and image.

The sequence aaba associated with the interval Saaba = [0.297; 0.395[ will be encoded by the binary word 0101 corresponding to the interval [0.0111; 0.1000[ in binary or [0.3125; 0.375[ in decimal. Indeed, it is the smallest word for which the interval is included into the interval Saaba.

These different operations are illustrated in Figure 2.4.

image

Figure 2.4. Example of arithmetic coding

The decoder performs the same operations as the coder in order to recover the sequence of symbols transmitted by the source. Practically, it is not necessary to know the conditional probabilities of the Q symbols of the source. These probabilities will be estimated as the source symbols occur. For example, in the case considered above, we can use the Laplace’s law of succession to estimate this probability as follows:

[2.4] images

where c is the number of past realizations of the symbol a and s is the total number of past realizations. An adaptive version of the arithmetic coding (context adaptive binary arithmetic coding (CABAC) is used for the video standard H264/AVC).

2.2.5. Algorithm LZ78

This algorithm, proposed in 1978 by Lempel and Ziv, is independent of the statistical properties of the source. It uses a dictionary that is completed as the symbols to be coded occur. Each element of the dictionary is a pair composed of a pointer or index on a previous element of the dictionary and a symbol. As a result, each element of the dictionary will be related to a string of symbols.

For example, let us consider a binary sequence delivered by a source:

0010000011000100100000101100010000010000110001010100001000001100000101100000011

We first find the shortest string that we have not yet found starting from the left.

0,01000001100010010000010110001000001000011

The second string different from 0 is 01

0,01,000001100010010000010110001000001000011

The third string different from 0 and 01 is 00

0,01,00,0001100010010000010110001000001000011

Finally, the sequence can be decomposed as follows:

0, 01, 00, 000, 1, 10, 001, 0010, 0000, 101, 100, 010, 00001, 000011

The encoding is done string-by-string from left to right. The strings are described from the strings already memorized. We obtain Table 2.3.

Table 2.3. Dictionary of strings

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
ø 0 01 00 000 1 10 001 0010 0000 101 100 010 00001 000011
0-0 1-1 1-0 3-0 0-1 5-0 3-1 7-0 4-0 6-1 6-0 2-0 9-1 13-1

The first line of the table corresponds to the index of the strings in the dictionary, the second line corresponds to the strings and the third line corresponds to the pair index-symbol used for the encoding of these strings. For example, the string 0010 is encoded by 7-0 since it is built from the string 001 (index 7 in the dictionary) to which we add the symbol 0. In order to initialize the procedure of the dictionary construction, the first entry (index 0) will be associated with the empty string.

In this example, the encoded sequence will be:

0-0, 1-1, 1-0, 3-0, 0-1, 5-0, 3-1, 7-0, 4-0, 6-1, 6-0, 2-0, 9-1, 13-1

The tree of all strings memorized in the dictionary is given in Figure 2.5. Each node corresponds to a string obtained by adding a 0 or a 1 (label on the branch) to a previous string. This graphical representation facilitates the construction of the dictionary, coding and encoding of the sequence.

image

Figure 2.5. Tree associated with the strings memorized in the dictionary

Finally, the binary encoded sequence is as follows:

0-0, 1-1, 01-0, 11-0, 000-1, 101-0, 011-1, 111-0, 0100-0, 0110-1, 0110-0, 0010-0, 1001-1, 1101-1

We should emphasize that the characters ‘,’ and ‘-’ are just here to help the understanding of the algorithm but are not practically used. In order to allow the decoding, the number of bits required to encode the index is always increasing: 2 strings with an index of length 1, then 2 other strings with the index of length 2, 22 strings with an index of length 3, 23 strings with an index of length 4, etc.

The decoder will perform the same algorithm to reconstruct the initial sequence. The main drawback of this algorithm is that the size of the dictionary is always increasing. A solution is to define a maximum number of entries of the dictionary. This algorithm can also be used with non-binary sources such as ASCII characters [SAL 07].

2.2.6. Lempel–Ziv–Welch (LZW) algorithm

Proposed in 1984 by Terry Welch, this algorithm is a popular evolution of the LZ78 algorithm. Similarly to the LZ78 algorithm, the source sequence is divided into short strings of different lengths. These strings, called prototype strings, are memorized in an initially empty dictionary. A new string is added to the dictionary if it is different from the already memorized strings. Furthermore, if we add a bit 0 or 1 to this sequence, we should not find a string already memorized in the dictionary.

EXAMPLE 2.2.– Let us consider again the binary source sequence of the previous example. This sequence is divided into strings like for the LZ78 algorithm. We have:

0, 01, 00, 000, 1, 10, 001, 0010, 0000, 101, 100, 010, 00001, 000011, 0001, 0101, 000010, 0000110, 0000101, 1000, 00011

The prototype strings in bold correspond to the 16 prototype strings memorized in the dictionary. For example, the string 00001 has been removed from the dictionary since the strings 000010 and 000011 are in the dictionary. Table 2.4 gives the list of the 16 prototype strings for this example. Each prototype string will be coded using 4 bits.

The tree of the memorized prototype strings is given in Figure 2.6.

Finally, the binary source sequence is decomposed using the prototype strings memorized in the dictionary:

0010,0000110,0010,010,0000101, 1000, 1000,0010,00011,0001,0101, 000010, 0000110, 0000101, 1000, 00011

Table 2.4. List of prototype strings

position prototype string code word
1 1 0000
2 01 0001
3 001 0010
4 010 0011
5 100 0100
6 101 0101
7 0000 0110
8 0001 0111
9 0010 1000
10 0101 1001
11 1000 1010
12 00011 1011
13 000010 1100
14 000011 1101
15 0000101 1110
16 0000110 1111

The output of the source coder is the following:

1000 1111 1000 0011 1110 1010 1010 1000 1011 0111 1001 1101 1111 1110 1010 1011

Again, the source decoder will use the same algorithm to reconstruct the dictionary and then decode the sequence and obtain the initial source sequence.

In this example, the Lempel–Ziv–Welch algorithm encodes the 79 bits of the source sequence into 16 words of 4 bits, i.e. 64 bits in total. While the length of the source sequence is short, the algorithm already reduces slightly the number of bits. In practice, the content of the dictionary is adapted dynamically depending on the evolution of the source properties. There are other algorithms such as the Tunstall code that associates at each message of variable lengths a word composed of a fixed number of q-ary symbols. The Lempel–Ziv algorithm and its evolution are used for the compression of files. It can be proven that they allow us to reach asymptotically H(X).

image

Figure 2.6. Tree of the prototype strings

2.3. Sampling and quantization

2.3.1. Review of the sampling theorem

Let x(t) be a signal with limited bandwidth B generated by an analog source. Using the sampling theorem given in Chapter 1, we can show that this signal can be written from the sequence x(kT) as follows:

[2.5] images

with image andt image.

image

Figure 2.7. Sampling and reconstruction

The discrete sequence x(kT) is obtained by sampling x(t) at time image (sampling frequency 2B samples per second). So, the output of the analog source can be converted into a discrete time signal without loss of information. In order to have a digital discrete sequence, we will also have to quantize the amplitude of the discrete sequence using a finite number of levels.

2.3.2. Scalar quantization

The scalar quantization is the process of mapping the signal amplitude to a set of L values. When the L values are regularly spaced, we say that this quantization is uniform. In the opposite case, the quantization is non-uniform.

The value of the quantized sample image will be the closest (according to the Euclidean distance) to the value of the sample x. If L is a power of 2 (L = 2R), then each quantized sample image can be described using a binary word of R bits (coding operation).

[2.6] images

DEFINITION 2.1.– Let S = {S1, S2,…, SL} be the set of intervals of cells and Y = {y1, y2, …, yL} be the set of quantized values. The quantization is mathematically defined the following relation:

[2.7] images

Each interval or cell Si is bounded by two thresholds denoted as ai−1 and ai. Consequently, the width of Si is equal to aiai−1. The quantization is uniform if all the intervals have the same width Δ. Δ is called the quantization step size or the resolution.

An example of uniform quantization is given in Figure 2.8 for L = 8.

image

Figure 2.8. Quantification uniform L = 8

The relation between the sample amplitude x and the quantized sample amplitude image is given in Figure 2.9 for L = 8.

A classical analog to digital converter (ADC) performs the sampling, uniform quantization and binary encoding. For an ADC with R = 8 bits/sample, we have L = 256.

An example of non-uniform quantization is given in Figures 2.10 and 2.11 for L = 8.

The quality of a quantizer can be measured using the square error image between the quantized signal and original signal.

image

Figure 2.9. Uniform quantization L = 8

image

Figure 2.10. Non-uniform quantization L = 8

DEFINITION 2.2.– From the square error, we define the mean square error (MSE) or mean distortion D.

[2.8] images

where f(x) is the probability density of x.

image

Figure 2.11. Non-uniform quantization L = 8

So, when the probability density f(x) is known, the aim of the quantization is to code the source samples with a given number of bits in order to minimize the mean distortion D. The quantization introduces a quantization error q between the amplitude x and quantized amplitude image. We have the following relation:

[2.9] images

For a uniform quantization, the quantization error q is between image and image

When considering that the amplitude is significantly higher than the quantization step size Δ, the probability density of q is well approximated using a uniform distribution:

images

Let us compute the mean square error D:

If the uniform quantization is performed over L = 2R levels and if the dynamic of the source signal is equal to A with A = ΔL = Δ2R, the MSE is given by:

[2.11] images

The signal-to-noise ratio in dB can be written as:

where image is the source variance. We can notice that one additional bit improves the signal-to-noise ratio by 6 dB.

If we assume that the signal x is sinusoidal with peak-to-peak amplitude A (i.e. peak amplitude of A/2), then image and from relation [2.12], we obtain:

[2.13] images

If the signal x is uniformly distributed between –A/2 and +A/2, we have image and the distortion function of the rate R is equal to:

and the signal-to-noise ratio is the following:

[2.15] images

2.3.2.1. Pulse coded modulation and logarithmic quantization

Pulse coded modulation (PCM) is the simplest coding technique. It is composed of two different functions: the scalar quantization and the coding operation. After sampling, the samples are just quantized and then encoded. Figure 2.12 shows the block diagram of a PCM coder (without the binary coder). This technique is well adapted to memoryless sources.

image

Figure 2.12. Block diagram of the PCM coder

We have the following relation between the input and output of the PCM coder:

[2.16] images

where qn is the quantization error.

In many applications such as speech coding, for example, the low amplitude signals appear more frequently than high amplitude signals. The uniform scalar quantization does not take into account this non-uniform distribution and is consequently suboptimal. This is why different non-uniform scalar quantizers have been proposed to improve the performance. The non-uniform quantizer can be seen as the association of a compressor which performs a nonlinear transformation and a uniform scalar quantizer. Usually, this nonlinear transformation is implemented using a look up table.

One of the criteria to determine the nonlinear function consists of forcing the signal-to-noise ratio to be constant over the full range of the source signal by adapting the quantization step size as a function of the input signal image cst. Consequently, the non-linear functions are logarithms and for speech coding, two compression laws are mainly used (ITU-T G.711 standard) [ITU 89]:

A law (European system):

[2.17] images

For A law, the inverse function can be written as:

[2.18] images

μ law (American system):

[2.19] images

For μ law, the inverse function can be written as:

[2.20] images

In both laws, the natural logarithm is used. For a classical speech signal, the A law reduces the quantization noise power by 24 dB compared to uniform quantization.

In Figure 2.13, we have plotted the A and μ laws. We can notice that the two curves are almost the same.

The non-uniform quantization can be implemented using a uniform quantization over 12 bits. The law is then approximated using 13 segments. The look up table is given in Table 2.5.

image

Figure 2.13. A law and μ law. For a color version of the figure, see www.iste.co.uk/leruyet/communications1.zip

Table 2.5. Look up table for the implementation of the non-uniform quantization

image

2.3.3. Optimal scalar quantization

2.3.3.1. Conditions for optimality

The optimal scalar quantization is the set of cells S = {S1, S2,…, SL} and quantized values Y = {y1, y2, …, yL} which minimize the MSE D. There is no simple solution to determine it. However, two conditions are necessary to guarantee the optimality and have been used to design practical methods of construction:

– nearest neighbor rule: let {y1, y2, …, yL} be a set of quantized values, where the cells must satisfy:

[2.21] images

–centroid condition: let {S1, S2, …, SL} be a set of cells, where the quantized values must be calculated as follows:

[2.22] images

2.3.3.2. Limit case

We have seen in Chapter 3 that Shannon has introduced the asymptotic distortion rate function given by equation [1.91]. In this section, we will derive different effective distortion rate functions depending on the solutions implemented.

When the number of cells L is high, the width of the cells is small and we can assume that the density probability of x is constant in each cell

image, the width of the interval Si. We have:

[2.23] images

And consequently,

[2.24] images

By integrating the right term, we obtain:

[2.25] images

To minimize the distortion, the term (aiyi)3 should be as close as possible to (ai 1 − yi)3. We deduce that the quantized value yi should be selected as the center of the cell yi ≈ (ai−1 + ai)/2.

Finally, the MSE can be written as:

In the case of uniform quantization, we find again the result of equation [2.10].

It is possible to approximate D(R) as a function of the distribution of the input x. Using the relation image, from [2.26] we can write:

[2.27] images

Using the inequality of Hölders image, we have:

[2.28] images

we obtain:

[2.29] images

In the limit case of small width Δi, we can replace the summation by an integration as follows:

[2.30] images

Finally, using the relation L2 = 2−2R, we obtain:

This formula is called Panter and Dite approximation. In the case of a source with uniform distribution, we find the same relations as [2.14]. Using this approximation, we can show that for a Gaussian source with mean image, we have:

[2.32] images

For a source with memory, the samples x are not independent and if we can perform an entropy coding (lossless source coding) after the scalar quantization in order that the rate R equals the entropy of the source composed of the samples at the output of the scalar quantizer.

images
[2.33] images

where HD(X) is the differential entropy of the source X. By using the Jensen inequality and equation [2.26], we obtain:

[2.34] images

The asymptotic effective distortion rate function when the scalar quantizer followed by an entropy coder is:

[2.35] images

Let us compute this distortion rate function for a Gaussian source. In equation [1.55], we have shown that image. Consequently, we obtain the following function:

[2.36] images

In the case of a Gaussian source, the gain achieved by adding an entropy coder after a scalar quantization is equal to image (2.8 dB). We can show that this asymptotic distortion rate function is 10 log10 πe/6 (1.53 dB) higher than the Shannon’s distortion rate function.

2.3.4. Vector quantization

To perform a vector quantization, first we have to combine N real samples to form a vector x = [x1, x2, …, xn]. The main interest of the vector quantization is its ability to exploit the correlation between the samples when the source is with memory. Assuming a number of bit per dimension R, the number of vectors in the dictionary will be equal to:

[2.37] images

The vector quantizer clearly generalizes the scalar quantizer: it is defined by a set of cells or Voronoï regions S = {S1, S2,…, SL} and a dictionary composed of L vectors of dimension image with yi = [yi1, yi2, … , yin].

The vector quantization is mathematically defined with the following relation:

[2.38] images

From definition [1.86], the MSE or average distortion per dimension DN is equal to:

[2.39] images

where f(x) is the probability density of x.

The two criteria for the optimality (nearest neighbor rule and centroid condition) can be directly generalized from the scalar case.

2.3.4.1. Llyod’s algorithm

Lloyd’s algorithm [LLO 82] allows us to build iteratively a vector quantizer by taking into account these two criteria. The algorithm is described as follows:

  • Step 1: initialization
    • - generate a training sequence consisting of a huge number of samples representative of the source {x1, x2,…, xM} with M image L,
    • - initialize the dictionary with vectors [y1,y2, … , yL] selected randomly;
  • Step 2: nearest neighbor rule
    • - using the dictionary, we associate each sample of the training sequence with its closest vector:
[2.40] images
  • Step 3: centroid condition

    using the set of samples xk belonging to the i-th region Si, we calculate the mean that will become the new quantized value of Si:

[2.41] images
  • Return to step 2 until convergence.

While the algorithm is always converging, we cannot guarantee that the obtained quantizer is optimal (with respect to the minimization of the MSE criterion). Lloyd’s algorithm is also known as the K-means algorithm and is used for data partitioning. There is a second algorithm to build a quantizer namely Lloyd’s 2 algorithm also called Lloyd-Max’s algorithm.

We consider a simple example to illustrate the interest of the vector quantization to quantize a source with memory. The signal to be quantized is a synthetic sequence defined by image where images is a white centered Gaussian noise (autoregressive model of order 2). This source is with memory as shown in Figure 2.14. By constructing 2 dimensions vectors x(n) = [x(n) x(n + 1)]T and plotting these vectors in a plane as shown in Figure 2.15, we can see that there is a correlation between the consecutive samples since the vectors are concentrated along a line. Compared to scalar quantization, vector quantization is able to exploit this correlation.

In Figures 2.16 and 2.17, we show the quantized values and cells obtained after scalar and vector quantization. In both cases, the quantized values have been obtained using the Lloyd’s algorithm. For the scalar quantization, the number of quantized values L is equal to 4, while for the vector quantization L = 16 since the chosen number of dimensions is N = 2. The two quantizations are using 2 bits per sample. The MSE per dimension is 0.0267 and 0.0.015 for the scalar and vector quantizations, respectively.

image

Figure 2.14. Example of realization of a source with memory function of the time

image

Figure 2.15. Example of realization of a source with memory projected on a plane

image

Figure 2.16. Example of scalar quantization. For a color version of the figure, see www.iste.co.uk/leruyet/communications1.zip

image

Figure 2.17. Example of vector quantization. For a color version of the figure, see www.iste.co.uk/leruyet/communications1.zip

2.3.5. Optimal vector quantization

Let M (Si, yi) be the normalized moment of inertia for a Voronoi’s cell Si:

[2.42] images

When the number of cells L is high, Gersho [GER 92] has shown that it was reasonable to assume that the normalized moment of inertia is constant whichever the cell M(Si, yi) ≈ MN. The distortion DN can then be approximated as a function of the input distribution x as follows:

with

[2.44] images

In the case N = 1 (scalar quantization), M1 = 1/12 and image, i.e. image. We find again the Panter and Dite approximation [2.31].

Zamir and Feder [ZAM 96] have shown that when N tends toward infinite, M = 1/2πe. Let us compare the gain obtained by increasing the number of dinsion in the case of an i.i.d Gaussian source. For N → ∞, we have β = 2πe and M = 1/2πe, i.e. M1β1 = 1 and it corresponds to the Shannon’s distortion rate given by relation [1.91]. Consequently, the asymptotic gain is equal to:

[2.45] images

To conclude, for a Gaussian source, the vector quantization brings an asymptotical gain of 4.35 dB with respect to the scalar quantization.

The direct implementation of the vector quantization can be quite complex since it is necessary to perform an exhaustive search to find the quantized vector. Different authors have proposed vector quantization with a specific structure such as:

  • – the multistages vector quantifier [JUA 82];
  • – the tree-structured vector quantifier [GRA 82, LIN 80];
  • – the shape gain vector quantifier [SAB 84];
  • – the lattice vector quantifier [GIB 88, CON 92].

This section has only introduced the basic principles of vector quantization. For a more in-depth analysis, we recommend [GER 92] or the overview paper [GRA 98].

2.4. Coding techniques for analog sources with memory

2.4.1. Introduction

In the previous section, we have seen that it was possible for i.i.d sources to almost reach the distortion rate function by using scalar quantization associated with coding entropy or using vector quantization. In most of the applications, the source is with memory. In this case, it is important to exploit the correlation between the samples in order to improve the efficiency of the source coding.

We can distinguish three main families of techniques to exploit the correlation in order to reduce the number of coded bits:

  • – the direct coding techniques based on a time waveform such as the delta modulation, PCM, differential pulse coded modulation (DPCM) and adaptive differential pulse coded modulation;
  • – the techniques based on source model such as the linear prediction coding (LPC and RPE-LTP) or the vector quantization (vector sum excited linear prediction (VSELP), code excited linear predictive (CELP) and algebraic CELP (ACELP)) used for low-rate speech coding;
  • – the transform coding techniques (discrete cosinus, wavelet, filterbank, etc.).

2.4.2. Linear prediction

Linear prediction is a statistical estimation technique to estimate the current and future samples of a random process from the past samples [MAK 75, VAI 08].

Let us suppose that the signal xn can be modeled with a stationary random process with an autocorrelation function Rxx(m) = E[xnxn–m]. We want to estimate image from the P past observations [xn–1, xn–2, … xnP]. The block diagram of a P order predictor is given Figure 2.18. The aim of the linear prediction is to search the vector a that will minimize the MSE image

image

Figure 2.18. Block diagram of the P order predictor

Let

[2.46] images
[2.47] images

We can determine the P prediction coefficients ai which minimize the MSE image.

with

images

where Rxx(m) = E[xnxn–m] is the autocorrelation function of the samples xn.

Let us derive image with respect to1 a:

[2.49] images

By canceling this expression, we obtain the following solution:

[2.50] images

Replacing a obtained above in equation [2.48], we obtain:

This relation is equivalent to the solving of a linear system using a set of equations called Yuke–Walker or Wiener–Hopf equations:

[2.52] images

This system can be solved using the Gauss algorithm but the complexity is in O(P3). The Levinson–Durbin’s algorithm avoids the inversion of the matrix R and requires only O(P2) operations. It allows us to recursively build the order P predictor by computing the order m + 1 predictor from the order m predictor. We will describe the Levinson–Durbin without demonstration. Let a1(m),…, am(m) be the coefficients of order m predictor and image be the error prediction variance at order m. The algorithm is the following:

images

The coefficients ai are equal to the coefficients ai(P) ∀i = 1,…, P obtained at the end of the algorithm. The coefficients kii = 1,…, P are called the partial correlation coefficients and their module is less than 1. The residual MSE is given by:

The coefficients ai can be calculated at the beginning of the transmission or periodically adjusted. In practice, the autocorrelation function Rxx(m) = E[xnxn–m], i.e. estimated from a finite number of observed samples N image P:

[2.54] images

The prediction gain is defined as the ratio of the power errors with and without prediction. Using relations [2.51] and [2.53], we have:

[2.55] images

Since the module of the partial correlation coefficient ki is less than 1, the gain is an increasing function P. The prediction gain is all the greater as |ki| is close to 1.

2.4.3. Predictive scalar quantization

2.4.3.1. Delta modulation

We know that analog sources such as audio and video signals are very correlated and this redundancy cannot be exploited by a simple PCM modulation. When the source is with memory, the variation between the consecutive samples can be rather small. Consequently, by quantizing the difference between the consecutive samples, it is possible to reduce the rate at the output of the coder.

The principle of the delta modulation is to quantize the amplitude difference en between the input sample xn and the reconstructed sample image:

[2.56] images

The quantization is only a two-level quantization image. Figure 2.19 shows the block diagram of a delta modulator. The accumulator performs the following operation:

[2.57] images

If at time n, we have image, then image and the output of the accumulator at time n + 1 is increased by image.

image

Figure 2.19. Delta modulator

If at time n, we have image, then image and the output of the accumulator at time n + 1 is decreased by image.

The block diagram of the delta demodulator is shown in Figure 2.20.

image

Figure 2.20. Delta demodulator

The quantization error qn due to this quantization is given by:

[2.58] images

It is then possible to derive the expression of the reconstructed sample image from image and the quantization error:

[2.59] images

Consequently, the reconstructed sample image is equal to the previous input sample xn–1 and the quantization error qn–1. An example of behavior is given in Figure 2.21.

image

Figure 2.21. Example of behavior of image in a delta modulator delta

We can observe two kinds of errors: the slope overload due to the slope of image limited to Δ/Tech. In order to reduce the slope overload, the sampling frequency should be 4−5 times the minimum sampling frequency. Another solution is to increase the value of Δ. The second kind of error is the granular noise that happens even if the input signal is constant. Indeed, the reconstructed samples image alternate between the two levels (peak-to-peak noise of Δ). A solution is to reduce the value of Δ. Consequently, we can see that the choice of Δ is a compromise between these two kinds of errors. An efficient solution is to adapt the step size Δ depending on the signal variation. This is the principle used in the continuously variable slope delta modulation (CVSD).

2.4.3.2. Differential pulse coded modulation

The basic principle of DPCM is to quantize the amplitude difference en between the input sample xn and the predicted sample image. image and is called the prediction error. Let us suppose that the predicted sample image at the receiver side can be perfectly feedback to the transmitter as shown in Figure 2.22.

image

Figure 2.22. Ideal block diagram of a DPCM transmission chain

In practice, image is obtained using a P order predictor as seen previously. However, it is not possible to use xn as the input of the predictor. Indeed, in this case, it will not be possible to exactly add image at the receiver side (xn is not known at the receiver) and we will have a reconstruction error image higher than the quantization error image (see the exercise on DPCM modulation). The proposed solution is to use a closed loop predictive quantization, i.e. to include a local decoder at the transmitter side. By doing this, the prediction will also be performed using the reconstructed signal image. The block diagram of the DPCM coder using the closed loop predictive quantization is given in Figure 2.23.

The output of the predictor is given by:

[2.60] images

where the input of the predictor is image.

image

Figure 2.23. Block diagram of the DPCM coder

We can check that the quantization error image is also the difference image:

[2.61] images

Figure 2.24 gives a block diagram of the DPCM decoder.

image

Figure 2.24. DPCM decoder

The coefficients of the predictor should be periodically transmitted to the receiver. In this case, the DPCM decoder will use exactly the same predictor as the transmitter one (assuming no transmission errors). Consequently, we can reconstruct the samples image.

Using this scheme, the samples en are decorrelated. Their amplitude is small and consequently they require a limited number of bits for the quantization. We can notice that the delta modulation is a simplified version of the DPCM modulation. Indeed, for the delta modulation, the quantization is performed using one bit and the predictor is replaced by a simple filter with transfer function z–1. There are more complex versions of the DPCM modulation using two predictors. It is also possible to adapt the quantization step size dynamically as a function of the variance of the source samples. A popular version is the adaptive DPCM (ADPCM) modulator. DPCM modulation is used for speech coding in ITU G.721, G.722, G.723 and G.726 standards [ITU 90].

2.4.4. Transform coding

As the linear prediction or the vector quantization, the transform coding exploits the source correlation in order to decrease the amount of data to code the source. The principle of transform coding consists, after having combined N input samples into a N dimension vector, of applying a linear transform in order to reduce the correlation between the samples and concentrate the energy over a few terms [GOY 01]. We will then quantized those significant terms using a scalar quantization and then apply entropy coding.

A first advantage of transform coding is its low complexity: it is possible to process large vectors due to the linearity. Another advantage is that, after transform coding, the transform domain can be more adapted to the human perception (audio or visual depending on the application).

The block diagram of the transform coding is given in Figure 2.25.

For the coding, the linear transform, defined by the matrix A, modifies the input vector x = [x0, x1,…, xN–1] composed of N samples into a vector of transform coefficients y = [y0, y1,…, yN–1] as follows:

[2.62] images

We then apply a scalar quantization αk for each transform coefficient yk. The indices are finally coded using an entropy coding γ. For the decoding, after entropy decoding γ–1, we apply an inverse quantization βk over the index and then the vector image is calculated by inverse linear transform of the vector image:

[2.63] images

where B is the inverse transform matrix.

image

Figure 2.25. Block diagram of the transform coding

Due to their low complexity and interesting properties, we will mainly use orthogonal transforms. In this case, we have:

[2.64] images

The orthogonal transform preserves the norm and more importantly, it preserves the distortion, i.e. image. We can show that the transform that minimizes the distortion is the Karhunen–Loève transform. The columns ai of the transform matrix A should satisfy the following relation:

[2.65] images

where λi are the eigenvalues of the covariance matrix R = E[xxT] of vector x. Consequently, the columns ai should be the eigenvectors of R.

The Karhunen–Loève transform is also called the Hotelling transform or the principal components analysis (PCA). The main drawback of the Karhunen–Loève transform is its computational complexity. Indeed, it is necessary to update the transform matrix with the covariance matrix if the signal is not stationary. This is why in practice, we often prefer to use the discrete fourier transform (DFT) or discrete cosinus transform (DCT).

The coefficients akn of the matrix A when using the DCT are given by:

[2.66] images

and

images

The DCT is used in many standards of still image or video compression such as JPEG and Motion Picture Expert Group (MPEG) standards.

One last problem to solve is the determination of the best bit allocation in order to minimize the MSE for a given global rate R:

[2.67] images

where Ri and Di(Ri) are, respectively, the rate and distortion associated with the coefficient yi.

From equation [2.43] we have, in the scalar case, image It can be shown [SEG 76] that the coding rate Ri minimizing the distortion D is the following:

[2.68] images

The main drawback of this theoretical solution is that the obtained coding rates are not integers. To solve this issue, some heuristic algorithms have been proposed for which the rates are obtained by adding one bit at each iteration.

image

Figure 2.26. Example of scalar quantization after a Karhunen–Loève transform. For a color version of the figure, see www.iste.co.uk/leruyet/communications1.zip

In Figure 2.26, we show the vectors and cells obtained using Karhunen–Loève transform and scalar quantization using the same example as in section 2.3.4. We can observe that after the transformation, the energy is concentrated along the horizontal axis. Consequently, we can allocate 3 bits for this dimension and only 1 bit for the other dimension, i.e. 2 bits on average per sample. The MSE is equal to 0.0136 on the first dimension and 0.0199 for the second dimension. Consequently, the MSE per dimension is equal to 0.0168, very close to the one of two-dimension (2D) vector quantization.

2.4.5. Subband coding

Subband coding consists of splitting the input signal into many narrowband signals centered around different frequencies and then coding each narrowband signal separately. Subband coding is the first stage of many audio and video coding techniques. The implementation is simplified by using a filterbank composed of N filters as shown in Figure 2.27.

image

Figure 2.27. Subband coding

The decimation operation (↓ N) consists of removing N – 1 samples over N and the upsampling operation (↑ N) consists of inserting N – 1 zeros between each sample. The prototype filter G is a filter with a finite impulse response g of length L, for which the z transform is:

[2.69] images

The N filters with z transform Gi(z) and frequency response Gi(f) can be deduced from the frequency response of the prototype filter G after a frequency shift of (2i + 1)/4N.

[2.70] images

So, the normalized central frequency of each filter is (2k + 1)/4N and the normalized bandwidth is 1/2N. The N z transform Gi(z) can be obtained from the impulse response g:

[2.71] images

The practical implementation is done by exploiting the polyphase decomposition of filterbanks [VAI 93].

2.5. Application to the image and sound compression

2.5.1. Application to the still image compression

In this section, we will briefly present the JPEG standard [JPE 90] adopted in 1992. JPEG is a group editing compression standard for still image compression. For the JPEG standard, the image pixels are processed by blocks of 8 × 8 pixels. Each pixel is coded using 8 bits. We will consider only the luminance part, but a similar process is applied for the two chrominance signals. For each block, the steps of the compression are as follows:

  • – application of a 2D DCT for each block of 8 × 8 pixels by performing a first one-dimensional (1D) DCT on the lines and then a 1D DCT on the columns;
  • – substraction by a value 128;
  • – quantization from a predefined table by applying to each symbol the following operation:
[2.72] images

An example of table is given in Table 2.6.

Table 2.6. Example of JPEG quantization table

16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99
  • – the data of each block are then serialized (zigzag) as shown in Figure 2.28;
  • – finally, the data are encoded using Huffman’s coding.

A good compromise between the quality of the reconstructed image and the number of transmitted bits is obtained. In JPEG standard, the DC elements (mean values) are separately encoded using predictive coding.

image

Figure 2.28. Zigzag serialization

In the JPEG 2000 standard [TAU 02], the wavelet transform is used instead of the DCT transform. After quantization, the image is decomposed into subimages corresponding to a passband filtered version and downsampled from the original image. The wavelet transform can be implemented efficiently using a filterbank. The standards for video compression MPEG (MPEG-1, MPEG-2, MPEG-4 Part 2 and MPEG-4 AVC/H.264) exploit both time correlation and spatial correlation and use the transform coding based on the DCT transform [SAL 07]. In the H265 high-efficiency video coding (HEVC) standard published in 2013 [SUL 12], the dimension of the DCT transform is adapted depending on the content of the image (16 × 16, 32 × 32 and 64 × 64). This standard is very well adapted for high-resolution video coding (2 K, 4 K, 8 K, etc.) and for mobile application.

2.5.2. Application to speech coding

2.5.2.1. Characteristic of speech signal

Speech is the result of the coordinated action of the respiratory and masticatory organs. The lungs push the air during the expiration phase, through the trachea. At the level of the larynx, the air pressure is modulated before being applied to the vocal tract composed of the pharyngeal cavity and then the nasal and oral cavities in parallel. The speech signal is a complex signal due to the different elements involved in its generation. It is locally stationary when the duration of the analysis window is about 20 ms.

The speech signal can be divided mainly into two families of produced sounds:

  • – the voiced sound such as the vowels, the liquid consonant “1” and nasal consonant “m”: these sounds are produced by the flow of air from the lungs and by adjusting the tension of the vocal cords so that they vibrate in a relaxation oscillation. The transmitted signal is a periodic or quasi-periodic signal characterized by a fundamental frequency f1 called pitch frequency or simply pitch which gives the modulation (80−150 Hz for men and 120−300 Hz for women) and secondary frequencies f2, f3, etc. The resonances are referred to as formants and their frequency locations as the formant frequencies. The characteristics of the vocal tract directly impact the produced sound. The frequency is controlled by the tongue and lips, while the intensity is adjusted by adapting the amount of air pushed by the lungs;
  • – the unvoiced sound such as the fricatives “f”, “s” and “ch”: they are produced when the airflow leaves from the mouth or nose without vibration of the vocal cords. These sounds have no specific periodicity and can be modeled from a white Gaussian source.

NOTE.− There exist some sounds combining the two modes of generation such as the voiced fricative consonant (for example, “v”, “z” and “j”).

The vocal tract can be modeled using a linear filter with an input signal composed of periodical pulses or white Gaussian noise. The characteristics of the filter are adapted depending on the generated sounds. Different techniques have been proposed to perform speech coding and can be classified into four main categories:

  • – the direct coding (PCM, DPCM and ADPCM);
  • – the linear prediction (adaptive predictive coding (APC) and LPC);
  • – the vector quantization (VSELP, CELP and ACELP);
  • – the subband coding (Mixed-excitation linear prediction (MELP)).

The important parameters to evaluate a speech coder are the following:

  • – the binary rate ranging from 1.2 to 64 kb/s;
  • – the quality, understanding and robustness to background noise. For the evaluation of the subjective quality, the mean opinion score (MOS) grade ranging from 1 (bad) to 5 (excellent) is commonly used for the standardization tests;
  • – the complexity of the coder (ranging from 2 to 30 million instructions per second (MIPS)) depending on the technique;
  • – the latency of the coder/decoder chain. It is usually related to the frame duration (from 0.125 to 30 ms).
image

Figure 2.29. Example of speech signal (“this is”) in the time domain

2.5.2.2. Direct coding

We have studied the predictive scalar quantization in section 2.4.3. Table 2.7 gives a comparison of different techniques of direct speech coding (PCM, DPCM and ADPCM) when considering a sampling frequency of 8 kHz. The selected parameters are the most commonly used in practice.

2.5.2.3. Linear predictive coder

Linear predictive coder (LPC) [CAM 86] is an obsolete coder based on linear prediction. However, it is the basis of many current speech coders. Figure 2.30 shows a block diagram of this speech coder.

Technique Quantifier Number of bits Rate
PCM Uniform 12 bits 96 kb/s
log PCM Logarithmic 8 bits 64 kb/s
DPCM Logarithmic 4−6 bits 32−48 kb/s
ADPCM Adaptive 3−4 bits 24−32 kb/s

Table 2.7. Comparison of modulations for speech coding

Figure 2.30. Simplified block diagram of the LPC coder

image

For unvoiced sounds, we will assume that the speech signal can be modeled using an autoregressive random process of order P0 and consequently there exists a whitening filter with transfer function A(z) of order P ≥ P0. In this case, we will determine the filter H(z) = 1/A(z) which reconstructs as best as possible the input sequence x from an excitation signal e white and random of power spectrum density γEE(f) = σ2. We have:

[2.73] images

The power spectrum density of the reconstruct signal image at the receiver is given by:

[2.74] images

The coefficients ai of the filter are obtained using the Levinson–Durbin algorithm. In the case of a voiced sound, the excitation signal e is an impulse train with a period equal to the pitch of the input sequence.

In order to reduce the number of bits associated with the P coefficients ai that will be transmitted, we will use the line spectrum pairs (LSP). From A(z), we construct two polynomials P(z) and Q(z) as follows:

[2.75] images

If P is even, each of the two polynomials has P/2 conjugate roots on the unit circle, which allows us to rewrite the polynomials as a product given by:

[2.76] images
[2.77] images

From the P coefficients ai, we can compute the coefficients wi with the following property:

Under the condition that the order in equation [2.78] is kept after quantization, the LSP representation guarantees the stability of the filter used for the decoding.

We will briefly describe the parameters of the LPC10 coder using 10 coefficients and introduce by the American Department of Defense (Federal Standard 1015) [CAM 86]. Since the duration of the frame is 22.5 ms and the frequency sampling is 8 kHz, the frame is composed of 180 samples. The binary rate is 2.4 kbit/s and consequently only 54 bits are used to code a frame. Among these 54 bits, 42 bits are used for the coding of the LPC coefficients using the LSP representation, 6 bits for the pitch, 1 bit for the selection voiced/unvoiced sound and finally 6 bits for the gain g.

2.5.2.4. Code excited linear predictive

The CELP or its variations such as the multi-pulse excitation (MPE), the regular-pulse excitation (RPE) and VSELP is used in most of the speech coders with rate less or equal to 16 kbit/s. Compared to the LPC coder which tries to obtain power spectral density, imagethe CELP coder will directly search for image [SCH 85]. This approach is called analysis by synthesis (AbS). Figure 2.31 shows the block diagram of this speech coder. An equivalent representation is given in Figure 2.32. The CELP coder is composed of two predictive filters which decorrelate the input sequence x and provide a residual prediction error. The first filter called long term predictor (LTP) takes into account the periodicity of the voiced sound (pitch). The second filter called short term predictor (STP) deals more particularly with the unvoiced sounds. A shape gain vector quantization is used for the coding of the residual error. The aim of the coder will be to determine the coefficients of the predictive filters as well as the indices of the excitation vector and gain selected from the two dictionaries.

image

Figure 2.31. Block diagram of the CELP coder

Since the joint search of all the parameters is too complex, we will start by the computation of the filters’ coefficients by applying a white random signal at the input of the filters. We will then determine the best excitation sequence and the gain. A frequency prefiltering not shown in the figures is usually added to concentrate the error in the non-perceptual zone by exploiting the masking effect that will be introduced in the next section. Depending on the standard, the dictionary of excitation vectors is predefined or adapted to the input signal statistics.

image

Figure 2.32. Another version of the CELP coder

We will now give some standards using the CELP coder and its variations:

  • – ITU G.723.1 standard: two coders are normalized multi-pulse maximum likelihood quantization (MP-MLQ) (6.3 kbit/s)/ACELP (5.3 kbit/s). This standard defines the speech coder for teleconference over the public switched telephone network (PSTN) and for voice over IP (VOIP). The STP and LTP filters are of order 10 and 5, respectively;
  • – ITU G.728 standard low delay (LD) CELP (16 kbit/s): the frame duration is very short (0.625 ms). There is no LTP filter and the STP filter coefficients are not transmitted to the decoder. The indices of the excitation and gain vectors are coded using 10 bits in total (10 bits/0.625 ms= 16 kbit/s);
  • – ITU G.729 standard conjugate structure algebraic CELP (CS-CELP) (8 kbit/s): the excitation vectors are coded sequences with a conjugate algebraic structure. The non-zero elements of the excitation vectors are only 1 or –1 at regular positions [SAL 98]. The rate is divided as follows: coefficients of the STP filter of order 10 (1.8 kbit/s), coefficients of the LTP liter (2 kbit/s) and indices of the excitation and gain vectors (4.2 kbit/s);
  • – ETSI 6.20 GSM half rate (HR) standard (in Europe) and TIA IS-54 standard (in Japan and North America) VSELP (6.3 kbit/s): the dictionary of excitation vectors is highly structured;
  • – ETSI GSM full rate (FR) standard (13 kbit/s) uses a regular pulse excitation – long term prediction (RPE-LTP) coder: the excitation sequence is a concatenation of sequences composed of regularly spaced pulses. The rate is divided as follows: coefficients of STP filter of order 8(1.8 kbit/s), coefficients of LTP filter (1.8 kbit/s) and indices of the excitation and gain vectors (9.4 kbit/s);
  • – ETSI GSM enhanced full rate (EFR) standard (12.2 kbit/s) uses the algorithm ACELP;
  • – TIA IS-96 standard used in cellular communication of third-generation CDMA QSCELP (1.2-9.6 kbit/s) no LTP fiter;
  • – FS 1016 standard CELP (4.8 kbit/s): developed by the American Department of Defense (DoD) for the third generation of secure telephone (STU-III).

A more detailed description of the different speech coding techniques and standards is given in [CHU 04].

2.5.3. Application to audio compression

While the techniques of speech coding that we have previously presented exploit the characteristics of the voice signal generated by the human, the audio compression will take into account the psychoacoustic properties of the human auditory system.

2.5.3.1. Psychoacoustic properties

The human ear can be decomposed into three parts: the outer ear (pinna and canal), the middle ear (eardrum and ossicles) and the inner ear (cochlea), each playing a specific role in the human auditory system. The sound waves reach the eardrum as pressure variations received by the outer ear. These pressure variations are transformed into mechanical vibrations by the ossicles and are transmitted to the cochlea. The cochlea is filled with a liquid in which bathes the 3 cm long basilar membrane. The basilar membrane converts the mechanical energy into electrical energy and is connected to the auditory nerve. The studies of the psychoacoustic properties of the ear have been carried out by Fletcher and Zwicker in the 1960s [ZWI 61]. These studies have emphasized the absolute threshold, i.e. the minimum detectable level of the sound as shown in Figure 2.33. The maximum of sensitivity is between 2 and 5 kHz.

image

Figure 2.33. Absolute threshold curve

Another important property for audio compression is masking. Masking is the process by which the detectability of one sound is impaired by the presence of another sound if their respective frequencies are close enough. This can be explained by the structure of the basilar membrane which can be divided into different segments, each dealing with a given frequency band. These frequency bands are called the critical bands and consequently, the auditory human can be modeled by a filterbank composed of 24 frequency bands. The common admitted list of these critical bands using the Bark’s scale2 is given in Table 2.8.

Figure 2.34 illustrates the masking in the presence of four primary tones or tonal components at frequencies 1,940 Hz, 4.5 kHz, 5.2 kHz and 19 kHz, respectively, in an audio signal. The masking curves associated with each of these tones define the required level to hear a secondary tone as a function of the frequency. These results show that the masking is effective in a narrow band around the tonal components. It should be noticed that the slopes of the masking curves are higher toward lower frequencies than toward upper frequencies.

Table 2.8. List of the 24 critical bands using the Bark’s scale

Index Band (Hz) Width (Hz) Index Band (Hz) Width (Hz)
1 20–100 80 13 1,720–2,000 280
2 100–200 100 14 2,000–2,320 320
3 200–300 100 15 2,320–2,700 380
4 300–400 100 16 2,700–3,150 450
5 400–510 110 17 3,150–3,700 550
6 510–630 120 18 3,700–4,400 700
7 630–770 140 19 4,400–5,300 900
8 770–920 150 20 5,300–6,400 1,100
9 920–1,080 160 21 6,400–7,700 1,300
10 1,080–1,270 190 22 7,700–9,500 1,800
11 1,270–1,480 210 23 9,500–12,000 2,500
12 1,480–1,720 240 24 12,000–15,500 3,500

2.5.3.2. MPEG audio coding

The MPEG standard is an international ISO standard for audio and video compression. The MPEG 1 standard was finalized in 1992, while the MPEG 2 standard was published in 1994 [PAN 95]. The MPEG audio coder exploits the limitation of the human auditory system. More precisely, it eliminates the non-detectable frequencies, allowing us to significantly reduce the number of bits required to code the audio signal. Depending on the application, three different layers of the audio coding system can be applied. The required rates for each layer to obtain the same subjective quality as a stereo compact disk are the following:

  • – layer 1: 384 kbit/s (compression rate of 3.6);
  • – layer 2: 256 kbit/s (compression rate of 5.4);
  • – layer 3: 128 kbit/s (compression rate of 10.8).

A compact disk requires a binary rate of 2 × 44100 × 16 = 1378 kbit/s (sampling frequency = 44,100 Hz and scalar quantization using 16 bits).

image

Figure 2.34. Masking level in frequency. Fora color version of the figure, see www.iste.co.uk/leruyet/communications1.zip

In this section, we will provide some brief information on the standard MPEG 1 layer 1. This standard allows three different sampling frequencies: 32, 44.1 and 48 kHz.

The general block diagram of an MPEG audio coder is given in Figure 2.35.

This MPEG audio coder is composed of four main parts:

  • – the filterbank which divides the audio signal into N = 32 narrowband signals with same bandwidth;
  • – the spectral estimation estimating the power spectrum density of the audio signal inside an analysis window;
  • – the psychoacoustic model that determines the masking level and the signal-to-mask ratio for each subband;
  • – the bit allocation which calculates the number of bits allocated for each subband from the signal-to-mask ratio.
image

Figure 2.35. Block diagram of an MPEG audio coder

In the MPEG 1 layer 1 standard, from a frame composed of 512 samples, the filterbank generates N = 32 signals subsampled with a factor of 32. Let us give the expression of the 32 signals at the output of the filterbank from the input signal and the impulse response of the different filters:

[2.79] images

where x[n] is the input signal with 0 ≤ n ≤ 511 and:

[2.80] images

with h[n] = −C[n] if [n/64] is odd and h[n] = C[n] otherwise. C[n] are the coefficients of the filtering window defined in the standard. The impulse response of the prototype filter h[n] is given in Figure 2.36.

This filterbank is implemented using a polyphase network followed by a modified DCT. The 32 signals at the output of the filterbank can be calculated as follows:

[2.81] images

where i is the bandwidth index ranging from 0 to 31. st[i] is the output of the filter for the band i at time t where t is an integer multiple of 32. x[n] is the input signal composed of 512 samples memorized in a circular buffer, the values of which are renewed by block of 32. M[i][k] are the coefficients of the analysis matrix given by:

[2.82] images
image

Figure 2.36. Impulse response of the prototype filter

The frequency response of the filterbank is represented in Figure 2.37. The bandwidth of each subband is 22/32 ≈ 0.7 kHz.

For each block of 32 processed samples, the output is composed of one sample per subband. The signal is consequently downsampled by a factor of 32. One of the main drawbacks of this scheme is that all the bandwidths are the same and do not match with the critical bands of the human auditory system.

image

Figure 2.37. Frequency response of the filterbank in the bandwidth [0; 22 Khz]

As previously mentioned, the psychoacoustic model determines the signal-to-mask ratio for each subband. Two models have been proposed in the standard. The procedure is rather complex and depends on the chosen model. For model 1, the procedure consists of first localizing the tonal components corresponding to the highest energy in the power spectral density of the audio signal. After processing of these tonal components, the remaining energy in each critical band is calculated and is associated with a non-tonal component. The frequency of this non-tonal component is equal to the geometric mean calculated in the considered critical band. Then, the masking threshold and the signal-to mask-ratios are computed for each critical band. These signal-to-mask ratios are used for the bit allocation in each subband.

2.6. Exercises

2.6.1. Exercise 1: entropy and Huffman’s coding

Let X be a source with symbol rate D=1000 symbols/s and using four messages A, B, C, D with occurrence probabilities: Pr(A) = 0.6; Pr(B) = 0.2; Pr(C) = 0.15; Pr(D) = 0.05:

1) Compute the entropy H(X) and entropy rate DI of the source X.

2) We code the symbols as follows: A->00; C-> 10; B -> 01; D -> 11

Determine the binary rate D’ and entropy rate H’ after encoding.

3) Implement Huffman’s algorithm to efficiently encode the symbols. Determine the binary rate D" and entropy rate H" after encoding.

4) Propose a method to improve the efficiency of the coding.

2.6.2. Exercise 2: entropy and Huffman’s coding for a source with memory

Let us consider a source X using three messages A, B,C with occurrence probabilities: Pr(A) = 0.5; Pr(B) = 0.25; Pr(C) = 0.25

1) Calculate the entropy H(X) of the source.

2) Use Huffman’s algorithm to encode the source. Compute the average length of the words.

3) We now assume that the symbols are correlated two-by-two: Pr(AA) = 0.25; Pr(AB) = 0.2; Pr(AC) = 0.05; Pr(BA) = 0.05; Pr(BB) = 0.05; Pr(BC) = 0.15; Pr(CA) = 0.2; Pr(CB) = 0; Pr(CC) = 0.05. Calculate H(X). Propose an efficient encoding for this source and compute the average length of the words.

2.6.3. Exercise 3: LZ78 encoding and decoding

Let us consider the following binary message:

1 0 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0

1) Encode this message using the LZ78 method.

2) Decode the encoded sequence.

2.6.4. Exercise 4: scalar quantization and Huffman’s coding

Let us consider a source generating real samples. The probability density of the samples p(x) is given in Figure 2.38.

image

Figure 2.38. Probability density

1) We quantize the samples using eight levels and a step size equal to 2. The decision thresholds are −6, −4, −2, 0, +2, +4 and +6 and the set of quantized values is {−7, −5, −3, −1, +1, +3, +5, +7}. Determine the probabilities of each quantized value image.

2) Compute the entropy of the discrete sequence obtained after quantization.

3) Use Huffman’s algorithm to efficiently encode the quantized samples. Determine the average number of bits per sample.

2.6.5. Exercise 5: scalar quantization and distortion

Let us consider a source generating real samples. We quantize the samples using four levels and a step size of 1. The threshold levels are −1, 0, +1 and the set of quantized values is {−1.5, −0.5, +0.5, +1.5}.

1) Plot the transfer function of the quantifier.

2) Quantify the following sequence : 0.1 1.8 −0.3 −0.9 0.5 and compute the square error for each sample.

3) The density probability of this source is given in Figure 2.39. We use the same quantifier as previously. Plot the function d = f(x) with image.

image

Figure 2.39. Density probability

4) Determine literally the MSE or mean distortion D and compute it in our case.

5) We know that in the case of a Gaussian source with variance image, the theoretical limit for D is:image. Compute this value and compare it with the previously calculated mean distortion.

6) In which case, we have mean square distortion = (step size)2/12?

2.6.6. Exercise 6: DPCM coding and decoding

Let us consider a predictive coding system composed of a coder and a decoder as shown in Figure 2.40.

image

Figure 2.40. Block diagram of the predictive coding system

1) Give the transfer function (using z transform) of this chain when assuming that the predictor is a digital filter with response W(z). Deduce the global error between the input and output of the system.

We modify the coder as follows:

image

Figure 2.41. Block diagram of the modified coder

2) Write E(z) as a function of X(z) and Q(z). We can show that the input signal and the error quantization q(n) are filtered using the same filter with transfer function 1 − W(z).

Let us consider the DPCM coder given in Figure 2.23.

3) Show that the quantization error image is equal to image.

4) Show that the DPCM coder is equivalent to the previous scheme.

5) The structure of the DPCM decoder is given in Figure 2.24. Justify this association.

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset