Chapter 9
Audio Coding

U. Zölzer and P. Bhattacharya

For transmission and storage of audio signals, different methods for compressing data have been investigated in addition to the pulse‐code modulation (PCM) representation. The requirements of different applications have resulted in a variety of audio coding methods which have become international standards. In this chapter, the basic principles of audio coding will be introduced and then the most important audio coding standards will be discussed. Audio coding can be divided into two types: lossless and lossy audio coding. Lossless audio coding is based on a statistical model of the signal amplitudes and coding of the audio signal (audio coder). The reconstruction of the audio signal at the receiver allows a lossless resynthesis of the signal amplitudes of the original audio signal (audio decoder). Alternatively, lossy audio coding makes use of a psychoacoustic model of the human acoustic perception for quantizing and coding the audio signal. In this case, only the acoustic relevant parts of the signal are coded and reconstructed at the receiver. The samples of the original audio signal are not exactly reconstructed. The objective of both audio coding methods is a data rate reduction or data compression for transmission or storage compared with the original PCM signal.

9.1 Lossless Audio Coding

Lossless audio coding is based on linear prediction followed by entropy coding [Jay84], as shown in Fig. 9.1.

  • Linear Prediction. A quantized set of coefficients upper P for a block of upper M samples is determined, which leads to an estimate ModifyingAbove x With caret left-parenthesis n right-parenthesis of the input sequence x left-parenthesis n right-parenthesis. The aim is to minimize the power of the difference signal d left-parenthesis n right-parenthesis without any additional quantization errors, i.e. the word length of the signal ModifyingAbove x With caret left-parenthesis n right-parenthesis must be equal to the word length of the input. An alternative approach [Han98, Han01] quantizes the prediction signal ModifyingAbove x With caret left-parenthesis n right-parenthesis such that the word length of the difference signals d left-parenthesis n right-parenthesis remains the same as the input signal word length. Figure 9.2 shows a signal block x left-parenthesis n right-parenthesis and the corresponding spectrum StartAbsoluteValue upper X left-parenthesis f right-parenthesis EndAbsoluteValue. Filtering the input signal with the predictor filter transfer function upper P left-parenthesis z right-parenthesis delivers the estimate ModifyingAbove x With caret left-parenthesis n right-parenthesis. Subtracting input and prediction signals yields the prediction error d left-parenthesis n right-parenthesis, which is also shown in Fig. 9.2 and which has a considerably lower power compared with the input power. The spectrum of this prediction error is nearly white (see Fig. 9.2 bottom right figure). The prediction can be represented as a filter operation with an analysis transfer function upper H Subscript upper A Baseline left-parenthesis z right-parenthesis equals 1 minus upper P left-parenthesis z right-parenthesis on the coder side.
  • Entropy Coding. Quantization of signal d left-parenthesis n right-parenthesis arising from the probability density function of the block. Samples d left-parenthesis n right-parenthesis of higher probability are coded with shorter data words whereas samples d left-parenthesis n right-parenthesis of lower probability are coded with longer data words [Huf52].
  • Frame Packing. The frame packing uses the quantized and coded difference signal and the coding of the upper M coefficients of the predictor filter upper P left-parenthesis z right-parenthesis of order upper M.
    Schematic illustration of lossless audio coding based on linear prediction and entropy coding.

    Figure 9.1 Lossless audio coding based on linear prediction and entropy coding.

    Schematic illustration of signals and spectra for linear prediction.

    Figure 9.2 Signals and spectra for linear prediction.

  • Decoder. On the decoder side, the inverse synthesis transfer function upper H Subscript upper S Baseline left-parenthesis z right-parenthesis equals upper H Subscript upper A Superscript negative 1 Baseline left-parenthesis z right-parenthesis equals left-bracket 1 minus upper P left-parenthesis z right-parenthesis right-bracket Superscript negative 1 reconstructs the input signal with the coded difference samples and the upper M filter coefficients. The frequency response of this synthesis filter represents the spectral envelope shown in the top right part of Fig. 9.2. The synthesis filter shapes the white spectrum of the difference (prediction error) signal with the spectral envelope of the input spectrum.

The attainable compression rates depend on the statistics of the audio signal and allow a compression rate of up to two [Bra92, Cel93, Rob94, Cra96, Cra97, Pur97, Han98, Han01, Lie02, Raa02, Sch02]. Figure 9.3 illustrates examples of the necessary word length for lossless audio coding [Blo95, Sqa88]. In addition to the local entropy of the signal (entropy computed over a block length of 256), results for linear prediction followed by Huffman coding [Huf52] are presented. Huffman coding is carried out with a fixed code table [Pen93] and a power‐controlled choice of adapted code tables. It is observed from Fig. 9.3 that for high signal powers, a reduction of word length is possible if the choice is made from several adapted code tables. Lossless compression methods are used for storage media with limited word length (16 bit), which are used for recording audio signals of higher word lengths (greater-than16 bit). Further applications are in the transmission and archiving of audio signals.

Schematic illustration of lossless audio coding: word length in bits versus time.

Figure 9.3 Lossless audio coding (Mozart, Stravinsky): word length in bits versus time (entropy ‐ ‐, linear prediction with Huffman coding —).

9.2 Lossy Audio Coding

Significantly higher compression rates (of factor four to eight) can be obtained with lossy coding methods. Psychoacoustic phenomena of human hearing are used for signal compression. The fields of application have a wide range, from professional audio like source coding for audio transmission to home entertainment applications.

An outline of the coding methods [Bra94] is standardized in an international specification ISO/IEC 11172‐3 [ISO92], which is based on the following processing (see Fig. 9.4) steps.

  • Sub‐band decomposition with filter banks of short latency time.
  • Calculation of psychoacoustic model parameters based on

    short‐time fast Fourier transform (FFT).

  • Dynamic bit allocation owing to psychoacoustic model parameters

    (signal‐to‐mask ratio SMR).

  • Quantization and coding of sub‐band signals.
  • Multiplex and frame packing.
Schematic illustration of lossy audio coding based on sub-band coding and psychoacoustic models.

Figure 9.4 Lossy audio coding based on sub‐band coding and psychoacoustic models.

Owing to lossy audio coding, post‐processing of such signals or several coding and decoding steps is associated with some additional problems. The high compression rates justify the use of lossy audio coding techniques in applications like transmission.

9.3 Psychoacoustics

In this section, the basic principles of psychoacoustics are presented. The results of psychoacoustic investigations by Zwicker [Zwi82, Zwi90] form the basis for audio coding based on models of human perception. These coded audio signals have a significantly reduced data rate compared with the linearly quantized PCM representation. The human auditory system analyzes broadband signals in so‐called critical bands. The aim of psychoacoustic coding of audio signals is to decompose the broadband audio signal into sub‐bands, which are matched to the critical bands, and then perform quantization and coding of these sub‐band signals [Joh88a, Joh88b, Thei88]. Because the perception of sound below the absolute threshold of hearing is not possible, sub‐band signals below this threshold need neither to be coded nor transmitted. In addition to the perception in critical bands and the absolute threshold, the effects of signal masking in human perception play an important role in signal coding. These will be explained in the following and their application to psychoacoustic coding will be discussed.

9.3.1 Critical Bands and Absolute Threshold

Critical Bands. Critical bands, as investigated by Zwicker [Zwi82], are listed in Table 9.1.

Table 9.1 Critical bands as given by Zwicker. Modified from [Zwi82].

z/Barkf Subscript l/Hzf Subscript u/Hzf Subscript upper B/Hzf Subscript c/Hz
0010010050
1100200100150
2200300100250
3300400100350
4400510110450
5510630120570
6630770140700
7770920150840
892010801601000
9108012701901170
10127014802101370
11148017202401600
12172020002801850
13200023203202150
14232027003802500
15270031504502900
16315037005503400
17370044007004000
18440053009004800
195300640011005800
206400770013007000
217700950018008500
2295001200250010500
231200015500350013500
2415500

A transformation of the linear frequency scale into a hearing adapted scale is given by Zwicker [Zwi90] (units of z in Bark)

(9.1)StartFraction z Over Bark EndFraction equals 13 arc tangent left-parenthesis 0.76 StartFraction f Over kHz EndFraction right-parenthesis plus 3.5 arc tangent left-parenthesis StartFraction f Over 7.5 kHz EndFraction right-parenthesis squared period

The individual critical bands have the following bandwidths:

(9.2)normal upper Delta f Subscript upper B Baseline equals 25 plus 75 left-parenthesis 1 plus 1.4 left-parenthesis StartFraction f Over kHz EndFraction right-parenthesis squared right-parenthesis Superscript 0.69 Baseline period

Absolute Threshold. The absolute threshold upper L Subscript upper T Sub Subscript q (threshold in quiet) denotes the curve of sound pressure level upper L [Zwi82] versus frequency, which leads to the perception of a sinusoidal tone. The absolute threshold is given by [Ter79]

(9.3)StartFraction upper L Subscript upper T Sub Subscript q Subscript Baseline Over dB EndFraction equals 3.64 left-parenthesis StartFraction f Over kHz EndFraction right-parenthesis Superscript negative 0.8 Baseline minus 6.5 exp left-parenthesis minus 0.6 left-parenthesis StartFraction f Over kHz EndFraction minus 3.3 right-parenthesis squared right-parenthesis plus 1 0 Superscript negative 3 Baseline left-parenthesis StartFraction f Over kHz EndFraction right-parenthesis Superscript 4 Baseline period

Below the absolute threshold, no perception of signals is possible. Figure 9.5 shows the absolute threshold versus frequency. Band splitting in critical bands and the absolute threshold allow for the calculation of an offset between the signal level and the absolute threshold for every critical band. This offset is responsible for choosing appropriate quantization steps per critical band.

Schematic illustration of absolute threshold (threshold in quiet).

Figure 9.5 Absolute threshold (threshold in quiet).

9.3.2 Masking

For audio coding, the use of sound perception in critical bands and absolute threshold only is not sufficient for high compression rates. The bases for further data reduction are the masking effects investigated by Zwicker [Zwi82, Zwi90]. For band‐limited noise or a sinusoidal signal, frequency‐dependent masking thresholds can be given. These thresholds perform masking of frequency components if these components are below a masking threshold (see Fig. 9.6). The application of masking for perceptual coding is described in the following.

Schematic illustration of masking threshold of band-limited noise.

Figure 9.6 Masking threshold of band‐limited noise.

Calculation of Signal Power in Band bold-italic i. First, the sound pressure level within a critical band is calculated. The short‐time spectrum upper X left-parenthesis k right-parenthesis equals DFT left-bracket x left-parenthesis n right-parenthesis right-bracket is used to calculate the power density spectrum

(9.4)upper S Subscript p Baseline left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis equals upper S Subscript p Baseline left-parenthesis e Superscript j StartFraction 2 pi k Over upper N EndFraction Baseline right-parenthesis equals upper X Subscript upper R Superscript 2 Baseline left-parenthesis e Superscript j StartFraction 2 pi k Over upper N EndFraction Baseline right-parenthesis plus upper X Subscript upper I Superscript 2 Baseline left-parenthesis e Superscript j StartFraction 2 pi k Over upper N EndFraction Baseline right-parenthesis
(9.5)upper S Subscript p Baseline left-parenthesis k right-parenthesis equals upper X Subscript upper R Superscript 2 Baseline left-parenthesis k right-parenthesis plus upper X Subscript upper I Superscript 2 Baseline left-parenthesis k right-parenthesis 0 less-than-or-equal-to k less-than-or-equal-to upper N minus 1 comma

with the help of an upper N‐point FFT. The signal power in band i is calculated by the sum

(9.6)upper S Subscript p Baseline left-parenthesis i right-parenthesis equals sigma-summation Underscript normal upper Omega equals normal upper Omega Subscript l i Baseline Overscript normal upper Omega Subscript u i Baseline Endscripts upper S Subscript p Baseline left-parenthesis k right-parenthesis

from the lower frequency up to the upper frequency of critical band i. The sound pressure level in band i is given by upper L Subscript upper S Baseline left-parenthesis i right-parenthesis equals 10 log Subscript 10 Baseline upper S Subscript p Baseline left-parenthesis i right-parenthesis.

Absolute Threshold. The absolute threshold is set such that a 4‐kHz signal with peak amplitude plus-or-minus 1 LSB for a 16‐bit representation lies at the lower limit of the absolute threshold curve. Every masking threshold calculated in individual critical bands, which lies below the absolute threshold, is set to a value equal to the absolute threshold in the corresponding band. Because the absolute threshold within a critical band varies for low and high frequencies, it is necessary to make use of the mean absolute threshold within a band.

Masking Threshold. The offset between signal level and the masking threshold in critical band i (see Fig. 9.7) is given by [Hel72]

(9.7)StartFraction upper O left-parenthesis i right-parenthesis Over dB EndFraction equals alpha left-parenthesis 14.5 plus i right-parenthesis plus left-parenthesis 1 minus alpha right-parenthesis a Subscript v Baseline comma
Schematic illustration of offset between signal level and masking threshold.

Figure 9.7 Offset between signal level and masking threshold.

where alpha denotes the tonality index and a Subscript v is the masking index. The masking index [Kap92] is given by

(9.8)a Subscript v Baseline equals negative 2 minus 2.05 arc tangent left-parenthesis StartFraction f Over 4 kHz EndFraction right-parenthesis minus 0.75 arc tangent left-parenthesis StartFraction f squared Over 2.56 kHz squared EndFraction right-parenthesis period

As an approximation,

(9.9)StartFraction upper O left-parenthesis i right-parenthesis Over dB EndFraction equals alpha left-parenthesis 14.5 plus i right-parenthesis plus left-parenthesis 1 minus alpha right-parenthesis 5.5

can be used [Joh88a, Joh88b]. If a tone is masking a noise‐like signal (alpha equals 1), the threshold is set 14.5 plus i dB below the value of upper L Subscript upper S Baseline left-parenthesis i right-parenthesis. If a noise‐like signal is masking a tone (alpha equals 0), the threshold is set 5.5 plus i dB below upper L Subscript upper S Baseline left-parenthesis i right-parenthesis. To recognize a tonal or noise‐like signal within a certain number of samples, the spectral flatness measure (SFM) is estimated. The SFM is defined by the ratio of the geometric to arithmetic mean value of upper S Subscript p Baseline left-parenthesis i right-parenthesis according to

(9.10)SFM equals 10 log Subscript 10 Baseline left-parenthesis StartStartFraction left-bracket product Underscript k equals 1 Overscript StartFraction upper N Over 2 EndFraction Endscripts upper S Subscript p Baseline left-parenthesis e Superscript j StartFraction 2 pi k Over upper N EndFraction Baseline right-parenthesis right-bracket Superscript 1 slash StartFraction upper N Over 2 EndFraction Baseline OverOver StartFraction 1 Over upper N slash 2 EndFraction sigma-summation Underscript k equals 1 Overscript StartFraction upper N Over 2 EndFraction Endscripts upper S Subscript p Baseline left-parenthesis e Superscript j StartFraction 2 pi k Over upper N EndFraction Baseline right-parenthesis EndEndFraction i right-parenthesis period

The SFM is compared with the SFM of a sinusoidal signal (definition SFM Subscript max Baseline equals negative 60 dB) and the tonality index is calculated [Joh88a, Joh88b] by

(9.11)alpha equals upper M upper I upper N left-parenthesis StartFraction SFM Over SFM Subscript max Baseline EndFraction comma 1 right-parenthesis period

An SFM = 0 dB corresponds to a noise‐like signal and leads to alpha equals 0, whereas an SFM = 75 dB gives a tone‐like signal (alpha equals 1). With the sound pressure level upper L Subscript upper S Baseline left-parenthesis i right-parenthesis and the offset upper O left-parenthesis i right-parenthesis, the masking threshold is given by

(9.12)upper T left-parenthesis i right-parenthesis equals 1 0 Superscript left-bracket upper L Super Subscript upper S Superscript left-parenthesis i right-parenthesis minus upper O left-parenthesis i right-parenthesis right-bracket slash 10 Baseline period

Masking Across Critical Bands. Masking across critical bands can be carried out with the help of the Bark scale. The masking threshold is of a triangular form which decreases with upper S 1 dB per Bark for the lower slope and with upper S 2 dB per Bark for the upper slope, depending on the sound pressure level upper L Subscript i and the center frequency f Subscript c Sub Subscript i in band i (see [Ter79]), according to

(9.13)upper S 1 equals 27 dB slash Bark
(9.14)upper S 2 equals 24 plus 0.23 left-parenthesis StartFraction f Subscript c Sub Subscript i Subscript Baseline Over kHz EndFraction right-parenthesis Superscript negative 1 Baseline minus 0.2 StartFraction upper L Subscript upper S Baseline left-parenthesis i right-parenthesis Over dB EndFraction dB slash Bark period

An approximation of the minimum masking within a critical band can be made using Fig. 9.8 [Thei88, Sauv90]. Masking at the upper frequency f Subscript u Sub Subscript i in the critical band i is responsible for masking the quantization noise with approximately 32 dB using the lower masking threshold that decreases by 27 dB/Bark. The upper slope has a steepness which depends on the sound pressure level. This steepness is lower than the steepness of the lower slope. Masking across critical bands is presented in Fig. 9.9. The masking signal in critical band i minus 1 is responsible for masking the quantization noise in critical band i as well as the masking signal in critical band i. This kind of masking across critical bands further reduces the number of quantization steps within critical bands.

Schematic illustration of masking within a critical band. Based on [Thei88] and [Sauv90].

Figure 9.8 Masking within a critical band. Based on [Thei88] and [Sauv90].

Schematic illustration of masking across critical bands.

Figure 9.9 Masking across critical bands.

An analytical expression for masking across critical bands [Schr79] is given by

Here, normal upper Delta i denotes the distance between two critical bands in Bark. Expression (9.15) is called the spreading function. With the help of this spreading function, masking of critical band i by critical band j can be calculated [Joh88a, Joh88b] with abs left-parenthesis i minus j right-parenthesis less-than-or-equal-to 25 such that

Schematic illustration of stepwise calculation of psychoacoustic model.

Figure 9.10 Stepwise calculation of psychoacoustic model.

The masking across critical bands can therefore be expressed as a matrix operation given by

(9.17)Start 4 By 1 Matrix 1st Row upper S Subscript m Baseline left-parenthesis 0 right-parenthesis 2nd Row upper S Subscript m Baseline left-parenthesis 1 right-parenthesis 3rd Row vertical-ellipsis 4th Row upper S Subscript m Baseline left-parenthesis 24 right-parenthesis EndMatrix equals Start 4 By 5 Matrix 1st Row 1st Column upper B left-parenthesis 0 right-parenthesis 2nd Column upper B left-parenthesis negative 1 right-parenthesis 3rd Column upper B left-parenthesis negative 2 right-parenthesis 4th Column midline-horizontal-ellipsis 5th Column upper B left-parenthesis negative 24 right-parenthesis 2nd Row 1st Column upper B left-parenthesis 1 right-parenthesis 2nd Column upper B left-parenthesis 0 right-parenthesis 3rd Column upper B left-parenthesis negative 1 right-parenthesis 4th Column midline-horizontal-ellipsis 5th Column upper B left-parenthesis negative 23 right-parenthesis 3rd Row 1st Column vertical-ellipsis 2nd Column vertical-ellipsis 3rd Column vertical-ellipsis 4th Column Blank 5th Column vertical-ellipsis 4th Row 1st Column upper B left-parenthesis 24 right-parenthesis 2nd Column upper B left-parenthesis 23 right-parenthesis 3rd Column upper B left-parenthesis 22 right-parenthesis 4th Column midline-horizontal-ellipsis 5th Column upper B left-parenthesis 0 right-parenthesis EndMatrix Start 4 By 1 Matrix 1st Row upper S Subscript p Baseline left-parenthesis 0 right-parenthesis 2nd Row upper S Subscript p Baseline left-parenthesis 1 right-parenthesis 3rd Row vertical-ellipsis 4th Row upper S Subscript p Baseline left-parenthesis 24 right-parenthesis EndMatrix period

A renewed calculation of the masking threshold with Eq. (9.16) leads to the global masking threshold

(9.18)upper T Subscript m Baseline left-parenthesis i right-parenthesis equals 1 0 Superscript log Super Subscript 10 Superscript upper S Super Subscript m Superscript left-parenthesis i right-parenthesis minus upper O left-parenthesis i right-parenthesis slash 10 Baseline period

For a clarification of the single steps for a psychoacoustic‐based audio coding, we will summarize the operations with exemplified analysis results.

  • Calculation of the signal power upper S Subscript p Baseline left-parenthesis i right-parenthesis in critical bands

    right-arrow upper L Subscript upper S Baseline left-parenthesis i right-parenthesis in dB (Fig. 9.10a).

  • Calculation of masking across critical bands upper T Subscript m Baseline left-parenthesis i right-parenthesis

    right-arrow upper L Subscript upper T Sub Subscript m Baseline left-parenthesis i right-parenthesis in dB (Fig. 9.10b).

  • Masking with tonality index

    right-arrow upper L Subscript upper T Sub Subscript m Baseline left-parenthesis i right-parenthesis in dB (Fig. 9.10c).

  • Calculation of global masking threshold with respect to threshold in quiet upper L Subscript upper T Sub Subscript q

    right-arrow upper L Subscript upper T Sub Subscript m comma a b s Baseline left-parenthesis i right-parenthesis in dB (Fig. 9.10d).

With the help of the global masking threshold upper L Subscript upper T Sub Subscript m comma a b s Baseline left-parenthesis i right-parenthesis, we calculate the signal‐to‐mask ratio (SMR)

(9.19)SMR left-parenthesis i right-parenthesis equals upper L Subscript upper S Baseline left-parenthesis i right-parenthesis minus upper L Subscript upper T Sub Subscript m comma a b s Subscript Baseline left-parenthesis i right-parenthesis in dB

per Bark band. This SMR defines the necessary number of bits per critical band, such that masking of quantization noise is achieved. For the given example, the signal power and the global masking threshold are shown in Fig. 9.11a. The resulting signal‐to‐mask ratio SMR left-parenthesis i right-parenthesis is shown in Fig. 9.11b. As soon as SMR left-parenthesis i right-parenthesis greater-than 0, one has to allocate bits to the critical band i. For SMR left-parenthesis i right-parenthesis less-than 0, the corresponding critical band will not be transmitted. Figure 9.12 shows the masking thresholds in critical bands for a sinusoid of 440 Hz. Compared to the first example, the influence of masking thresholds across critical bands is easier to note and to interpret.

Schematic illustration of calculation of the signal-to-mask ratio SMR.

Figure 9.11 Calculation of the signal‐to‐mask ratio SMR.

Schematic illustration of calculation of psychoacoustic model for a pure sinusoid with 440 Hz.

Figure 9.12 Calculation of psychoacoustic model for a pure sinusoid with 440 Hz.

9.4 ISO‐MPEG1 Audio Coding

In this section, the coding method for digital audio signals is described, which is specified in the standard ISO/IEC 11172‐3 [ISO92]. The filter banks used for sub‐band decomposition, the psychoacoustic models, dynamic bit allocation, and coding are discussed. A simplified block diagram of the coder for implementing layers I and II of the standard is shown in Fig. 9.13. The corresponding decoder is shown in Fig. 9.14. It uses the information from the ISO‐MPEG1 frame and feeds the decoded sub‐band signals to a synthesis filter bank for reconstructing the broadband PCM signal. The complexity of the decoder is significantly lower compared with that of the coder. Prospective improvements of the coding method are being made entirely for the coder.

Schematic illustration of simplified block diagram of an ISO-MPEG1 coder.

Figure 9.13 Simplified block diagram of an ISO‐MPEG1 coder.

Schematic illustration of simplified block diagram of an ISO-MPEG1 decoder.

Figure 9.14 Simplified block diagram of an ISO‐MPEG1 decoder.

9.4.1 Filter Banks

The sub‐band decomposition is done with a pseudo‐quadrature mirror filter (QMF) bank (see Fig. 9.15). The theoretical background is found in the related literature [Rot83, Mas85, Vai93]. The decomposition of the broadband signal is made into upper M uniformly spaced sub‐bands. The sub‐bands are processed further after a sampling rate reduction by a factor of upper M. The implementation of an ISO‐MPEG1 coder is based on upper M equals 32 frequency bands. The individual bandpass filters upper H 0 left-parenthesis z right-parenthesis ellipsis upper H Subscript upper M minus 1 Baseline left-parenthesis z right-parenthesis are designed using a prototype lowpass filter upper H left-parenthesis z right-parenthesis and frequency‐shifted versions. The frequency shifting of the prototype with cutoff frequency pi slash 2 upper M is done by modulating the impulse response h left-parenthesis n right-parenthesis with a cosine term [Bos02] according to

(9.20)h Subscript k Baseline left-parenthesis n right-parenthesis equals h left-parenthesis n right-parenthesis dot cosine left-parenthesis StartFraction pi Over 32 EndFraction left-parenthesis k plus 0 comma 5 right-parenthesis left-parenthesis n minus 16 right-parenthesis right-parenthesis comma
(9.21)f Subscript k Baseline left-parenthesis n right-parenthesis equals 32 dot h left-parenthesis n right-parenthesis dot cosine left-parenthesis StartFraction pi Over 32 EndFraction left-parenthesis k plus 0 comma 5 right-parenthesis left-parenthesis n plus 16 right-parenthesis right-parenthesis comma

with k equals 0 comma ellipsis comma 31 and n equals 0 comma ellipsis comma 511. The bandpass filters have bandwidth pi slash upper M. For the synthesis filter bank, corresponding filters upper F 0 left-parenthesis z right-parenthesis ellipsis upper F Subscript upper M minus 1 Baseline left-parenthesis z right-parenthesis give outputs which are added together resulting in a broadband PCM signal. The prototype impulse response with 512 taps, the modulated bandpass impulse responses, and the corresponding magnitude responses are shown in Fig. 9.16. The magnitude responses of all 32 bandpass filters are also shown. The overlap of neighboring bandpass filters is limited to the lower and upper filter bands. This overlap reaches up to the center frequency of the neighboring bands. The resulting aliasing after downsampling in each sub‐band will be canceled in the synthesis filter bank. The pseudo‐QMF bank can be implemented by the combination of a polyphase filter structure followed by a discrete cosine transform [Rot83, Vai93, Kon94].

Schematic illustration of pseudo-QMF bank.

Figure 9.15 Pseudo‐QMF bank.

Schematic illustration of impulse responses and magnitude responses of pseudo-QMF bank.

Figure 9.16 Impulse responses and magnitude responses of pseudo‐QMF bank.

For increasing the frequency resolution, layer III of the standard decomposes each of the 32 sub‐bands further into a maximum of 18 uniformly spaced sub‐bands (see Fig. 9.17). The decomposition is carried out with the help of an overlapped transform of windowed sub‐band samples. The method is based on a modified discrete cosine transform (MDCT), also known as the TDAC filter bank (time domain aliasing cancellation) and MLT (modulated lapped transform). An exact description is found in [Pri87, Mal92]. This extended filter bank is denoted as the polyphase/MDCT hybrid filter bank [Bra94]. The higher frequency resolution enables a higher coding gain but has the disadvantage of having a worse time resolution. This is observed for impulse‐like signals. To minimize these artifacts, the number of sub‐bands per sub‐band can be altered from 18 down to 6. Sub‐band decompositions that are matched to the signal can be obtained by specially designed window functions with overlapping transforms [Edl89, Edl95]. The equivalence of overlapped transforms and filter banks is found in [Mal92, Glu93, Vai93, Edl95, Vet95].

Schematic illustration of polyphase/MDCT hybrid filter bank.

Figure 9.17 Polyphase/MDCT hybrid filter bank.

9.4.2 Psychoacoustic Models

Two psychoacoustic models have been developed for layers I to III of the ISO‐MPEG1 standard. Both models can be used independently of each other for all three layers. Psychoacoustic model 1 is used for layers I and II whereas model 2 is used for layer III. Owing to the numerous applications of layers I and II, we will discuss psychoacoustic model 1 in the following.

Psychoacoustic Model 1. Bit allocation in each of the 32 sub‐bands is carried out using the signal‐to‐mask ratio SMR left-parenthesis i right-parenthesis. This is based on the minimum masking threshold and the maximum signal level within a sub‐band. To calculate this ratio, the power density spectrum is estimated with the help of a short‐time FFT in parallel with the analysis filter bank. As a consequence, a higher frequency resolution is obtained to estimate the power density spectrum in contrast to the frequency resolution of the 32‐band analysis filter bank. The signal‐to‐mask ratio for every sub‐band is determined as follows.

  1. Calculating the power density spectrum of a block of upper N samples using FFT. After windowing a block of upper N equals 512 (upper N equals 1024 for layer II) input samples, the power density spectrum
    (9.22)upper X left-parenthesis k right-parenthesis equals 10 log Subscript 10 Baseline StartAbsoluteValue StartFraction 1 Over upper N EndFraction sigma-summation Underscript n equals 0 Overscript upper N minus 1 Endscripts h left-parenthesis n right-parenthesis x left-parenthesis n right-parenthesis e Superscript minus j n k Baseline 2 pi slash upper N Baseline EndAbsoluteValue squared in dB
    is calculated. After this, the window h left-parenthesis n right-parenthesis is displaced by 384 (12 dot 32) samples and the next block is processed.
  2. Determination of sound pressure level in every sub‐band. The sound pressure level is derived from the calculated power density spectrum and by calculating a scaling factor in the corresponding sub‐band, as given by
    (9.23)upper L Subscript upper S Baseline left-parenthesis i right-parenthesis equals upper M upper A upper X left-bracket upper X left-parenthesis k right-parenthesis comma 20 log Subscript 10 Baseline left-bracket SCF Subscript m a x Baseline left-parenthesis i right-parenthesis dot 32768 right-bracket minus 10 right-bracket in dB period

    For upper X left-parenthesis k right-parenthesis, the maximum of the spectral lines in a sub‐band is used. The scaling factor SCF left-parenthesis i right-parenthesis for sub‐band i is calculated from the absolute value of the maximum of 12 consecutive sub‐band samples. A nonlinear quantization to 64 levels is carried out (layer I). For layer II, the sound pressure level is determined by choosing the largest of the three scaling factors from 3 dot 12 sub‐band samples.

  3. Considering the absolute threshold. The absolute threshold upper L upper T Subscript q Baseline left-parenthesis m right-parenthesis is specified for different sampling rates in [ISO92]. The frequency index m is based on a reduction of upper N slash 2 relevant frequencies with the FFT of index k (see Fig. 9.18). The sub‐band index is still i.
    Schematic illustration of nomenclature of frequency indices.

    Figure 9.18 Nomenclature of frequency indices.

  4. Calculating tonal upper X Subscript t m Baseline left-parenthesis k right-parenthesis or non‐tonal upper X Subscript n m Baseline left-parenthesis k right-parenthesis masking components and determining relevant masking components (for details, see [ISO92]). These masking components are denoted as upper X Subscript t m Baseline left-bracket z left-parenthesis j right-parenthesis right-bracket and upper X Subscript n m Baseline left-bracket z left-parenthesis j right-parenthesis right-bracket. With the index j, tonal and non‐tonal masking components are labeled. The variable z left-parenthesis m right-parenthesis is listed for reduced frequency indices m in [ISO92]. It allows a finer resolution of the 24 critical bands with the frequency group index z.
  5. Calculating the individual masking thresholds. For masking thresholds of tonal and non‐tonal masking components upper X Subscript t m Baseline left-bracket z left-parenthesis j right-parenthesis right-bracket and upper X Subscript n m Baseline left-bracket z left-parenthesis j right-parenthesis right-bracket, the following calculation is performed:
    (9.24)upper L upper T Subscript t m Baseline left-bracket z left-parenthesis j right-parenthesis comma z left-parenthesis m right-parenthesis right-bracket equals upper X Subscript t m Baseline left-bracket z left-parenthesis j right-parenthesis right-bracket plus a Subscript v Sub Subscript t m Subscript Baseline left-bracket z left-parenthesis j right-parenthesis right-bracket plus v Subscript f Baseline left-bracket z left-parenthesis j right-parenthesis comma z left-parenthesis m right-parenthesis right-bracket in dB comma
    (9.25)upper L upper T Subscript n m Baseline left-bracket z left-parenthesis j right-parenthesis comma z left-parenthesis m right-parenthesis right-bracket equals upper X Subscript n m Baseline left-bracket z left-parenthesis j right-parenthesis right-bracket plus a Subscript v Sub Subscript n m Subscript Baseline left-bracket z left-parenthesis j right-parenthesis right-bracket plus v Subscript f Baseline left-bracket z left-parenthesis j right-parenthesis comma z left-parenthesis m right-parenthesis right-bracket in dB period

    The masking index for tonal masking components is given by

    (9.26)a Subscript v Sub Subscript t m Subscript Baseline equals negative 1.525 minus 0.275 dot z left-parenthesis j right-parenthesis minus 4.5 in dB

    and the masking index for non‐tonal masking components is

    (9.27)a Subscript v Sub Subscript n m Subscript Baseline equals negative 1.525 minus 0.175 dot z left-parenthesis j right-parenthesis minus 0.5 in dB period

    The masking function v Subscript f Baseline left-bracket z left-parenthesis j right-parenthesis comma z left-parenthesis m right-parenthesis right-bracket with distance normal upper Delta z equals z left-parenthesis m right-parenthesis minus z left-parenthesis j right-parenthesis is given by

    v Subscript f Baseline equals Start 5 By 4 Matrix 1st Row 1st Column 17 dot left-parenthesis normal upper Delta z plus 1 right-parenthesis minus left-parenthesis 0.4 dot upper X left-bracket z left-parenthesis j right-parenthesis right-bracket plus 6 right-parenthesis 2nd Column negative 3 3rd Column less-than-or-equal-to normal upper Delta z less-than 4th Column negative 1 comma 2nd Row 1st Column left-parenthesis 0.4 dot upper X left-bracket z left-parenthesis j right-parenthesis right-bracket plus 6 right-parenthesis dot normal upper Delta z 2nd Column negative 1 3rd Column less-than-or-equal-to normal upper Delta z less-than 4th Column 0 comma 3rd Row 1st Column negative 17 dot normal upper Delta z 2nd Column 0 3rd Column less-than-or-equal-to normal upper Delta z less-than 4th Column 1 comma 4th Row 1st Column minus left-parenthesis normal upper Delta z minus 1 right-parenthesis dot left-parenthesis 17 minus 0.15 dot upper X left-bracket z left-parenthesis j right-parenthesis right-bracket right-parenthesis minus 17 2nd Column 1 3rd Column less-than-or-equal-to normal upper Delta z less-than 4th Column 8 comma 5th Row 1st Column in dB 2nd Column Blank 3rd Column in Bark period EndMatrix

    This masking function v Subscript f Baseline left-bracket z left-parenthesis j right-parenthesis comma z left-parenthesis m right-parenthesis right-bracket describes the masking of the frequency index z left-parenthesis m right-parenthesis by the masking component z left-parenthesis j right-parenthesis.

  6. Calculating the global masking threshold. For frequency index m, the global masking threshold is calculated as the sum of all contributing masking components according to
    (9.28)StartLayout 1st Row 1st Column upper L upper T Subscript g Baseline left-parenthesis m right-parenthesis 2nd Column equals 3rd Column 10 log Subscript 10 Baseline left-bracket 1 0 Superscript upper L upper T Super Subscript q Superscript left-parenthesis m right-parenthesis slash 10 Baseline plus sigma-summation Underscript j equals 1 Overscript upper T Subscript m Baseline Endscripts 1 0 Superscript upper L upper T Super Subscript t m Superscript left-bracket z left-parenthesis j right-parenthesis comma z left-parenthesis m right-parenthesis right-bracket slash 10 Baseline plus sigma-summation Underscript j equals 1 Overscript upper R Subscript m Baseline Endscripts 1 0 Superscript upper L upper T Super Subscript n m Superscript left-bracket z left-parenthesis j right-parenthesis comma z left-parenthesis m right-parenthesis right-bracket slash 10 Baseline right-bracket in dB period 2nd Row 1st Column Blank 2nd Column Blank EndLayout

    The total number of tonal and non‐tonal masking components are denoted as upper T Subscript m and upper R Subscript m, respectively. For a given sub‐band i, only masking components that lie in the range negative 8 to +3 Bark will be considered. Masking components outside this range are neglected.

  7. Determination of the minimum masking threshold in every sub‐band:
    (9.29)upper L upper T Subscript m i n Baseline left-parenthesis i right-parenthesis equals upper M upper I upper N left-bracket upper L upper T Subscript g Baseline left-parenthesis m right-parenthesis right-bracket in dB period

    Several masking thresholds upper L upper T Subscript g Baseline left-parenthesis m right-parenthesis can occur in a sub‐band as long as m lies within the sub‐band i.

  8. Calculation of the signal‐to‐mask ratio SMR left-parenthesis i right-parenthesis in every sub‐band:
    (9.30)SMR left-parenthesis i right-parenthesis equals upper L Subscript upper S Baseline left-parenthesis i right-parenthesis minus upper L upper T Subscript m i n Baseline left-parenthesis i right-parenthesis in dB period

The signal‐to‐mask ratio determines the dynamic range that has to be quantized in the particular sub‐band so that the level of quantization noise lies below the masking threshold. The signal‐to‐mask ratio is the basis for the bit allocation procedure for quantizing the sub‐band signals.

9.4.3 Dynamic Bit Allocation and Coding

Dynamic Bit Allocation. Dynamic bit allocation is used to determine the number of bits that are necessary for the individual sub‐bands so that a transparent perception is possible. The minimum number of bits in sub‐band i can be determined from the difference between scaling factor SCF left-parenthesis i right-parenthesis and the absolute threshold upper L upper T Subscript q Baseline left-parenthesis i right-parenthesis as b left-parenthesis i right-parenthesis equals SCF left-parenthesis i right-parenthesis minus upper L upper T Subscript q Baseline left-parenthesis i right-parenthesis. With this, quantization noise remains under the masking threshold. Masking across critical bands is used for the implementation of the ISO‐MPEG1 coding method.

For a given transmission rate, the maximum possible number of bits upper B Subscript m for coding sub‐band signals and scaling factors is calculated as

(9.31)upper B Subscript m Baseline equals sigma-summation Underscript i equals 1 Overscript 32 Endscripts b left-parenthesis i right-parenthesis plus SCF left-parenthesis i right-parenthesis plus additional information period

The bit allocation is performed within an allocation frame consisting of 12 sub‐band samples (384 equals 12 dot 32 PCM samples) for layer I and 36 sub‐band samples (1152 equals 36 dot 32 PCM samples) for layer II.

The dynamic bit allocation for the sub‐band signals is carried out as an iterative procedure. At the beginning, the number of bits per sub‐band is set to zero. First, the mask‐to‐noise ratio,

(9.32)MNR left-parenthesis i right-parenthesis equals SNR left-parenthesis i right-parenthesis minus SMR left-parenthesis i right-parenthesis comma

is determined for every sub‐band. The signal‐to‐mask ratio SMR left-parenthesis i right-parenthesis is the result of the psychoacoustic model. The signal‐to‐noise ratio SNR left-parenthesis i right-parenthesis is defined by a table in [ISO92], in which for every number of bits, a corresponding signal‐to‐noise ratio is specified. The number of bits must be increased as long as the mask‐to‐noise ratio MNR is less than zero.

The iterative bit allocation is performed by the following steps.

  1. Determination of the minimum MNR left-parenthesis i right-parenthesis of all sub‐bands.
  2. Increasing the number of bits of these sub‐bands on to the next stage of the MPEG1 standard. Allocation of 6 bits for the scaling factor of the MPEG1 standard when the number of bits is increased for the first time.
  3. New calculation of MNR left-parenthesis i right-parenthesis in this sub‐band.
  4. Calculation of the number of bits for all sub‐bands and scaling factors and comparison with the maximum number upper B Subscript m. If the number of bits is smaller than the maximum number, the iteration starts again with step 1.

Quantization and Coding of Sub‐band Signals. The quantization of the sub‐band signals is done with the allocated bits for the corresponding sub‐band. The 12 (36) sub‐band samples are divided by the corresponding scaling factor and then linearly quantized and coded (for details see [ISO92]). This is followed by a frame packing. In the decoder, the procedure is reversed. The decoded sub‐band signals with different word lengths are reconstructed to a broadband PCM signal with a synthesis filter bank (see Fig. 9.14). MPEG‐1 audio coding has a one or a two channel stereo mode with sampling frequencies of 32, 44.1, and 48 kHz and a bit rate of 128 kbit/s per channel.

9.5 MPEG‐2 Audio Coding

The aim of the introduction of MPEG‐2 audio coding was the extension of MPEG‐1 to lower sampling frequencies and multichannel coding [Bos97]. Backward compatibility to existing MPEG‐1 systems is achieved through the version MPEG‐2 BC (Backward Compatible) and the introduction toward lower sampling frequencies of 32, 22.05, 24 kHz with version MPEG‐2 LSF (Lower Sampling Frequencies). The bit rate for a five‐channel MPEG‐2 BC coding with full bandwidth of all channels is 640–896 kBit/s.

9.6 MPEG‐2 Advanced Audio Coding

To improve the coding of mono, stereo, and multichannel audio signals, the MPEG‐2 AAC (Advanced Audio Coding) standard is specified. This coding standard is not backward compatible with the MPEG‐1 standard and forms the kernel for new extended coding standards such as MPEG‐4. The achievable bit rate for a five‐channel coding is 320 kBit/s. In the following, the main signal processing steps for MPEG‐2 AAC will be introduced and the principle functionalities will be explained. An extensive explanation can be found in [Bos97, Bra98, Bos02]. An MPEG‐2 AAC coder is shown in Fig. 9.19. The corresponding decoder performs the functional units in reverse order with corresponding decoder functionalities.

Schematic illustration of MPEG-2 AAC coder and decoder.

Figure 9.19 MPEG‐2 AAC coder and decoder.

Pre‐processing: the input signal will be band limited according to the sampling frequency. This step is used only in the scalable sampling rate profile [Bos97, Bra98, Bos02].

Filter bank: the time‐frequency decomposition into upper M equals 1024 sub‐bands with an overlapped MDCT [Pri86, Pri87] is based on blocks of upper N equals 2048 input samples. A step‐wise explanation of the implementation will be introduced. A graphical representation of the single steps is depicted in Fig. 9.20. The single steps are as follows.

  1. Partitioning of the input signal x left-parenthesis n right-parenthesis with time index n into overlapped blocks
    (9.33)x Subscript m Baseline left-parenthesis r right-parenthesis equals x left-parenthesis m upper M plus r right-parenthesis r equals 0 comma ellipsis comma upper N minus 1 semicolon negative infinity less-than-or-equal-to m less-than-or-equal-to infinity
    of length upper N with an overlap (hop size) of upper M equals StartFraction upper N Over 2 EndFraction. The time index inside a block is denoted by r. The variable m denotes the block index.
  2. Windowing of blocks with window function w left-parenthesis r right-parenthesis right-arrow x Subscript m Baseline left-parenthesis r right-parenthesis dot w left-parenthesis r right-parenthesis.
  3. MDCT
    (9.34)upper X left-parenthesis m comma k right-parenthesis equals StartRoot StartFraction 2 Over upper M EndFraction EndRoot sigma-summation Underscript r equals 0 Overscript upper N minus 1 Endscripts x Subscript m Baseline left-parenthesis r right-parenthesis w left-parenthesis r right-parenthesis cosine left-parenthesis StartFraction pi Over upper M EndFraction left-parenthesis k plus one half right-parenthesis left-parenthesis r plus StartFraction upper M plus 1 Over 2 EndFraction right-parenthesis right-parenthesis comma
    (9.35)k equals 0 comma ellipsis comma upper M minus 1
    yields every upper M input samples upper M equals upper N slash 2 spectral coefficients from upper N windowed input samples.
  4. Quantization of spectral coefficients upper X left-parenthesis m comma k right-parenthesis leads to quantized spectral coefficients upper X Subscript upper Q Baseline left-parenthesis m comma k right-parenthesis based on a psychoacoustic model.
  5. IMDCT
    (9.35)ModifyingAbove x With caret Subscript m Baseline left-parenthesis r right-parenthesis equals StartRoot StartFraction 2 Over upper M EndFraction EndRoot sigma-summation Underscript k equals 0 Overscript upper M minus 1 Endscripts upper X Subscript upper Q Baseline left-parenthesis m comma k right-parenthesis cosine left-parenthesis StartFraction pi Over upper M EndFraction left-parenthesis k plus one half right-parenthesis left-parenthesis r plus StartFraction upper M plus 1 Over 2 EndFraction right-parenthesis right-parenthesis comma
    (9.36)r equals 0 comma ellipsis comma upper N minus 1
    yields every upper M input samples upper N output samples in block ModifyingAbove x With caret Subscript m Baseline left-parenthesis r right-parenthesis.
  6. Windowing of inverse transformed block ModifyingAbove x With caret Subscript m Baseline left-parenthesis r right-parenthesis with window function w left-parenthesis r right-parenthesis.
  7. Reconstruction of output signal y left-parenthesis n right-parenthesis by overlap and add operation according to
    (9.36)y left-parenthesis n right-parenthesis equals sigma-summation Underscript m equals negative infinity Overscript infinity Endscripts ModifyingAbove x With caret Subscript m Baseline left-parenthesis r right-parenthesis w left-parenthesis r right-parenthesis comma r equals 0 comma ellipsis comma upper N minus 1
    with overlap upper M.
Schematic illustration of time-frequency decompostion with MDCT/inverse modified discrete cosine transform (IMDCT).

Figure 9.20 Time‐frequency decompostion with MDCT/inverse modified discrete cosine transform (IMDCT).

Schematic illustration of signals of MDCT/IMDCT.

Figure 9.21 Signals of MDCT/IMDCT.

To explain the procedural steps, we consider the MDCT/IMDCT of a sine pulse shown in Fig. 9.21. The left column shows top down the input signal and partitions of the input signal of block length upper N equals 256. The window function is a sine window. The corresponding MDCT coefficients of length upper M equals 128 are shown in the middle column. The IMDCT delivers the signals in the right column. One can notice that the inverse transforms with the IMDCT do not exactly reconstruct the single input blocks. Moreover, each output block consists of an input block and a special superposition of a time‐reversed and circular shifted by upper M equals upper N slash 2 input block, which is denoted by time domain aliasing [Pri86, Pri87, Edl89]. The overlap and add operation of the single output blocks perfectly recovers the input signal, which is shown in the top signal of the right column (Fig. 9.21). For a perfect reconstruction of the output signal, the window function of the analysis and synthesis step has to fulfill the condition w squared left-parenthesis r right-parenthesis plus w squared left-parenthesis r plus upper M right-parenthesis equals 1 comma r equals 0 comma ellipsis comma upper M minus 1. The Kaiser–Bessel‐derived window [Bos02] and a sine window h left-parenthesis n right-parenthesis equals sine left-parenthesis left-parenthesis n plus one half right-parenthesis StartFraction pi Over upper N EndFraction right-parenthesis with n equals 0 comma ellipsis comma upper N minus 1 [Mal92] are applied. Figure 9.22 shows both window functions with upper N equals 2048 and the corresponding magnitude responses for a sampling frequency of f Subscript upper S Baseline equals 44100 Hz. The sine window has a smaller passband width but slower falling side lobes. In contrast, the Kaiser–Bessel‐derived window shows a wider passband and a faster decay of the side lobes. To demonstrate the filter bank properties and, in particular, the frequency decomposition of MDCT, we derive the modulated bandpass impulse responses of the window functions (prototype impulse response w left-parenthesis n right-parenthesis equals h left-parenthesis n right-parenthesis) according to

(9.37)h Subscript k Baseline left-parenthesis n right-parenthesis equals 2 dot h left-parenthesis n right-parenthesis dot cosine left-parenthesis StartFraction pi Over upper M EndFraction left-parenthesis k plus one half right-parenthesis left-parenthesis n plus StartFraction upper M plus 1 Over 2 EndFraction right-parenthesis right-parenthesis
(9.38)k equals 0 comma ellipsis comma upper M minus 1 semicolon n equals 0 comma ellipsis comma upper N minus 1 period

Figure 9.23 shows the normalized prototype impulse response of the sine window and the first two modulated bandpass impulse responses h 0 left-parenthesis n right-parenthesis and h 1 left-parenthesis n right-parenthesis, and accordingly, the corresponding magnitude responses are depicted. In addition to the increased frequency resolution with upper M equals 1024 bandpass filters, a reduced stopband attenuation can be observed. A comparison of this magnitude responses of the MDCT with the frequency resolution of the pseudo‐QMF bank with upper M equals 32 in Fig. 9.16 points out the different properties of both sub‐band decompositions.

Schematic illustration of Kaiser-Bessel-derived window and sine window for N=2048 and magnitude responses of the normalized window functions.

Figure 9.22 Kaiser–Bessel‐derived window and sine window for upper N equals 2048 and magnitude responses of the normalized window functions.

Schematic illustration of normalized impulse responses of sine window for N=2048, modulated bandpass impulse responses, and magnitude responses.

Figure 9.23 Normalized impulse responses of sine window for upper N equals 2048, modulated bandpass impulse responses, and magnitude responses.

To adjust the time and frequency resolution to the properties of an audio signal, several methods have been investigated. Signal‐adaptive audio coding based on the wavelet transform can be found in [Sin93, Ern00]. Window switching can be applied to achieve a time‐variant time‐frequency resolution for MDCT and IMDCT applications. For stationary signals, a high frequency resolution and a low time resolution are necessary. This leads to long windows with upper N equals 2048. Coding of attacks of instruments needs a high time resolution (reduction of window length to upper N equals 256) and thus reduces the frequency resolution (reduction of number of spectral coefficients). A detailed description of switching between time–frequency resolution with the MDCT/IMDCT can be found in [Edl89, Bos97, Bos02]. Examples of switching between different window functions and windows of different length are shown in Fig. 9.24.

Schematic illustration of switching of window functions.

Figure 9.24 Switching of window functions.

Temporal Noise Shaping. A further method for adapting the time–frequency resolution of a filter bank and here a MDCT/IMDCT to the signal characteristic is based on linear prediction along the spectral coefficients in the frequency domain [Her96, Her99]. This method is called temporal noise shaping (TNS) and is a weighting of the temporal envelope of the time domain signal. Such weighting of the temporal envelope is demonstrated in Fig. 9.25.

Schematic illustration of attack of castanet and spectrum.

Figure 9.25 Attack of castanet and spectrum.

Figure 9.25a shows a signal from a castanet attack. Making use of the discrete cosine transform (DCT, [Rao90]),

(9.38)upper X Superscript upper C left-parenthesis 2 right-parenthesis Baseline left-parenthesis k right-parenthesis equals StartRoot StartFraction 2 Over upper N EndFraction EndRoot c Subscript k Baseline sigma-summation Underscript n equals 0 Overscript upper N minus 1 Endscripts x left-parenthesis n right-parenthesis cosine left-parenthesis StartFraction left-parenthesis 2 n plus 1 right-parenthesis k pi Over 2 upper N EndFraction right-parenthesis comma k equals 0 comma ellipsis comma upper N minus 1 comma

and the IDCT,

(9.42)StartLayout 1st Row 1st Column x left-parenthesis n right-parenthesis 2nd Column equals 3rd Column StartRoot StartFraction 2 Over upper N EndFraction EndRoot sigma-summation Underscript k equals 0 Overscript upper N minus 1 Endscripts c Subscript k Baseline upper X Superscript upper C left-parenthesis 2 right-parenthesis Baseline left-parenthesis k right-parenthesis cosine left-parenthesis StartFraction left-parenthesis 2 n plus 1 right-parenthesis k pi Over 2 upper N EndFraction right-parenthesis comma n equals 0 comma ellipsis comma upper N minus 1 2nd Row 1st Column Blank 2nd Column Blank 3rd Column with c Subscript k Baseline equals Start 2 By 2 Matrix 1st Row 1st Column 1 slash StartRoot left-parenthesis EndRoot 2 right-parenthesis 2nd Column k equals 0 comma 2nd Row 1st Column 1 2nd Column otherwise comma EndMatrix EndLayout

the spectral coefficients of the DCT of this castanet attack are represented in Fig. 9.25b. After quantization of these spectral coefficients upper X left-parenthesis k right-parenthesis to 4 bit (Fig. 9.25d) and IDCT of the quantized spectral coefficients, the time‐domain signal in Fig. 9.25c and the difference signal in Fig. 9.25e between input and output results. One can notice in the output and difference signal that the error is spread along the entire block length. This means that before the attack of the castanet happens, the error signal of the block is perceptible. The time‐domain masking, the so‐called pre‐masking [Zwi90], is not sufficient. Ideally, the spreading of the error signal should follow the time‐domain envelope of the signal itself. From the forward linear prediction in the time domain, it is known that the power spectral density of the error signal after coding and decoding is weighted by the envelope of the power spectral density of the input signal [Var06]. Performing a forward linear prediction along the frequency axis in the frequency domain and quantization and coding leads to an error signal in the time domain where the temporal envelope of the error signal follows the time‐domain envelope of the input signal [Her96]. To point out the temporal weighting of the error signal, we consider the forward prediction in the time domain using Fig. 9.26a. For coding the input signal, x left-parenthesis n right-parenthesis is predicted by an impulse response p left-parenthesis n right-parenthesis. The output of the predictor is subtracted from the input signal x left-parenthesis n right-parenthesis and delivers the signal d left-parenthesis n right-parenthesis, which is then quantized to a reduced word length. The quantized signal d Subscript upper Q Baseline left-parenthesis n right-parenthesis equals x left-parenthesis n right-parenthesis asterisk a left-parenthesis n right-parenthesis plus e left-parenthesis n right-parenthesis is the sum of the convolution of x left-parenthesis n right-parenthesis with the impulse response a left-parenthesis n right-parenthesis and the additive quantization error e left-parenthesis n right-parenthesis. The power spectral density of the coder output is upper S Subscript upper D Sub Subscript upper Q Subscript upper D Sub Subscript upper Q Baseline left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis equals upper S Subscript upper X upper X Baseline left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis dot StartAbsoluteValue upper A left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis EndAbsoluteValue squared plus upper S Subscript upper E upper E Baseline left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis. The decoding operation performs the convolution of d Subscript upper Q Baseline left-parenthesis n right-parenthesis with the impulse response h left-parenthesis n right-parenthesis of the inverse system to the coder. Therefore, a left-parenthesis n right-parenthesis asterisk h left-parenthesis n right-parenthesis equals delta left-parenthesis n right-parenthesis must hold and thus upper H left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis equals 1 slash upper A left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis. Hereby, the output signal y left-parenthesis n right-parenthesis equals x left-parenthesis n right-parenthesis plus e left-parenthesis n right-parenthesis asterisk h left-parenthesis n right-parenthesis is derived with the corresponding discrete Fourier transform upper Y left-parenthesis k right-parenthesis equals upper X left-parenthesis k right-parenthesis plus upper E left-parenthesis k right-parenthesis dot upper H left-parenthesis k right-parenthesis. The power spectral density of the decoder out signal is given by upper S Subscript upper Y upper Y Baseline left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis equals upper S Subscript upper X upper X Baseline left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis plus upper S Subscript upper E upper E Baseline left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis dot StartAbsoluteValue upper H left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis EndAbsoluteValue squared. Herein, one can notice the spectral weighting of the quantization error with the spectral envelope of the input signal which is represented by StartAbsoluteValue upper H left-parenthesis e Superscript j normal upper Omega Baseline right-parenthesis EndAbsoluteValue. The same kind of forward prediction will now be applied in the frequency domain to the spectral coefficients upper X left-parenthesis k right-parenthesis equals DCT left-bracket x left-parenthesis n right-parenthesis right-bracket for a block of input samples x left-parenthesis n right-parenthesis shown in Fig. 9.26b. The output of the decoder is then given by upper Y left-parenthesis k right-parenthesis equals upper X left-parenthesis k right-parenthesis plus upper E left-parenthesis k right-parenthesis asterisk upper H left-parenthesis k right-parenthesis with upper A left-parenthesis k right-parenthesis asterisk upper H left-parenthesis k right-parenthesis equals delta left-parenthesis k right-parenthesis. Thus, the corresponding time‐domain signal is y left-parenthesis n right-parenthesis equals x left-parenthesis n right-parenthesis plus e left-parenthesis n right-parenthesis dot h left-parenthesis n right-parenthesis, where the temporal weighting of the quantization error with the temporal envelope of the input signal is clearly evident. The temporal envelope is represented by the absolute value StartAbsoluteValue h left-parenthesis n right-parenthesis EndAbsoluteValue of the impulse response h left-parenthesis n right-parenthesis. The relation between the temporal signal envelope (absolute value of the analytical signal) and the auto‐correlation function of the analytical spectrum is discussed in [Her96]. The dualities between forward linear predictions in the time and frequency domains are summarized in Table 9.2. Figure 9.27 demonstrates the operations for temporal noise shaping in the coder, where the prediction is performed along the spectral coefficients. The coefficients of the forward predictor have to be transmitted to the decoder, where inverse filtering is performed along the spectral coefficients.

Schematic illustration of forward prediction in time and frequency domains.

Figure 9.26 Forward prediction in time and frequency domains.

Schematic illustration of temporal noise shaping with forward prediction in the frequency domain.

Figure 9.27 Temporal noise shaping with forward prediction in the frequency domain.

Table 9.2 Forward prediction in the time and frequency domains.

Prediction in time domainPrediction in frequency domain
StartLayout 1st Row 1st Column y left-parenthesis n right-parenthesis 2nd Column equals 3rd Column x left-parenthesis n right-parenthesis plus e left-parenthesis n right-parenthesis asterisk h left-parenthesis n right-parenthesis 2nd Row 1st Column upper Y left-parenthesis k right-parenthesis 2nd Column equals 3rd Column upper X left-parenthesis k right-parenthesis plus upper E left-parenthesis k right-parenthesis dot upper H left-parenthesis k right-parenthesis EndLayoutStartLayout 1st Row 1st Column y left-parenthesis n right-parenthesis 2nd Column equals 3rd Column x left-parenthesis n right-parenthesis plus e left-parenthesis n right-parenthesis dot h left-parenthesis n right-parenthesis 2nd Row 1st Column upper Y left-parenthesis k right-parenthesis 2nd Column equals 3rd Column upper X left-parenthesis k right-parenthesis plus upper E left-parenthesis k right-parenthesis asterisk upper H left-parenthesis k right-parenthesis EndLayout

The temporal weighting is finally demonstrated in Fig. 9.28, where the corresponding signals with the forward prediction in the frequency domain are shown. Figure 9.28a/b shows the castanet signal x left-parenthesis n right-parenthesis and its corresponding spectral coefficients upper X left-parenthesis k right-parenthesis of the applied DCT. The forward prediction delivers upper D left-parenthesis k right-parenthesis in Fig. 9.28d and the quantized signal upper D Subscript upper Q Baseline left-parenthesis k right-parenthesis in Fig. 9.28f. After the decoder, the signal upper Y left-parenthesis k right-parenthesis in Fig. 9.28h is reconstructed by the inverse transfer function. The IDCT of upper Y left-parenthesis k right-parenthesis finally results in the output signal y left-parenthesis n right-parenthesis in Fig. 9.28e. The difference signal x left-parenthesis n right-parenthesis minus y left-parenthesis n right-parenthesis in Fig. 9.28g demonstrates the temporal weighting of the error signal with the temporal envelope from Fig. 9.28c. For this example, the order of the predictor is chosen to be 20 [Bos97] and the prediction along the spectral coefficients upper X left-parenthesis k right-parenthesis is performed by the Burg method. The prediction gain for this signal in the frequency domain is upper G Subscript p Baseline equals 16 dB (see Fig. 9.28d).

Schematic illustration of temporal noise shaping: attack of castanet and spectrum.

Figure 9.28 Temporal noise shaping: attack of castanet and spectrum.

Frequency Domain Prediction. A further compression of the bandpass signals is possible by using linear prediction. A backward prediction [Var06] of the bandpass signals is applied on the coder side (see Fig. 9.29). In using a backward prediction, the predictor coefficients need not be coded and transmitted to the decoder, because the estimate of the input sample is based on the quantized signal. The decoder derives the predictor coefficients p left-parenthesis n right-parenthesis in the same way from the quantized input. A second‐order predictor is sufficient, because the bandwidth of the bandpass signals is very low [Bos97].

Schematic illustration of backward prediction of bandpass signals.

Figure 9.29 Backward prediction of bandpass signals.

Mono/Side Coding. Coding of stereo signals with left and right signals x Subscript upper L Baseline left-parenthesis n right-parenthesis and x Subscript upper R Baseline left-parenthesis n right-parenthesis can be achieved by coding a mono signal (M) x Subscript upper M Baseline left-parenthesis n right-parenthesis equals left-parenthesis x Subscript upper L Baseline left-parenthesis n right-parenthesis plus x Subscript upper R Baseline left-parenthesis n right-parenthesis right-parenthesis slash 2 and a side (S, difference) signal x Subscript upper S Baseline left-parenthesis n right-parenthesis equals left-parenthesis x Subscript upper L Baseline left-parenthesis n right-parenthesis minus x Subscript upper R Baseline left-parenthesis n right-parenthesis right-parenthesis slash 2 (M/S coding). Because the power of the side signal is reduced for highly correlated left and right signals, a reduction of the bit rate for this signal can be achieved. The decoder can reconstruct the left signal x Subscript upper L Baseline left-parenthesis n right-parenthesis equals x Subscript upper M Baseline left-parenthesis n right-parenthesis plus x Subscript upper S Baseline left-parenthesis n right-parenthesis and the right signal x Subscript upper R Baseline left-parenthesis n right-parenthesis equals x Subscript upper M Baseline left-parenthesis n right-parenthesis minus x Subscript upper S Baseline left-parenthesis n right-parenthesis, if no quantization and coding is applied to the mono and side signals. This M/S coding is carried out for MPEG‐2 AAC [Bra98, Bos02] with the spectral coefficients of a stereo signal (see Fig. 9.30).

Schematic illustration of M/S coding in the frequency domain.

Figure 9.30 M/S coding in the frequency domain.

Intensity Stereo Coding. For intensity stereo (IS) coding, a mono signal x Subscript upper M Baseline left-parenthesis n right-parenthesis equals x Subscript upper L Baseline left-parenthesis n right-parenthesis plus x Subscript upper R Baseline left-parenthesis n right-parenthesis and two temporal envelopes e Subscript upper L Baseline left-parenthesis n right-parenthesis and e Subscript upper R Baseline left-parenthesis n right-parenthesis of the left and right signals are coded and transmitted. On the decoding side, the left signal is reconstructed by y Subscript upper L Baseline left-parenthesis n right-parenthesis equals x Subscript upper M Baseline left-parenthesis n right-parenthesis dot e Subscript upper L Baseline left-parenthesis n right-parenthesis and the right signal by y Subscript upper R Baseline left-parenthesis n right-parenthesis equals x Subscript upper M Baseline left-parenthesis n right-parenthesis dot e Subscript upper R Baseline left-parenthesis n right-parenthesis. This reconstruction is lossy. The intensity stereo coding of MPEG‐2 AAC [Bra98] is performed by the summation of spectral coefficients of both signals and by coding of scale factors which represent the temporal envelope of both signals (see Fig. 9.31). This type of stereo coding is only useful for higher frequency bands, because the human perception for phase shifts is non‐sensitive for frequencies above 2 kHz.

Schematic illustration of intensity stereo coding in the frequency domain.

Figure 9.31 Intensity stereo coding in the frequency domain.

Quantization and Coding. During the last coding step, the quantization and coding of the spectral coefficients take place. The quantizers, which are used in the figures for prediction along spectral coefficients in the frequency direction (Fig. 9.27) and prediction in the frequency domain along bandpass signals (Fig. 9.29), are now combined to a single quantizer per spectral coefficient. This quantizer performs nonlinear quantization similar to a floating‐point quantizer presented in Chapter 2, such that a nearly constant SNR over a wide amplitude range is achieved. This floating‐point quantization with a so‐called scale factor is applied to several frequency bands, in which several spectral coefficients use a common scale factor derived from an iteration loop (see Fig. 9.19). Finally, a Huffman coding of the quantized spectral coefficients is performed. An extensive presentation can be found in [Bos97, Bra98, Bos02].

9.7 MPEG‐4 Audio Coding

The MPEG‐4 audio coding standard consists of a family of audio and speech coding methods for different bit rates and a variety of multimedia applications [Bos02, Her02]. In addition to a higher coding efficiency, new functionalities such as scalability, object‐oriented representation of signals, and interactive synthesis of signals at the decoder are integrated. The MPEG‐4 coding standard is based on the following speech and audio coders.

  • Speech coders
    1. – CELP: Code‐excited linear prediction (bit rate 4–24 kBit/s).
    2. – HVXC: Harmonic vector excitation coding (bit rate 1.4–4 kBit/s).
  • Audio coders
    1. – Parametric audio: representation of a signal as a sum of sinusoids, harmonic components, and residual components (bit rate 4–16 kBit/s).
    2. – Structured audio: synthetic signal generation at decoder (extension of the MIDI standard1 (0.2–4 kBit/s).
    3. – Generalized audio: extension of MPEG‐2 AAC with additional methods in the time‐frequency domain. The basic structure is depicted in Fig. 9.19 (bit rate 6–64 kBit/s).

The basics of speech coders can found in [Var06]. The specified audio coders allow coding with lower bit rates (parametric audio and structured audio) and coding with higher quality at lower bit rates compared with MPEG‐2 AAC.

Schematic illustration of MPEG-4 parametric coder.

Figure 9.32 MPEG‐4 parametric coder.

Compared with the so far introduced coding methods, such as MPEG‐1 and MPEG‐2, parametric audio coding is of special interest as an extension to the filter bank methods [Pur99, Edl00]. A parametric audio coder is shown in Fig. 9.32. The analysis of the audio signal leads to a decomposition into sinusoidal, harmonic, and noise‐like signal components and the quantization and coding of these signal components is based on psychoacoustics [Pur02a]. According to an analysis/synthesis approach [McA86, Ser89, Smi90, Geo92, Geo97, Rod97, Mar00a] shown in Fig. 9.33, the audio signal is represented in a parametric form given by

(9.40)x left-parenthesis n right-parenthesis equals sigma-summation Underscript i equals 1 Overscript upper M Endscripts upper A Subscript i Baseline left-parenthesis n right-parenthesis cosine left-parenthesis 2 pi StartFraction f Subscript i Baseline left-parenthesis n right-parenthesis Over f Subscript upper A Baseline EndFraction n plus phi Subscript i Baseline left-parenthesis n right-parenthesis right-parenthesis plus x Subscript n Baseline left-parenthesis n right-parenthesis period

The first term describes a sum of sinusoids with time‐varying amplitudes upper A Subscript i Baseline left-parenthesis n right-parenthesis, frequencies f Subscript i Baseline left-parenthesis n right-parenthesis, and phases phi Subscript i Baseline left-parenthesis n right-parenthesis. The second term consists of a noise‐like component x Subscript n Baseline left-parenthesis n right-parenthesis with a time‐varying temporal envelope. This noise‐like component x Subscript n Baseline left-parenthesis n right-parenthesis is derived by subtracting the synthesized sinusoidal components from the input signal. With the help of a further analysis step, harmonic components with a fundamental frequency and multiples of this fundamental frequency are identified and grouped to harmonic components. The extraction of deterministic and stochastic components from an audio signal can be found in [Alt99, Hai03, Kei01, Kei02, Mar00a, Mar00b, Lag02, Lev98, Lev99, Pur02b]. In addition to the extraction of sinusoidal components, the modeling of noise‐like components and transient components is of specific importance [Lev98, Lev99]. Figure 9.34 exemplifies the decomposition of an audio signal into a sum of sinusoids x Subscript s Baseline left-parenthesis n right-parenthesis and a noise‐like signal x Subscript n Baseline left-parenthesis n right-parenthesis. The spectrogram shown in Fig. 9.35 represents the short‐time spectra of the sinusoidal components. The extraction of the sinusoids has been achieved by a modified FFT method [Mar00a] with an FFT length of upper N equals 2048 and an analysis hop size of upper R Subscript upper A Baseline equals 512.

Schematic illustration of parameter extraction with analysis/synthesis.

Figure 9.33 Parameter extraction with analysis/synthesis.

Schematic illustration of original signal, sum of sinusoids, and noise-like signal.

Figure 9.34 Original signal, sum of sinusoids, and noise‐like signal.

Schematic illustration of spectrogram of sinusoidal components.

Figure 9.35 Spectrogram of sinusoidal components.

The corresponding parametric MPEG‐4 decoder is shown in Fig. 9.36 [Edl00, Mei02]. The synthesis of the three signal components can be achieved by inverse FFT and overlap and add methods or can be directly performed by time‐domain methods [Rod97, Mei02]. A significant advantage of parametric audio coding is the direct access at the decoder to the three main signal components which allows effective post‐processing for the generation of a variety of audio effects [Zöl11]. Effects such as time and pitch scaling, virtual sources in three‐dimensional spaces, and cross‐synthesis of signals (karaoke) are just a few examples for interactive sound design on the decoding side.

Schematic illustration of MPEG-4 parametric decoder.

Figure 9.36 MPEG‐4 parametric decoder.

9.8 Spectral Band Replication

To further reduce the bit rate, an extension of MPEG‐1 layer 3 with the name MP3pro was introduced [Die02, Zie02]. The underlying method, called spectral band replication (SBR) performs a lowpass and highpass decomposition of the audio signal, where the lowpass filtered part is coded by a standard coding method (e.g. MPEG‐1 Layer 3) and the highpass part is represented by a spectral envelope and a difference signal [Eks02, Zie03]. Figure 9.37 shows the functional units of an SBR coder. For the analysis of the difference signal, the highpass part (HP generator) is reconstructed from the lowpass part and compared with the actual highpass part. The difference is coded and transmitted. For decoding (see Fig. 9.38), the decoded lowpass part of a standard decoder is used by the HP generator to reconstruct the highpass part. An additional coded difference signal is added at the decoder. An equalizer provides the spectral envelope shaping for the highpass part. The spectral envelope of the highpass signal can be achieved by a filter bank and computing the root‐mean‐square values of each bandpass signal [Eks02, Zie03]. The reconstruction of the highpass part (HP generator) can also be achieved by a filter bank and substituting the bandpass signals by using the lowpass parts [Schu96, Her98]. To code the difference signal of the highpass part, additive sinusoidal models can be applied such as the parametric methods of the MPEG‐4 coding approach.

Schematic illustration of SBR coder.

Figure 9.37 SBR coder.

Schematic illustration of SBR decoder.

Figure 9.38 SBR decoder. Based on [Edl00] and [Mei02].

Schematic illustration of functional units of the SBR method.

Figure 9.39 Functional units of the SBR method.

Figure 9.39 shows the functional units of the SBR method in the frequency domain. First, the short‐time spectrum is used to calculate the spectral envelope (Fig. 9.39a). The spectral envelope can be derived from an FFT, a filter bank, the cepstrum, or by linear prediction [Zöl11]. The band‐limited lowpass signal can be downsampled and coded by a standard coder which operates at a reduced sampling rate. In addition, the spectral envelope has to be coded (Fig. 9.39b). On the decoding side, the reconstruction of the upper spectrum is achieved by frequency shifting of the lowpass part, or even specific lowpass parts, and applying the spectral envelope onto this artificial highpass spectrum (Fig. 9.39c). An efficient implementation of a time‐varying spectral envelope computation (at the coder side) and spectral weighting of the highpass signal (at the decoder side) with a complex‐valued QMF bank is described in [Eks02].

9.9 Constrained Energy Lapped Transform – Gain and Shape Coding

Another audio coding system using overlapped MDCT and a gain‐shape quantization scheme in the frequency domain was introduced in [Val09, Val10], which is lossy in nature. The block diagram in Fig. 9.40 shows a simple version of the so‐called constrained energy lapped transform (CELT) encoder and decoder. In this section, a simplified version of the CELT codec is described that excludes some refinements such as pre‐ and post‐filtering, bit allocation for pitch period filter, and filter gain, which are described in [Val13].

Schematic illustration of a simplified illustration of the CELT codec.

Figure 9.40 A simplified illustration of the CELT codec.

In the CELT codec, the MDCT with a given window length upper N is performed on a pre‐processed audio input to generate the coefficients bold c. From the coefficients, the energies in 20 Bark bands are computed as

(9.41)upper E Subscript i Baseline equals StartRoot bold c Subscript i Superscript upper T Baseline dot bold c Subscript i Baseline EndRoot comma

where i equals 0 comma 1 comma ellipsis comma 19 denote the band indices. These band energies are called gain coefficients and they are given by g Subscript i Baseline equals upper E Subscript i. Division of the Bark band coefficient vectors bold c Subscript i by these gain coefficients g Subscript i leads to the shape coefficient vectors given by

(9.42)bold s Subscript i Baseline equals StartFraction bold c Subscript i Baseline Over g Subscript i Baseline EndFraction period

The mapping from DCT coefficients bold c to Bark bands bold c Subscript i is described in Table IV of [Val10]. The corresponding frequency for each DCT coefficient c Subscript k is given by f Subscript k Baseline equals StartFraction k Over 2 upper N EndFraction dot f Subscript upper S in Hz with k equals 0 comma 1 ellipsis comma 255. The sequential operations for computing gain and shape are given by

(9.43)bold c equals Start 4 By 2 Matrix 1st Row 1st Column bold c 0 2nd Column equals Start 3 By 1 Matrix 1st Row c 0 2nd Row c 1 3rd Row c 2 EndMatrix 2nd Row 1st Column bold c 1 2nd Column equals Start 3 By 1 Matrix 1st Row c 3 2nd Row c 4 3rd Row c 5 EndMatrix 3rd Row 1st Column vertical-ellipsis 4th Row 1st Column bold c 19 2nd Column equals Start 3 By 1 Matrix 1st Row c 180 2nd Row vertical-ellipsis 3rd Row c 232 EndMatrix EndMatrix right-arrow bold g equals Start 4 By 1 Matrix 1st Row g 0 equals StartRoot bold c 0 Superscript upper T Baseline dot bold c 0 EndRoot 2nd Row g 1 equals StartRoot bold c 1 Superscript upper T Baseline dot bold c 1 EndRoot 3rd Row vertical-ellipsis 4th Row g 19 equals StartRoot bold c 19 Superscript upper T Baseline dot bold c 19 EndRoot EndMatrix right-arrow bold s equals Start 4 By 1 Matrix 1st Row bold s 0 equals bold c 0 slash g 0 2nd Row bold s 1 equals bold c 1 slash g 1 3rd Row vertical-ellipsis 4th Row bold s 19 equals bold c 19 slash g 19 EndMatrix period

Figure 9.41 shows the individual spectrograms representing the energies per Bark band in dB, the corresponding gain coefficients in dB, and the shape coefficient vectors. The values of elements in each shape vector lie between negative 1 and 1. In the next step, quantization of the gain and shape coefficients takes place. The gain coefficients use scalar quantization and the shape coefficients use a form of vector quantization, which usually performs a dimension reduction and leads to compression.

Schematic illustration of CELT spectrograms for a) energies per Bark band, b) gain coefficients in dB, and c) shape coefficients.

Figure 9.41 CELT spectrograms for a) energies per Bark band, b) gain coefficients in dB, and c) shape coefficients.

9.9.1 Gain Quantization

CELT uses a strategy to quantize the gain in the log2 domain that includes two steps – coarse and fine quantization. For coarse quantization, gain coefficients from each band can be directly quantized with a pre‐defined resolution. However, direct quantization of the coefficients will result in the usage of a high number of bits during encoding if the variance among the coefficients or the entropy of their distribution is high. When CELT is used for real‐time audio coding over a network as part of the OPUS codec [Val12], the available bandwidth governs the bit allocation. Hence, the CELT encoder eliminates the redundancy between the gain coefficients across time and frequency by performing a prediction. The corresponding details of the predictive method are provided in [Val10]. The method leads to a set of predictive gain coefficients, denoted by bold g Superscript normal p, with lower variance or entropy than the original gain coefficients. Figure 9.42 shows the mean gain vector compared with the mean predictive gain vector, averaged across all bands for 100 overlapping frames of an audio signal. The variance of the mean gain vector is 0.088 while the variance of the average predictive gain vector is 0.017. The coarse quantization (CQ) is then performed as

(9.44)bold g Subscript cq Superscript normal p Baseline equals CQ left-parenthesis bold g Subscript db Superscript normal p Baseline right-parenthesis equals CQ left-parenthesis log Subscript 2 Baseline left-parenthesis bold g Superscript normal p Baseline right-parenthesis right-parenthesis equals Start 4 By 1 Matrix 1st Row CQ left-parenthesis log Subscript 2 Baseline left-parenthesis g 0 Superscript normal p Baseline right-parenthesis right-parenthesis 2nd Row CQ left-parenthesis log Subscript 2 Baseline left-parenthesis g 1 Superscript normal p Baseline right-parenthesis right-parenthesis 3rd Row vertical-ellipsis 4th Row CQ left-parenthesis log Subscript 2 Baseline left-parenthesis g 19 Superscript normal p Baseline right-parenthesis right-parenthesis EndMatrix

and the fine quantization (FQ) is performed as

(9.45)bold g Subscript fq Superscript normal p Baseline equals FQ left-parenthesis bold g Subscript db Superscript normal p Baseline minus bold g Subscript cq Superscript normal p Baseline right-parenthesis equals FQ left-parenthesis bold e Subscript cq Baseline right-parenthesis comma

where bold g Subscript db Superscript normal p denote the predictive gain coefficients expressed in decibels and bold e Subscript cq denotes the coarse quantization error. The range of the rounded values after coarse quantization varies between frames and depends on the quantization resolution q Subscript r. The quantized values or indices are transformed into positive integers for encoding. The coarse quantization symbols are usually distributed in such a way that it can be modeled by a generalized Gaussian or Laplace distribution. An example of the real symbol distribution over 100 overlapping frames is provided in Fig. 9.43 along with the ideal Laplace distribution that can be used during entropy coding. The coarse quantization error usually lies between negative 0.5 times q Subscript r and 0.5 times q Subscript r. Fine quantization maps the error value to an integral symbol which represents a sub‐interval of the error range. The resolution for fine quantization depends on its bit allocation. The fine quantization symbols per frame are usually not distributed well enough to be modeled by any standard distribution. However, if many frames are taken into account, its distribution approaches a uniform distribution, as shown in Fig. 9.44.

Schematic illustration of mean gain and mean predictive gain coefficients, averaged over all bands.

Figure 9.42 Mean gain and mean predictive gain coefficients, averaged over all bands.

Schematic illustration of actual and ideal distribution of the coarse quantization symbols for 100 frames.

Figure 9.43 Actual and ideal distribution of the coarse quantization symbols for 100 frames.

Schematic illustration of distribution of the fine quantization symbols for 100 frames.

Figure 9.44 Distribution of the fine quantization symbols for 100 frames.

9.9.2 Shape Quantization

Quantization of the shape vectors is performed by pyramid vector quantization (PVQ) [Fis86, Say12, Val13]. In PVQ, a codebook containing reference vectors of length upper N is created, where upper N denotes the length of the shape vector to be quantized. The codebook is made of integers, whose absolute values add up to a predefined constant upper K referred to as the number of pulses. The codebook vectors are normalized to produce unit vectors which are ordered and assigned with numeric indices. The shape vector to be encoded receives the index that corresponds to its nearest unit vector in the codebook. The upper L 1 distance is used to find the appropriate unit vector in a codebook search. Each band is uniquely encoded based on its bit allocation and the low‐ and mid‐frequency bands usually have a better bit resolution. A shape vector bold s Subscript i of the ith band is mapped to a codebook index s Subscript p v q comma i according to

(9.46)bold s Subscript pvq Baseline equals PVQ left-parenthesis bold s right-parenthesis equals Start 4 By 1 Matrix 1st Row s Subscript p v q comma 0 Baseline equals PVQ left-parenthesis bold s 0 right-parenthesis 2nd Row s Subscript p v q comma 1 Baseline equals PVQ left-parenthesis bold s 1 right-parenthesis 3rd Row vertical-ellipsis 4th Row s Subscript p v q comma 19 Baseline equals PVQ left-parenthesis bold s 19 right-parenthesis EndMatrix period

The distribution of the shape quantization symbols per frame is usually sparse and similar to the fine quantization symbols, it cannot be modeled by any standard distribution. When many frames are considered, it approaches a uniform distribution for each band. The inverse operation maps the codebook index s Subscript p v q comma i to a codebook vector bold modifying above s with caret Subscript i as given by

(9.47)bold modifying above s with caret equals PVQ Superscript negative 1 Baseline left-parenthesis bold s Subscript pvq Baseline right-parenthesis equals Start 4 By 1 Matrix 1st Row bold modifying above s with caret 0 equals PVQ Superscript negative 1 Baseline left-parenthesis s Subscript p v q comma 0 Baseline right-parenthesis 2nd Row bold modifying above s with caret 1 equals PVQ Superscript negative 1 Baseline left-parenthesis s Subscript p v q comma 1 Baseline right-parenthesis 3rd Row vertical-ellipsis 4th Row bold modifying above s with caret 19 equals PVQ Superscript negative 1 Baseline left-parenthesis s Subscript p v q comma 19 Baseline right-parenthesis EndMatrix period

9.9.3 Range Coding

Range coding [Nig79, Say12] is an entropy coding method that performs a progressive encoding of symbols by narrowing down an initial range, based on the probability distribution of the symbols, until the final range and codeword are determined. The final codeword is decoded with the help of the probability distribution of the symbols as well. It is very similar to arithmetic encoding, except that the encoding is done with digits in any base instead of with bits. However, CELT uses range coding as a bit packer. It performs an iterative compression of a sequence of 20 symbols corresponding to 20 bands with a pre‐defined probability density function and outputs a single codeword. From this codeword, the iterative decompression delivers the 20 input symbols. This range coding and decoding is usually performed for the course gain symbols only while the shape and fine quantization symbols are encoded as raw and decoded directly. However, if the standalone CELT codec is used for non‐real‐time compression of audio after a look‐ahead of multiple frames, entropy coding is able to compress the shape and fine quantization symbols as well with the use of a uniform distribution function.

9.9.4 CELT Decoding

Figure 9.40 shows a simplified operation of the CELT decoder. Once the bit streams are available to the decoder, range decoding is applied to retrieve the coarse quantization symbols while the raw bits are decoded to retrieve the fine and shape quantization symbols. The inverse quantization of the symbols takes place in the next step as denoted by FQ Superscript negative 1, GQ Superscript negative 1, and PVQ Superscript negative 1 to get the respective coefficients. The reconstructed gain error is added to the reconstructed predictive gain coefficients and subsequently the reconstructed gain coefficients bold modifying above g with caret Subscript d b are calculated based on the inverse prediction method. The gain coefficients are transformed to the linear domain as bold modifying above g with caret equals 2 Superscript bold modifying above g with caret Super Subscript d b Superscript and the frequency domain coefficients are reconstructed as

(9.48)bold modifying above c with caret Subscript i Baseline equals modifying above g with caret Subscript i Baseline dot bold modifying above s with caret Subscript i Baseline period

The coefficients are transformed using IMDCT to get the reconstructed time‐domain signal. Figure 9.45 shows an original and reconstructed signal along with the reconstruction error for 100 frames of a test audio file using a quantization resolution of 6 dB. For the above example, CELT encoding results in an average rate of 348 bits per frame while the average rate for coarse quantization after entropy coding is nearly 26 bits per frame. The overall compression achieved for the corresponding audio snippet is nearly 25 % of the original.

Schematic illustration of performance of the CELT codec on an audio snippet.

Figure 9.45 Performance of the CELT codec on an audio snippet.

9.10 JS Applet – Psychoacoustics

The applet shown in Fig. 9.46 demonstrates psychoacoustic audio masking effects [Gui05]. It is designed for a first insight into the perceptual experience of masking a sinusoidal signal with band‐limited noise.

You can choose between two predefined audio files from our web server (“Audio1.wav” or “Audio2.wav”). These are band‐limited noise signals with different frequency ranges. A sinusoidal signal is generated by the applet, and two sliders can be used to control its frequency and magnitude values.

Schematic illustration of JS applet - psychoacoustics.

Figure 9.46 JS applet – psychoacoustics.

9.11 Exercises

1. Psychoacoustics

  1. Human hearing
    1. In which frequency range is a human able to perceive sounds?
    2. What is the frequency range of speech?
    3. In the above specified range, where is human hearing most sensitive?
    4. How is the absolute threshold of hearing determined?
  2. Masking
    1. What is frequency‐domain masking?
    2. What is a critical band and why is it needed for frequency masking phenomena?
    3. Consider a Subscript i and f Subscript i to be respectively the amplitude and the frequency of a partial at index i and upper V left-parenthesis a Subscript i Baseline right-parenthesis to be the corresponding volume in dB. The difference between the level of the masker and the masking threshold is negative 10 dB. The masking curves toward lower and higher frequencies are described respectively by a left slope (27 dB/Bark) and a right slope (15 dB/Bark). Explain the main steps of frequency masking in this case and show with plots how this masking phenomena is achieved.
    4. What are the psychoacoustic parameters used for lossy audio coding?
    5. How can we explain temporal masking and what is its duration after stopping the active masker?

2. Audio coding

  1. How do a lossless coder and decoder work.
  2. What is the achievable compression factor for lossless coding?
  3. How do an MPEG‐1 Layer 3 coder and decoder work.
  4. How do an MPEG‐2 AAC coder and decoder work.
  5. What is temporal noise shaping?
  6. How do an MPEG‐4 coder and decoder work.
  7. What is the benefit of SBR?

References

  1. [Alt99] R. Althoff, F. Keiler, U. Zölzer: Extracting Sinusoids from Harmonic Signals, Proc. DAFX‐99 Workshop on Digital Audio Effects, pp. 97–100, Trondheim, Norway, 1999.
  2. [Blo95] T. Block: Untersuchung von Verfahren zur verlustlosen Datenkompression von digitalen Audiosignalen, Studienarbeit, TU Hamburg‐Harburg, 1995.
  3. [Bos97] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K. Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Davidson, Y. Oikawa: ISO/IEC MPEG‐2 Advanced Audio Coding, J. Aud. Eng. Soc., Vol.  45, No. 10, pp. 789–814, October 1997.
  4. [Bos02] M. Bosi, R.E. Goldberg: Introduction to Digital Audio Coding and Standards, Kluwer Academic Publishers, 2002.
  5. [Bra92] K. Brandenburg, J. Herre: Digital Audio Compression for Professional Applications, Proc. 92nd AES Convention, Preprint No. 3330, Vienna 1992.
  6. [Bra94] K. Brandenburg, G. Stoll: The ISO‐MPEG‐1 Audio: A Generic Standard for Coding of High Quality Digital Audio, J. Aud. Eng. Soc., Vol.  42, No. 10, pp. 780–792, October 1994.
  7. [Bra98] K. Brandenburg: Perceptual Coding of High Quality Digital Audio, in M. Kahrs, K. Brandenburg: Applications of Digital Signal Processing to Audio and Acoustics, Kluwer Academic Publishers, 1998.
  8. [Cel93] C. Cellier, P. Chenes, M. Rossi: Lossless Audio Data Compression for Real‐Time Applications, Proc. 95th AES Convention, Preprint No. 3780, New York 1993.
  9. [Cra96] P. Craven, M. Gerzon: Lossless coding for audio discs, J. Audio Eng. Soc., Vol.  44, No. 9, pp.706–720, 1996.
  10. [Cra97] P. Craven, M. Law, and J. Stuart: Lossless compression using IIR prediction, Proc. 102nd AES Convention, Preprint No. 4415, Munich, Germany, 1997.
  11. [Die02] M. Dietz, L. Liljeryd, K. Kjörling, O. Kunz: Spectral Band Replication: A Novel Approach in Audio Coding Proc. 112th AES Convention, Preprint No. 5553, Munich 2002.
  12. [Edl89] B. Edler: Codierung von Audiosignalen mit überlappender Transformation und adaptiven Fensterfunktionen, Frequenz, Vol.  43, pp. 252–256, 1989.
  13. [Edl00] B. Edler, H. Purnhagen: Parametric Audio Coding, 5th International Conference on Signal Processing (ICSP 2000), Beijing, August 2000.
  14. [Edl95] B. Edler: Äquivalenz von Transformation und Teilbandzerlegung in der Quellencodierung, Dissertation, Universität Hannover, 1995.
  15. [Eks02] P. Ekstrand: Bandwidth Extension of Audio Signals by Spectral Band Replication, Proc. 1st IEEE Benelux Workshop on Model‐based Processing and Coding of Audio (MPCA‐2002), Leuven, Belgium, 2002.
  16. [Ern00] M. Erne: Signal Adpative Audio Coding Using Wavelets and Rate Optimization, Dissertation, ETH Zrich, 2000.
  17. [Fis86] T.H. Fischer: A pyramid vector quantizer. IEEE Trans. on Information Theory, 32:56–583, 1986.
  18. [Geo92] E.B. George, M.J.T. Smith: Analysis‐by‐Synthesis/Overlap‐Add Sinusoidal Modeling applied to the Analysis and Synthesis of Musical Tones, Journal of the Audio Engineering Society, Vol.  40, No. 6, pp. 497–516, June 1992.
  19. [Geo97] E.B. George, M.J.T. Smith: Speech Analysis/Synthesis and Modification using an Analysis‐by‐Synthesis/Overlap‐Add Sinusoidal Model, IEEE Transactions on Speech and Audio Processing, Vol.  5, No. 5, pp.389–406, September 1997.
  20. [Glu93] R. Gluth: Beiträge zur Beschreibung und Realisierung digitaler, nichtrekursiver Filterbänke auf der Grundlage linearer diskreter Transformationen, Dissertation, Ruhr‐Universität Bochum, 1993.
  21. [Gui05] M. Guillemard, C. Ruwwe, U. Zölzer: J‐DAFx ‐ Digital Audio Effects in Java, Proc. 8th Int. Conference on Digital Audio Effects (DAFx‐05), Madrid, Spain, pp.161–166, 2005.
  22. [Hai03] S. Hainsworth, M. Macleod: On Sinusoidal Parameter Estimation, Proc. DAFX‐03 Conference on Digital Audio Effects, London, UK, September 2003.
  23. [Han98] M. Hans: Optimization of Digital Audio for Internet Transmission, PhD Thesis, Georgia Inst. Technol., Atlanta, 1998.
  24. [Han01] M. Hans, R.W. Schafer: Lossless Compression of Digital Audio, IEEE Signal Processing Magazine, Vol.  18, No. 4, pp. 21–32, July 2001.
  25. [Hel72] R.P. Hellman: Asymmetry in Masking Between Noise and Tone, Perception and Psychophys., Vol. 11, pp. 241–246, 1972.
  26. [Her96] J. Herre, J.D. Johnston: Enhancing the Performance of Perceptual Audio Coders by Using Temporal Noise Shaping (TNS), Proc. 101st AES Convention, Preprint No. 4384, Los Angeles, 1996.
  27. [Her98] J. Herre, D. Schultz: Extending the MPEG‐4 AAC Codec by Perceptual Noise Substitution, Proc. 104th AES Convention, Preprint No. 4720, Amsterdam, 1998.
  28. [Her99] J. Herre: Temporal Noise Shaping, Quantization and Coding Methods In Perceptual Audio Coding: A Tutorial Introduction, Proc. AES 17th International Conference on High Quality Audio Coding, Florence, September 1999.
  29. [Her02] J. Herre, B. Grill: Overview of MPEG‐4 Audio and its Applications in Mobile Communications, Proc. 112th AES Convention, Preprint No. 5553, Munich 2002.
  30. [Huf52] D.A. Huffman: A Method for the Construction of Minimum‐”‐Redundancy Codes, Proc. of the IRE, pp. 1098–1101, 1952.
  31. [ISO92]ISO/IEC 11172‐3: Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to 1.5 Mbits/s ‐ Audio Part, International Standard, 1992.
  32. [Jay84] N.S. Jayant, P. Noll: Digital Coding of Waveforms, Prentice‐Hall, New Jersey, 1984.
  33. [Joh88a] J.D. Johnston: Transform Coding af Audio Signals Using Perceptual Noise Criteria, IEEE Journal on Selected Areas in Communications, Vol. 6, No. 2, pp. 314–323, February 1988.
  34. [Joh88b] J.D. Johnston: Estimation of Perceptual Entropy Using Noise Masking Criteria, Proc. ICASSP‐88, pp. 2524–2527, 1988.
  35. [Kap92] R. Kapust: A Human Ear Related Objective Measurement Technique Yields Audible Error and Error Margin, Proc. 11th Int. AES Conference ‐ Test&Measurement, Portland, pp. 191–202, 1992.
  36. [Kei01] F. Keiler, U. Zölzer: Extracting Sinusoids from Harmonic Signals, Journal of New Music Research, Special Issue: “Musical Applications of Digital Signal Processing”, Guest Editor: Mark Sandler, Vol.  30, No. 3, pp. 243–258, September 2001.
  37. [Kei02] F. Keiler, S. Marchand: Survey on Extraction of Sinusoids in Stationary Sounds, Proc. DAFX‐02 Conference on Digital Audio Effects, pp. 51–58, Hamburg, 2002.
  38. [Kon94] K. Konstantinides: Fast Subband Filtering in MPEG Audio Coding, IEEE Signal Processing Letters, Vol. 1, No. 2, pp. 26–28, February 1994.
  39. [Lag02] M. Lagrange, S. Marchand, J.‐B. Rault: Sinusoidal Parameter Extraction and Component Selection in a Non Stationary Model, Proc. DAFX‐02 Conference on Digital Audio Effects, pp. 59–64, Hamburg, 2002.
  40. [Lev98] S. Levine: Audio Representations for Data Compression and Compressed Domain Processing, PhD Thesis, Stanford University, 1998.
  41. [Lev99] S. Levine, J.O. Smith: Improvements to the Switched Parametric & Transform Audio Coder, Proc. 1999 IEEE Workshop on Applicatrions of Signal Proceassing to Audio and Acoustics, New Paltz, October 1999.
  42. [Lie02] T. Liebchen: Lossless Audio Coding Using Adaptive Multichannel Prediction, Proc. 113th AES Convention, Preprint No. 5680, Los Angeles, 2002.
  43. [Mar00a] S. Marchand: Sound Models for Computer Music, PhD Thesis, University of Bordeaux, October 2000.
  44. [Mar00b] S. Marchand: Compression of Sinusoidal Modeling Parameters, Proc. DAFX‐00 Conference on Digital Audio Effects, pp. 273–276, Verona, Italy, December 2000.
  45. [Mal92] H.S. Malvar: Signal Processing with Lapped Transforms, Artech House, Norwood, 1992.
  46. [McA86] R. McAulay, T. Quatieri: Speech Transformations Based in a Sinusoidal Representation, IEEE Trans. Acoustics, Speech, Signal Processing, Vol.  34, No. 4, pp. 744–754, 1989.
  47. [Mas85] J. Masson, Z. Picel: Flexible Design of Computationally Efficient Nearly Perfect QMF Filter Banks, Proc. ICASSP‐85, pp. 541–544, 1985.
  48. [Mei02] N. Meine, H. Purnhagen: Fast Sinusoid Synthesis For MPEG‐4 HILN Parametric Audio Decoding, Proc. DAFX‐02 Conference on Digital Audio Effects, pp. 239–244, Hamburg, September 2002.
  49. [Nig79] G. Nigel and N. Martin: Range encoding: An algorithm for removing redundancy from a digitized message. In Proc. Video and Data Recording Conference, Oct 1979.
  50. [Pen93] W.B. Pennebaker, J.L. Mitchell: JPEG Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993.
  51. [Pri86] J.P. Princen, A.B. Bradley: Analysis/Synthesis Filter Bank Design Based on Time Domain Aliasing Cancellation, IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol.  34, No. 5, pp. 1153–1161, October 1986.
  52. [Pri87] J.P. Princen, A.W. Johnston, A.B. Bradley: Subband/Transform Coding Using Filter Bank Designs Based on Time Domain Aliasing Cancellation, Proc. ICASSP‐87, pp. 2161–2164, 1987.
  53. [Pur97] M. Purat, T. Liebchen, and P. Noll: Lossless Transform Coding of Audio Signals, Proc. 102nd AES Convention, Preprint No. 4414, Munich, Germany, 1997.
  54. [Pur99] H. Purnhagen: Advances in Parametric Audio Coding, Proc. 1999 IEEE Workshop on Applicatrions of Signal Proceassing to Audio and Acoustics, New Paltz, October 1999.
  55. [Pur02a] H. Purnhagen, N. Meine, B. Edler: Sinusoidal Coding Using Loudness‐Based Component Selection, Proc. ICASSP‐2002, May 13‐17, Orlando, USA, 2002.
  56. [Pur02b] H. Purnhagen: Parameter Estimation and Tracking For Time‐Varying Sinusoids, Proc. 1st IEEE Benelux Workshop on Model Based Processing and Coding of Audio, Leuven, Belgium, November 2002.
  57. [Raa02] M. Raad, A. Mertins: From Lossy to Lossless Audio Coding Using SPIHT, Proc. DAFX‐02 Conference on Digital Audio Effects, pp. 245–250, Hamburg, 2002.
  58. [Rao90] K.R. Rao, P. Yip: Discrete Cosine Transform – Algorithms, Advantages, Applications, Academic Press, Inc., San Diego, 1990.
  59. [Rob94] T. Robinson: SHORTEN: Simple lossless and near‐lossless waveform compression, Technical Report CUED/F‐INFENG/TR.156, Cambridge University Engineering Department, Cambridge, UK, December 1994.
  60. [Rod97] X. Rodet: Musical Sound Signals Analysis/Synthesis: Sinusoidal+Residual and Elementary Waveform Models, Proceedings of the IEEE Time‐Frequency and Time‐Scale Workshop (TFTS‐97), University of Warwick, Coventry, UK, August 1997.
  61. [Rot83] J.H. Rothweiler: Polyphase Quadrature Filters ‐ A New Subband Coding Technique, Proc. ICASSP‐87, pp. 1280–1283, 1983.
  62. [Sauv90] U. Sauvagerd: Bitratenreduktion hochwertiger Musiksignale unter Verwendung von Wellendigitalfiltern, VDI‐Verlag, Düsseldorf 1990.
  63. [Say12] K. Sayood: Introduction to Data Compression. Morgan Kaufmann, 2012.
  64. [Schr79] M.R. Schroeder, B.S. Atal, J.L. Hall: Optimizing Digital Speech Coders by Exploiting Masking Properties of the Human Ear, J. Acoust. Soc. Am., Vol. 66, No. 6, pp. 1647–1652, December 1979.
  65. [Sch02] G.D.T. Schuller, Bin Yu, Dawei Huang, B. Edler: Perceptual Audio Coding Using Adaptive Pre‐ and Post‐Filters and Lossless Compression, IEEE Transactions on Speech and Audio Processing, Vol. 10, No. 6, pp. 379–390, Sept. 2002.
  66. [Schu96] D. Schulz: Improving Audio Codecs by Noise Substitution, J. Aud. Eng. Soc., Vol.  44, No. 7/8, pp. 593–598, July/August 1996.
  67. [Ser89] X. Serra: A System for Sound Analysis/Transformation/Synthesis based on a Deterministic plus Stochastic Decomposition, PhD Thesis, Stanford University, 1989.
  68. [Sin93] D. Sinha, A.H. Tewfik: Low bit rate Transparent Audio Compression Using Adapted Wavelets, IEEE Transactions on Signal Processing, Vol. 41, pp.3463–3479, 1993.
  69. [Smi90] J.O. Smith, X. Serra: Spectral Modeling Synthesis: A Sound Analysis/Synthesis System based on a Deterministic plus Stochastic Decomposition, Computer Music Journal, Vol.  14, No. 4, pp. 12–24, 1990.
  70. [Sqa88]EBU‐SQAM: Sound Quality Assessment Material, Recordings for Subjective Tests, CompactDisc, 1988.
  71. [Ter79] E. Terhardt: Calculating Virtual Pitch, Hearing Res., Vol. 1, pp. 155–182, 1979.
  72. [Thei88] G. Theile, G. Stoll, M. Link: Low Bit‐Rate Coding of High‐quality Audio Signals, EBU Review, No. 230, pp. 158–181, August 1988.
  73. [Vai93] P.P. Vaidyanathan: Multirate Systems and Filter Banks, Prentice‐Hall, Englewood Cliffs, 1993.
  74. [Val09] J.‐M. Valin, T.B. Terriberry, and G. Maxwell: A full‐bandwidth audio codec with low complexity and very low delay. In Proc. EUSIPCO 2009, pages 125–1258, 2009.
  75. [Val10] J.‐M. Valin, T. B. Terriberry, C. Montgomery, and G. Maxwell: A high‐quality speech and audio codec with less than 10 ms delay. IEEE Trans. Audio, Speech and Language Processing, 18(1):5–67, 2010.
  76. [Val12] J.‐M. Valin, K. Vos, and T. Terriberry: Definition of the Opus Audio Codec. RFC 6716, Sep 2012.
  77. [Val13] J.‐M. Valin, T. B. Terriberry, C. Montgomery, and K. Vos: Highquality, low‐delay coding in the opus codec. In Audio Engineering Society Convention 135, Oct 2013.
  78. [Var06] P. Vary, R. Martin: Digital Speech Transmission. Enhancement, Coding and Error Concealment, J. Wiley & Sons, Chichester, 2006.
  79. [Vet95] M. Vetterli, J. Kovacevic: Wavelets and Subband Coding, Prentice‐Hall, Englewood Cliffs, 1995.
  80. [Zie02] T. Ziegler, A. Ehret, P. Ekstrand, M. Lutzky: Enhancing mp3 with SBR: Features and Capabilities of the new mp3PRO Algorithm, Proc. 112th AES Convention, Preprint No. 5560, Munich 2002.
  81. [Zie03] T. Ziegler, M. Dietz, K. Kjörling, A. Ehret: aacPlus‐Full Bandwidth Audio Coding for Broadcast and Mobile Applications, International Signal Processing Conference, Dallas, 2003.
  82. [Zöl11] U. Zölzer (Ed.): DAFX ‐ Digital Audio Effects, 2nd edition, J. Wiley & Sons, Chichester, 2011.
  83. [Zwi82] E. Zwicker: Psychoakustik, Springer‐Verlag, Berlin, 1982.
  84. [Zwi90] E. Zwicker, H. Fastl: Psychoacoustics, Springer‐Verlag, Berlin, 1990.

Note

  1. 1 http://www.midi.org/
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset