This chapter contains an introduction to the main principles of digital audio, described in a relatively non-mathematical way. Further reading recommendations at the end of this chapter are given for those who want to study the subject in more depth. Subsequent chapters deal with digital recording and editing systems and with digital audio applications.

DIGITAL AND ANALOG RECORDING CONTRASTED

In analog recording, as described in the previous chapters, sound is recorded by converting continuous variations in sound pressure into continuous variations in electrical voltage, using a microphone. This varying voltage is then converted into a varying pattern of magnetization on a tape, or, alternatively, into a pattern of light and dark areas on an optical-film soundtrack, or a groove of varying deviation on an LP.

Because the physical characteristics of analog recordings relate closely to the sound waveform, replaying them is a relatively simple matter. Variations in the recorded signal can be converted directly into variations in sound pressure using a suitable collection of transducers and amplifiers. The replay system, however, is unable to tell the difference between wanted signals and unwanted signals. Unwanted signals might be distortions, noise and other forms of interference introduced by the recording process. For example, a record player cannot distinguish between the stylus movement it experiences because of a scratch on a record (unwanted) and that caused by a loud transient in the music (wanted). Imperfections in the recording medium are reproduced as clicks, crackles and other noises.

Digital recording, on the other hand, converts the electrical waveform from a microphone into a series of binary numbers, each of which represents the amplitude of the signal at a unique point in time, recording these numbers in a coded form which allows the system to detect whether the replayed signal is correct or not. A reproducing device is then able to distinguish between the wanted and the unwanted signals introduced above, and is thus able to reject all but the wanted original information in most cases. Digital audio can be engineered to be more tolerant of a poor recording channel than analog audio. Distortions and imperfections in the storage or transmission process need not affect the sound quality of the signal provided that they remain within the design limits of the system and that timing and data errors are corrected. These issues are given further coverage in Fact File 8.1.

Digital audio has made it possible for sound engineers to take advantage of developments in the computer industry, and this is particularly beneficial because the size of that industry results in mass production (and therefore cost savings) on a scale not possible for audio products alone. Today it is common for sound to be recorded, processed and edited on relatively low-cost desktop computer equipment, and this is a trend likely to continue.

FACT FILE 8.1 ANALOG AND DIGITAL INFORMATION

Analog information is made up of a continuum of values, which at any instant may have any value between the limits of the system. For example, a rotating knob may have one of an infinite number of positions — it is therefore an analog controller (see the diagram below). A simple switch, on the other hand, can be considered as a digital controller, since it has only two positions — off or on. It cannot take any value in between. The brightness of light that we perceive with our eyes is analog information and as the sun goes down the brightness falls gradually and smoothly, whereas a household light without a dimmer may be either on or off — its state is binary (that is, it has only two possible states).

Electrically, analog information may be represented as a varying voltage or current. If a rotary knob is used to control a variable resistor connected to a voltage supply, its position will affect the output voltage as shown below. This, like the knob’s position, may occupy any value between the limits — in this case anywhere between zero volts and +V. The switch could be used to control a similar voltage supply and in this case the output voltage could only be either zero volts or +V. In other words the electrical information that resulted would be binary. The high (+V) state could be said to correspond to a binary one and the low state to binary zero (although in many real cases it is actually the other way around).

Binary information is inherently more resilient to noise and interference than analog information, as shown in the diagram below. If noise is added to an analog signal it becomes very difficult to tell what is the wanted signal and what is the unwanted noise, as there is no means of distinguishing between the two. If noise is added to a binary signal it is possible to extract the important information at a later stage. By comparing the signal amplitude with a fixed decision point it is possible for a receiver to treat everything above the decision point as ‘high’ and everything below it as ‘low’. For any noise or interference to influence the state of a digital signal it must be at least large enough in amplitude to cause a high level to be interpreted as ‘low’, or vice versa.

The timing of digital signals may also be corrected to some extent, giving digital signals another advantage over analog ones. This is because digital information has a discrete time structure in which the intended sample instants are known. If the timing of bits in a digital message becomes unstable, such as after having been passed over a long cable with its associated signal distortions, resulting in timing ‘jitter’, the signal may be reclocked at a stable rate.

FIGURE 8.1
(a) A binary number (word or ‘byte’) consists of bits.(b) Each bit represents a power of two.(c) Binary numbers can be represented electrically in pulse code modulation (PCM) by a string of high and low voltages.

BINARY FOR BEGINNERS

First, we introduce the basics of binary number systems, because nearly all digital audio systems are based on this.

In the decimal number system each digit of a number represents a power of ten. In a binary system each digit or bit represents a power of two (see Figure 8.1 ). It is possible to calculate the decimal equivalent of a binary integer (whole number) by using the method shown. Negative numbers need special treatment, as described in Fact File 8.2. A number made up of more than 1 bit is called a binary ‘word’, and an 8 bit word is called a ‘byte’ (from ‘by eight’). Four bits is called a ‘nibble’. The more bits there are in a word the larger the number of states it can represent, with 8 bits allowing 256 (2⁸) states and 16 bits allowing 65536 (2¹⁶). The bit with the lowest weight (2°) is called the least significant bit or LSB and that with the greatest weight is called the most significant bit or MSB. The term kilobyte or Kbyte is used to mean 1024 or 2¹⁰ bytes and the term megabyte or Mbyte represents 1024 Kbytes.

Electrically it is possible to represent a binary word in either serial or parallel form. In serial communication only one connection need be used and the word is clocked out one bit at a time using a device known as a shift register. The shift register is previously loaded with the word in parallel form (see Figure 8.2). The rate at which the serial data is transferred depends on the rate of the clock. In parallel communication each bit of the word is transferred over a separate connection.

FACT FILE 8.2 NEGATIVE NUMBERS

Negative integers are usually represented in a form known as ‘two’s complement’. Negative values are represented by taking the positive equivalent, inverting all the bits and adding a one. Thus to obtain the 4 bit binary equivalent of decimal minus five (−5¹⁰) in binary two’s complement form:

Two’s complement numbers have the advantage that the MSB represents the sign (1 = negative, 0 = positive) and that arithmetic may be performed on positive and negative numbers giving the correct result:

The carry bit that may result from adding the two MSBs is ignored.

An example is shown here of 4 bit, two’s complement numbers arranged in a circular fashion. It will be seen that the binary value changes from all zeros to all ones as it crosses the zero point and that the maximum positive value is 0111 whilst the maximum negative value is 1000, so the values wrap around from maximum positive to maximum negative.

FIGURE 8.2 A shift register is used to convert a parallel binary word into a serial format. The clock is used to shift the bits one at a time out of the register, and its frequency determines the bit rate. The data may be clocked out of the register either MSB or LSB first, depending on the device and its configuration.

Table 8.1 Hexadecimal and decimal equivalents to binary numbers

Binary	Hexadecimal	Decimal
0000	0	0
0001	1	1
0010	2	2
0011	3	3
0100	4	4
0101	5	5
0110	6	6
0111	7	7
1000	8	8
1001	9	9
1010	A	10
1011	B	11
1100	C	12
1101	D	13
1110	E	14
1111	F	15

FIGURE 8.3 This 16 bit binary number may be represented in hexadecimal as shown, by breaking it up into 4 bit nibbles and representing each nibble as a hex digit.

Because binary numbers can become fairly unwieldy when they get long, various forms of shorthand are used to make them more manageable. The most common of these is hexadecimal. The hexadecimal system represents decimal values from 0 to 15 using the 16 symbols 0–9 and A–F, according to Table 8.1. Each hexadecimal digit corresponds to 4 bits or one nibble of the binary word. An example showing how a long binary word may be written in hexadecimal (hex) is shown in Figure 8.3 — it is simply a matter of breaking the word up into 4 bit chunks and converting each chunk to hex. Similarly, a hex word can be converted to binary by using the reverse process.

Logical operations can be carried out on binary numbers, which enables various forms of mathematics to be done in binary form, as introduced in Fact File 8.3.

Fixed-point binary numbers are often used in digital audio systems to represent sample values. These are usually integer values represented by a number of bytes (2 bytes for 16bit samples, 3 bytes for 24 bit samples, etc.). In some applications it is necessary to represent numbers with a very large range, or in a fractional form. Here floating-point representation may be used. A typical floating-point binary number might consist of 32 bits, arranged as 4 bytes, as shown in Figure 8.4. Three bytes are used to represent the mantissa and 1 byte the exponent (although the choice of number of bits for the exponent and mantissa are open to variance depending on the application). The mantissa is the main part of the numerical value and the exponent determines the power of two to which the mantissa must be raised. The MSB of the exponent is used to represent its sign and the same for the mantissa.

FACT FILE 8.3 LOGICAL OPERATION

Most of the apparently complicated processing operations that occur within a computer are actually just a fast sequence of simple logical operations. The apparent power of the computer and its ability to perform complex tasks are really due to the speed with which simple operations are performed.

The basic family of logical operations is shown here in the form of a truth table next to the electrical symbol that represents each ‘logic gate’. The AND operation gives an output only when both its inputs are true; the OR operation gives an output when either of its inputs are true; and the XOR (exclusive OR) gives an output only when one of its inputs is true. The inverter or NOT gate gives an output which is the opposite of its input and this is often symbolized using a small circle on inputs or outputs of devices to indicate inversion.

It is normally more straightforward to perform arithmetic processing operations on fixed-point numbers than on floating-point numbers, but signal processing devices are available in both forms.

FIGURE 8.4 An example of floating-point number representation in a binary system.

FIGURE 8.5 Block diagram of the typical digital recording or broadcasting signal chain.

THE DIGITAL AUDIO SIGNAL CHAIN

Figure 8.5 shows the signal chain involved in a typical digital recording or broadcasting system. First, the analog audio signal (a time-varying electrical voltage) is passed through an analog-to-digital (A/D) convertor where it is transformed from a continuously varying voltage into a series of ‘samples’, which are ‘snapshots’ of the analog signal taken many thousand times per second. Each sample is represented by a number. If the system uses some form of data reduction (see below) this will be carried out here, after A/D conversion and before channel coding. The resulting sequence of audio data is coded into a form that makes it suitable for recording or broadcasting (a process known as coding or channel coding), and the signal is then recorded or transmitted. Upon replay or reception the signal is decoded and subjected to error correction, and it is this latter process which works out what damage has been done to the signal since it was coded. The channel coding and error detection/correction processes are usually integral to the recording or transmission system and modern disk-based recording systems often rely on the built-in processes of generic computer mass storage systems to deal with this. After decoding, any errors in timing or value of the samples are corrected if possible and the result is fed to a digital-to-analog (D/A) convertor, which turns the numerical data back into a time-continuous analog audio signal.

In the following sections each of the main processes involved in this chain will be explained, followed by a discussion of the implementation of this technology in real audio systems.

FIGURE 8.6 A rotary knob’s position could be measured against a numbered scale such as the decimal scale shown. Quantizing the knob’s position would involve deciding which of the limited number of values (0–9) most closely represented the true position.

ANALOG-TO-DIGITAL CONVERSION

A basic example

In order to convert analog information into digital information it is necessary to measure its amplitude at specific points in time (called ‘sampling’) and to assign a binary digital value to each measurement (called ‘quantizing’). A simple example of the process can be taken from control technology in which it is wished to convert the position of a rotary knob into a digital control signal that could be used by a computer. This concept can be extended to the conversion of audio signals.

The diagram in Figure 8.6 shows such a rotary knob against a fixed scale running from 0 to 9. The position of the control should be measured or ‘sampled’ at regular intervals to register changes. The rate at which switches and analog controls are sampled depends on how important it is that they are updated regularly. Some older audio mixing consoles sampled the positions of automated controls once per television frame (40 ms in Europe), whereas some modern digital mixers sample controls as often as once per audio sample period (roughly 20μs). Clearly the more regularly a control’s position is sampled the more data will be produced, since there will be one binary value per sample. A smooth representation of changing control movements is ensured by regular sampling.

To quantize the position of the knob it is necessary to determine which point of the scale it is nearest at each sampling instant and assign a binary number that is equivalent to its position. Unless the pointer is at exactly one of the increments the quantizing process involves a degree of error. The maximum error is plus or minus half of an increment, because once the pointer is more than halfway between one increment and the next it should be quantized to the next.

Introduction to audio A/D conversion

The process of A/D conversion is of paramount importance in determining the inherent sound quality of a digital audio signal. The technical quality of the audio signal, once converted, can never be made any better, only worse. Some applications deal with audio purely in the digital domain, in which case A/D conversion is not an issue, but most operations involve the acquisition of audio material from the analog world at one time or another. The quality of convertors varies very widely in digital audio workstations and their peripherals because the price range of such workstations is also great. Some stand-alone professional convertors can easily cost as much as the complete digital audio hardware and software for a desktop computer. One can find audio A/D convertors built into many multimedia desktop computers now, but these are often rather low performance devices when compared with the best available. As will be seen below, the sampling rate and the number of bits per sample are the main determinants of the quality of a digital audio signal, but the design of the convertors determines how closely the sound quality approaches the theoretical limits.

Despite the above, it must be admitted that to the undiscerning ear one 16 bit convertor sounds very much like another and that there is a law of diminishing returns when one compares the increased cost of good convertors with the perceivable improvement in quality. Convertors are very much like wine in this respect.

Audio sampling

An analog audio signal is a time-continuous electrical waveform and the A/D convertor’s task is to turn this signal into a time-discrete sequence of binary numbers. The sampling process employed in an A/D convertor involves the measurement or ‘sampling’ of the amplitude of the audio waveform at regular intervals in time (see Figure 8.7). From this diagram it will be clear that the sample pulses represent the instantaneous amplitudes of the audio signal at each point in time. The samples can be considered as instantaneous ‘still frames’ of the audio signal which together and in sequence form a representation of the continuous waveform, rather like the still frames that make up a movie film give the impression of a continuously moving picture when played in quick succession.

In order to represent the fine detail of the signal it is necessary to take a large number of these samples per second. The mathematical sampling theorem proposed by Shannon indicates that at least two samples must be taken per audio cycle if the necessary information about the signal is to be conveyed. This means that the sampling frequency must be at least twice as high as the highest audio frequency to be handled by the system (this is known as the Nyquist criterion).

FIGURE 8.7 An arbitrary audio signal is sampled at regular intervals of time t to create short sample pulses whose amplitudes represent the instantaneous amplitude of the audio signal at each point in time.

Another way of visualizing the sampling process is to consider it in terms of modulation, as shown in Figure 8.8. The continuous audio waveform is used to modulate a regular chain of pulses. The frequency of these pulses is the sampling frequency. Before modulation all these pulses have the same amplitude (height), but after modulation the amplitude of the pulses is modified according to the instantaneous amplitude of the audio signal at that point in time. This process is known as pulse amplitude modulation (PAM). Fact File 8.4 describes a frequency domain view of this process.

Filtering and aliasing

It can be seen from Figure 8.9 that if too few samples are taken per cycle of the audio signal then the samples may be interpreted as representing a wave other than that originally sampled. This is one way of understanding the phenomenon known as aliasing. An ‘alias’ is an unwanted representation of the original signal that arises when the sampled signal is reconstructed during D/A conversion.

It is relatively easy to see why the sampling frequency must be at least twice the highest baseband audio frequency from Figure 8.10. It can be seen that an extension of the baseband above the Nyquist frequency results in the lower sideband of the first spectral repetition overlapping the upper end of the baseband and appearing within the audible range that would be reconstructed by a D/A convertor. Two further examples are shown to illustrate the point — the first in which a baseband tone has a low enough frequency for the sampled sidebands to lie above the audio frequency range, and the second in which a much higher frequency tone causes the lower sampled sideband to fall well within the baseband, forming an alias of the original tone that would be perceived as an unwanted component in the reconstructed audio signal.

FIGURE 8.8 In pulse amplitude modulation, the instantaneous amplitude of the sample pulses is modulated by the audio signal amplitude (positive only values shown).

The aliasing phenomenon can be seen in the case of the well-known ‘spoked-wheel’ effect on films, since moving pictures are also an example of a sampled signal. In film, still pictures (image samples) are normally taken at a rate of 24 per second. If a rotating wheel with a marker on it is filmed it will appear to move round in a forward direction as long as the rate of rotation is much slower than the rate of the still photographs, but as its rotation rate increases it will appear to slow down, stop, and then appear to start moving backwards. The virtual impression of backwards motion gets faster as the rate of rotation of the wheel gets faster and this backwards motion is the aliased result of sampling at too low a rate. Clearly the wheel is not really rotating backwards, it just appears to be. Perhaps ideally one would arrange to filter out moving objects that were rotating faster than half the frame rate of the film, but this is hard to achieve in practice and visible aliasing does not seem to be as annoying subjectively as audible aliasing.

FACT FILE 8.4 SAMPLING — FREQUENCY DOMAIN

Before modulation the audio signal has a frequency spectrum extending over the normal audio range, known as the baseband spectrum (upper diagram). The shape of the waveform and its equivalent spectrum is not significant in this diagram — it is just an artist’s impression of a complex audio signal such as music. The sampling pulses, before modulation, have a line spectrum at multiples of the sampling frequency, which is much higher than the highest audio frequency (middle diagram). The frequency spectrum of the pulse-amplitude-modulated (PAM) signal is as shown in the lower diagram. In addition to the ‘baseband’ audio signal (the original audio spectrum before sampling) there are now a number of additional images of this spectrum, each centered on multiples of the sampling frequency. Sidebands have been produced either side of the sampling frequency and its multiples, as a result of the amplitude modulation, and these extend above and below the sampling frequency and its multiples to the extent of the base bandwidth. In other words these sidebands are pairs of mirror images of the audio baseband.

FIGURE 8.9
In example (a) many samples are taken per cycle of the wave. In example (b) less than two samples are taken per cycle, making it possible for another lower-frequency wave to be reconstructed from the samples. This is one way of viewing the problem of aliasing.

FIGURE 8.10 Aliasing viewed in the frequency domain. In (a) the audio baseband extends up to half the sampling frequency (the Nyquist frequency f_n) and no aliasing occurs. In (b) the audio baseband extends above the Nyquist frequency and consequently overlaps the lower sideband of the first spectral repetition, giving rise to aliased components in the shaded region. In (c) a tone at 1 kHz is sampled at a sampling frequency of 30kHz, creating sidebands at 29 and 31 kHz (and at 59 and 61 kHz, etc.). These are well above the normal audio frequency range, and will not be audible. In (d) a tone at 17kHz is sampled at 30kHz, putting the first lower sideband at 13kHz — well within the normal audio range. The 13kHz sideband is said to be an alias of the original wave.

FIGURE 8.11
In simple A/D convertors an analog anti-aliasing filter is used prior to conversion, which removes input signals with a frequency above the Nyquist limit.

If audio signals are allowed to alias in digital recording one hears the audible equivalent of the backwards-rotating wheel — that is, sound components in the audible spectrum that were not there in the first place, moving downwards in frequency as the original frequency of the signal increases. In basic convertors, therefore, it is necessary to filter the baseband audio signal before the sampling process, as shown in Figure 8.11, so as to remove any components having a frequency higher than half the sampling frequency. It is therefore clear that in practice the choice of sampling frequency governs the high frequency limit of a digital audio system.

In real systems, and because filters are not perfect, the sampling frequency is usually made higher than twice the highest audio frequency to be represented, allowing for the filter to roll off more gently. The filters incorporated into both D/A and A/D convertors have a pronounced effect on sound quality, since they determine the linearity of the frequency response within the audio band, the slope with which it rolls off at high frequency and the phase linearity of the system. In a non-oversampling convertor, the filter must reject all signals above half the sampling frequency with an attenuation of at least 80 dB. Steep filters tend to have an erratic phase response at high frequencies and may exhibit ‘ringing’ due to the high ‘Q’ of the filter. Steep filters also have the added disadvantage that they are complicated to produce. Although filter effects are unavoidable to some extent, manufacturers have made considerable improvements to analog anti-aliasing and reconstruction filters and these may be retro-fitted to many existing systems with poor filters. A positive effect is normally noticed on sound quality.

The process of oversampling and the use of higher sampling frequencies (see below) have helped to ease the problems of such filtering. Here the first repetition of the baseband is shifted to a much higher frequency, allowing the use of a shallower anti-aliasing filter and consequently fewer audible side effects.

Sampling frequency and sound quality

The choice of sampling frequency determines the maximum audio bandwidth available. There is a strong argument for choosing a sampling frequency no higher than is strictly necessary, in other words not much higher than twice the highest audio frequency to be represented. This often starts arguments over what is the highest useful audio frequency and this is an area over which heated debates have raged. Conventional wisdom has it that the audio frequency band extends up to 20 kHz, implying the need for a sampling frequency of just over 40 kHz for high quality audio work. There are in fact two standard sampling frequencies between 40 and 50 kHz: the Compact Disc rate of 44.1 kHz and the so-called ‘professional’ rate of 48 kHz. These are both allowed in the original AES-5 standard of 1984, which sets down preferred sampling frequencies for digital audio equipment. Fact File 8.5 shows commonly encountered sampling frequencies.

The 48 kHz rate was originally specified for professional use because it left a certain amount of leeway for downward varispeed in tape recorders. When many digital recorders are varispeeded, the sampling frequency changes proportionately and the result is a shifting of the first spectral repetition of the audio baseband. If the sampling frequency is reduced too far aliased components may become audible. Most professional digital tape recorders allowed for only around ±12.5% of varispeed for this reason. It is possible now, though, to avoid such problems using digital low-pass filters whose cut-off frequency varies with the sampling frequency, or by using digital signal processing to vary the pitch of audio without varying the output sampling frequency.

The 44.1 kHz frequency had been established earlier on for the consumer Compact Disc and is very widely used in the industry. In fact in many ways it has become the sampling rate of choice for most professional recordings. It allows for full use of the 20 kHz audio band and oversampling convertors allow for the use of shallow analog anti-aliasing filters which avoid phase problems at high audio frequencies. It also generates 10% less data per second than the 48 kHz rate, making it economical from a storage point of view.

FACT FILE 8.5 AUDIO SAMPLING FREQUENCIES

The table shows commonly encountered sampling frequencies and their applications.

Frequency (kHz)	Application
8	Telephony (speech quality). ITU-T G711 standard.
16	Used in some telephony applications. ITU-T G722 data reduction.
~22.05	Half the CD frequency is 22.05 kHz. Used in some older computer applications. The original Apple Macintosh audio sampling frequency was 22 254.5454 … Hz.
32	Used in some broadcast coding systems, e.g. NICAM. DAT long play mode. AES-5 secondary rate.
44.056	A slight modification of the 44.1 kHz frequency used in some older equipment to synchronize digital audio with the NTSC television frame rate of 29.97 frames per second. Such ‘pull-down’ rates are sometimes still encountered in video sync situations.
44.1	CD sampling frequency. AES-5 secondary rate.
47.952	Occasionally encountered when 48kHz equipment is used in NTSC video operations. Another ‘pull-down’ rate, ideally to be avoided.
48	AES-5 primary rate for professional applications. Basic rate for Blu-Ray disk (which no longer specifies 44.1 kHz as an option).
88.2	Twice the CD sampling frequency. Optional for DVD-Audio.
96	AES-5-1998 secondary rate for high bandwidth applications. Optional for DVD-Video, DVD-Audio and Blu-Ray disks.
176.4 and 192	Four times the basic standard rates. Optional in DVD-Audio. 192 kHz is the highest sampling frequency allowed on Blu-Ray audio disks.
2.8224 MHz	DSD sampling frequency. A highly oversampled rate used in 1 bit PCM systems such as SuperAudio CD.

A rate of 32 kHz is used in some broadcasting applications, such as NICAM 728 stereo TV transmissions, and in some radio distribution systems. Television and FM radio sound bandwidth is limited to 15 kHz and a considerable economy of transmission bandwidth is achieved by the use of this lower sampling rate. The majority of important audio information lies below 15 kHz in any case and little is lost by removing the top 5 kHz of the audio band. Some professional audio applications offer this rate as an option, but it is not common. It is used for the long play mode of some DAT machines, for example.

Arguments for the standardization of higher sampling rates have become stronger in recent years, quoting evidence from sources claiming that information above 20 kHz is important for higher sound quality, or at least that the avoidance of steep filtering must be a good thing. The DVD standards, for example, incorporate such sampling frequencies as standard features. AES-5-1998 (a revision of the AES standard on sampling frequencies) now allows 96 kHz as an optional rate for applications in which the audio bandwidth exceeds 20 kHz or where relaxation of the anti-alias filtering region is desired. Doubling the sampling frequency leads to a doubling in the overall data rate of a digital audio system and a consequent halving in storage time per megabyte. It also means that any signal processing algorithms need to process twice the amount of data and alter their algorithms accordingly. It follows that these higher sampling rates should be used only after careful consideration of the merits.

Low sampling frequencies such as those below 30 kHz are sometimes encountered for lower quality sound applications such as the storage and transmission of speech, the generation of computer sound effects and so forth. Multimedia applications may need to support these rates because such applications often involve the incorporation of sounds of different qualities. There are also low sampling frequency options for data reduction codecs, as discussed below.

Quantizing

After sampling, the modulated pulse chain is quantized. In quantizing a sampled audio signal the range of sample amplitudes is mapped onto a scale of stepped binary values, as shown in Figure 8.12. The quantizer determines which of a fixed number of quantizing intervals (of size Q) each sample lies within and then assigns it a value that represents the mid-point of that interval. This is done in order that each sample amplitude can be represented by a unique binary number in pulse code modulation (PCM). (PCM is the designation for the form of modulation in which signals are represented as a sequence of sampled and quantized binary data words.) In linear quantizing each quantizing step represents an equal increment of signal voltage and most high-quality audio systems use linear quantizing.

Quantizing error is an inevitable side effect in the process of A/D conversion and the degree of error depends on the quantizing scale used. Considering binary quantization, a 4-bit scale offers 16 possible steps, an 8 bit scale offers 256 steps, and a 16bit scale 65536. The more bits, the more accurate the process of quantization. The quantizing error magnitude will be a maximum of plus or minus half the amplitude of one quantizing step and a greater number of bits per sample will therefore result in a smaller error (see Figure 8.13), provided that the analog voltage range represented remains the same.

FIGURE 8.12
When a signal is quantized, each sample is mapped to the closest quantizing interval Q, and given the binary value assigned to that interval. (Example of a 3 bit quantizer shown.) On D/A conversion each binary value is assumed to represent the voltage at the mid point of the quantizing interval.

Figure 8.14 shows the binary number range covered by digital audio signals at different resolutions using the usual two’s complement hexadecimal representation. It will be seen that the maximum positive sample value of a 16 bit signal is &7FFF, whilst the maximum negative value is &8000. The sample value changes from all zeros (&0000) to all ones (&FFFF) as it crosses the zero point. The maximum digital signal level is normally termed 0dBFS (FS = full scale).

The quantized output of an A/D convertor can be represented in either serial or parallel form, as shown in Fact File 8.6.

Quantizing resolution and sound quality

The quantizing error may be considered as an unwanted signal added to the wanted signal, as shown in Figure 8.15. Unwanted signals tend to be classified either as distortion or noise, depending on their characteristics, and the nature of the quantizing error signal depends very much upon the level and nature of the related audio signal. Here are a few examples, the illustrations for which have been prepared in the digital domain for clarity, using 16-bit sample resolution.

FIGURE 8.13 In (a) a 3 bit scale is used and only a small number of quantizing intervals covers the analog voltage range, making the maximum quantizing error quite large. The second sample in this picture will be assigned the value 010, for example the corresponding voltage of which is somewhat higher than that of the sample. During D/A conversion the binary sample values from (a) would be turned into pulses with the amplitudes shown in (b), where many samples have been forced to the same level owing to quantizing. In (c) the 4 bit scale means that a larger number of intervals are used to cover the same range and the quantizing error is reduced. (Expanded positive range only shown for clarity.)

FIGURE 8.14
Binary number ranges (in hexadecimal) related to analog voltage ranges for different convertor resolutions, assuming two’s complement representation of negative values. (a) 8 bit quantizer, (b) 16 bit quantizer, (c) 20 bit quantizer.

First, consider a very low-level sine wave signal, sampled then quantized, having a level only just sufficient to turn the least significant bit of the quantizer on and off at its peak (see Figure 8.16a). Such a signal would have a quantizing error that was periodic, and strongly correlated with the signal, resulting in harmonic distortion. Figure 8.16b shows the frequency spectrum, analyzed in the digital domain of such a signal, showing clearly the distortion products (predominantly odd harmonics) in addition to the original fundamental. Once the signal falls below the level at which it just turns on the LSB there is no modulation. The audible result, therefore, of fading such a signal down to silence is that of an increasingly distorted signal suddenly disappearing. A higher-level sine wave signal would cross more quantizing intervals and result in more non-zero sample values. As signal level rises the quantizing error, still with a maximum value of ± 0.5Q, becomes increasingly small as a proportion of the total signal level and the error gradually loses its correlation with the signal.

FACT FILE 8.6 PARALLEL AND SERIAL REPRESENTATION

Electrically it is possible to represent the quantized binary signal in either serial or parallel form. When each bit of the audio sample is carried on a separate wire, the signal is said to be in a parallel format, so a 16 bit convertor would have 16 single bit outputs. If the data is transmitted down a single wire or channel, one bit after the other, the data is said to be in serial format. In serial communication the binary word is clocked out one bit at a time using a device known as a shift register. The shift register is previously loaded with the word in parallel form as shown in the diagram. The rate at which the serial data is transferred depends on the rate of the clock.

Serial form is most useful for transmission over interconnects or transmission links that might cover substantial distances or where the bulk and cost of the interconnect limits the number of paths available. Parallel form tends to be used internally, within high speed digital systems, although serial forms are increasingly used here as well. Most digital audio interfaces are serial, for example, although the Tascam TDIF interface uses a parallel representation of the audio data.

FIGURE 8.15
Quantizing error depicted as an unwanted signal added to the original sample values. Here the error is highly correlated with the signal and will appear as distortion. (Courtesy of Allen Mornington West.)

FIGURE 8.16 (a) A 1kHz sine wave at very low level (amplitude ±1LSB) just turns the least significant bit of the quantizer on and off. Analyzed in the digital domain with sample values shown in hex on the vertical axis and time in ms on the horizontal axis. (b) Frequency spectrum of this quantized sine wave, showing distortion products.

Consider now a music signal of reasonably high level. Such a signal has widely varying amplitude and spectral characteristics and consequently the quantizing error is likely to have a more random nature. In other words it will be more noise-like than distortion-like, hence the term quantizing noise that is often used to describe the audible effect of quantizing error. An analysis of the power of the quantizing error, assuming that it has a noise-like nature, shows that it has an r.m.s. amplitude of Q/√12, where Q is the voltage increment represented by one quantizing interval. Consequently the signal-to-noise ratio of an ideal n bit quantized signal can be shown to be:

This implies a theoretical S/N ratio that approximates to just over 6 dB per bit. So a 16 bit convertor might be expected to exhibit an S/N ratio of around 98 dB, and an 8 bit convertor around 50 dB. This assumes an undithered convertor, which is not the normal case, as described below. If a convertor is undithered there will only be quantizing noise when a signal is present, but there will be no quiescent noise floor in the absence of a signal. Issues of dynamic range with relation to human hearing are discussed further in Fact File 8.7.

The dynamic range of a digital audio system is limited at high signal levels by the point at which the quantizing range of the convertor has been ‘used up’ (in other words, when there are no more bits available to represent a higher-level signal). At this point the waveform will be hard clipped (see Figure 8.17) and will become very distorted. This point will normally be set to occur at a certain electrical input voltage, such as +24 dBu in some professional systems. (The effect is very different from that encountered in analog tape recorders which tend to produce gradually more distortion as the recording level increases. Digital recorders remain relatively undistorted as the recording level rises until the overload point is reached, at which point very bad distortion occurs.)

FACT FILE 8.7 DYNAMIC RANGE AND PERCEPTION

It is possible with digital audio to approach the limits of human hearing in terms of sound quality. In other words, the unwanted artefacts of the process can be controlled so as to be close to or below the thresholds of perception. It is also true, though, that badly engineered digital audio can sound poor and that the term ‘digital’ does not automatically imply high quality. The choice of sampling parameters and noise-shaping methods, as well as more subtle aspects of convertor design, affect the frequency response, distortion and perceived dynamic range of digital audio signals.

The human ear’s capabilities should be regarded as the standard against which the quality of digital systems is measured, since it could be argued that the only distortions and noises that matter are those that can be heard. Work carried out by Louis Fielder and Elizabeth Cohen attempted to establish the dynamic range requirements for high-quality digital audio systems by investigating the extremes of sound pressure available from acoustic sources and comparing these with the perceivable noise floors in real acoustic environments. Using psychoacoustic theory, Fielder was able to show what was likely to be heard at different frequencies in terms of noise and distortion, and where the limiting elements might be in a typical recording chain. He determined a dynamic range requirement of 122 dB for natural reproduction. Taking into account microphone performance and the limitations of consumer loudspeakers, this requirement dropped to 115dB for consumer systems.

FIGURE 8.17 Signals exceeding peak level in a digital system are hard-clipped, since no more digits are available to represent the sample value.

The number of bits per sample therefore dictates the signal-to-noise ratio of a linear PCM digital audio system. Fact File 8.8 summarizes the applications for different quantizing resolutions. For many years 16 bit linear PCM was considered the norm for high-quality audio applications. This is the CD standard and is capable of offering a good S/N ratio range of over 90 dB. For most purposes this is adequate, but it fails to reach the psychoacoustic ideal of 122 dB for subjectively noise-free reproduction in professional systems. To achieve such a performance requires a convertor resolution of around 21 bits, which is achievable with today’s convertor technology, depending on how the specification is interpreted. So-called 24 bit convertors are indeed available today, but their audio performance is strongly dependent upon the stability of the timing clock, electrical environment, analog stages, grounding and other issues.

For professional recording purposes one may need a certain amount of ‘headroom’ — in other words some unused dynamic range above the normal peak recording level which can be used in unforeseen circumstances such as when a signal overshoots its expected level. This can be particularly necessary in live recording situations where one is never quite sure what is going to happen with recording levels. This is another reason why many professionals feel that a resolution of greater than 16 bits is desirable for original recording. 20 and 24 bit recording formats are becoming increasingly popular for this reason, with mastering engineers then optimizing the finished recording for 16 bit media (such as CD) using noise-shaped requantizing processes.

FACT FILE 8.8 QUANTIZING RESOLUTIONS

The table shows some commonly encountered quantizing resolutions and their applications.

Bits per sample	Approx. dynamic range with dither (dB)	Application
8	44	Low-moderate quality for older PC internal sound generation. Some older multimedia applications. Usually in the form of unsigned binary numbers.
12	68	Older Akai samplers, e.g. S900.
14	80	Original EIAJ format PCM adaptors, such as Sony PCM-100.
16	92	CD standard. DAT standard. Commonly used high quality resolution for consumer media, some professional recorders and multimedia PCs. Usually two’s complement (signed) binary numbers.
20	116	High quality professional audio recording and mastering applications.
24	140	Maximum resolution of most recent professional recording systems, also of AES 3 digital interface. Dynamic range exceeds psychoacoustic requirements. Hard to convert accurately at this resolution.

Use of dither

The use of dither in A/D conversion, as well as in conversion between one sample resolution and another, is now widely accepted as correct. It has the effect of linearizing a normal convertor (in other words it effectively makes each quantizing interval the same size) and turns quantizing distortion into a random, noise-like signal at all times. This is desirable for a number of reasons. First, because white noise at a very low level is less subjectively annoying than distortion; second, because it allows signals to be faded smoothly down without the sudden disappearance noted above; and third, because it often allows signals to be reconstructed even when their level is below the noise floor of the system. Undithered audio signals begin to sound ‘grainy’ and distorted as the signal level falls. Quiescent hiss will disappear if dither is switched off, making a system seem quieter, but a small amount of continuous hiss is considered preferable to low-level distortion. The resolution of modern high-resolution convertors is such that the noise floor is normally inaudible in any case.

Dithering a convertor involves the addition of a very low-level signal to the audio whose amplitude depends upon the type of dither employed (see Fact File 8.9). The dither signal is usually noise, but may also be a waveform at half the sampling frequency or a combination of the two. A signal that has not been correctly dithered during the A/D conversion process cannot thereafter be dithered with the same effect, because the signal will have been irrevocably distorted. How then does dither perform the seemingly remarkable task of removing quantizing distortion?

FACT FILE 8.9 TYPES OF DITHER

Research has shown that certain dither signals are more suitable than others for high quality audio work. Dither noise is often characterized in terms of its probability distribution, which is a statistical method of showing the likelihood of the signal having a certain amplitude. A simple graph is used to indicate the shape of the distribution. The probability is the vertical axis and the amplitude in terms of quantizing steps is the horizontal axis.

Logical probability distributions can be understood simply by thinking of the way in which dice fall when thrown (see the diagram). A single throw has a rectangular probability distribution function (RPDF), as shown in (a), because there is an equal chance of the throw being between 1 and 6. The total value of a pair of dice, on the other hand, has a roughly triangular probability distribution function (TPDF), as shown in (b), with the peak grouped on values from 6 to 8, because there are more combinations that make these totals than there are combinations making 2 or 12. Going back to digital electronics, one could liken the dice to random number generators and see that RPDF dither could be created using a single random number generator, and that TPDF dither could be created by adding the outputs of two RPDF generators.

RPDF dither has equal likelihood that the amplitude of the noise will fall anywhere between zero and maximum, whereas TPDF dither has greater likelihood that the amplitude will be zero than that it will be maximum. Although RPDF and TPDF dither can have the effect of linearizing a digital audio system and removing distortion, RPDF dither tends to result in noise modulation at low signal levels. The most suitable dither noise is found to be TPDF with a peak-to-peak amplitude of 2Q. If RPDF dither is used it should have a peak-to-peak amplitude of 1Q. Analog white noise has Gaussian probability, whose shape is like a normal distribution curve. With Gaussian noise, the optimum r.m.s. amplitude for the dither signal is 0.5Q, at which level noise modulation is minimized but not altogether absent. Dither at this level has the effect of reducing the undithered dynamic range by about 6dB, making the dithered dynamic range of an ideal 16 bit convertor around 92dB.

FIGURE 8.18
(a) Dither noise added to a sine wave signal prior to quantization. (b) Post-quantization the error signal is now random and noise-like. (Courtesy of Allen Mornington West.)

It was stated above that the distortion was a result of the correlation between the signal and the quantizing error, making the error periodic and subjectively annoying. Adding noise, which is a random signal, to the audio has the effect of randomizing the quantizing error and making it noise-like as well (shown in Figure 8.18a and b). If the noise has an amplitude similar in level to the LSB (in other words, one quantizing step) then a signal lying exactly at the decision point between one quantizing interval and the next may be quantized either upwards or downwards, depending on the instantaneous level of the dither noise added to it. Over time this random effect is averaged, leading to a noise-like quantizing error and a fixed noise floor in the system.

Dither is also used in digital processing devices such as mixers, but in such cases it is introduced in the digital domain as a random number sequence (the digital equivalent of white noise). In this context it is used to remove low-level distortion in signals whose gains have been altered and to optimize the conversion from high resolution to lower resolution during post-production.

Oversampling in A/D conversion

Oversampling involves sampling audio at a higher frequency than strictly necessary to satisfy the Nyquist criterion. Normally, though, this high rate is reduced to a lower rate in a subsequent digital filtering process, in order that no more storage space is required than for conventionally sampled audio. It works by trading off quantizing resolution against sampling rate, based on the principle that the information carrying capacity of a channel is related to the product of these two factors. Samples at a high rate with low resolution can be converted into samples at a lower rate with higher resolution, with no overall loss of information. Oversampling has now become so popular that it is the norm in most high-quality audio convertors.

Although oversampling A/D convertors often quote very high sampling rates of up to 128 times the basic rates of 44.1 or 48 kHz, the actual rate at the digital output of the convertor is reduced to a basic rate or a small multiple thereof (e.g. 48, 96 or 192 kHz). Samples acquired at the high rate are quantized to only a few bits’ resolution and then digitally filtered to reduce the sampling rate, as shown in Figure 8.19. The digital low-pass filter limits the bandwidth of the signal to half the basic sampling frequency in order to avoid aliasing, and this is coupled with ‘decimation’. Decimation reduces the sampling rate by dropping samples from the oversampled stream. A result of the low-pass filtering operation is to increase the word length of the samples very considerably. This is not simply an arbitrary extension of the word length, but an accurate calculation of the correct value of each sample, based on the values of surrounding samples. Although oversampling convertors quantize samples initially at a low resolution, the output of the decimator consists of samples at a lower rate with more bits of resolution. The sample resolution can then be shortened as necessary (see ‘Requantization’, p. 237) to produce the desired word length.

FIGURE 8.19
Block diagram of oversampling A/D conversion process.

FIGURE 8.20 (a) Oversampling in A/D conversion initially creates spectral repetitions that lie a long way from the top of the audio baseband. The dotted line shows the theoretical extension of the baseband and the potential for aliasing, but the audio signal only occupies the bottom part of this band. (b) Decimation and digital low-pass filtering limits the baseband to half the sampling frequency, thereby eliminating any aliasing effects, and creates a conventional collection of spectral repetitions at multiples of the sampling frequency.

Oversampling brings with it a number of benefits and is the key to improved sound quality at both the A/D and D/A ends of a system. Because the initial sampling rate is well above the audio range (often tens or hundreds of times the nominal rate) the spectral repetitions resulting from PAM are a long way from the upper end of the audio band (see Figure 8.20). The analog anti-aliasing filter used in conventional convertors is replaced by a digital decimation filter. Such filters can be made to have a linear phase response if required, resulting in higher sound quality.If oversampling is also used in D/A conversion the analog reconstruction filter can have a shallower roll-off. This can have the effect of improving phase linearity within the audio band, which is known to improve audio quality. In oversampled D/A conversion, basic rate audio is up-sampled to a higher rate before conversion and reconstruction filtering. Oversampling also makes it possible to introduce so-called ‘noise shaping’ into the conversion process, which allows quantizing noise to be shifted out of the most audible parts of the spectrum.

Oversampling without subsequent decimation is a fundamental principle of Sony’s Direct Stream Digital system, described below.

Noise shaping in A/D conversion

Noise shaping is a means by which noise within the most audible parts of the audio frequency range is reduced at the expense of increased noise at other frequencies, using a process that ‘shapes’ the spectral energy of the quantizing noise. It is possible because of the high sampling frequencies used in oversampling convertors. A high sampling frequency extends the frequency range over which quantizing noise is spread, putting much of it outside the audio band.

Quantizing noise energy extends over the whole baseband, up to the Nyquist frequency. Oversampling spreads the quantizing noise energy over a wider spectrum, because in oversampled convertors the Nyquist frequency is well above the upper limit of the audio band. This has the effect of reducing the in-band noise by around 3 dB per octave of oversampling (in other words, a system oversampling at twice the Nyquist rate would see the noise power within the audio band reduced by 3 dB).

In oversampled noise-shaping A/D conversion an integrator (low-pass filter) is introduced before the quantizer, and a D/A convertor is incorporated into a negative feedback loop, as shown in Figure 8.21. This is the so-called ‘sigma-delta convertor’. Without going too deeply into the principles of such convertors, the result is that the quantizing noise (introduced after the integrator) is given a rising frequency response at the input to the decimator, whilst the input signal is passed with a flat response. There are clear parallels between such a circuit and analog negative-feedback circuits.

Without noise shaping, the energy spectrum of quantizing noise is flat up to the Nyquist frequency, but with first-order noise shaping this energy spectrum is made non-flat, as shown in Figure 8.22. With second-order noise shaping the in-band reduction in noise is even greater, such that the in-band noise is well below that achieved without noise shaping.

FIGURE 8.21 Block diagram of a noise shaping delta-sigma A/D convertor.

FIGURE 8.22 Frequency spectra of quantizing noise. In a non-oversampled convertor, as shown in (a), the quantizing noise is constrained to lie within the audio band. In an oversampling convertor, as shown in (b), the quantizing noise power is spread over a much wider range, thus reducing its energy in the audio band. (c) With noise shaping the noise power within the audio band is reduced still further, at the expense of increased noise outside that band.

D/A CONVERSION

A basic D/A convertor

The basic D/A conversion process is shown in Figure 8.23. Audio sample words are converted back into a staircase-like chain of voltage levels corresponding to the sample values. This is achieved in simple convertors by using the states of bits to turn current sources on or off, making up the required pulse amplitude by the combination of outputs of each of these sources. This staircase is then ‘resampled’ to reduce the width of the pulses before they are passed through a low-pass reconstruction filter whose cut-off frequency is half the sampling frequency. The effect of the reconstruction filter is to join up the sample points to make a smooth waveform. Resampling is necessary to avoid any discontinuities in signal amplitude at sample boundaries and because otherwise the averaging effect of the filter would result in a reduction in the amplitude of high-frequency audio signals (the so-called ‘aperture effect’). Aperture effect may be reduced by limiting the width of the sample pulses to perhaps one-eighth of the sample period. Equalization may be required to correct for aperture effect.

Oversampling in D/A conversion

Oversampling may be used in D/A conversion, as well as in A/D conversion. In the D/A case additional samples must be created in between the Nyquist rate samples in order that conversion can be performed at a higher sampling rate. These are produced by sample rate conversion of the PCM data. These samples are then converted back to analog at the higher rate, again avoiding the need for steep analog filters. Noise shaping may also be introduced at the D/A stage, depending on the design of the convertor, to reduce the subjective level of the noise.

FIGURE 8.23 Processes involved in D/A conversion (positive sample values only shown).

A number of advanced D/A convertor designs exist which involve overs-ampling at a high rate, creating samples with only a few bits of resolution. The extreme version of this approach involves very high rate conversion of single bit samples (so-called ‘bit stream conversion’), with noise shaping to optimize the noise spectrum of the signal. The theory of these convertors is outside the scope of this book.

DIRECT STREAM DIGITAL (DSD)

Direct Stream Digital (DSD) is Sony’s proprietary name for its 1-bit digital audio coding system that uses a very high sampling frequency (2.8224 MHz as a rule). This system is used for audio representation on the consumer Super Audio CD (SACD) and in various items of professional equipment used for producing SACD material. It is not directly compatible with conventional PCM systems although DSD signals can be down-sampled and converted to multibit PCM if required.

DSD signals are the result of delta-sigma conversion of the analog signal, a technique used at the front end of some oversampling convertors described above. As shown in Figure 8.24, a delta — sigma convertor employs a comparator and a feedback loop containing a low-pass filter that effectively quantizes the difference between the current sample and the accumulated value of previous samples. If it is higher then a ‘1’ results, if it is lower a ‘0’ results. This creates a 1 bit output that simply alternates between one and zero in a pattern that depends on the original signal waveform, as shown in Figure 8.24. Conversion to analog can be as simple a matter as passing the bit stream through a low-pass filter, but is usually somewhat more sophisticated, involving noise shaping and higher order filtering.

Although one would expect 1 bit signals to have an appalling signal-to-noise ratio, the exceptionally high sampling frequency spreads the noise over a very wide frequency range leading to lower noise within the audio band. Additionally, high-order noise shaping is used to reduce the noise in the audio band at the expense of that at much higher (inaudible) frequencies, as discussed earlier. A dynamic range of around 120 dB is therefore claimed, as well as a frequency response extending smoothly to over 100 kHz.

FIGURE 8.24 Direct Stream Digital bitstream generation. (a) Typical binary representation of a sine wave. (b) Pulse density modulation. (c) DSD signal chain.

CHANGING THE RESOLUTION OF AN AUDIO SIGNAL (REQUANTIZATION)

There may be points in an audio production when the need arises to change the resolution of a signal. A common example of this in high-quality audio is when mastering 16 bit consumer products from 20 or 24 bit recordings, but it also occurs within signal processors of all types because sample word lengths may vary at different stages. It is important that this operation is performed correctly because incorrect requantization results in unpleasant distortion, just like undithered quantization in A/D conversion. Dynamic range enhancement can also be employed when requantizing for consumer media, as shown in Fact File 8.10.

FACT FILE 8.10 DYNAMIC RANGE ENHANCEMENT

It is possible to maximize the subjective dynamic range of digital audio signals during the process of requantization. This is particularly useful when mastering high resolution recordings for CD because the reduction to 16 bit word lengths would normally result in increased quantizing noise. It is in fact possible to retain most of the dynamic range of a higher resolution recording, even though it is being transferred to a 16 bit medium. This remarkable feat is achieved by a noise-shaping process similar to that described earlier.

During requantization digital filtering is employed to shape the spectrum of the quantizing noise so that as much of it as possible is shifted into the least audible parts of the spectrum. This usually involves moving the noise away from the 4 kHz region where the ear is most sensitive and increasing it at the high-frequency end of the spectrum. The result is often quite high levels of noise at high frequency, but still lying below the audibility threshold. In this way CDs can be made to sound almost as if they had the dynamic range of 20 bit recordings. Some typical weighting curves used in a commercial mastering processor from Meridian are shown in the diagram, although many other shapes are in use. Some approaches allow the mastering engineer to choose from a number of ‘shapes’ of noise until one is found that is subjectively the most pleasing for the type of music concerned, whereas others stick to one theoretically derived ‘correct’ shape.

FIGURE 8.25
Truncation of audio samples results in distortion. (a) Shows the spectrum of a 1kHz signal generated and analyzed at 20 bit resolution. In (b) the signal has been truncated to 16 bit resolution and the distortion products are clearly noticeable.

If the length of audio samples needs to be reduced then the worst possible solution is simply to remove unwanted LSBs. Taking the example of a 20 bit signal being reduced to 16 bits, one should not simply remove the four LSBs and expect everything to be all right. By removing the LSBs one would be creating a similar effect to not using dither in A/D conversion — in other words one would introduce low-level distortion components. Low-level signals would sound grainy and would not fade smoothly into noise. Figure 8.25 shows a 1 kHz signal at a level of −90 dBFS that originally began life at 20 bit resolution but has been truncated to 16 bits. The harmonic distortion is clearly visible.

FIGURE 8.26
The correct order of events when requantizing an audio signal at a lower resolution is shown here.

The correct approach is to redither the signal for the target resolution by adding dither noise in the digital domain. This digital dither should be at an appropriate level for the new resolution and the LSB of the new sample should then be rounded up or down depending on the total value of the LSBs to be discarded, as shown in Figure 8.26. It is worrying to note how many low-cost digital audio applications fail to perform this operation satisfactorily, leading to complaints about sound quality. Many professional quality audio workstations allow for audio to be stored and output at a variety of resolutions and may make dither user-selectable. They also allow the level of the audio signal to be changed in order that maximum use may be made of the available bits. It is normally important, for example when mastering a CD from a 20 bit recording, to ensure that the highest level signal on the original recording is adjusted during mastering so that it peaks close to the maximum level before requantizing and redithering at 16 bit resolution. In this way as much as possible of the original low-level information is preserved and quantizing noise is minimized. This applies in any requantizing operation, not just CD mastering. A number of applications are available that automatically scale the audio signal so that its level is optimized in this way, allowing the user to set a peak signal value up to which the highest level samples will be scaled. Since some overload detectors on digital meters and CD mastering systems look for repeated samples at maximum level to detect clipping, it is perhaps wise to set peak levels so that they lie just below full modulation. This will ensure that master tapes are not rejected for a suspected recording fault by duplication plants and subsequent users do not complain of ‘over’ levels.

INTRODUCTION TO DIGITAL SIGNAL PROCESSING

Just as processing operations like equalization, fading and compression can be performed in the analog domain, so they can in the digital domain. Indeed it is often possible to achieve certain operations in the digital domain with fewer side effects such as phase distortion. It is possible to perform operations in the digital domain that are either very difficult or impossible in the analog domain. High-quality, authentic-sounding artificial reverberation is one such example, in which the reflection characteristics of different halls and rooms can be accurately simulated. Digital signal processing (DSP) involves the high-speed manipulation of the binary data representing audio samples. It may involve changing the values and timing order of samples and it may involve the combining of two or more streams of audio data. DSP can affect the sound quality of digital audio in that it can add noise or distortion, although one must assume that the aim of good design is to minimize any such degradation in quality.

In the sections that follow an introduction will be given to some of the main applications of DSP in audio workstations without delving into the mathematical principles involved. In some cases the description is an oversimplification of the process, but the aim has been to illustrate concepts, not to tackle the detailed design considerations involved.

Gain changing (level control)

It is relatively easy to change the level of an audio signal in the digital domain. It is most easy to shift its gain by 6 dB since this involves shifting the whole sample word either one step to the left or right (see Figure 8.27). Effectively the original value has been multiplied or divided by a factor of two. More precise gain control is obtained by multiplying the audio sample value by some other factor representing the increase or decrease in gain. The number of bits in the multiplication factor determines the accuracy of gain adjustment. The result of multiplying two binary numbers together is to create a new sample word which may have many more bits than the original and it is common to find that digital mixers have internal structures capable of handling 32 bit words, even though their inputs and outputs may handle only 20. Because of this, redithering is usually employed in mixers at points where the sample resolution has to be shortened, such as at any digital outputs or conversion stages, in order to preserve sound quality as described above.

FIGURE 8.27 The gain of a sample may be changed by 6 dB simply by shifting all the bits one step to the left or right.

The values used for multiplication in a digital gain control may be derived from any user control such as a fader, rotary knob or on-screen representation, or they may be derived from stored values in an automation system. A simple ‘old-fashioned’ way of deriving a digital value from an ‘analog’ fader is to connect the fader to a fixed voltage supply and connect the fader wiper to an A/D convertor, although it is quite common now to find controls capable of providing a direct binary output relating to their position. The ‘law’ of the fader (the way in which its gain is related to its physical position) can be determined by creating a suitable look-up table of values in memory which are then used as multiplication factors corresponding to each physical fader position.

Mixing

Mixing is the summation of independent data streams representing the different audio channels. Time coincident samples from each input channel are summed to produce a single output channel sample. Clearly it is possible to have many mix ‘buses’ by having a number of separate summing operations for different output channels. The result of summing a lot of signals may be to increase the overall level considerably and the architecture of the mixer must allow enough headroom for this possibility. In the same way as an analog mixer, the gain structure within a digital mixer must be such that there is an appropriate dynamic range window for the signals at each point in the chain, also allowing for operations such as equalization that change the signal level.

Crossfading is a combination of gain changing and mixing, as described in Fact File 8.11.

FACT FILE 8.11 CROSSFADING

Crossfading is employed widely in audio workstations at points where one section of sound is to be joined to another (edit points). It avoids the abrupt change of waveform that might otherwise result in an audible click and allows one sound to take over smoothly from the other. The process is illustrated conceptually here. It involves two signals each undergoing an automated fade (binary multiplication), one downwards and the other upwards, followed by an addition of the two signals. By controlling the rates and coefficients involved in the fades one can create different styles of crossfade for different purposes.

Digital filters and equalization

Digital filtering is something of a ‘catch-all’ term, and is often used to describe DSP operations that do not at first sight appear to be filtering. A digital filter is essentially a process that involves the time delay, multiplication and recombination of audio samples in all sorts of configurations, from the simplest to the most complex. Using digital filters one can create low- and high-pass filters, peaking and shelving filters, echo and reverberation effects, and even adaptive filters that adjust their characteristics to affect different parts of the signal.

FIGURE 8.28 Examples of (a) the frequency response of a simple filter, and (b) the equivalent time domain impulse response.

To understand the basic principle of digital filters it helps to think about how one might emulate a certain analog filtering process digitally. Filter responses can be modeled in two main ways — one by looking at their frequency domain response and the other by looking at their time domain response. (There is another approach involving the so-called z-plane transform, but this is not covered here.) The frequency domain response shows how the amplitude of the filter’s output varies with frequency, whereas the time domain response is usually represented in terms of an impulse response (see Figure 8.28). An impulse response shows how the filter’s output responds to stimulation at the input by a single short impulse. Every frequency response has a corresponding impulse (time) response because the two are directly related. If you change the way a filter responds in time you also change the way it responds in frequency. A mathematical process known as the Fourier transform is often used as a means of transforming a time domain response into its equivalent frequency domain response. They are simply two ways of looking at the same thing.

Digital audio is time discrete because it is sampled. Each sample represents the amplitude of the sound wave at a certain point in time. It is therefore normal to create certain filtering characteristics digitally by operating on the audio samples in the time domain. In fact if it were desired to emulate a certain analog filter characteristic digitally one would theoretically need only to measure its impulse response and model this in the digital domain. The digital version would then have the same frequency response as the analog version, and one can even envisage the possibility for favorite analog filters to be recreated for the digital workstation. The question, though, is how to create a particular impulse response characteristic digitally, and how to combine this with the audio data.

FIGURE 8.29 A simple FIR filter (transversal filter). N = multiplication coefficient for each tap. Response shown below indicates successive outputs’ samples, multiplied by decreasing coefficients.

As mentioned earlier, all digital filters involve delay, multiplication and recombination of audio samples, and it is the arrangement of these elements that gives a filter its impulse response. A simple filter model is the finite impulse response (FIR) filter, or transversal filter, shown in Figure 8.29. As can be seen, this filter consists of a tapped delay line with each tap being multiplied by a certain coefficient before being summed with the outputs of the other taps. Each delay stage is normally a one sample period delay. An impulse arriving at the input would result in a number of separate versions of the impulse being summed at the output, each with a different amplitude. It is called a finite impulse response filter because a single impulse at the input results in a finite output sequence determined by the number of taps. The more taps there are the more intricate the filter’s response can be made, although a simple low-pass filter only requires a few taps.

The other main type is the infinite impulse response (IIR) filter, which is also known as a recursive filter because there is a degree of feedback between the output and the input (see Figure 8.30). The response of such a filter to a single impulse is an infinite output sequence, because of the feedback. IIR filters are often used in audio equipment because they involve fewer elements for most variable equalizers than equivalent FIR filters, and they are useful in effects devices. They are unfortunately not phase linear, though, whereas FIR filters can be made phase linear.

FIGURE 8.30
A simple IIR filter (recursive filter). The output impulses continue indefinitely but become very small. N in this case is about 0.8. A similar response to the previous FIR filter is achieved but with fewer stages.

Digital reverberation and other effects

It can probably be seen that the IIR filter described in the previous section forms the basis for certain digital effects, such as reverberation. The impulse response of a typical room looks something like Figure 8.31, that is, an initial direct arrival of sound from the source, followed by a series of early reflections, followed by a diffuse ‘tail’ of densely packed reflections decaying gradually to almost nothing. Using a number of IIR filters, perhaps together with a few FIR filters, one could create a suitable pattern of delayed and attenuated versions of the original impulse to simulate the decay pattern of a room. By modifying the delays and amplitudes of the early reflections and the nature of the diffuse tail one could simulate different rooms.

FIGURE 8.31 The impulse response of a typical reflective room.

Figur 8.32 A simple digital dynamics processing operation.

The design of convincing reverberation algorithms is a skilled task, and the difference between crude approaches and good ones is very noticeable. Some audio workstations offer limited reverberation effects built into the basic software package, but these often sound rather poor because of the limited DSP power available (often processed on the computer’s own CPU) and the crude algorithms involved. More convincing reverberation processors are available which exist either as stand-alone devices or as optional plug-ins for the workstation, having access to more DSP capacity and tailor-made software.

Other simple effects can be introduced without much DSP capacity, such as double-tracking and phasing/flanging effects. These often only involve very simple delaying and recombination processes. Pitch shifting can also be implemented digitally, and this involves processes similar to sample rate conversion, as described below. High-quality pitch shifting requires quite considerable horsepower because of the number of calculations required.

Dynamics processing

Digital dynamics processing involves gain control that depends on the instantaneous level of the audio signal. A simple block diagram of such a device is shown in Figure 8.32. A side chain produces coefficients corresponding to the instantaneous gain change required, which are then used to multiply the delayed audio samples. First, the r.m.s. level of the signal must be determined, after which it needs to be converted to a logarithmic value in order to determine the level change in decibels. Only samples above a certain threshold level will be affected, so a constant factor must be added to the values obtained, after which they are multiplied by a factor to represent the compression slope. The coefficient values are then anti-logged to produce linear coefficients by which the audio samples can be multiplied.

Sample rate conversion

Sample rate conversion is necessary whenever audio is to be transferred between systems operating at different rates. The aim is to convert the audio to the new rate without any change in pitch or addition of distortion or noise. These days sample rate conversion can be a very high-quality process, although it is never an entirely transparent process because it involves modifying the sample values and timings. As with requantizing algorithms, it is fairly common to encounter poorly implemented sample rate conversion on low-cost digital audio workstations, often depending very much on the specific software application rather than the hardware involved.

The easiest way to convert from one rate to another is by passing through the analog domain and resampling at the new rate, but this may introduce a small amount of extra noise. The most basic form of digital rate conversion involves the translation of samples at one fixed rate to a new fixed rate, related by a simple fractional ratio. Fractional-ratio conversion involves the mathematical calculation of samples at the new rate based on the values of samples at the old rate. Digital filtering is used to calculate the amplitudes of the new samples such that they are correct based on the impulse response of original samples, after low-pass filtering with an upper limit of the Nyquist frequency of the original sampling rate. A clock rate common to both sample rates is used to control the interpolation process. Using this method, some output samples will coincide with input samples, but only a limited number of possibilities exist for the interval between input and output samples.

If the input and output sampling rates have a variable or non-simple relationship the above does not hold true, since output samples may be required at any interval in between input samples. This requires an interpolator with many more clock phases than for fractional-ratio conversion, the intention being to pick a clock phase that most closely corresponds to the desired output sample instant at which to calculate the necessary coefficient. There will clearly be an error, which may be made smaller by increasing the number of possible interpolator phases. The audible result of the timing error is equivalent to the effects of jitter on an audio signal (see above), and should be minimized in design so that the effects of sample rate conversion are below the noise floor of the signal resolution in hand. If the input sampling rate is continuously varied (as it might be in variable-speed searching or cueing) the position of interpolated samples in relation to original samples must vary also. This requires real-time calculation of the filter phase.

Many workstations now include sample rate conversion as either a standard or optional feature, so that audio material recorded and edited at one rate can be reproduced at another. It is important to ensure that the quality of the sample rate conversion is high enough not to affect the sound quality of your recordings, and it should only be used if it cannot be avoided. Poorly implemented applications sometimes omit to use correct low-pass filtering to avoid aliasing, or incorporate very basic digital filters, resulting in poor sound quality after rate conversion.

Sample rate conversion is also useful as a means of synchronizing an external digital source to a standard sampling frequency reference, when it is outside the range receivable by a workstation.

PITCH SHIFTING AND TIME STRETCHING

Pitch alteration is now a common feature of many digital audio processing systems. It can be used either to modify the musical pitch of notes in order to create alternative versions, including harmony lines, or it can be used to ‘correct’ the pitch of musical lines such as vocals that were not sung in tune. The most basic way of doing this is to alter the sampling frequency of the audio signal (slowing down the sampling frequency without doing any sophisticated rate conversion will cause the perceived pitch to drop), but this also changes the speed and duration, and the resulting sample data is then at a non-standard sampling frequency. Resampling the original signal in the digital domain at a different frequency, followed by replay at the original sampling frequency, is an alternative, but the speed will be similarly affected. Modern pitch alteration algorithms are usually much more sophisticated than this and can alter the pitch without altering the speed. Time stretching is a related application, and involves altering the duration of a clip without altering its pitch. Pitch correction algorithms attempt to identify the fundamental pitch of individual notes in a phrase and quantize them to a fixed pitch scale in order to force melodies into tune. This can be done with varying degrees of severity and musicality, leading to results anywhere on a range from crude effect to subtle correction of tuning.

Both effects can be achieved by transforming a signal into the frequency domain, modifying it and resynthesizing it in the time domain. Techniques based on phase vocoding or spectral modeling are sometimes used. Both approaches succeed to some extent in enabling pitch and time information to be analyzed and modified independently, although with varying side effects depending on the content of the signal and the parameters of the processing. The signal is transformed to the frequency domain in overlapping blocks using a short time Fourier transform (STFT). It then becomes possible to modify the signal in the frequency domain, for example by scaling certain spectral components, before performing an inverse transform to return it to the time domain with modified pitch. Alternatively the original spectral components can be resynthesized with a new time scale at the stage of the inverse transform in order to change the duration. In the time domain, time stretch processing typically involves identifying the fundamental period of the wave and extracting or adding individual cycles with crossfades, to shorten or lengthen the clip. It may also involve removing or adding samples in silent gaps between notes or phrases. (The latter is particularly used in algorithms that attempt to fit overdubbed speech dynamically to a guide track, for movie sound applications.)

In general, both time and pitch-shifting only work successfully over a limited range of 30% or so either way, although occasionally they can be made to work over wider ranges depending on the nature of the signal and the sophistication of the algorithm. One problem with simple pitch shifting is the well-known ‘Pinky and Perky’ effect that can be noticed when shifting musical sounds (particularly voices) too far from their original frequency, making them sound unnatural. This is because real musical sounds and human voices have so-called formants, which are peaks and troughs in the spectrum that are due to influences like resonances in the instrument or vocal tract. These give a voice its unique character, for example. When the pitch is shifted the formant structure can become shifted too, so that the peaks and troughs are no longer in the right place in the frequency spectrum for the voice in question. Some sophisticated pitch shifting algorithms therefore employ advanced signal processing methods that can identify the so-called spectral envelope of an instrument or voice (its pattern of peaks and troughs in the frequency spectrum), and attempt to retain this even when the pitch is shifted. In this way a voice can be made to sound like the original singer even when shifted over quite a wide range.

AUDIO DATA REDUCTION

Conventional PCM audio has a high data rate, and there are many applications for which it would be an advantage to have a lower data rate without much (or any) loss of sound quality. 16 bit linear PCM at a sampling rate of 44.1 kHz (‘CD quality digital audio’) results in a data rate of about 700 kbit/s. For multimedia applications, broadcasting, communications and some consumer purposes (e.g. streaming over the Internet) the data rate may be reduced to a fraction of this with minimal effect on the perceived sound quality. At very low rates the effect on sound quality is traded off with the bit rate required. Simple techniques for reducing the data rate, such as reducing the sampling rate or number of bits per sample would have a very noticeable effect on sound quality, so most modern low bit rate coding works by exploiting the phenomenon of auditory masking to ‘hide’ the increased noise resulting from bit rate reduction in parts of the audio spectrum where it will hopefully be inaudible. There are a number of types of low bit rate coding used in audio systems, working on similar principles, and used for applications such as consumer disk and tape systems (e.g. Sony ATRAC), digital cinema sound (e.g. Dolby Digital, Sony SDDS, DTS) and multimedia applications (e.g. MPEG).

Why reduce the data rate?

Nothing is inherently wrong with linear PCM from a sound quality point of view, indeed it is probably the best thing to use. The problem is simply that the data rate is too high for a number of applications. Two channels of linear PCM require a rate of around 1.4Mbit/s, whereas applications such as Digital Audio Broadcasting (DAB) or Digital Radio need it to be more like 128kbit/s (or perhaps lower for some applications) in order to fit sufficient channels into the radio frequency spectrum — in other words more than ten times less data per second. Some Internet streaming applications need it to be even lower than this, with rates down in the low tens of kilobits per second for mobile communications.

The efficiency of mass storage media and data networks is related to their data transfer rates. The more data can be moved per second, the more audio channels may be handled simultaneously; the faster a disk can be copied, the faster a sound file can be transmitted across the world. In reducing the data rate that each audio channel demands, one also reduces the requirement for such high specifications from storage media and networks, or alternatively one can obtain greater functionality from the same specification. A network connection capable of handling eight channels of linear PCM simultaneously could be made to handle, say, 48 channels of data-reduced audio, without unduly affecting sound quality.

Although this sounds like magic and makes it seem as if there is no point in continuing to use linear PCM, it must be appreciated that the data reduction is achieved by throwing away data from the original audio signal. The more data is thrown away the more likely it is that unwanted audible effects will be noticed. The design aim of most of these systems is to try to retain as much as possible of the sound quality whilst throwing away as much data as possible, so it follows that one should always use the least data reduction necessary, where there is a choice.

FIGURE 8.33 (a) In lossless coding the original data is reconstructed perfectly upon decoding, resulting in no loss of information. (b) In lossy coding the decoded information is not the same as that originally coded, but the coder is designed so that the effects of the process are minimal.

Lossless and lossy coding

There is an important distinction to be made between the type of data reduction used in some computer applications and the approach used in many audio coders. The distinction is really between ‘lossless’ coding and coding which involves some loss of information (see Figure 8.33). It is quite common to use data compression on computer files in order to fit more information onto a given disk or tape, but such compression is usually lossless in that the original data is reconstructed bit for bit when the file is decompressed. A number of tape backup devices for computers have a compression facility for increasing the apparent capacity of the medium, for example. Methods are used which exploit redundancy in the information, such as coding a string of 80 zeros by replacing them with a short message stating the value of the following data and the number of bytes involved. This is particularly relevant in single-frame bit-mapped picture files where there may be considerable runs of black or white in each line of a scan, where nothing in the image is changing. One may expect files compressed using off-the-shelf PC data compression applications to be reduced to perhaps 25–50% of their original size, but it must be remembered that they are often dealing with static data, and do not have to work in real time. Also, it is not normally acceptable for decompressed computer data to be anything but the original data.

It is possible to use lossless coding on audio signals. Lossless coding allows the original PCM data to be reconstructed perfectly by the decoder and is therefore ‘noiseless’ since there is no effect on audio quality. The data reduction obtained using these methods ranges from nothing to about 2.5:1 and is variable depending on the program material. This is because audio signals have an unpredictable content, do not make use of a standard limited character set, and do not spend long periods of time in one binary state or the other. Although it is possible to perform this reduction in real time, the coding gains are not sufficient for many applications. Nonetheless, a halving in the average audio data rate is certainly a useful saving. A form of lossless data reduction known as Direct Stream Transfer (DST) can be used for Super Audio CD in order to fit the required multichannel audio data into the space available. A similar system was designed for DVD-Audio, called MLP (Meridian Lossless Packing), which evolved into Dolby TrueHD, used as the lossless coding scheme for the Blu-Ray disc.

‘Noisy’ or lossy coding methods make possible a far greater degree of data reduction, but require the designer and user to arrive at a compromise between the degree of data reduction and potential effects on sound quality. Here data reduction is achieved by coding the signal less accurately than in the original PCM format (using fewer bits per sample), thereby increasing quantizing noise, but with the intention that increases in noise will be ‘masked’ (made inaudible) by the signal. The original data is not reconstructed perfectly on decoding. The success of such techniques therefore relies on being able to model the characteristics of the human hearing process in order to predict the masking effect of the signal at any point in time — hence the common term ‘perceptual coding’ for this approach. Using detailed psychoacoustic models it is possible to code high-quality audio at rates under 100kbit/s per channel with minimal effects on audio quality. Higher data rates, such as 192kbit/s, can be used to obtain an audio quality that is demonstrably indistinguishable from the original PCM.

MPEG — an example of lossy coding

The following is a very brief overview of how one approach works, based on the technology involved in the MPEG (Moving Pictures Expert Group) standards.

As shown in Figure 8.34, the incoming digital audio signal is filtered into a number of narrow frequency bands. Parallel to this a computer model of the human hearing process (an auditory model) analyzes a short portion of the audio signal (a few milliseconds). This analysis is used to determine what parts of the audio spectrum will be masked, and to what degree, during that short time period. In bands where there is a strong signal, quantizing noise can be allowed to rise considerably without it being heard, because one signal is very efficient at masking another lower level signal in the same band as itself (see Figure 8.35). Provided that the noise is kept below the masking threshold in each band it should be inaudible.

FIGURE 8.34
Generalized block diagram of a psychoacoustic low bit rate coder.

FIGURE 8.35
Quantizing noise lying under the masking threshold will normally be inaudible

Blocks of audio samples in each narrow band are scaled (low-level signals are amplified so that they use more of the most significant bits of the range) and the scaled samples are then reduced in resolution (requantized) by reducing the number of bits available to represent each sample — a process that results in increased quantizing noise. The output of the auditory model is used to control the requantizing process so that the sound quality remains as high as possible for a given bit rate. The greatest number of bits is allocated to frequency bands where noise would be most audible, and the fewest to those bands where the noise would be effectively masked by the signal. Control information is sent along with the blocks of bit rate-reduced samples to allow them to be reconstructed at the correct level and resolution upon decoding.

The above process is repeated every few milliseconds, so that the masking model is constantly being updated to take account of changes in the audio signal. Carefully implemented, such a process can result in a reduction of the original data rate to anything from about one-quarter to less than one-tenth. A decoder uses the control information transmitted with the bit rate-reduced samples to restore the samples to their correct level and can determine how many bits were allocated to each frequency band by the encoder, reconstructing linear PCM samples and then recombining the frequency bands to form a single output (see Figure 8.36). A decoder can be much less complex, and therefore cheaper, than an encoder, because it does not need to contain the auditory model.

FIGURE 8.36 Generalized block diagram of an MPEG-Audio decoder.

Table 8.2 MPEG-1 layers

Layer	Complexity	Min. delay	Bit rate range	Target
1	Low	19 ms	32–448 kbit/s	192 kbit/s
2	Moderate	35 ms	32–384 kbit/s*•	128 kbit/s
3	High	59 ms	32–320 kbit/s	64 kbit/s

•* In Layer 2, bit rates of 224 kbit/s and above are for stereo modes only

A standard known as MPEG-1, published by the International Standards Organization (ISO 11172-3), defines a number of ‘layers’ of complexity for low bit rate audio coders as shown in Table 8.2. Each of the layers can be operated at any of the bit rates within the ranges shown (although some of the higher rates are intended for stereo modes) and the user must make appropriate decisions about what sound quality is appropriate for each application. The lower the data rate, the lower the sound quality that will be obtained. At high data rates the encoding-decoding process has been judged by many to be audibly ‘transparent’ — in other words listeners cannot detect that the coded and decoded signal is different from the original input. The target bit rates were for ‘transparent’ coding.

‘MP3’ will be for many people the name associated with downloading music files from the Internet. The term MP3 has caused some confusion; it is short for MPEG-1 Layer 3, but MP3 has virtually become a generic term for the system used for receiving compressed audio from the Internet. There is also MPEG-2 which can handle multichannel surround, and further developments in this and later systems will be briefly touched upon.

MPEG-2 BC (Backwards Compatible with MPEG-1) additionally supports sampling frequencies from 16kHz to 22.05 kHz and 24kHz at bit rates from 32 to 256kbit/s for Layer 1. For Layers 2 and 3, bit rates are from 8 to 160kbit/s. Developments, intended to supersede MPEG-2 BC, have included MPEG-2 AAC (Advanced Audio Coding). This defines a standard for multichannel coding of up to 48 channels, with sampling rates from 8 kHz to 96 kHz. It also incorporates a Modified Discrete Cosine transform system as used in the MiniDisc coding format (ATRAC). MPEG-2AAC was not, however, designed to be backwards compatible with MPEG-1.

MPEG-4 ‘natural audio coding’ is based on the standards outlined for MPEG-2 AAC; it includes further coding techniques for reducing transmission bandwidth and it can scale the bit rate according to the complexity of the decoder. This is used in Apple’s iPod, for example. There are also intermediate levels of parametric representation in MPEG-4 such as used in speech coding, whereby speed and pitch of basic signals can be altered over time. One has access to a variety of methods of representing sound at different levels of abstraction and complexity, all the way from natural audio coding (lowest level of abstraction), through parametric coding systems based on speech synthesis and low-level parameter modification (see below), to fully synthetic audio objects.

When audio signals are described in the form of ‘objects’ and ‘scenes’, it requires that they be rendered or synthesized by a suitable decoder. Structured Audio (SA) in MPEG-4 enables synthetic sound sources to be represented and controlled at very low bit rates (less than 1 kbit/s). An SA decoder can synthesize music and sound effects. SAOL (Structured Audio Orchestra Language), as used in MPEG-4, was developed at MIT and is an evolution of CSound (a synthesis language used widely in the electroacoustic music and academic communities). It enables ‘instruments’ and ‘scores’ to be downloaded. The instruments define the parameters of a number of sound sources that are to be rendered by synthesis (e.g. FM, wavetable, granular, additive) and the ‘score’ is a list of control information that governs what those instruments play and when (represented in the SASL or Structured Audio Score Language format). This is rather like a more refined version of the established MIDI control protocol, and indeed MIDI can be used if required for basic music performance control. This is discussed further in Chapter 14.

Sound scenes, as distinct from sound objects, are usually made up of two elements — that is, the sound objects and the environment within which they are located. Both elements are integrated within one part of MPEG-4. This part of MPEG-4 uses so-called BIFS (Binary Format for Scenes) for describing the composition of scenes (both visual and audio). The objects are known as nodes and are based on VRML (virtual reality modeling language). So-called Audio BIFS can be post-processed and represents parametric descriptions of sound objects. Advanced Audio BIFS also enables virtual environments to be described in the form of perceptual room acoustics parameters, including positioning and directivity of sound objects. MPEG-4 audio scene description distinguishes between physical and perceptual representation of scenes, rather like the low- and high-level description information mentioned above.

Parametric audio coding

One variant on lossy low bit rate coding involves the encoding of audio signals in the form of a ‘core’ signal alongside a sparse stream of ‘parameters’ that describe one or more features needed to reconstruct an approximation of the original signals. The idea is to code a basic version of the original audio signal, that could be decoded by compatible decoders, and to transmit spatial or spectral enhancements in the form of much lower bit rate ‘side’ information. For example, parametric stereo coding transmits a basic mono channel plus side parameters that represent the inter-channel intensity, time and correlation differences that enable the spatial impression of the original two-channel stereo to be reconstructed at the decoder. MPEG-Surround transmits a mono or stereo downmix of the original surround, plus side information to enable the surround spatial impression to be approximated upon decoding (see Figure 8.37). The downmix is encoded using a ‘legacy’ or conventional stereo coder such as MP3. The additional bit rate required for the side information is usually only a few kilobits per second, as opposed to the few hundred that might be needed to transmit the surround information as conventionally coded audio. This enables convincing surround to be transmitted at bit rates as low as 64 kilobits per second.

FIGURE 8.37 Block diagram of MPEG Surround process showing (a) encoder, and (b) decoder. (Courtesy of Jeroen Breebaart)

Another example is MPEG AAC+ or HE-AAC, which employs a method known as Spectral Band Replication (SBR) to reconstruct the (untransmitted) high-frequency part of the spectrum in the decoder. This achieves greater coding efficiency by transmitting low bit rate side information to describe only the shape and tonality of the high frequency part of the original signal’s spectrum. The decoder attempts a synthesis of the missing high frequency content based on lower frequency parts of the spectrum and information in the side parameters. Psychoacoustically the brain does not seem to be as fussy about the precise detail of the very highest frequencies as it is about the lower parts of the spectrum, so this can sound quite convincing while saving a lot of bits.

Surround coding formats

Dolby Digital or AC-3 encoding was developed as a means of delivering 5.1-channel surround to cinemas or the home without the need for analog matrix encoding. The AC-3 coding algorithm can be used for a wide range of different audio signal configurations and bit rates from 32kbit/s for a single mono channel up to 640 kbit/s for surround signals. It is used widely for the distribution of digital sound tracks on 35 mm movie films, the data being stored optically in the space between the sprocket holes on the film.

It is sufficient to say here that the process involves a number of techniques by which the data representing audio from the source channels is transformed into the frequency domain and requantized to a lower resolution, relying on the masking characteristics of the human hearing process to hide the increased quantizing noise that results from this process. A common bit pool is used so that channels requiring higher data rates than others can trade their bit rate requirements provided that the overall total bit rate does not exceed the constant rate specified.

Aside from the representation of surround sound in a compact digital form, Dolby Digital includes a variety of operational features that enhance system flexibility and help adapt replay to a variety of consumer situations. These include dialog normalization (‘dialnorm’) and the option to include dynamic range control information alongside the audio data for use in environments where background noise prevents the full dynamic range of the source material being heard. Downmix control information can also be carried alongside the audio data in order that a two-channel version of the surround sound material can be reconstructed in the decoder. As a rule, Dolby Digital data is stored or transmitted with the highest number of channels needed for the end product to be represented and any compatible downmixes are created in the decoder. This differs from some other systems where a two-channel downmix is carried alongside the surround information.

Dialnorm indication can be used on broadcast and other material to ensure that the dialog level remains roughly constant from program to program. It is assumed that dialog level is the main factor governing the listening level used in people’s homes, and that they do not want to keep changing this as different programs come on the air (e.g. from advertising to news programs). The dialnorm level is the average dialog level over the duration of the program compared to the maximum level that would be possible, measured using an A-weighted L_EQ reading (this averages the level linearly over time). So, for example, if the dialog level averaged at 70 dBA over the program, and the SPL corresponding to peak recording level was 100 dBA, the dialnorm setting would be −30 dB.

The original DTS (Digital Theater Systems) ‘Coherent Acoustics’ system is another digital signal coding format that can be used to deliver surround sound in consumer or professional applications, using low bit rate coding techniques to reduce the data rate of the audio information. The DTS system can accommodate a wide range of bit rates from 32 kbit/s up to 4.096 Mbit/s (somewhat higher than Dolby Digital), with up to eight source channels and with sampling rates up to 192 kHz. Variable bit rate and lossless coding are also optional. Downmixing and dynamic range control options are provided in the system. Because the maximum data rate is typically somewhat higher than that of Dolby Digital or MPEG, a greater margin can be engineered between the signal and any artefacts of low bit rate coding, leading to potentially higher sound quality. Such judgments, though, are obviously up to the individual and it is impossible to make blanket statements about comparative sound quality between systems.

SDDS stands for Sony Dynamic Digital Sound, and was the third of the main competing formats for digital film sound. Using Sony’s ATRAC data reduction system (also used on MiniDiscs), it too encodes audio data with a substantial saving in bit rate compared with the original PCM (about 5:1 compression). It is not as widely encountered as the other two and does not appear to be available in consumer decoders.

Of the MPEG multichannel coding formats, the MPEG-2 BC (backwards compatible) version worked by encoding a matrixed downmix of the surround channels and the center channel into the left and right channels of an MPEG-1 compatible frame structure. Although MPEG-2 BC was originally intended for use with DVD releases in Region 2 countries (primarily Europe), this requirement was dropped in favor of Dolby Digital. MPEG-2 AAC, on the other hand, is a more sophisticated algorithm that codes multichannel audio to create a single bit stream that represents all the channels, in a form that cannot be decoded by an MPEG-1 decoder. Having dropped the requirement for backwards compatibility, the bit rate can now be optimized by coding the channels as a group and taking advantage of interchannel redundancy if required. The MPEG-2 AAC system contained contributions from a wide range of different manufacturers. The parametric MPEG Surround standard was described briefly in the previous section.

Spatial audio object coding

The Spatial Audio Object Coding (SAOC) standard was published as MPEG-D Part 2 — ISO/IEC 23003-2 in 2010 (Part 1 is MPEG Surround, Part 2 is Unified speech and audio coding). It describes a user-controllable rendering of multiple audio objects based on transmission of a mono or stereo downmix of the object signals. SAOC encodes Object Level Differences (OLD), Inter-Object Cross Coherences (IOC) and Downmix Channel Level Differences (DCLD) into a parameter bitstream, and so does not discretely encode input audio signals. An MPEG surround decoder can be manipulated by the user to place the audio objects in the desired positions and at different levels, or attenuated and replaced. An increase in level and/or repositioning of an object can also improve intelligibility with certain speaker layouts and environments. The SAOC bitstream is independent of loudspeaker configuration, and a default downmix option ensures backwards compatibility.

High resolution data-reduced formats

The increased interest in high resolution or ‘HD’ (high definition) audio distribution to consumers, either on Blu-Ray Disc (BD) or as downloads, has given rise to a number of data-reduced coding formats that are designed specifically for the purpose. Some are lossy (but not very lossy) and some are lossless.

Dolby’s TrueHD, based on Meridian Lossless Packing (MLP), is a lossless codec resulting in decoded quality that is identical to the studio master. It enables 7.1-channel playback on BD although it has the capacity to support more than 16 channels of audio. Operating at data rates of up to 18Mbit/s, it supports the BD standard’s requirement for eight full range channels at 96kHz/24 bits and up to 5.1 channels at 192 kHz/24 bits. An entirely separate artistic stereo mix can be carried if desired. Dolby makes a client-server-based encoder for its HD audio options as well as a standalone version.

Dolby Digital Plus is an extension to AC-3 (Dolby Digital), with higher data-rate options and shorter frames if required. It is designed to offer enhanced quality to Dolby Digital, running at data rates up to 6Mbit/s, although the typical data rate on HD optical disks is said to be between 768kbit/s and 1.5Mbit/s. The data stream can be decoded by legacy receivers, which will only decode the Dolby Digital core at up to 640 kbit/s.

DTS offers two codecs that can be used for higher resolution audio on optical disks. Both are backwards compatible with the original DTS Digital Surround decoder because they are based on a lossy core plus extension model. Some other lossless formats take a similar form, for backwards compatibility, whereas others are lossless from the bottom up. DTS-HD High Resolution Audio offers data rates from 2 to 6 Mbit/s, offering quality that is not identical to the studio master but claimed to be close (it’s still a lossy coding format). This version allows for a maximum of 7.1 channels at 96 kHz in a CBR (constant bit-rate) stream. DTS-HD Master Audio operates at data rates up to 24.5 Mbit/s in a variable bit rate (VBR) stream, offering 7.1 channels at 96kHz, or 5.1 at 192 kHz. This version is lossless, and therefore bit-for-bit compatible with the original master. The core coding, which works at up to 1509 kbit/s with 6.1 channels, is at a higher bit rate than typical DVD audio data rates, so non-HD players still get a quality increase. This data stream can be routed to legacy AV receivers using a SPDIF connection. A tool is available (Neural Upmix) that enables one to upmix creatively from 5.1 to surround formats with higher numbers of channels. The encoder enables one to set the downmix coefficients from surround to stereo. There is also a QC control tool that enables one to hear the effect of conversion of 5.1 material to different loudspeaker layouts such as nonstandard 7.1 speaker positions where there are sides and rears.

Free Lossless Audio Encoding (FLAC) is an open-source lossless coding option with data-reduction performance that is very similar to other codecs covered by IP rights. It is claimed to offer fast encoding and decoding with low complexity and is implemented in a lot of software and hardware players used with downloaded audio files. Not all players will decode FLAC files at sampling frequencies above 48 kHz, and only a limited number will handle 192 kHz.

High Definition AAC (HD AAC) has a lossy core accompanied by a lossless extension that enables decoding to provide bit-for-bit compatibility with the original master recording. The AAC core part is compatible with existing decoders in mobile devices such as the iPod and iTunes. It can operate at sampling rates up to 192 kHz and at 24-bit resolution.

Table of Contents for Chapter 8 Digital Audio Principles

Create new playlist

Sign In

Sign Up

CHAPTER CONTENTS

Table of Contents for
Chapter 8 Digital Audio Principles