Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2 Digital audio principles

This chapter explains the fundamental principles of digital audio as they apply in computers. The aim is to aid understanding of the inner workings of equipment so that appropriate operational and technical decisions can be made.

2.1 Analog and digital information

The human senses deal mainly with analog information but computers deal internally with digital information, resulting in a need for conversion between one domain and the other at various points.

Analog information is made up of a continuum of values of some physical quantity, which at any instant may have any value between the limits of the system. For example, a rotating knob may have one of an infinite number of positions – it is therefore an analog controller (see Figure 2.1). A simple switch, on the other hand, can be considered as a digital controller, since it has only two positions – off or on. It cannot take any value in between. The brightness of light that we perceive with our eyes is analog information and as the sun goes down the brightness falls gradually and smoothly, whereas a household light without a dimmer may be either on or off – its state is binary (that is it has only two possible states). A single unit of binary information is called a bit (binary digit) and a bit can only have the value one or zero (corresponding, say, to high and low, or on and off states of the electrical signal).

Electrically, analog information may be represented as a varying voltage or current. If the rotary knob of Figure 2.1 is used to control a variable resistor connected to a voltage supply, its position will affect the output voltage (see Figure 2.2). This, like the knob’s position, may occupy any value between the limits – in this case anywhere between 0 V and + V. The switch could be used to control a similar voltage supply and in this case the output voltage could only be either 0 V or + V. In other words the electrical information that resulted would be binary. The high (+ V) state could be said to correspond to a binary one and the low state to binary zero (although in many real cases it is actually the other way around). One switch can represent only one binary digit (or bit) but most digital information is made up of more than one bit, allowing digital representations of a number of fixed values.

Figure 2.1 (a) A continuously variable control such as a rotary knob is an analog controller. (b) A two-way switch is a digital controller

Figure 2.2 Electrical representation of analog and digital information. The rotary controller of Figure 2.1(a) could adjust a variable resistor, producing a voltage anywhere between the limits of 0 and +V, as shown in (a). The switch connected as shown in (b) allows the selection of either 0 or +V states at the output

Figure 2.3 When noise is added to an analog signal, as shown at (a), it is not possible for a receiver to know what is the original signal and what is the unwanted noise. With the binary signal, as shown at (b), it is possible to extract the original information even when noise has been added. Everything above the decision level is high and everything below it is low

Analog information in an electrical form can be converted into a digital electrical form using a device known as an analog-to-digital (A/D) convertor. This must be done if the information is to be handled by any logical system such as a computer. The process will be described later. The output of an A/D convertor is a series of binary numerical values representing the analog voltage as accurately as possible, at discrete points in time (sampling instants).

Digital information made up of binary digits is inherently more resilient to noise and interference than analog information, as shown in Figure 2.3. If noise is added to an analog signal it becomes very difficult to tell what is the wanted signal and what is the unwanted noise, as there is no means of distinguishing between the two. If noise is added to a binary signal it is possible to extract the important information at a later stage, as it is known that only two states matter – the high and low, or one and zero states. By comparing the signal amplitude with a fixed decision point it is possible for a receiver to treat everything above the decision point as ‘high’ and everything below it as ‘low’. Any levels in between can be classified in the nearest direction. For any noise or interference to influence the state of a digital signal it must be at least large enough in amplitude to cause a high level to be interpreted as ‘low’, or vice versa.

The timing of digital signals may also be corrected to some extent, giving digital signals another advantage over analog ones. This arises because digital information has a discrete time structure in which the intended sample instants are known. If the timing of bits in a digital message becomes unstable, such as after having been passed over a long cable with its associated signal distortions, resulting in timing ‘jitter’, the signal may be re-clocked at a stable rate. There is no equivalent way of removing unwanted speed or timing distortions from analog signals because they have a time-continuous structure.

2.2 Binary number systems

2.2.1 Basic binary

In the decimal number system each digit of a number represents a power of ten. In a binary system each digit or bit represents a power of two (see Figure 2.4). It is possible to calculate the decimal equivalent of a binary integer (whole number) by using the method shown. A number made up of more than one bit is called a binary ‘word’, and an 8-bit word is called a ‘byte’ (from ‘by eight’). Four bits is called a ‘nibble’. The more bits there are in a word the larger the number of states it can represent, with 8 bits allowing 256 (2⁸) states and 16 bits allowing 65 536 (2¹⁶). The bit with the lowest weight (2⁰) is called the least significant bit or LSB and that with the greatest weight is called the most significant bit or MSB. The term kilobyte or Kbyte is used to mean 1024 or 2¹⁰ bytes and the term megabyte or Mbyte represents 1024 Kbytes.

Figure 2.4 (a) A binary number (word or ‘byte’) consists of bits. (b) Each bit represents a power of two. (c) Binary numbers can be represented electrically in pulse code modulation (PCM) by a string of high and low voltages

Electrically it is possible to represent a binary word in either serial or parallel form. In serial communication only one connection need be used and the word is clocked out one bit at a time using a device known as a shift register. The shift register is previously loaded with the word in parallel form (see Figure 2.5). The rate at which the serial data is transferred depends on the rate of the clock. In parallel communication each bit of the word is transferred over a separate connection.

Because binary numbers can become fairly unwieldy when they get long, various forms of shorthand are used to make them more manageable. The most common of these is hexadecimal. The hexadecimal system represents decimal values from 0 to 15 using the sixteen symbols 0–9 and A–F, according to Table 2.1. Each hexadecimal digit corresponds to four bits or one nibble of the binary word. An example showing how a long binary word may be written in hexadecimal (hex) is shown in Figure 2.6 – it is simply a matter of breaking the word up into 4-bit chunks and converting each chunk to hex. Similarly, a hex word can be converted to binary by using the reverse process.

Figure 2.5 A shift register is used to convert a parallel binary word into a serial format. The clock is used to shift the bits one at a time out of the register, and its frequency determines the bit rate. The data may be clocked out of the register either MSB or LSB first, depending on the device and its configuration

Table 2.1 Hexadecimal and decimal equivalents to binary numbers

Binary	Hexadecimal	Decimal
0000	0	0
0001	1	1
0010	2	2
0011	3	3
0100	4	4
0101	5	5
0110	6	6
0111	7	7
1000	8	8
1001	9	9
1010	A	10
1011	B	11
1100	C	12
1101	D	13
1110	E	14
1111	F	15

Figure 2.6 This 16-bit binary number may be represented in hexadecimal as shown, by breaking it up into 4-bit nibbles and representing each nibble as a hex digit

Hexadecimal numbers are often labelled with the prefix ‘&’ to distinguish them from other forms of notation.

2.2.2 Negative numbers

Negative integers are usually represented in a form known as ‘twos complement’. Negative values are represented by taking the positive equivalent, inverting all the bits and adding a one. Thus to obtain the 4-bit binary equivalent of decimal minus five (–5¹⁰) in binary twos complement form:

5¹⁰ = 0101²

–5¹⁰ = 1010 + 0001 = 1011²

Twos complement numbers have the advantage that the MSB represents the sign (1 = negative, 0 = positive) and that arithmetic may be performed on positive and negative numbers giving the correct result:

e.g. (in decimal):	5
	+ (–3)
	=2
or (in binary):	0101
	+ 1101
	= 0010

The carry bit that may result from adding the two MSBs is ignored.

An example is shown in Figure 2.7 of 4-bit, twos complement numbers arranged in a circular fashion. It will be seen that the binary value changes from all zeros to all ones as it crosses the zero point and that the maximum positive value is 0111 whilst the maximum negative value is 1000, so the values wrap around from maximum positive to maximum negative.

Figure 2.7 Negative numbers represented in twos complement form create a continuum of values where maximum positive wraps round to maximum negative, and bits change from all zeros to all ones at the zero crossing point

Figure 2.8 An example of floating point number representation in a binary system

2.2.3 Fixed- and floating-point representation

Fixed-point binary numbers are often used in digital audio systems to represent sample values. These are usually integer values represented by a number of bytes (2 bytes for 16 bit samples, 3 bytes for 24 bit samples, etc.). In some applications it is necessary to represent numbers with a very large range, or in a fractional form. Here floating-point representation may be used. A typical floating-point binary number might consist of 32 bits, arranged as 4 bytes, as shown in Figure 2.8. Three bytes are used to represent the mantissa and one byte the exponent (although the choice of number of bits for the exponent and mantissa are open to variance depending on the application). The mantissa is the main part of the numerical value and the exponent determines the power of two to which the mantissa must be raised. The MSB of the exponent is used to represent its sign and the same for the mantissa.

It is normally more straightforward to perform arithmetic processing operations on fixed-point numbers than on floating-point numbers, but signal processing devices are available in both forms.

2.2.4 Logical operations

Most of the apparently complicated processing operations that occur within a computer are actually just a fast sequence of simple logical operations. The apparent power of the computer and its ability to perform complex tasks are really due to the speed with which simple operations are performed.

Figure 2.9 Symbols and truth tables for basic logic functions. The inverter shown on the right has an output which is always the opposite of the input. The circle shown on the inverter’s output is used to signify inversion on any input or output of a logic gate

The basic family of logical operations is shown in Figure 2.9 in the form of a truth table next to the electrical symbol that represents each ‘logic gate’. The AND operation gives an output only when both its inputs are true; the OR operation gives an output when either of its inputs are true; and the XOR (exclusive OR) gives an output only when one of its inputs is true. The inverter or NOT gate gives an output which is the opposite of its input and this is often symbolised using a small circle on inputs or outputs of devices to indicate inversion.

2.3 Basic A/D and D/A conversion of control information

In order to convert analog information into digital information it is necessary to measure its amplitude at specific points in time (called ‘sampling’) and to assign a binary value to each measurement (called ‘quantising’). The diagram in Figure 2.10 shows a rotary knob against a fixed scale running from 0 to 9. If one were to quantise the position of the knob it would be necessary to determine which point of the scale it was nearest and unless the pointer was at exactly one of the increments the quantising process would involve a degree of error. It will be seen that the maximum error is actually plus or minus half of an increment, as once the pointer is more than halfway between one increment and the next it should be quantised to the next.

Quantising error is an inevitable side effect in the process of A/D conversion and the degree of error depends on the quantising scale used. Considering binary quantisation, a 4-bit scale offers 16 possible steps, an 8-bit scale offers 256 steps, and a 16-bit scale 65 536. The more bits, the more accurate the process of quantisation.

Figure 2.10 A rotary knob’s position could be measured against a numbered scale such as the decimal scale shown. Quantising the knob’s position would involve deciding which of the limited number of values (0–9) most closely represented the true position

Figure 2.11 In older equipment a control’s position was digitised by sampling and quantising an analog voltage derived from a variable resistor connected to the control knob

In older systems, the position of an analog control was first used to derive an analog voltage (as shown earlier in Figure 2.2), then that voltage was converted into a digital value using an A/D convertor (see Figure 2.11). More recent controls may be in the form of binary encoders whose output is immediately digital. Unlike analog controls, switches do not need the services of an A/D convertor for their outputs to be useable by a computer – a switch’s output is normally binary in the first place. Only one bit is needed to represent the position of a simple switch.

The rate at which switches and analog controls are sampled depends very much on how important it is that they are updated regularly. Some older audio mixing consoles sampled the positions of automated controls once per television frame (40 ms in Europe), whereas some modern digital mixers sample controls as often as once per audio sample period (roughly 20 μs). Clearly the more regularly a control is sampled the more data will be produced, since there will be one binary value per sample.

Digital-to-analog conversion is the reverse process and involves taking the binary value that represents one sample and converting it back into an electrical voltage. In a control system this voltage could then be used to alter the gain of a voltage-controlled amplifier (VCA), for example, as shown in Figure 2.12. Alternatively it may not be necessary to convert the word back to an analog voltage at all. Many systems are entirely digital and can use the binary value derived from a control’s position as a multiplier in a digital signal processing operation. A signal processing operation may be designed to emulate an analog control process.

Figure 2.12 A D/A convertor could be used to convert a binary value representing a control’s position into an analog voltage. This could then be used to alter the gain of a voltage-controlled amplifier (VCA)

2.4 A/D conversion of audio signals

The process of A/D conversion is of paramount importance in determining the inherent sound quality of a digital audio signal. The technical quality of the audio signal, once converted, can never be made any better, only worse. Some applications deal with audio purely in the digital domain, in which case A/D conversion is not an issue, but most operations involve the acquisition of audio material from the analog world at one time or another. The quality of convertors varies very widely in digital audio workstations and their peripherals because the price range of such workstations is also great. Some stand-alone professional convertors can easily cost as much as the complete digital audio hardware and software for a desktop computer. One can find audio A/D convertors built in to many multimedia desktop computers now, but these are often rather low performance devices when compared with the best available. As will be seen below, the sampling rate and the number of bits per sample are the main determinants of the quality of a digital audio signal, but the design of the convertors determines how closely the sound quality approaches the theoretical limits.

Despite the above, it must be admitted that to the undiscerning ear one 16-bit convertor sounds very much like another and that there is a law of diminishing returns when one compares the increased cost of good convertors with the perceivable improvement in quality. Convertors are very much like wine in this respect.

2.4.1 Audio sampling

An analog audio signal is a time-continuous electrical waveform and the A/D convertor’s task is to turn this signal into a time-discrete sequence of binary numbers. The sampling process employed in an A/D convertor involves the measurement or ‘sampling’ of the amplitude of the audio waveform at regular intervals in time (see Figure 2.13). From this diagram it will be clear that the sample pulses represent the instantaneous amplitudes of the audio signal at each point in time. The samples can be considered as like instantaneous ‘still frames’ of the audio signal which together and in sequence form a representation of the continuous waveform, rather as the still frames that make up a movie film give the impression of a continuously moving picture when played in quick succession.

In order to represent the fine detail of the signal it is necessary to take a large number of these samples per second. The mathematical sampling theorem proposed by Shannon indicates that at least two samples must be taken per audio cycle if the necessary information about the signal is to be conveyed. It can be seen from Figure 2.14 that if too few samples are taken per cycle of the audio signal then the samples may be interpreted as representing a wave other than that originally sampled. This is one way of understanding the phenomenon known as aliasing. An ‘alias’ is an unwanted representation of the original signal that arises when the sampled signal is reconstructed during D/A conversion.

Figure 2.13 An arbitary audio signal is sampled at regular intervals of time t to create short sample pulses whose amplitudes represent the instantaneous amplitude of the audio signal at each point in time

Figure 2.14 In the upper example many samples are taken per cycle of the wave. In the lower example less than two samples are taken per cycle, making it possible for another lower-frequency wave to be reconstructed from the samples. This is one way of viewing the problem of aliasing

Figure 2.15 In pulse amplitude modulation, the instantaneous amplitude of the sample pulses is modulated by the audio signal amplitude (positive values only shown)

Another way of visualising the sampling process is to consider it in terms of modulation, as shown in Figure 2.15. The continuous audio waveform is used to modulate a regular chain of pulses. The frequency of these pulses is the sampling frequency. Before modulation all these pulses have the same amplitude (height), but after modulation the amplitude of the pulses is modified according to the instantaneous amplitude of the audio signal at that point in time. This process is known as pulse amplitude modulation (PAM). The frequency spectrum of the modulated signal is as shown in Figure 2.16. It will be seen that in addition to the ‘baseband’ audio signal (the original audio spectrum before sampling) there are now a number of additional images of this spectrum, each centred on multiples of the sampling frequency. Sidebands have been produced either side of the sampling frequency and its multiples, as a result of the amplitude modulation, and these extend above and below the sampling frequency and its multiples to the extent of the base bandwidth. In other words these sidebands are pairs of mirror images of the audio baseband.

Figure 2.16 The frequency spectrum of a PAM signal consists of a number of repetitions of the audio baseband signal reflected on either side of multiples of the sampling frequency

2.4.2 Filtering and aliasing

It is relatively easy to see why the sampling frequency must be at least twice the highest baseband audio frequency from Figure 2.17. It can be seen that an extension of the baseband above the Nyquist frequency results in the lower sideband of the first spectral repetition overlapping the upper end of the baseband and thus appearing within the audible range that would be reconstructed by a D/A convertor. Two further examples are shown to illustrate the point – the first in which a baseband tone has a low enough frequency for the sampled sidebands to lie above the audio frequency range, and the second in which a much higher frequency tone causes the lower sampled sideband to fall well within the baseband, forming an alias of the original tone that would be perceived as an unwanted component in the reconstructed audio signal.

Figure 2.17 Aliasing viewed in the frequency domain. In (a) the audio baseband extends up to half the sampling frequency (the Nyquist frequency f_n) and no aliasing occurs. In (b) the audio baseband extends above the Nyquist frequency and consequently overlaps the lower sideband of the first spectral repetition, giving rise to aliased components in the shaded region. In (c) a tone at 1 kHz is sampled at a sampling frequency of 30 kHz, creating sidebands at 29 and 31 kHz (and at 59 and 61 kHz, etc.). These are well above the normal audio frequency range, and will not be audible. In (d) a tone at 17 kHz is sampled at 30 kHz, putting the first lower sideband at 13 kHz – well within the normal audio range. The 13 kHz sideband is said to be an alias of the original wave

The aliasing phenomenon can be seen in the case of the well-known ‘spoked-wheel’ effect on films, since moving pictures are also an example of a sampled signal. In film, still pictures (image samples) are normally taken at a rate of 24 per second. If a rotating wheel with a marker on it is filmed it will appear to move round in a forward direction as long as the rate of rotation is much slower than the rate of the still photographs, but as its rotation rate increases it will appear to slow down, stop, and then appear to start moving backwards. The virtual impression of backwards motion gets faster as the rate of rotation of the wheel gets faster and this backwards motion is the aliased result of sampling at too low a rate. Clearly the wheel is not really rotating backwards, it just appears to be. Perhaps ideally one would arrange to filter out moving objects that were rotating faster than half the frame rate of the film, but this is hard to achieve in practice and visible aliasing does not seem to be as annoying subjectively as audible aliasing.

If audio signals are allowed to alias in digital recording one hears the audible equivalent of the backwards-rotating wheel – that is, sound components in the audible spectrum that were not there in the first place, moving downwards in frequency as the original frequency of the signal increases. In basic convertors, therefore, it is necessary to filter the baseband audio signal before the sampling process, as shown in Figure 2.18, so as to remove any components having a frequency higher than half the sampling frequency. It is therefore clear that in practice the choice of sampling frequency governs the high frequency limit of a digital audio system.

In real systems, and because filters are not perfect, the sampling frequency is usually made higher than twice the highest audio frequency to be represented, allowing for the filter to roll off more gently. The filters incorporated into both D/A and A/D convertors have a pronounced effect on sound quality, since they determine the linearity of the frequency response within the audio band, the slope with which it rolls off at high frequency and the phase linearity of the system. In a non-oversampling convertor, the filter must reject all signals above half the sampling frequency with an attenuation of at least 80 dB. Steep filters tend to have an erratic phase response at high frequencies and may exhibit ‘ringing’ due to the high ‘Q’ of the filter. Steep filters also have the added disadvantage that they are complicated to produce. Although filter effects are unavoidable to some extent, manufacturers have made considerable improvements to analog anti-aliasing and reconstruction filters and these may be retro-fitted to many existing systems with poor filters. A positive effect is normally noticed on sound quality.

Figure 2.18 In simple A/D convertors an analog anti-aliasing filter is used prior to conversion, which removes input signals with a frequency above the Nyquist limit

The process of oversampling and the use of higher sampling frequencies (see below) has helped to ease the problems of such filtering. Here the first repetition of the baseband is shifted to a much higher frequency, allowing the use of a shallower anti-aliasing filter and consequently fewer audible side effects.

2.4.3 Quantisation

After sampling, the modulated pulse chain is quantised. In quantising a sampled audio signal the range of sample amplitudes is mapped onto a scale of stepped values, as shown in Figure 2.19. The quantiser determines which of a fixed number of quantising intervals (of size Q) each sample lies within and then assigns it a value that represents the mid-point of that interval. This is done in order that each sample amplitude can be represented by a unique binary number in pulse code modulation (PCM) (PCM is the designation for the form of modulation in which signals are represented as a sequence of sampled and quantised binary data words). In linear quantising each quantising step represents an equal increment of signal voltage.

The quantising error magnitude will be a maximum of plus or minus half the amplitude of one quantising step and a greater number of bits per sample will therefore result in a smaller error (see Figure 2.20), provided that the analog voltage range represented remains the same.

Figure 2.19 When a signal is quantised, each sample is mapped to the closest quantising interval Q, and given the binary value assigned to that interval. (Example of a 3-bit quantiser shown.) On D/A conversion each binary value is assumed to represent the voltage at the mid point of the quantising interval

Figure 2.20 In (a) a 3-bit scale is used and only a small number of quantising intervals cover the analog voltage range, making the maximum quantising error quite large. The second sample in this picture will be assigned the value 010, for example, the corresponding voltage of which is somewhat higher than that of the sample. During D/A conversion the binary sample values from (a) would be turned into pulses with the amplitudes shown in (b), where many samples have been forced to the same level owing to quantising. In (c) the 4-bit scale means that a larger number of intervals is used to cover the same range and the quantising error is reduced (expanded positive range only shown for clarity)

Figure 2.21 shows the binary number range covered by digital audio signals at different resolutions using the usual twos complement hexadecimal representation. It will be seen that the maximum positive sample value of a 16-bit signal is &7FFF, whilst the maximum negative value is &8000. The sample value changes from all zeros (&0000) to all ones (&FFFF) as it crosses the zero point. The maximum digital signal level is normally termed 0 dBFS (FS = full scale). Signals rising above this level are normally hard-clipped, resulting in severe distortion, as shown in Figure 2.22.

Figure 2.21 Binary number ranges (in hexadecimal) related to analog voltage ranges for different convertor resolutions, assuming twos complement representation of negative values. (a) 8-bit quantiser, (b) 16-bit quantiser, (c) 20-bit quantiser

Figure 2.22 Signals exceeding peak level in a digital system are hard-clipped, since no more digits are available to represent the sample value

2.4.4 Relationship between sample resolution and sound quality

The quantising error may be considered as an unwanted signal added to the wanted signal, as shown in Figure 2.23. Unwanted signals tend to be classified either as distortion or noise, depending on their characteristics, and the nature of the quantising error signal depends very much upon the level and nature of the related audio signal. Here are a few examples, the illustrations for which have been prepared in the digital domain for clarity, using 16-bit sample resolution.

First consider a very low level sine wave signal, sampled then quantised, having a level only just sufficient to turn the least significant bit of the quantiser on and off at its peak (see Figure 2.24(a)). Such a signal would have a quantising error that was periodic, and strongly correlated with the signal, resulting in harmonic distortion. Figure 2.24(b) shows the frequency spectrum, analysed in the digital domain of such a signal, showing clearly the distortion products (predominantly odd harmonics) in addition to the original fundamental. Once the signal falls below the level at which it just turns on the LSB there is no modulation. The audible result, therefore, of fading such a signal down to silence is that of an increasingly distorted signal suddenly disappearing. A higher-level sine wave signal would cross more quantising intervals and result in more non-zero sample values. As signal level rises the quantising error, still with a maximum value of ±0.5Q, becomes increasingly small as a proportion of the total signal level and the error gradually loses its correlation with the signal.

Figure 2.23 Quantising error depicted as an unwanted signal added to the original sample values. Here the error is highly correlated with the signal and will appear as distortion. (Courtesy of Allen Mornington West)

Consider now a music signal of reasonably high level. Such a signal has widely varying amplitude and spectral characteristics and consequently the quantising error is likely to have a more random nature. In other words it will be more noise-like than distortion-like, hence the term quantising noise that is often used to describe the audible effect of quantising error. An analysis of the power of the quantising error, assuming that it has a noise-like nature, shows that it has an r.m.s. amplitude of Q/, where Q is the voltage increment represented by one quantising interval. Consequently the signal-to-noise ratio of an ideal n-bit quantised signal can be shown to be:

6.02n + 1.76 dB

This implies a theoretical S/N ratio that approximates to just over 6 dB per bit. So a 16-bit convertor might be expected to exhibit a S/N ratio of around 98 dB, and an 8-bit convertor around 50 dB. This assumes an undithered convertor, which is not the normal case, as described below. If a convertor is undithered there will only be quantising noise when a signal is present, but there will be no quiescent noise floor in the absence of a signal. Issues of dynamic range with relation to human hearing are discussed further in Section 2.6.

2.4.5 Use of dither

The use of dither in A/D conversion, as well as in conversion between one sample resolution and another, is now widely accepted as correct. It has the effect of linearising a normal convertor (in other words it effectively makes each quantising interval the same size) and turns quantising distortion into a random, noise-like signal at all times. This is desirable for a number of reasons. Firstly because white noise at very low level is less subjectively annoying than distortion; secondly because it allows signals to be faded smoothly down without the sudden disappearance noted above; and thirdly because it often allows signals to be reconstructed even when their level is below the noise floor of the system. Undithered audio signals begin to sound ‘grainy’ and distorted as the signal level falls. Quiescent hiss will disappear if dither is switched off, making a system seem quieter, but a small amount of continuous hiss is considered preferable to low level distortion. The resolution of modern high resolution convertors is such that the noise floor is normally inaudible in any case.

Figure 2.24 (a) A 1 kHz sine wave at very low level (amplitude ±1 LSB) just turns the least significant bit of the quantiser on and off. Analysed in the digital domain with sample values shown in hex on the vertical axis and time in ms on the horizontal axis. (b) Frequency spectrum of this quantised sine wave, showing distortion products

Dithering a convertor involves the addition of a very low level signal to the audio whose amplitude depends upon the type of dither employed (see below). The dither signal is usually noise, but may also be a waveform at half the sampling frequency or a combination of the two. A signal that has not been correctly dithered during the A/D conversion process cannot thereafter be dithered with the same effect, because the signal will have been irrevocably distorted. How then does dither perform the seemingly remarkable task of removing quantising distortion?

It was stated above that the distortion was a result of the correlation between the signal and the quantising error, making the error periodic and subjectively annoying. Adding noise, which is a random signal, to the audio has the effect of randomising the quantising error and making it noise-like as well (shown in Figure 2.25(a) and (b)). If the noise has an amplitude similar in level to the LSB (in other words, one quantising step) then a signal lying exactly at the decision point between one quantising interval and the next may be quantised either upwards or downwards, depending on the instantaneous level of the dither noise added to it. Over time this random effect is averaged, leading to a noise-like quantising error and a fixed noise floor in the system.

Figure 2.25 (a) Dither noise added to a sine wave signal prior to quantisation. (b) Post-quantisation the error signal is now random and noise-like. (Courtesy of Allen Mornington West)

Figure 2.26(a) shows the same low-level sine wave as in Figure 2.24, but this time with dither noise added. The quantised signal retains the cyclical pattern of the 1 kHz sine wave but is now modulated much more frequently between states, and a random element has been added. The frequency spectrum of this signal, Figure 2.26(b), shows a single sine wave component accompanied by a flat noise floor. Figure 2.26(c) and (d) show the waveform and spectrum of a dithered sine wave at a level that would be impossible to represent in an undithered 16-bit system. In this case the LSB is in the zero state much more frequently than the one state, but an element of the original 1 kHz period can still be seen in its modulation pattern if studied carefully. The duty cycle of the LSB modulation (ratio between time on and time off) varies with the instantaneous amplitude of the original signal. When this is passed through a D/A convertor and reconstruction filter the result is a pure sine wave signal plus noise, as can be seen from the spectrum analysis.

Figure 2.26 (a) 1 kHz sine wave, amplitude ±1 LSB, with dither added, analysed in the digital domain. (b) Spectrum of this dithered low level sine wave showing lack of distortion and flat noise floor. (c) 1 kHz sine wave at a level of –104 dBFS with dither, showing occasional modulation of LSB. (d) Spectrum of this signal showing that it is still possible to discern the original signal. An undithered 16-bit system would be incapable of representing a signal below about –97 dBFS

Dither is also used in digital processing devices such as mixers, but in such cases it is introduced in the digital domain as a random number sequence (the digital equivalent of white noise). In this context it is used to remove low-level distortion in signals whose gains have been altered and to optimise the conversion from high resolution to lower resolution during post-production (see below).

2.4.6 Types of dither

Research has shown that certain types of dither signal are more suitable than others for high quality audio work. Dither noise is often characterised in terms of its probability distribution, which is a statistical method of showing the likelihood of the signal having a certain amplitude. A simple graph such as that shown in Figure 2.27 is used to indicate the shape of the distribution. The probability is the vertical axis and the amplitude in terms of quantising steps is the horizontal axis.

Logical probability distributions can be understood simply by thinking of the way in which dice fall when thrown (see Figure 2.28). A single die throw has a rectangular probability distribution function (RPDF), because there is an equal chance of the throw being between 1 and 6 (unless the die is weighted!). The total value of a pair of dice, on the other hand, has a roughly triangular probability distribution function (TPDF) with the peak grouped on values from 6 to 8, because there are more combinations that make these totals than there are combinations making 2 or 12. Going back to digital electronics, one could liken the dice to random number generators and see that RPDF dither could be created using a single random number generator, and that TPDF dither could be created by adding the outputs of two RPDF generators.

RPDF dither has equal likelihood that the amplitude of the noise will fall anywhere between zero and maximum, whereas TPDF dither has greater likelihood that the amplitude will be zero than that it will be maximum. Analog white noise has Gaussian probability, whose shape is slightly more unusual than either of the logically generated dithers. Although RPDF, TPDF and Gaussian dither can have the effect of linearising conversion and removing distortion, RPDF dither tends to result in noise modulation at low signal levels. The most suitable dither noise is found to be TPDF with a peak-to-peak amplitude of 2Q (see Figure 2.29). If RPDF dither is used it should have a peak-to-peak amplitude of 1Q.

Figure 2.27 A probability distribution curve for dither shows the likelihood of the dither signal having a certain amplitude, averaged over a long time period

Figure 2.28 Probability distributions of dice throws. (a) A single die throw shows a rectangular PDF. (b) A pair of thrown dice added together has a roughly triangular PDF (in fact it is stepped)

Figure 2.29 Most suitable digital dither signals for audio. (a) TPDF dither with a peak-to-peak amplitude of 2Q. (b) RPDF dither with an amplitude of 1Q

Whilst it is easy to generate ideal logical PDFs in the digital domain, it is likely that the noise source present in many convertors will be analog and therefore Gaussian in nature. With Gaussian noise, the optimum r.m.s. amplitude for the dither signal is 0.5Q, at which level noise modulation is minimised but not altogether absent. Dither at this level has the effect of reducing the undithered dynamic range by about 6 dB, making the dithered dynamic range of an ideal 16 bit convertor around 92 dB.

2.4.7 Oversampling in A/D conversion

Oversampling involves sampling audio at a higher frequency than strictly necessary to satisfy the Nyquist criterion. Normally, though, this high rate is reduced to a lower rate in a subsequent digital filtering process, in order that no more storage space is required than for conventionally sampled audio. It works by trading off sample resolution against sampling rate, based on the principle that the information carrying capacity of a channel is related to the product of these two factors. Samples at a high rate with low resolution can be converted into samples at a lower rate with higher resolution, with no overall loss of information (this is related to sound quality). Oversampling has now become so popular that it is the norm in most high-quality audio convertors.

Although oversampling A/D convertors often quote very high sampling rates of up to 128 times the basic rates of 44.1 or 48 kHz, the actual rate at the digital output of the convertor is reduced to a basic rate or a small multiple thereof (e.g. 48, 96 or 128 kHz). Samples acquired at the high rate are quantised to only a few bits resolution and then digitally filtered to reduce the sampling rate, as shown in Figure 2.30. The digital low-pass filter limits the bandwidth of the signal to half the basic sampling frequency in order to avoid aliasing, and this is coupled with ‘decimation’. Decimation reduces the sampling rate by dropping samples from the oversampled stream. A result of the low-pass filtering operation is to increase the word length of the samples very considerably. This is not simply an arbitrary extension of the wordlength, but an accurate calculation of the correct value of each sample, based on the values of surrounding samples (see Section 2.11 on digital signal processing). Although oversampling convertors quantise samples initially at a low resolution, the output of the decimator consists of samples at a lower rate with more bits of resolution. The sample resolution can then be shortened as necessary (see Section 2.8 on requantising) to produce the desired word length.

Oversampling brings with it a number of benefits and is the key to improved sound quality at both the A/D and D/A ends of a system. Because the initial sampling rate is well above the audio range (often tens or hundreds of times the nominal rate) the spectral repetitions resulting from PAM are a long way from the upper end of the audio band (see Figure 2.31). The analog anti-aliasing filter used in conventional convertors is replaced by a digital decimation filter. Such filters can be made to have a linear phase response if required, resulting in higher sound quality. If oversampling is also used in D/A conversion the analog reconstruction filter can have a shallower roll-off. This can have the effect of improving phase linearity within the audio band, which is known to improve audio quality. In oversampled D/A conversion, basic rate audio is up-sampled to a higher rate before conversion and reconstruction filtering. Oversampling also makes it possible to introduce so-called ‘noise shaping’ into the conversion process, which allows quantising noise to be shifted out of the most audible parts of the spectrum.

Figure 2.30 Block diagram of oversampling A/D conversion process

Figure 2.31 (a) Oversampling in A/D conversion initially creates spectral repetitions that lie a long way from the top of the audio baseband. The dotted line shows the theoretical extension of the baseband and the potential for aliasing, but the audio signal only occupies the bottom part of this band. (b) Decimation and digital low pass filtering limits the baseband to half the sampling frequency, thereby eliminating any aliasing effects, and creates a conventional collection of spectral repetitions at multiples of the sampling frequency

Oversampling without subsequent decimation is a fundamental principle of Sony’s Direct Stream Digital system, described in Section 2.7.

2.4.8 Noise shaping in A/D conversion

Noise shaping is a means by which noise within the most audible parts of the audio frequency range is reduced at the expense of increased noise at other frequencies, using a process that ‘shapes’ the spectral energy of the quantising noise. It is possible because of the high sampling rates used in oversampling convertors. A high sampling rate extends the frequency range over which quantising noise is spread, putting much of it outside the audio band.

Figure 2.32 Block diagram of a noise shaping delta-sigma A/D convertor

Figure 2.33 Frequency spectra of quantising noise. In a non-oversampled convertor, as shown in (a), the quantising noise is constrained to lie within the audio band. In an oversampling convertor, as shown in (b), the quantising noise power is spread over a much wider range, thus reducing its energy in the audio band. (c) With noise shaping the noise power within the audio band is reduced still further, at the expense of increased noise outside that band

Quantising noise energy extends over the whole baseband, up to the Nyquist frequency. Oversampling spreads the quantising noise energy over a wider spectrum, because in oversampled convertors the Nyquist frequency is well above the upper limit of the audio band. This has the effect of reducing the in-band noise by around 3 dB per octave of oversampling (in other words, a system oversampling at twice the Nyquist rate would see the noise power within the audio band reduced by 3 dB).

In oversampled noise-shaping A/D conversion an integrator (low-pass filter) is introduced before the quantiser, and a D/A convertor is incorporated into a negative feedback loop, as shown in Figure 2.32. This is the so-called ‘sigma-delta convertor’. Without going too deeply into the principles of such convertors, the result is that the quantising noise (introduced after the integrator) is given a rising frequency response at the input to the decimator, whilst the input signal is passed with a flat response. There are clear parallels between such a circuit and analog negative-feedback circuits.

Without noise shaping, the energy spectrum of quantising noise is flat up to the Nyquist frequency, but with first-order noise shaping this energy spectrum is made non-flat, as shown in Figure 2.33. With second-order noise shaping the in-band reduction in noise is even greater, such that the in-band noise is well below that achieved without noise shaping.

2.5 D/A conversion

2.5.1 A basic D/A convertor

The basic D/A conversion process is shown in Figure 2.34. Audio sample words are converted back into a staircase-like chain of voltage levels corresponding to the sample values. This is achieved in simple convertors by using the states of bits to turn current sources on or off, making up the required pulse amplitude by the combination of outputs of each of these sources. This staircase is then ‘resampled’ to reduce the width of the pulses before they are passed through a low-pass reconstruction filter whose cut-off frequency is half the sampling frequency. The effect of the reconstruction filter is to join up the sample points to make a smooth waveform. Resampling is necessary because otherwise the averaging effect of the filter would result in a reduction in the amplitude of high-frequency audio signals (the so-called ‘aperture effect’). Aperture effect may be reduced by limiting the width of the sample pulses to perhaps one-eighth of the sample period. Equalisation may be required to correct for aperture effect.

Figure 2.34 Processes involved in D/A conversion (positive sample values only shown)

2.5.2 Oversampling in D/A conversion

Oversampling may be used in D/A conversion, as well as in A/D conversion. In the D/A case additional samples must be created in between the Nyquist rate samples in order that conversion can be performed at a higher sampling rate. These are produced by sample rate conversion of the PCM data. (Sample rate conversion is introduced in Section 2.11.7.) These samples are then converted back to analog at the higher rate, again avoiding the need for steep analog filters. Noise shaping may also be introduced at the D/A stage, depending on the design of the convertor, to reduce the subjective level of the noise.

A number of advanced D/A convertor designs exist which involve oversampling at a high rate, creating samples with only a few bits of resolution. The extreme version of this approach involves very high rate conversion of single bit samples (so-called ‘bit stream conversion’), with noise shaping to optimise the noise spectrum of the signal. The theory of these convertors is outside the scope of this book.

2.6 Sound quality versus sample rates and resolutions

The question often arises as to what sample rate and resolution is necessary for a certain quality of audio. What are the effects of selecting certain values? Are there standards? This section aims to provide some guidelines in this area, with reference to the capabilities of human hearing that must be considered the ultimate arbiter in this matter.

2.6.1 Psychoacoustic limitations

It is possible with digital audio to approach the limits of human hearing in terms of sound quality. In other words, the unwanted artefacts of the process can be controlled so as to be close to or below the thresholds of perception. It is also true, though, that badly engineered digital audio can sound poor and that the term ‘digital’ does not automatically imply high quality. The choice of sampling parameters and noise shaping methods, as well as more subtle aspects of convertor design, affect the frequency response, distortion and perceived dynamic range of digital audio signals.

The human ear’s capabilities should be regarded as the standard against which the quality of digital systems is measured, since it could be argued that the only distortions and noises that matter are those that can be heard. It might be considered wise to design a convertor whose noise floor was tailored to the low level sensitivity of the ear, for example. Figure 2.35 shows a typical low level hearing sensitivity curve, indicating the sound pressure level (SPL) required for a sound just to be audible. It will be seen that the ear is most sensitive in the middle frequency range, around 4 kHz, and that the response tails off towards the low and high frequency ends of the spectrum. This curve is often called the ‘minimum audible field (MAF)’ or ‘threshold of hearing’. It has an SPL of 0 dB (ref. 20 μPa) at 1 kHz. It is worth remembering, though, that the thresholds of hearing of the human ear are not absolute but probabilistic. In other words, when trying to determine what can and cannot be perceived one is dealing with statistical likelihood of perception. This is important for any research which attempts to establish criteria for audibility, since there are certain sounds which, although as much as 10 dB below the accepted thresholds, have a statistical likelihood of perception which may approach certainty in some cases. Also, some listeners are known to be more sensitive than others.

Figure 2.35 Hearing threshold curve

Dynamic range could be said to be equal to the range between the MAF and the loudest sound tolerable. The loudest sound tolerable depends very much on the person, but the threshold of ‘pain’ is usually said to occur between 130 and 140 dB SPL. The absolute maximum dynamic range of human hearing is therefore around 140 dB at 1 kHz, but quite a lot less than that at low and high frequencies. Whether or not it is desirable to be able to record and reproduce such a wide dynamic range is debatable.

Work carried out by Louis Fielder and Elizabeth Cohen attempted to establish the dynamic range requirements for high quality digital audio systems by investigating the extremes of sound pressure available from acoustic sources and comparing these with the perceivable noise floors in real acoustic environments. Using psychoacoustic theory, Fielder was able to show what was likely to be heard at different frequencies in terms of noise and distortion, and where the limiting elements might be in a typical recording chain. Having defined dynamic range as ‘the ratio between the r.m.s. maximum undistorted sine wave level producing peak levels equal to a particular peak level and the r.m.s. level of 20 kHz band-limited white noise that has the same apparent loudness as a particular audio chain’s equipment noise in the absence of a signal’, he proceeded to show that the just audible level of a 20 kHz bandwidth noise signal was about 4 dB SPL, and that a number of musical performances reached levels of between 120 and 129 dB SPL in favoured listening positions. From this he determined a dynamic range requirement of 122 dB for natural reproduction. Taking into account microphone performance and the limitations of consumer loudspeakers, this requirement dropped to 115 dB for consumer systems.

2.6.2 Sampling rate

The choice of sampling rate determines the maximum audio bandwidth available. There is a strong argument for choosing a sampling rate no higher than is strictly necessary, in other words not much higher than twice the highest audio frequency to be represented. This often starts arguments over what is the highest useful audio frequency and this is an area over which heated debates have raged. Conventional wisdom has it that the audio frequency band extends up to 20 kHz, implying the need for a sampling frequency of just over 40 kHz for high quality audio work. There are in fact two standard sampling frequencies between 40 and 50 kHz: the compact disc rate of 44.1 kHz and the so-called ‘professional’ rate of 48 kHz. These are both allowed in the original AES5 standard of 1984, which sets down preferred sampling frequencies for digital audio equipment. Table 2.2 is an attempt to summarise the variety of sampling frequencies in existence and their applications.

The 48 kHz rate was originally included because it left a certain amount of leeway for downward varispeed in tape recorders. When many digital recorders are varispeeded, their sampling rate changes proportionately and the result is a shifting of the first spectral repetition of the audio baseband. If the sampling rate is reduced too far, aliased components may become audible. Most professional digital tape recorders allowed for only around ±12.5 per cent of varispeed for this reason. It is possible now, though, to avoid such problems using digital low pass filters whose cut-off frequency varies with the sampling frequency, or by using digital signal processing to vary the pitch of audio without varying the output sampling frequency.

Table 2.2 Common sampling rates encountered in digital audio applications

Frequency (kHz)	Application
8	Telephony (speech quality). ITU-T G711 standard.
16	Used in some telephony applications. ITU-T G722 data reduction.
18.9	CD-ROM/XA and CD-I standard for low–moderate quality audio using ADPCM to extend playing time.
~22.05	Half the CD frequency is 22.05 kHz. Used in some moderate quality computer applications. The original Apple Macintosh audio sampling frequency was 22254.5454… Hz.
32	Used in some broadcast coding systems, e.g. NICAM. DAT long play mode. AES 5 secondary rate.
37.8	CD-ROM/XA and CD-I standard for intermediate quality audio using ADPCM.
44.056	A slight modification of the 44.1 kHz frequency used in some older equipment to synchronise digital audio with the NTSC television frame rate of 29.97 frames per second. Such ‘pulldown’ rates are sometimes still encountered in video sync situations.
44.1	CD sampling frequency. AES 5 secondary rate.
47.952	Occasionally encountered when 48 kHz equipment is used in NTSC video operations. Another ‘pull-down’ rate, ideally to be avoided.
48	AES 5 primary rate for professional applications.
88.2	Twice the CD sampling frequency. Optional for DVD-Audio.
96	AES 5-1998 secondary rate for high bandwidth applications. Optional for DVD-Video and DVD-Audio.
176.4 and 192	Four times the basic standard rates, optional in DVD-Audio.
2.8224 MHz	DSD sampling frequency.

The 44.1 kHz frequency had been established earlier on for the consumer compact disc and is very widely used in the industry. In fact in many ways it has become the sampling rate of choice for most professional recordings. It allows for full use of the 20 kHz audio band and oversampling convertors allow for the use of shallow analog anti-aliasing filters which avoid phase problems at high audio frequencies. It also generates 10 per cent less data per second than the 48 kHz rate, making it economical from a storage point of view.

A rate of 32 kHz is used in some broadcasting applications, such as NICAM 728 stereo TV transmissions, and in some radio distribution systems. Television and FM radio sound bandwidth is limited to 15 kHz and a considerable economy of transmission bandwidth is achieved by the use of this lower sampling rate. The majority of important audio information lies below 15 kHz in any case and little is lost by removing the top 5 kHz of the audio band. Some professional audio applications offer this rate as an option, but it is not common. It is used for the long play mode of some DAT machines, for example.

Arguments for the standardisation of higher sampling rates have become stronger in recent years, quoting evidence from sources claiming that information above 20 kHz is important for higher sound quality, or at least that the avoidance of steep filtering must be a good thing. Many sound engineers seem to be in favour of such moves, claiming to be able to distinguish clearly between the high rates and the conventional ones. One Japanese professor has shown convincing evidence that frequencies above 20 kHz stimulate the production of so-called alpha waves in the brain that correspond with a state of satisfaction and relaxation. It is certainly true that the ear’s frequency response does not cut off completely at 20 kHz, but there is very limited properly supported evidence that listeners can repeatably distinguish between signals containing higher frequencies and those that do not. Whatever the difficulties of arriving at convincing evidence for all this, sufficient people believe that it matters for manufacturers to be falling over themselves to implement high rate options in their equipment and the new DVD standards incorporate such sampling frequencies as standard features. AES 5–1998 (a revision of the AES standard on sampling frequencies) now allows 96 kHz as an optional rate for applications in which the audio bandwidth exceeds 20 kHz or where relaxation of the anti-alias filtering region is desired.

Doubling the sampling frequency leads to a doubling in the overall data rate of a digital audio system and a consequent halving in storage time per megabyte. It also means that any signal processing algorithms need to process twice the amount of data and alter their algorithms accordingly, so the move to higher rates is not taken lightly in large mixing console design, for example. It follows that these higher sampling rates should be used only after careful consideration of the merits.

Low sampling frequencies such as those below 30 kHz are sometimes encountered in PC workstations for lower quality sound applications such as the storage of speech samples, the generation of internal sound effects and so forth. Multimedia applications may need to support these rates because such applications often involve the incorporation of sounds of different qualities. There are also low sampling frequency options for data reduction codecs, as discussed in Section 2.12.

2.6.3 Quantising resolution

The number of bits per sample dictates the signal-to-noise ratio or dynamic range of a digital audio system. For the time being only linear PCM systems will be considered, because the situation is different when considering systems that use non-uniform quantisation or data reduction. Table 2.3 attempts to summarise the applications for different sample resolutions.

For many years 16-bit linear PCM was considered the norm for high-quality audio applications. This is the CD standard and is capable of offering a good dynamic range of over 90 dB. For most purposes this is adequate, but it fails to reach Fielder’s ideal (quoted above) of 122 dB for subjectively noise-free reproduction in professional systems. To achieve such a dynamic range requires a convertor resolution of around 21 bits, which is achievable with today’s convertor technology, depending on how the specification is interpreted. Some early designs employed two convertors with a gain offset, using digital signal processing to combine the outputs of the two in the range where they overlapped, achieving a significant increase in the perceived dynamic range. Others used two convertors in parallel with independent dither, summing their outputs so that the signal rose by 6 dB but the noise by only 3 dB. So-called 24-bit convertors are indeed available today, but exactly what this means in terms of technical specification is quite hard to define. Twenty-four active bits are certainly produced at the output of such devices but their audio performance is strongly dependent upon the stability of the timing clock, electrical environment, analog stages, grounding and other issues.

Table 2.3 Linear quantising resolution

Bits per sample	Approx. dynamic range with dither (dB)	Application
8	44	Low–moderate quality for older PC internal sound generation. Some older multimedia applications. Usually in the form of unsigned binary numbers.
12	68	Older Akai samplers, e.g. S900.
14	80	Original EIAJ format PCM adaptors, such as Sony PCM-100.
16	92	CD standard. DAT standard. Commonly used high quality resolution for consumer media, some professional recorders and multimedia PCs. Usually twos complement (signed) binary numbers.
20	116	High-quality professional audio recording and mastering applications.
24	140	Maximum resolution of most recent professional recording systems, also of AES 3 digital interface. Dynamic range exceeds psychoacoustic requirements. Hard to convert accurately at this resolution.

It is often the case that for professional recording purposes one needs a certain amount of ‘headroom’ – in other words some unused dynamic range above the normal peak recording level which can be used in unforeseen circumstances such as when a signal overshoots its expected level. This can be particularly necessary in live recording situations where one is never quite sure what is going to happen with recording levels. This is another reason why many professionals feel that a resolution of greater than 16 bits is desirable for original recording. For this reason, 20- and 24-bit recording formats are becoming increasingly popular, with mastering engineers then optimising the finished recording for 16-bit media (such as CD) using noise-shaped requantising processes.

At the lower quality end, older PC sound cards and internal sound generators operated at resolutions as low as 4 bits. Eight-bit resolution also used to be quite common in desktop computers, proving just about adequate for moderate quality sound through the PC’s internal loudspeakers. It gave a dynamic range of nearly 50 dB undithered. Modern multimedia PCs and sound cards generally offer 16-bit resolution as standard. Some early MIDI samplers operated at 8-bit resolution, and some more recent models at 12-bit, but it is now common for MIDI samplers to offer 16- or 20-bit resolution.

2.7 Direct Stream Digital (DSD)

DSD is Sony’s proprietary name for its 1-bit digital audio coding system that uses a very high sampling frequency (2.8224 MHz as a rule). This system is used for audio representation on the consumer Super Audio CD (SACD) and in various items of professional equipment used for producing SACD material. The company is trying to establish a following for this approach, for use in high-quality digital audio applications, and a number of other manufacturers are beginning to produce products that are capable of handling DSD signals. It is not directly compatible with conventional PCM systems although DSD signals can be down-sampled and converted to multibit PCM if required.

DSD signals are the result of delta-sigma conversion of the analog signal, a technique used at the front end of some oversampling convertors described above. As shown in Figure 2.36, a delta-sigma convertor employs a comparator and a feedback loop containing a low pass filter that effectively quantises the difference between the current sample and the accumulated value of previous samples. If it is higher then a ‘1’ results, if it is lower a ‘0’ results. This creates a one-bit output that simply alternates between one and zero in a pattern that depends on the original signal waveform. Conversion to analog can be as simple a matter as passing the bit stream through a low pass filter, but is usually somewhat more sophisticated, involving noise shaping and higher order filtering.

Although one would expect one-bit signals to have an appalling signal-to-noise ratio, the exceptionally high sampling frequency spreads the noise over a very wide frequency range leading to lower noise within the audio band. Additionally, high-order noise shaping is used to reduce the noise in the audio band at the expense of that at much higher (inaudible) frequencies, as discussed earlier. A dynamic range of around 120 dB is therefore claimed, as well as a frequency response extending smoothly to over 100 kHz.

Figure 2.36 A simple example of the DSD conversion process

2.8 Changing the resolution of an audio signal (requantisation)

There may be points in an audio production when the need arises to change the resolution of a signal. A common example of this in high-quality audio is when mastering 16-bit consumer products from 20- or 24-bit recordings, but it also occurs within signal processors of all types because sample wordlengths may vary at different stages. It is important that this operation is performed correctly because incorrect requantisation results in unpleasant distortion, just like undithered quantisation in A/D conversion.

If the length of audio samples needs to be reduced then the worst possible solution is simply to remove unwanted LSBs. Taking the example of a 20-bit signal being reduced to 16 bits, one should not simply remove the 4 LSBs and expect everything to be alright. By removing the LSBs one would be creating a similar effect to not using dither in A/D conversion – in other words one would introduce low-level distortion components. Low-level signals would sound grainy and would not fade smoothly into noise. Figure 2.37 shows a 1 kHz signal at a level of –90 dBFS that originally began life at 20-bit resolution but has been truncated to 16 bits. The harmonic distortion is clearly visible.

The correct approach is to redither the signal for the target resolution by adding dither noise in the digital domain. This digital dither should be at an appropriate level for the new resolution and the LSB of the new sample should then be rounded up or down depending on the total value of the LSBs to be discarded, as shown in Figure 2.38. It is worrying to note how many low-cost digital audio applications fail to perform this operation satisfactorily, leading to complaints about sound quality. Many professional quality audio workstations allow for audio to be stored and output at a variety of resolutions and may make dither user selectable. They also allow the level of the audio signal to be changed in order that maximum use may be made of the available bits. It is normally important, for example, when mastering a CD from a 20-bit recording, to ensure that the highest level signal on the original recording is adjusted during mastering so that it peaks close to the maximum level before requantising and redithering at 16-bit resolution. In this way as much as possible of the original low-level information is preserved and quantising noise is minimised. This applies in any requantising operation, not just CD mastering. A number of applications are available that automatically scale the audio signal so that its level is optimised in this way, allowing the user to set a peak signal value up to which the highest level samples will be scaled. Since some overload detectors on digital meters and CD mastering systems look for repeated samples at maximum level to detect clipping, it is perhaps wise to set peak levels so that they lie just below full modulation. This will ensure that master tapes are not rejected for a suspected recording fault by duplication plants and subsequent users do not complain of ‘over’ levels.

Figure 2.37 Truncation of audio samples results in distortion. (a) shows the spectrum of a 1 kHz signal generated and analysed at 20-bit resolution. In (b) the signal has been truncated to 16-bit resolution and the distortion products are clearly noticeable

Figure 2.38 The correct order of events when requantising an audio signal at a lower resolution is shown here

2.9 Dynamic range enhancement

It is possible to maximise the subjective dynamic range of digital audio signals during the process of requantisation described above. This is particularly useful when mastering highresolution recordings for CD because the reduction to 16-bit wordlengths would normally result in increased quantising noise. It is in fact possible to retain most of the dynamic range of a higher resolution recording, even though it is being transferred to a 16-bit medium. This remarkable feat is achieved by a noise shaping process similar to that described earlier.

During requantisation digital filtering is employed to shape the spectrum of the quantising noise so that as much of it as possible is shifted into the least audible parts of the spectrum. This usually involves moving the noise away from the 4 kHz region where the ear is most sensitive and increasing it at the HF end of the spectrum. The result is often quite high levels of noise at HF, but still lying below the audibility threshold. In this way CDs can be made to sound almost as if they had the dynamic range of 20-bit recordings. Some typical weighting curves used in a commercial mastering processor from Meridian are shown in Figure 2.39, although many other shapes are in use.

This is the principle employed in mastering systems such as Sony’s Super Bit Mapping (SBM). Some approaches allow the mastering engineer to choose from a number of ‘shapes’ of noise until he finds one which is subjectively the most pleasing for the type of music concerned, whereas others stick to one theoretically derived ‘correct’ shape.

Figure 2.39 Examples of noise weighting curves used in the Meridian 518 mastering processor. Note linear frequency scale. Shape A: flat dither, 2nd order shaper. Shape B: flat dither, 9th order shaper (MAP). Shape C: flat dither, 9th order shaper (MAF). Shape D: high pass dither, 9th order shaper (MAF). Shape E: High pass dither. MAP = minimum audible pressure, MAF = minimum audible field. (Courtesy of J. R. Stuart and R. J. Wilson, Meridian Audio)

2.10 Error correction

Since this book is concerned with digital audio for workstations the topic of error correction will only be touched upon briefly. Although dedicated audio recording formats need specially designed systems to protect against the effects of data errors, systems that use computer mass storage media do not. The reason for this is that mass storage media are formatted in such a way as to make them essentially error free. When, for example, a computer disk drive is formatted at a low level, the formatting application attempts to write data to each location and read it back. If the location proves to be damaged or gives erroneous replay it is noted as a ‘bad block’, after which it is never used for data storage. In addition, disk and tape drives look after their own error detection and correction by a number of means that are normally transparent to the digital audio system. If a data error is detected when reading data then the block of data is normally re-read a few times to see if the data can be retrieved. The only effect of this is to slow down transfer slightly.

This differs greatly from the situation with dedicated audio formats such as DAT. In dedicated audio formats there are many levels of error protection, some of which allow errors to be completely corrected (no effect on sound quality) and others that allow the audible effects of more serious errors to be minimised. A process known as interpolation, for example, allows missing samples to be ‘guessed’ by estimating the level of the missing sample based on those around it (see Figure 2.40). Computer systems, on the other hand, cannot allow this type of error correction because it is assumed that data is either correct or it is useless. When reading a financial spreadsheet, for example, it would not be acceptable for an erroneous figure to be guessed by looking at those on either side!

The result is that computer mass storage media are treated as raw, error-free data storage capacity, without the need to add an overhead for error correction data once formatted. This does not mean that such media are infallible and will never give errors, because they do fail occasionally, but that audio workstations do not normally use any additional procedures on top of those already in place. The downside of this is that if an unavoidable error does arise in the replay of a sound file from a digital workstation then it often results in a total inability to play that file. The file is assumed to be corrupt and the computer will not read it. The user does not have the option of being able to decide whether the error is audible, he must instead resort to one of the various computer file ‘rescue packages’ that attempt to rebuild the corrupted information using various proprietary techniques.

Figure 2.40 Interpolation is a means of hiding the audible effects of missing samples, as shown here

2.11 Introduction to digital audio signal processing

Just as processing operations like equalisation, fading and compression can be performed in the analog domain, so they can in the digital domain. Indeed it is often possible to achieve certain operations in the digital domain with fewer side effects such as phase distortion. It is possible to perform operations in the digital domain that are either very difficult or impossible in the analog domain. High quality, authentic-sounding artificial reverberation is one such example, in which the reflection characteristics of different halls and rooms can be accurately simulated. Digital signal processing (DSP) involves the high-speed manipulation of the binary data representing audio samples. It may involve changing the values and timing order of samples and it may involve the combining of two or more streams of audio data. DSP can affect the sound quality of digital audio in that it can add noise or distortion, although one must assume that the aim of good design is to minimise any such degradation in quality.

In the sections that follow an introduction will be given to some of the main applications of DSP in audio workstations without delving into the mathematical principles involved. In some cases the description is an over-simplification of the process, but the aim has been to illustrate concepts not to tackle the detailed design considerations involved.

2.11.1 Gain changing (level control)

It is relatively easy to change the level of an audio signal in the digital domain. It is most easy to shift its gain by 6 dB since this involves shifting the whole sample word either one step to the left or right (see Figure 2.41). Effectively the original value has been multiplied or divided by a factor of two. More precise gain control is obtained by multiplying the audio sample value by some other factor representing the increase or decrease in gain. The number of bits in the multiplication factor determines the accuracy of gain adjustment. The result of multiplying two binary numbers together is to create a new sample word which may have many more bits than the original and it is common to find that digital mixers have internal structures capable of handling 32-bit words, even though their inputs and outputs may handle only 20. Because of this, redithering is usually employed in mixers at points where the sample resolution has to be shortened, such as at any digital outputs or conversion stages, in order to preserve sound quality as described above.

Figure 2.41 The gain of a sample may be changed by 6 dB simply by shifting all the bits one step to the left or right

The values used for multiplication in a digital gain control may be derived from any user control such as a fader, rotary knob or on-screen representation, or they may be derived from stored values in an automation system. A simple ‘old-fashioned’ way of deriving a digital value from an ‘analog’ fader is to connect the fader to a fixed voltage supply and connect the fader wiper to an A/D convertor, although it is quite common now to find controls capable of providing a direct binary output relating to their position. The ‘law’ of the fader (the way in which its gain is related to its physical position) can be determined by creating a suitable look-up table of values in memory which are then used as multiplication factors corresponding to each physical fader position.

2.11.2 Crossfading

Crossfading is employed widely in audio workstations at points where one section of sound is to be joined to another (edit points). It avoids the abrupt change of waveform that might otherwise result in an audible click and allows one sound to take over smoothly from the other. The process is illustrated conceptually in Figure 2.42. It involves two signals each undergoing an automated fade (binary multiplication), one downwards and the other upwards, followed by an addition of the two signals. By controlling the rates and coefficients involved in the fades one can create different styles of crossfade for different purposes.

Figure 2.42 Conceptual block diagram of the crossfading process, showing two audio signals multiplied by changing coefficients, after which they are added together

2.11.3 Mixing

Mixing is the summation of independent data streams representing the different audio channels. Time coincident samples from each input channel are summed to produce a single output channel sample. Clearly it is possible to have many mix ‘buses’ by having a number of separate summing operations for different output channels. The result of summing a lot of signals may be to increase the overall level considerably and the architecture of the mixer must allow enough headroom for this possibility. In the same way as an analog mixer, the gain structure within a digital mixer must be such that there is an appropriate dynamic range window for the signals at each point in the chain, also allowing for operations such as equalisation that change the signal level.

2.11.4 Digital filters and equalisation

Digital filtering is something of a ‘catch-all’ term, and is often used to describe DSP operations that do not at first sight appear to be filtering. A digital filter is essentially a process that involves the time delay, multiplication and recombination of audio samples in all sorts of configurations, from the simplest to the most complex. Using digital filters one can create low- and high-pass filters, peaking and shelving filters, echo and reverberation effects, and even adaptive filters that adjust their characteristics to affect different parts of the signal.

To understand the basic principle of digital filters it helps to think about how one might emulate a certain analog filtering process digitally. Filter responses can be modelled in two main ways – one by looking at their frequency domain response and the other by looking at their time domain response. (There is another approach involving the so-called z-plane transform, but this is not covered here.). The frequency domain response shows how the amplitude of the filter’s output varies with frequency, whereas the time domain response is usually represented in terms of an impulse response (see Figure 2.43). An impulse response shows how the filter’s output responds to stimulation at the input by a single short impulse. Every frequency response has a corresponding impulse (time) response because the two are directly related. If you change the way a filter responds in time you also change the way it responds in frequency. A mathematical process known as the Fourier transform is often used as a means of transforming a time domain response into its equivalent frequency domain response. They are simply two ways of looking at the same thing.

Figure 2.43 Examples of (a) the frequency response of a simple filter, and (b) the equivalent time domain impulse response

Digital audio is time discrete because it is sampled. Each sample represents the amplitude of the sound wave at a certain point in time. It is therefore normal to create certain filtering characteristics digitally by operating on the audio samples in the time domain. In fact if it were desired to emulate a certain analog filter characteristic digitally one would theoretically need only to measure its impulse response and model this in the digital domain. The digital version would then have the same frequency response as the analog version, and one can even envisage the possibility for favourite analog filters to be recreated for the digital workstation. The question, though, is how to create a particular impulse response characteristic digitally, and how to combine this with the audio data.

As mentioned earlier, all digital filters involve delay, multiplication and recombination of audio samples, and it is the arrangement of these elements that gives a filter its impulse response. A simple filter model is the finite impulse response (FIR) filter, or transversal filter, shown in Figure 2.44. As can be seen, this filter consists of a tapped delay line with each tap being multiplied by a certain coefficient before being summed with the outputs of the other taps. Each delay stage is normally a one sample period delay. An impulse arriving at the input would result in a number of separate versions of the impulse being summed at the output, each with a different amplitude. It is called a finite impulse response filter because a single impulse at the input results in a finite output sequence determined by the number of taps. The more taps there are the more intricate the filter’s response can be made, although a simple low pass filter only requires a few taps.

Figure 2.44 A simple FIR filter (transversal filter). N = multiplication coefficient for each tap. Response shown below indicates successive outputs samples multiplied by decreasing coefficients

Figure 2.45 A simple IIR filter (recursive filter). The output impulses continue indefinitely but become very small. N in this case is about 0.8. A similar response to the previous FIR filter is achieved but with fewer stages

The other main type is the infinite impulse response (IIR) filter, which is also known as a recursive filter because there is a degree of feedback between the output and the input (see Figure 2.45). The response of such a filter to a single impulse is an infinite output sequence, because of the feedback. IIR filters are often used in audio equipment because they involve fewer elements for most variable equalisers than equivalent FIR filters, and they are useful in effects devices. They are unfortunately not phase linear, though, whereas FIR filters can be made phase linear.

2.11.5 Digital reverberation and other effects

It can probably be seen that the IIR filter described in Section 2.11.4 forms the basis for certain digital effects, such as reverberation. The impulse response of a typical room looks something like Figure 2.46, that is an initial direct arrival of sound from the source, followed by a series of early reflections, followed by a diffuse ‘tail’ of densely packed reflections decaying gradually to almost nothing. Using a number of IIR filters, perhaps together with a few FIR filters, one could create a suitable pattern of delayed and attenuated versions of the original impulse to simulate the decay pattern of a room. By modifying the delays and amplitudes of the early reflections and the nature of the diffuse tail one could simulate different rooms.

Figure 2.46 The impulse response of a typical reflective room

Figure 2.47 A simple digital dynamics processing operation

The design of convincing reverberation algorithms is a skilled task, and the difference between crude approaches and good ones is very noticeable. Some audio workstations offer limited reverberation effects built into the basic software package, but these often sound rather poor because of the limited DSP power available (often processed on the computer’s own CPU) and the crude algorithms involved. More convincing reverberation processors are available which exist either as stand-alone devices or as optional plug-ins for the workstation, having access to more DSP capacity and tailor-made software.

Other simple effects can be introduced without much DSP capacity, such as double-tracking and phasing/flanging effects. These often only involve very simple delaying and recombination processes. Pitch shifting can also be implemented digitally, and this involves processes similar to sample rate conversion, as described below. High-quality pitch shifting requires quite considerable horsepower because of the number of calculations required.

2.11.6 Dynamics processing

Digital dynamics processing involves gain control that depends on the instantaneous level of the audio signal. A simple block diagram of such a device is shown in Figure 2.47. A side chain produces coefficients corresponding to the instantaneous gain change required, which are then used to multiply the delayed audio samples. First the r.m.s. level of the signal must be determined, after which it needs to be converted to a logarithmic value in order to determine the level change in decibels. Only samples above a certain threshold level will be affected, so a constant factor must be added to the values obtained, after which they are multiplied by a factor to represent the compression slope. The coefficient values are then antilogged to produce linear coefficients by which the audio samples can be multiplied.

2.11.7 Sample rate conversion

Sample rate conversion is necessary whenever audio is to be transferred between systems operating at different rates. The aim is to convert the audio to the new rate without any change in pitch or addition of distortion or noise. These days sample rate conversion can be a very high-quality process, although it is never an entirely transparent process because it involves modifying the sample values and timings. As with requantising algorithms, it is fairly common to encounter poorly implemented sample rate conversion on low-cost digital audio workstations, often depending very much on the specific software application rather than the hardware involved.

The easiest way to convert from one rate to another is by passing through the analog domain and resampling at the new rate, but this may introduce a small amount of extra noise. The most basic form of digital rate conversion involves the translation of samples at one fixed rate to a new fixed rate, related by a simple fractional ratio. Fractional-ratio conversion involves the mathematical calculation of samples at the new rate based on the values of samples at the old rate. Digital filtering is used to calculate the amplitudes of the new samples such that they are correct based on the impulse response of original samples, after low-pass filtering with an upper limit of the Nyquist frequency of the original sampling rate. A clock rate common to both sample rates is used to control the interpolation process. Using this method, some output samples will coincide with input samples, but only a limited number of possibilities exist for the interval between input and output samples.

If the input and output sampling rates have a variable or non-simple relationship the above does not hold true, since output samples may be required at any interval in between input samples. This requires an interpolator with many more clock phases than for fractional-ratio conversion, the intention being to pick a clock phase that most closely corresponds to the desired output sample instant at which to calculate the necessary coefficient. There will clearly be an error, which may be made smaller by increasing the number of possible interpolator phases. The audible result of the timing error is equivalent to the effects of jitter on an audio signal (see above), and should be minimised in design so that the effects of sample rate conversion are below the noise floor of the signal resolution in hand. If the input sampling rate is continuously varied (as it might be in variable-speed searching or cueing) the position of interpolated samples with relation to original samples must vary also. This requires real-time calculation of filter phase.

Many workstations now include sample rate conversion as either a standard or optional feature, so that audio material recorded and edited at one rate can be reproduced at another. It is important to ensure that the quality of the sample rate conversion is high enough not to affect the sound quality of your recordings, and it should only be used if it cannot be avoided. Poorly implemented applications sometimes omit to use correct low-pass filtering to avoid aliasing, or incorporate very basic digital filters, resulting in poor sound quality after rate conversion.

Sample rate conversion is also useful as a means of synchronising an external digital source to a standard sampling frequency reference, when it is outside the range receivable by a workstation.

2.12 Audio data reduction

Conventional PCM audio has a high data rate, and there are many applications for which it would be an advantage to have a lower data rate without much (or any) loss of sound quality. Sixteen-bit linear PCM at a sampling rate of 44.1 kHz (‘CD quality digital audio’) results in a data rate of about 700 kbit s^–1. For multimedia applications, broadcasting, communications and some consumer purposes (e.g. streaming over the Internet) the data rate may be reduced to a fraction of this with minimal effect on the perceived sound quality. At very low rates the effect on sound quality is traded off with the bit rate required. Simple techniques for reducing the data rate, such as reducing the sampling rate or number of bits per sample, would have a very noticeable effect on sound quality, so most modern low bit rate coding works by exploiting the phenomenon of auditory masking to ‘hide’ the increased noise resulting from bit rate reduction in parts of the audio spectrum where it will hopefully be inaudible. There are a number of types of low bit rate coding used in audio systems, working on similar principles, and used for applications such as consumer disk and tape systems (e.g. Sony ATRAC), digital cinema sound (e.g. Dolby Digital, Sony SDDS, DTS) and multimedia applications (e.g. MPEG).

2.12.1 Why reduce the data rate?

Nothing is inherently wrong with linear PCM from a sound quality point of view, indeed it is probably the best thing to use. The problem is simply that the data rate is too high for a number of applications. Two channels of linear PCM require a rate of around 1.4 Mbit s^–1, whereas applications such as digital audio broadcasting (DAB) or digital radio need it to be more like 128 kbit s^–1 (or perhaps lower for some applications) in order to fit sufficient channels into the radio frequency spectrum – in other words more than ten times less data per second. Some Internet streaming applications need it to be even lower than this, with rates down in the low tens of kilobits per second for modem-oriented connections or mobile communications.

The efficiency of mass storage media and data networks is related to their data transfer rates. The more data can be moved per second, the more audio channels may be handled simultaneously, the faster a disk can be copied, the faster a sound file can be transmitted across the world. In reducing the data rate that each audio channel demands, one also reduces the requirement for such high specifications from storage media and networks, or alternatively one can obtain greater functionality from the same specification. A network connection capable of handling eight channels of linear PCM simultaneously could be made to handle, say, 48 channels of data-reduced audio, without unduly affecting sound quality.

Although this sounds like magic and makes it seem as if there is no point in continuing to use linear PCM, it must be appreciated that the data reduction is achieved by throwing away data from the original audio signal. The more data is thrown away the more likely it is that unwanted audible effects will be noticed. The design aim of most of these systems is to try to retain as much as possible of the sound quality whilst throwing away as much data as possible, so it follows that one should always use the least data reduction necessary, where there is a choice.

2.12.2 Lossless and lossy coding

There is an important distinction to be made between the type of data reduction used in some computer applications and the approach used in many audio coders. The distinction is really between ‘lossless’ coding and coding which involves some loss of information (see Figure 2.48). It is quite common to use data compression on computer files in order to fit more information onto a given disk or tape, but such compression is usually lossless in that the original data are reconstructed bit for bit when the file is decompressed. A number of tape backup devices for computers have a compression facility for increasing the apparent capacity of the medium, for example. Methods are used which exploit redundancy in the information, such as coding a string of eighty zeros by replacing them with a short message stating the value of the following data and the number of bytes involved. This is particularly relevant in single-frame bit-mapped picture files where there may be considerable runs of black or white in each line of a scan, where nothing in the image is changing. One may expect files compressed using off-the-shelf PC data compression applications to be reduced to perhaps 25–50 per cent of their original size, but it must be remembered that they are often dealing with static data, and do not have to work in real time. Also, it is not normally acceptable for decompressed computer data to be anything but the original data.

It is possible to use lossless coding on audio signals. Lossless coding allows the original PCM data to be reconstructed perfectly by the decoder and is therefore ‘noiseless’ since there is no effect on audio quality. The data reduction obtained using these methods ranges from nothing to about 2.5:1 and is variable depending on the program material. This is because audio signals have an unpredictable content, do not make use of a standard limited character set, and do not spend long periods of time in one binary state or the other. Although it is possible to perform this reduction in real time, the coding gains are not sufficient for many applications. Nonetheless, a halving in the average audio data rate is certainly a useful saving. A form of lossless data reduction known as Direct Stream Transfer (DST) can be used for Super Audio CD (see Section 2.7) in order to fit the required multichannel audio data into the space available. A similar system is available for DVD-Audio, called MLP (Meridian Lossless Packing), discussed further in Chapter 8.

Figure 2.48 (a) In lossless coding the original data is reconstructed perfectly upon decoding, resulting in no loss of information. (b) In lossy coding the decoded information is not the same as that originally coded, but the coder is designed so that the effects of the process are minimal

‘Noisy’ or lossy coding methods make possible a far greater degree of data reduction, but require the designer and user to arrive at a compromise between the degree of data reduction and potential effects on sound quality. Here data reduction is achieved by coding the signal less accurately than in the original PCM format (using fewer bits per sample), thereby increasing quantising noise, but with the intention that increases in noise will be ‘masked’ (made inaudible) by the signal. The original data is not reconstructed perfectly on decoding. The success of such techniques therefore relies on being able to model the characteristics of the human hearing process in order to predict the masking effect of the signal at any point in time – hence the common term ‘perceptual coding’ for this approach. Using detailed psychoacoustic models it is possible to code high-quality audio at rates under 100 kbit s^–1 per channel with minimal effects on audio quality. Higher data rates, such as 192 kbit s^–1, can be used to obtain an audio quality that is demonstrably indistinguishable from the original PCM.

2.12.3 MPEG – an example of lossy coding

The following is a very brief overview of how one approach works, based on the technology involved in the MPEG (Moving Pictures Expert Group) standards.

As shown in Figure 2.49, the incoming digital audio signal is filtered into a number of narrow frequency bands. Parallel to this a computer model of the human hearing process (an auditory model) analyses a short portion of the audio signal (a few milliseconds). This analysis is used to determine what parts of the audio spectrum will be masked, and to what degree, during that short time period. In bands where there is a strong signal, quantising noise can be allowed to rise considerably without it being heard, because one signal is very efficient at masking another lower level signal in the same band as itself (see Figure 2.50). Provided that the noise is kept below the masking threshold in each band it should be inaudible.

Figure 2.49 Generalised block diagram of a psychoacoustic low bit rate coder

Figure 2.50 Quantising noise lying under the masking threshold will normally be inaudible

Blocks of audio samples in each narrow band are scaled (low level signals are amplified so that they use more of the most significant bits of the range) and the scaled samples are then reduced in resolution (requantised) by reducing the number of bits available to represent each sample – a process that results in increased quantising noise. The output of the auditory model is used to control the requantising process so that the sound quality remains as high as possible for a given bit rate. The greatest number of bits is allocated to frequency bands where noise would be most audible, and the fewest to those bands where the noise would be effectively masked by the signal. Control information is sent along with the blocks of bit-rate-reduced samples to allow them to be reconstructed at the correct level and resolution upon decoding.

The above process is repeated every few milliseconds, so that the masking model is constantly being updated to take account of changes in the audio signal. Carefully implemented, such a process can result in a reduction of the data rate to anything from about one-quarter to less than one-tenth of the original data rate. A decoder uses the control information transmitted with the bit-rate-reduced samples to restore the samples to their correct level and can determine how many bits were allocated to each frequency band by the encoder, reconstructing linear PCM samples and then recombining the frequency bands to form a single output (see Figure 2.51). A decoder can be much less complex, and therefore cheaper, than an encoder, because it does not need to contain the auditory model.

A standard known as MPEG-1, published by the International Standards Organisation (ISO 11172–3), defines a number of ‘layers’ of complexity for low bit rate audio coders as shown in Table 2.4. Each of the layers can be operated at any of the bit rates within the ranges shown (although some of the higher rates are intended for stereo modes) and the user must make appropriate decisions about what sound quality is appropriate for each application. The lower the data rate, the lower the sound quality that will be obtained. At high data rates the encoding-decoding process has been judged by many to be audibly ‘transparent’ – in other words listeners cannot detect that the coded and decoded signal is different from the original input. The target bit rates were for ‘transparent’ coding.

Figure 2.51 Generalised block diagram of an MPEG-Audio decoder

Table 2.4 MPEG-1 layers

‘MP3’ will be for many people the name associated with downloading music files from the Internet. The term MP3 has caused some confusion; it is short for MPEG-1 Layer 3, but MP3 has virtually become a generic term for the system used for receiving compressed audio from the Internet. There is also MPEG-2 which can handle multichannel surround, and further developments in this and later systems will be briefly touched upon.

MPEG-2 BC (Backwards Compatible with MPEG-1) additionally supports sampling frequencies from 16 kHz to 22.05 kHz and 24 kHz at bit rates from 32 to 256 kbit s^–1 for Layer 1. For Layers 2 and 3, bit rates are from 8 to 160 kbit s^–1. Developments intended to supersede MPEG-2 BC have included MPEG-2 AAC (Advanced Audio Coding). This defines a standard for multichannel coding of up to 48 channels, with sampling rates from 8 kHz to 96 kHz. It also incorporates a Modified Discrete Cosine transform system as used in the MiniDisc coding format (ATRAC). MPEG-2 AAC was not, however, designed to be backwards compatible with MPEG-1.

MPEG-4 ‘natural audio coding’ is based on the standards outlined for MPEG-2 AAC; it includes further coding techniques for reducing transmission bandwidth and it can scale the bit rate according to the complexity of the decoder. There are also intermediate levels of parametric representation in MPEG-4 such as used in speech coding, whereby speed and pitch of basic signals can be altered over time. One has access to a variety of methods of representing sound at different levels of abstraction and complexity, all the way from natural audio coding (lowest level of abstraction), through parametric coding systems based on speech synthesis and low level parameter modification, to fully synthetic audio objects.

When audio signals are described in the form of ‘objects’ and ‘scenes’, it requires that they be rendered or synthesised by a suitable decoder. Structured Audio (SA) in MPEG-4 enables synthetic sound sources to be represented and controlled at very low bit rates (less than 1 kbit s^–1). An SA decoder can synthesise music and sound effects. SAOL (Structured Audio Orchestra Language), as used in MPEG-4, was developed at MIT and is an evolution of CSound (a synthesis language used widely in the electroacoustic music and academic communities). It enables ‘instruments’ and ‘scores’ to be downloaded. The instruments define the parameters of a number of sound sources that are to be rendered by synthesis (e.g. FM, wavetable, granular, additive) and the ‘score’ is a list of control information that governs what those instruments play and when (represented in the SASL or Structured Audio Score Language format). This is rather like a more refined version of the established MIDI control protocol, and indeed MIDI can be used if required for basic music performance control. This is discussed further in Chapter 4.

Sound scenes, as distinct from sound objects, are usually made up of two elements – that is the sound objects and the environment within which they are located. Both elements are integrated within one part of MPEG-4. This part of MPEG-4 uses so-called BIFS (Binary Format for Scenes) for describing the composition of scenes (both visual and audio). The objects are known as nodes and are based on VRML (virtual reality modelling language). So-called Audio BIFS can be post-processed and represent parametric descriptions of sound objects. Advanced Audio BIFS also enable virtual environments to be described in the form of perceptual room acoustics parameters, including positioning and directivity of sound objects. MPEG-4 audio scene description distinguishes between physical and perceptual representation of scenes, rather like the low- and high-level description information mentioned above.

2.12.4 Other data-reduced formats

Dolby Digital or AC-3 encoding was developed as a means of delivering 5.1-channel surround to cinemas or the home without the need for analog matrix encoding. The AC-3 coding algorithm can be used for a wide range of different audio signal configurations and bit rates from 32 kbit s^–1 for a single mono channel up to 640 kbit s^–1 for surround signals. It is used widely for the distribution of digital sound tracks on 35 mm movie films, the data being stored optically in the space between the sprocket holes on the film.

It is sufficient to say here that the process involves a number of techniques by which the data representing audio from the source channels is transformed into the frequency domain and requantised to a lower resolution, relying on the masking characteristics of the human hearing process to hide the increased quantising noise that results from this process. A common bit pool is used so that channels requiring higher data rates than others can trade their bit rate requirements provided that the overall total bit rate does not exceed the constant rate specified.

Aside from the representation of surround sound in a compact digital form, Dolby Digital includes a variety of operational features that enhance system flexibility and help adapt replay to a variety of consumer situations. These include dialogue normalisation (‘dialnorm’) and the option to include dynamic range control information alongside the audio data for use in environments where background noise prevents the full dynamic range of the source material to be heard. Downmix control information can also be carried alongside the audio data in order that a two-channel version of the surround sound material can be reconstructed in the decoder. As a rule, Dolby Digital data is stored or transmitted with the highest number of channels needed for the end product to be represented and any compatible downmixes are created in the decoder. This differs from some other systems where a two-channel down-mix is carried alongside the surround information.

The DTS (Digital Theater Systems) ‘Coherent Acoustics’ system is another digital signal coding format that can be used to deliver surround sound in consumer or professional applications, using low bit rate coding techniques to reduce the data rate of the audio information. The DTS system can accommodate a wide range of bit rates from 32 kbit s^–1 up to 4.096 Mbits^–1 (somewhat higher than Dolby Digital), with up to eight source channels and with sampling rates up to 192 kHz. Variable bit rate and lossless coding are also optional. Downmixing and dynamic range control options are provided in the system. Because the maximum data rate is typically somewhat higher than that of Dolby Digital or MPEG, a greater margin can be engineered between the signal and any artefacts of low bit rate coding, leading to potentially higher sound quality. Such judgements, though, are obviously up to the individual and it is impossible to make blanket statements about comparative sound quality between systems.

SDDS stands for Sony Dynamic Digital Sound, and is the third of the main competing formats for digital film sound. Using Sony’s ATRAC data reduction system (also used on MiniDiscs), it too encodes audio data with a substantial saving in bit rate compared with the original PCM (about 5:1 compression).

Real Networks has been developing data reduction for Internet streaming applications for a number of years and specialises in squeezing the maximum quality possible out of very low bit rates. It has recently released ‘Real Audio with ATRAC 3’ which succeeds the earlier Real Audio G2 standard. Audio can be coded at rates between 12 and 352 kbit s^–1, occupying only 63 per cent of the bandwidth previously consumed by G2.

Table of Contents for 2 Digital audio principles

Create new playlist

Sign In

Sign Up