CHAPTER 7 Digital Audio in Video

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

WHAT IS SOUND?

Physics can tell us the mechanism by which disturbances propagate through the air. If this is our definition of sound, we have the problem that in physics there are no limits to the frequencies and levels that must be considered. Biology can tell us that the ear responds to only a certain range of frequencies provided a threshold level is exceeded. This is a better definition of sound; reproduction is easier because it is necessary only to reproduce that range of levels and frequencies that the ear can detect.

Psycho-acoustics can describe how our hearing has finite resolution in both time and frequency domains such that what we perceive is an inexact impression. Some aspects of the original disturbance are inaudible to us and are said to be masked. If our goal is the highest quality, we can design our imperfect equipment so that the shortcomings are masked. Conversely if our goal is economy we can use compression and hope that masking will disguise the inaccuracies it causes.

By definition, the sound quality of a perceptive coder can be assessed only by human hearing. Equally, a useful perceptive coder can be designed only with a good knowledge of the human hearing mechanism.¹ The acuity of the human ear is astonishing. The frequency range is extremely wide, covering some 10 octaves (an octave is a doubling of pitch or frequency) without interruption. It can detect tiny amounts of distortion and will accept an enormous dynamic range. If the ear detects a different degree of impairment between two codecs having the same bit rate in properly conducted tests, we can say that one of them is superior. Thus quality is completely subjective and can be checked only by listening tests. However, any characteristic of a signal that can be heard can also be measured by a suitable instrument. The subjective tests can tell us how sensitive the instrument should be. Then the objective readings from the instrument give an indication of how acceptable a signal is in respect to that characteristic. Instruments for assessing the performance of codecs are currently extremely rare and there remains much work to be done.

THE EAR

The sense we call hearing results from acoustic, mechanical, hydraulic, nervous, and mental processes in the ear/brain combination, leading to the term psycho-acoustics. It is possible to introduce the subject here only briefly.²

Figure 7.1 shows that the structure of the ear is traditionally divided into the outer, middle, and inner ear. The outer ear works at low impedance, the inner ear works at high impedance, and the middle ear is an impedance-matching device. The visible part of the outer ear is called the pinna, which plays a subtle role in determining the direction of arrival of sound at high frequencies. It is too small to have any effect at low frequencies. Incident sound enters the auditory canal or meatus. The pipe-like meatus causes a small resonance at around 4 kHz. Sound vibrates the eardrum or tympanic membrane, which seals the outer ear from the middle ear. The inner ear or cochlea works by sound travelling though a fluid. Sound enters the cochlea via a membrane called the oval window. If airborne sound were to be incident on the oval window directly, the serious impedance mismatch would cause most of the sound to be reflected. The middle ear remedies that mismatch by providing a mechanical advantage. The tympanic membrane is linked to the oval window by three bones known as ossicles, which act as a lever system such that a large displacement of the tympanic membrane results in a smaller displacement of the oval window but with greater force. Figure 7.2 shows that the malleus applies a tension to the tympanic membrane, rendering it conical in shape. The malleus and the incus are firmly joined together to form a lever. The incus acts upon the stapes through a spherical joint. As the area of the tympanic membrane is greater than that of the oval window, there is a further multiplication of the available force. Consequently small pressures over the large area of the tympanic membrane are converted to high pressures over the small area of the oval window.

FIGURE 7.1

The structure of the human ear. See text for details.

FIGURE 7.2

The malleus tensions the tympanic membrane into a conical shape. The ossicles provide an impedance-transforming lever system between the tympanic membrane and the oval window.

The middle ear is normally sealed, but ambient pressure changes will cause static pressure on the tympanic membrane, which is painful. The pressure is relieved by the Eustachian tube, which opens involuntarily during swallowing. The Eustachian tubes open into the cavities of the head and must normally be closed to avoid one's own speech seeming deafeningly loud.

The ossicles are located by minute muscles, which are normally relaxed. However, the middle ear reflex is an involuntary tightening of the tensor tympani and stapedius muscles, which heavily damp the ability of the tympanic membrane and the stapes to transmit sound by about 12 dB at frequencies below 1 kHz. The main function of this reflex is to reduce the audibility of one's own speech. However, loud sounds will also trigger this reflex, which takes some 60–120 ms to occur, too late to protect against transients such as gunfire.

The cochlea, shown in Figure 7.3a, is a tapering spiral cavity within bony walls, which is filled with fluid. The widest part, near the oval window, is called the base and the distant end is the apex. Figure 7.3b shows that the cochlea is divided lengthwise into three volumes by Reissner's membrane and the basilar membrane. The scala vestibuli and the scala tympani are connected by a small aperture at the apex of the cochlea known as the helicotrema. Vibrations from the stapes are transferred to the oval window and become fluid pressure variations, which are relieved by the flexing of the round window. Effectively the basilar membrane is in series with the fluid motion and is driven by it except at very low frequencies, at which the fluid flows through the helicotrema, bypassing the basilar membrane.

FIGURE 7.3

(a) The cochlea is a tapering spiral cavity. (b) The cross section of the cavity is divided by Reissner's membrane and the basilar membrane. (c) The basilar membrane tapers so its resonant frequency changes along its length.

Figure 7.3c shows that the basilar membrane is not uniform, but tapers in width and varies in thickness in the opposite sense to the taper of the cochlea. The part of the basilar membrane that resonates as a result of an applied sound works as a function of the frequency. High frequencies cause resonance near to the oval window, whereas low frequencies cause resonance farther away. More precisely, the distance from the apex where the maximum resonance occurs is a logarithmic function of the frequency. Consequently tones spaced apart in octave steps will excite evenly spaced resonances in the basilar membrane. The prediction of resonance at a particular location on the membrane is called place theory. Essentially the basilar membrane is a mechanical frequency analyser. A knowledge of the way it operates is essential to an understanding of musical phenomena such as pitch discrimination, timbre, consonance, and dissonance and to auditory phenomena such as critical bands, masking, and the precedence effect.

The vibration of the basilar membrane is sensed by the organ of Corti, which runs along the centre of the cochlea. The organ of Corti is active in that it contains elements that can generate vibration as well as sense it. These are connected in a regenerative fashion so that the Q factor, or frequency selectivity of the ear, is higher than it would otherwise be. The deflection of hair cells in the organ of Corti triggers nerve firings and these signals are conducted to the brain by the auditory nerve.

Nerve firings are not a perfect analog of the basilar membrane motion. A nerve firing appears to occur at a constant phase relationship to the basilar vibration, a phenomenon called phase locking; but firings do not necessarily occur on every cycle. At higher frequencies firings are intermittent, yet each is in the same phase relationship.

The resonant behaviour of the basilar membrane is not observed at the lowest audible frequencies below 50 Hz. The pattern of vibration does not appear to change with frequency and it is possible that the frequency is low enough to be measured directly from the rate of nerve firings.

LEVEL AND LOUDNESS

At its best, the ear can detect a sound pressure variation of only 2 × 10⁻⁵ pascals rms and so this figure is used as the reference against which sound pressure level (SPL) is measured. The sensation of loudness is a logarithmic function of SPL and consequently a logarithmic unit, the decibel, is used in audio measurement.

The dynamic range of the ear exceeds 130 dB, but at the extremes of this range, the ear is either straining to hear or in pain. Neither of these cases can be described as pleasurable or entertaining, and it is hardly necessary to produce audio of this dynamic range because, among other things, the consumer is unlikely to have anywhere sufficiently quiet to listen to it. On the other hand, extended listening to music whose dynamic range has been excessively compressed is fatiguing.

The frequency response of the ear is not at all uniform and it also changes with SPL. The subjective response to level is called loudness and is measured in phons. The phon scale and the SPL scale coincide at one kHz, but at other frequencies the phon scale deviates because it displays the actual SPLs judged by a human subject to be equally as loud as a given level at one kHz. Figure 7.4 shows the so-called equal loudness contours, which were originally measured by Fletcher and Munson and subsequently by Robinson and Dadson. Note the irregularities caused by resonances in the meatus at about 4 and 13 kHz.

Usually, people's ears are at their most sensitive between about two and five kHz, and although some people can detect 20 kHz at high level, there is much evidence to suggest that most listeners cannot tell if the upper frequency limit of sound is 20 or 16 kHz.^3,4 For a long time it was thought that frequencies below about 40 Hz were unimportant, but it is now clear that reproduction of frequencies down to 20 Hz improves reality and ambience.⁵ The generally accepted frequency range for high-quality audio is 20–20,000 Hz, although for broadcasting an upper limit of 15,000 Hz is often applied. The most dramatic effect of the curves of Figure 7.4 is that the bass content of reproduced sound is disproportionately reduced as the level is turned down.

Loudness is a subjective reaction and is almost impossible to measure. In addition to the level-dependent frequency response problem, the listener uses the sound not for its own sake but to draw some conclusion about the source. For example, most people hearing a distant motorcycle will describe it as being loud. Clearly at the source, it is loud, but the listener has compensated for the distance.

The best that can be done is to make some compensation for the level-dependent response using weighting curves. Ideally there should be many, but in practice the A, B, and C weightings were chosen, in which the A curve is based on the 40-phon response. The measured level after such a filter is in units of dBA. The A curve is almost always used because it most nearly relates to the annoyance factor of distant noise sources.

FIGURE 7.4

Contours of equal loudness showing that the frequency response of the ear is highly level-dependent (solid line, age 20; dashed line, age 60).

CRITICAL BANDS

Figure 7.5 shows an uncoiled basilar membrane with the apex on the left so that the usual logarithmic frequency scale can be applied. The envelope of displacement of the basilar membrane is shown for a single frequency in Figure 7.5a. The vibration of the membrane in sympathy with a single frequency cannot be localized to an infinitely small area, and nearby areas are forced to vibrate at the same frequency with an amplitude that decreases with distance. Note that the envelope is asymmetrical because the membrane is tapering and because of frequency-dependent losses in the propagation of vibrational energy down the cochlea. If the frequency is changed, as in Figure 7.5b, the position of maximum displacement will also change. As the basilar membrane is continuous, the position of maximum displacement is infinitely variable, allowing extremely good pitch discrimination of about one-twelfth of a semitone, which is determined by the spacing of hair cells.

In the presence of a complex spectrum, the finite width of the vibration envelope means that the ear fails to register energy in some bands when there is more energy in a nearby band. Within those areas, other frequencies are mechanically excluded because their amplitude is insufficient to dominate the local vibration of the membrane. Thus the Q factor of the membrane is responsible for the degree of auditory masking, defined as the decreased audibility of one sound in the presence of another.

FIGURE 7.5

The basilar membrane symbolically uncoiled. (a) Single frequency causes the vibration envelope shown. (b) Changing the frequency moves the peak of the envelope.

The term used in psycho-acoustics to describe the finite width of the vibration envelope is “critical bandwidth.” Critical bands were first described by Fletcher.⁶ The envelope of basilar vibration is a complicated function. It is clear from the mechanism that the area of the membrane involved will increase as the sound level rises. Figure 7.6 shows the bandwidth as a function of level.

As was shown in Chapter 3, transform theory teaches that the higher the frequency resolution of a transform, the worse the time accuracy. As the basilar membrane has finite frequency resolution measured in the width of a critical band, it follows that it must have finite time resolution. This also follows from the fact that the membrane is resonant, taking time to start and stop vibrating in response to a stimulus. There are many examples of this. Figure 7.7 shows the impulse response. Figure 7.8 shows that the perceived loudness of a tone burst increases with duration up to about 200 ms due to the finite response time.

The ear has evolved to offer intelligibility in reverberant environments, which it does by averaging all received energy over a period of about 30 ms. Reflected sound that arrives within this time is integrated to produce a louder sensation, whereas reflected sound that arrives after that time can be temporally discriminated and is perceived as an echo. Our simple microphones have no such ability, which is why we often need to have acoustic treatment in areas where microphones are used.

FIGURE 7.6

The critical bandwidth changes with SPL.

FIGURE 7.7

Impulse response of the ear showing slow attack and decay due to resonant behaviour.

FIGURE 7.8

Perceived level of tone burst rises with duration as resonance builds up.

FIGURE 7.9

Effective rectangular bandwidth of the critical band is much wider than the resolution of the pitch discrimination mechanism.

A further example of the finite time discrimination of the ear is the fact that short interruptions to a continuous tone are difficult to detect. Finite time resolution means that masking can take place even when the masking tone begins after and ceases before the masked sound. This is referred to as forward and backward masking.⁷

As the vibration envelope is such a complicated shape, Moore and Glasberg⁸ have proposed the concept of equivalent rectangular bandwidth ERB (equivalent rectangular bandwidth) to simplify matters. The ERB is the bandwidth of a rectangular filter that passes the same power as a critical band. Figure 7.9a shows the expression they have derived linking the ERB with frequency. This is plotted in Figure 7.9b, in which it will be seen that one-third of an octave is a good approximation. This is about 30 times broader than the pitch discrimination also shown in Figure 7.9b.

STEREO AND SURROUND SOUND

The human listener can determine reasonably well where a sound is coming from. An understanding of the mechanisms of direction sensing is important for the successful implementation of spatial illusions such as stereophonic sound. As Figure 7.10 shows, having a pair of spaced ears allows a number of mechanisms. In Figure 7.10a a phase shift will be apparent between the two versions of a tone picked up at the two ears unless the source of the tone is dead ahead (or behind). In (b) the distant ear is shaded by the head, resulting in reduced response compared to the nearer ear. In (c) a transient sound arrives later at the more distant ear.

If the phase-shift mechanism (Figure 7.10a) is considered, then it will be clear that there will be considerable variation in this effect with frequency. At a low frequency such as 30 Hz, the wavelength is around 11.5 m. Even if heard from the side, the ear spacing of about 0.2 m will result in a phase shift of only 6° and so this mechanism must be quite weak at low frequencies. At high frequencies such as 10 kHz, the ear spacing is many wavelengths, and variations in the pathlength difference will produce a confusing and complex phase relationship. The problem with tones or single frequencies is that they produce a sinusoidal waveform, one cycle of which looks much like another, leading to ambiguities in the time between two versions. This is shown in Figure 7.11a.

Pure tones are extremely difficult to localize, especially as they will often excite room-standing waves, which give a completely misleading impression of the location of the sound source. Consequently the phase-comparison mechanism must be restricted to frequencies at which the wavelength is short enough to give a reasonable phase shift, but not so short that complete cycles of shift are introduced. This suggests a frequency limit of around 1500 Hz, which has been confirmed by experiment.

The shading mechanism of Figure 7.10b will be a function of the directivity factor, suggesting that at low and middle frequencies sound will diffract round the head sufficiently well that there will be no significant difference between the level at the two ears. Only at high frequencies does sound become directional enough for the head to shade the distant ear, causing what is called an interaural intensity difference. At very high frequencies, the shape of the pinnae must have some effect on the sound that is a function of direction. It is thought that the pinnae allow some discrimination in all axes.

Phase differences are useful only at low frequencies and shading works only at high frequencies. Fortunately real-world sounds are timbral or broadband and often contain transients, especially those sounds that indicate a potential hazard. Timbral, broadband, and transient sounds differ from tones in that they contain many different frequencies. A transient has a unique aperiodic waveform, which Figure 7.11b shows has the advantage that there can be no ambiguity in the interaural delay (IAD) between two versions. Figure 7.12 shows the time difference for different angles of incidence for a typical head.

Note that a 1° change in sound location causes an IAD of around 10μs. The smallest detectable IAD is a remarkable 6μs. The basilar membrane is a frequency analysis device, which produces nerve impulses from different physical locations according to which frequencies are present in the incident sound. Clearly when a timbral or transient sound arrives from one side, many frequencies will be excited simultaneously in the nearer ear, resulting in a pattern of nerve firings. This will be closely followed by a similar excitation pattern in the further ear.

FIGURE 7.10

Having two spaced ears allows a number of mechanisms. (a) Off-centre sounds result in phase difference. (b) Distant ear is shaded by head, producing loss of high frequencies. (c) Distant ear detects transient later.

FIGURE 7.11

(a) Pure tones cause ambiguity in timing differences. (b) Transients have no ambiguity and are easier to localize.

FIGURE 7.12

The IAD for various arrival directions.

Shading may change the relative amplitudes of the higher frequencies, but it will not change the pattern of frequency components present. A timbral waveform is periodic at the fundamental frequency but the presence of harmonics means that a greater number of nerve firings can be compared between the two ears. As the statistical deviation of nerve firings with respect to the incoming waveform is about 100μs, the only way an IAD of 6μs can be perceived is if the timing of many nerve firings is correlated in some way.

FIGURE 7.13

Configuration used for stereo listening.

The broader the range of frequencies in the sound source, the more effective this process will be. Analysis of the arrival time of transients is a most effective lateral direction-sensing mechanism. This is easily demonstrated by wearing a blindfold and having a helper move around the room making a variety of noises. The helper will be easier to localize when making clicking noises than when humming. It is easy to localize a double bass despite the low fundamental as it is a harmonically rich instrument.

It must be appreciated that human hearing can locate a number of different sound sources simultaneously. The hearing mechanism must be constantly comparing excitation patterns from the two ears with different delays. Strong correlation will be found where the delay corresponds to the interaural delay for a given source. This is apparent in the binaural threshold of hearing, which is 3–6 dB better than monaural at around four kHz. This delay-varying mechanism will take time and it is to be expected that the ear would then be slow to react to changes in source direction. This is indeed the case and experiments have shown that oscillating sources can be tracked only up to 2–3 Hz.

The ability to locate bursts of noise improves with burst duration up to about 700ms. The interaural phase, delay, and level mechanisms vary in their effectiveness depending on the nature of the sound to be located. A fixed response to each mechanism would be ineffective. For example, on a low-frequency tone, the time-of-arrival mechanism is useless, whereas on a transient it excels. The different mechanisms are quite separate on one level, but at some point in the brain's perception a fuzzy logic or adaptive decision has to be made as to how the outcome of these mechanisms will be weighted to make the final judgment of direction.

The most popular technique for giving an element of spatial realism in sound is stereophony, nowadays abbreviated to stereo, based on two simultaneous audio channels feeding two spaced loudspeakers. Figure 7.13 shows that the optimum listening arrangement for stereo is one in which the speakers and the listener are at different points of a triangle that is almost equilateral. Stereophony works by creating differences of phase and time of arrival of sound at the listener's ears. It was shown above that these are the most powerful hearing mechanisms for determining direction. Figure 7.14a shows that this time-of-arrival difference is achieved by producing the same waveform at each speaker simultaneously, but with a difference in the relative level, rather than phase. Each ear picks up sound from both loudspeakers and sums the waveforms.

FIGURE 7.14

(a) Stereo illusion is obtained by producing the same waveform at both speakers, but with different amplitudes. (b) As both ears hear both speakers, but at different times, relative level causes apparent time shift at the listener.

The sound picked up by the ear on the same side as the speaker is in advance of the same sound picked up by the opposite ear. When the level emitted by the left loudspeaker is greater than that emitted by the right, it will be seen from Figure 7.14b that the sum of the signals received at the left ear is a waveform that is phase advanced with respect to the sum of the waveforms received at the right ear. If the waveforms concerned are transient the result will be a time-of-arrival difference. These differences are interpreted as being due to a sound source left of centre.

The stereophonic illusion works properly only if the two loudspeakers are producing in-phase signals. In the case of an accidental phase reversal, the spatial characteristic will be ill-defined and lack images. At low frequencies the two loudspeakers are in one another's near field and so anti-phase connection results in bass cancellation. As the apparent position of a sound source between the two speakers can be controlled solely by the relative level of the sound emitted by each one, stereo signals in this format are called intensity stereo. In intensity stereo it is possible to “steer” a monophonic signal from a single microphone into a particular position in a stereo image using a form of differential gain control.

Figure 7.15 shows that this device, known as a panoramic potentiometer, or panpot for short, will produce equal outputs when the control is set to the centre. If the panpot is moved left or right, one output will increase and the other will reduce, moving or panning the stereo image to one side. If the system is perfectly linear, more than one sound source can be panned into a stereo image, with each source heard in a different location. This is done using a stereo mixer, shown in Figure 7.16, in which monophonic inputs pass via panpots to a stereo

(see Fig.7.8)

FIGURE 7.15

The panpot distributes a monophonic microphone signal into two stereo channels, allowing the sound source to be positioned anywhere in the image.

FIGURE 7.16

Multichannel mixing technique pans multiple sound sources into one stereo image.

Whilst the stereo illusion is very rewarding when well executed, the development of digital transmission allows an arbitrary number of sound channels to accompany video. In this way surround-sound techniques developed for the cinema can be enjoyed in the home.

Figure 7.17 shows the 5.1 channel system proposed for advanced television sound applications. In addition to the conventional L and R stereo speakers at the front, a centre speaker is used. When normal L–R stereo is heard from off-centre, the image will be pulled toward the nearer speaker. The centre speaker is primarily to pull central images back for the off-centre listener.

In most television applications it is only the dialog that needs to be centralized and consequently the centre speaker need not be capable of the full frequency range. Rear L and R speakers are also provided, making a total of five channels. A narrow bandwidth subwoofer channel is also provided to produce low frequencies for the inevitable earthquakes and explosions. The restricted bandwidth means that six full channels are not required, hence the term 5.1. Such systems require the separate channels to be carried individually to the viewer.

It should be appreciated that surround sound is something of a misnomer. Consider four speakers in a square in which the listener in the middle would face any pair of speakers. Anywhere between the front pair of speakers virtual sound sources can be presented. Anywhere between the rear pair of speakers more virtual sound sources can be located. Unfortunately there isn't a mechanism to put sound sources at the sides between the front and the rear speakers. This is simply because with respect to a pair of speakers to his or her side, the ears of a forward-facing listener are in tandem and the stereo illusion simply cannot function. Thus 4.0 or 5.0 surround sound isn't surround sound at all, but is front and rear stereo. The only way any sound will approach the listener from the side is if the room is reverberant and if the loudspeakers are capable of launching accurate sound in more than just the one direction. In order to mix 4.0 or 5.0 sound, the traditional acoustically dead room approach has to be abandoned and monitoring must be performed in surroundings more representative of the consumer environment. If sound from the sides is essential, a seven-channel system that actually has side speakers will be needed.

FIGURE 7.17

A 5.1-channel surround-sound system.

CHOICE OF SAMPLING RATE FOR AUDIO

Sampling theory is only the beginning of the process that must be followed to arrive at a suitable sampling rate. The finite slope of realizable filters will compel designers to raise the sampling rate. For consumer products, the lower the sampling rate, the better, because the cost of the medium is directly proportional to the sampling rate; thus sampling rates near to twice 20 kHz are to be expected. For professional products, there is a need to operate at variable speed for pitch correction. When the speed of a digital recorder is reduced, the reproduced sampling rate falls, and Figure 7.18 shows that with a minimal sampling rate the first image frequency can become low enough to pass the reconstruction filter. If the sampling frequency is raised without changing the response of the filters, the speed can be reduced without this problem.

FIGURE 7.18

(a) At normal speed, the reconstruction filter correctly prevents images entering the baseband. (b) When speed is reduced, the sampling rate falls, and a fixed filter will allow part of the lower sideband of the sampling frequency to pass. (c) If the sampling rate of the machine is raised, but the filter characteristic remains the same, the problem can be avoided.

In the early days of digital audio research, the necessary bandwidth of about one megabit per second per audio channel was difficult to store. Disk drives then had the bandwidth but not the capacity for long recording time, so attention turned to video recorders. These were adapted to store audio samples by creating a pseudo-video waveform, which could convey binary as black and white levels. The sampling rate of such a system is constrained to relate simply to the field rate and field structure of the television standard used, so that an integer number of samples can be stored on each usable TV line in the field. Such a recording can be made on a monochrome recorder, and these recordings are made in two standards, 525 lines at 60 Hz and 625 lines at 50 Hz. Thus it is possible to find a frequency that is a common multiple of the two and also suitable for use as a sampling rate.

The allowable sampling rates in a pseudo-video system can be deduced by multiplying the field rate by the number of active lines in a field (blanked lines cannot be used) and again by the number of samples in a line. By careful choice of parameters it is possible to use either 525/60 or 625/50 video with a sampling rate of 44.1 kHz.

In 60 Hz video, there are 35 blanked lines, leaving 490 lines per frame, or 245 lines per field for samples. If three samples are stored per line, the sampling rate becomes

60 × 245 × 3 = 44.1kHz.

In 50 Hz video, there are 37 lines of blanking, leaving 588 active lines per frame, or 294 per field, so the same sampling rate is given by

50.00 × 294 × 3 = 44.1 kHz.

The sampling rate of 44.1 kHz came to be that of the Compact Disc. Even though a CD has no video circuitry, the equipment used to make CD masters was video based and determined the sampling rate.

For landlines to FM stereo broadcast transmitters having a 15 kHz audio bandwidth, the sampling rate of 32 kHz is more than adequate and has been in use for some time in the United Kingdom and Japan. This frequency is also in use in the NICAM 728 stereo TV sound system and in DVB. It is also used for the Sony NT format minicassette. The professional sampling rate of 48 kHz was proposed as having a simple relationship to 32 kHz, being far enough above 40 kHz for variable-speed operation.

Although in a perfect world the adoption of a single sampling rate might have had virtues, for practical and economic reasons digital audio now has essentially three rates to support: 32 kHz for broadcast, 44.1 kHz for CD and its mastering equipment, and 48 kHz for “professional” use.⁹ A rate of 48 kHz is used extensively in television production, in which it can be synchronised to both U.S. and European line standards relatively easily. The currently available DVTR formats offer 48 kHz audio sampling. A number of formats can operate at more than one sampling rate although not all available hardware implements every possibility of the format. Most hard disk recorders will operate at a range of rates. Although higher sampling rates than 48 kHz are available, these appear to be offered because they have become technically possible rather than because there is a demonstrated need. Certainly there is nothing in psycho-acoustics to justify their use.

BASIC DIGITAL-TO-ANALOG CONVERSION

This direction of conversion will be discussed first, because ADCs often use embedded DACs in feedback loops. The purpose of a digital-to-analog convertor is to take numerical values and reproduce the continuous waveform that they represent. Figure 7.19 shows the major elements of a conventional conversion subsystem, i.e., one in which oversampling is not employed. The jitter in the clock needs to be removed with a VCO (voltage-controlled oscillator) or VCXO. Sample values are buffered in a latch and fed to the convertor element, which operates on each cycle of the clean clock. The output is then a voltage proportional to the number for at least a part of the sample period. A resampling stage may be found next, to remove switching transients, reduce the aperture ratio, or allow the use of a convertor, which takes a substantial part of the sample period to operate. The resampled waveform is then presented to a reconstruction filter, which rejects frequencies above the audio band.

FIGURE 7.19

The components of a conventional convertor. A jitter-free clock drives the voltage conversion, whose output may be resampled prior to reconstruction.

This section is primarily concerned with the implementation of the convertor element. There are two main ways of obtaining an analog signal from PCM (pulse code modulation) data. One is to control binary-weighted currents and sum them; the other is to control the length of time a fixed current flows into an integrator. The two methods are contrasted in Figure 7.20. They appear simple, but are of no use for audio in these forms because of practical limitations. In Figure 7.20c, the binary code is about to have a major overflow, and all the low-order currents are flowing. In (d), the binary input has increased by one, and only the most significant current flows. This current must equal the sum of all the others plus one. The accuracy must be such that the step size is within the required limits. In this simple four-bit example, if the step size needs to be a rather casual 10 percent accurate, the necessary accuracy is only one part in 160, but for a 16-bit system it would become one part in 655,360, or about two ppm. This degree of accuracy is almost impossible to achieve, let alone maintain in the presence of ageing and temperature change.

FIGURE 7.20

Elementary conversion: (a) weighted current DAC, (b) timed integrator DAC,

The integrator-type convertor in this four-bit example is shown in Figure 7.20c; it requires a clock for the counter, which allows it to count up to the maximum in less than one sample period. This will be more than 16 times the sampling rate. However, in a 16-bit system, the clock rate would need to be 65,536 times the sampling rate, or about three GHz. Whilst there may be a market for a CD player that can defrost a chicken, clearly some refinements are necessary to allow either of these convertor types to be used in audio applications.

One method of producing currents of high relative accuracy is dynamic element matching.^10,11 Figure 7.21 shows a current source feeding a pair of nominally equal resistors. The two will not be the same due to manufacturing tolerances and drift, and thus the current is only approximately divided between them. A pair of change-over switches places each resistor in series with each output. The average current in each output will then be identical, provided that the duty cycle of the switches is exactly 50 percent. This is readily achieved in a divide-by-2 circuit. The accuracy criterion has been transferred from the resistors to the time domain in which accuracy is more readily achieved. Current averaging is performed by a pair of capacitors that do not need to be of any special quality. By cascading these divide-by-2 stages, a binary-weighted series of currents can be obtained, as in Figure 7.22.

FIGURE 7.20

(c) current flow with 0111 input, (d) current flow with 1000 input, (e) integrator ramps up for 15 cycles of clock for input 1111.

FIGURE 7.21

Dynamic element matching. (a) Each resistor spends half its time in each current path. (b) Average current of both paths will be identical if duty cycle is 50 percent accurate.

In practice, a reduction in the number of stages can be obtained by using a more complex switching arrangement. This generates currents of ratio 1:1:2 by dividing the current into four paths and feeding two of them to one output, as shown in Figure 7.23. A major advantage of this approach is that no trimming is needed in manufacture, making it attractive for mass production. Freedom from drift is a further advantage.

FIGURE 7.21

FIGURE 7.22

Cascading the current dividers of Figure 7.21 produces a binary-weighted series of currents.

To prevent interaction between the stages in weighted-current convertors, the currents must be switched to ground or into the virtual earth by change-over switches. The on-resistance of these switches is a source of error, particularly the MSB, which passes most current. A solution in monolithic convertors is to fabricate switches whose area is proportional to the weighted current, so that the voltage drops of all the switches are the same. The error can then be removed with a suitable offset. The layout of such a device is dominated by the MSB switch because, by definition, it is as big as all the others put together.

FIGURE 7.23

A more complex dynamic element-matching system. Four drive signals (1, 2, 3, 4) of 25 percent duty cycle close switches of corresponding numbers. Two signals (5, 6) have 50 percent duty cycle, resulting in two current shares going to right-hand output. Division is thus into 1:1:2.

The practical approach to the integrator convertor is shown in Figures 7.24 and 7.25, in which two current sources, whose ratio is 256:1, are used; the larger is timed by the high byte of the sample and the smaller is timed by the low byte. The necessary clock frequency is reduced by a factor of 256. Any inaccuracy in the current ratio will cause one quantizing step in every 256 to be of the wrong size, as shown in Figure 7.26, but current tracking is easier to achieve in a monolithic device. The integrator capacitor must have low dielectric leakage and relaxation, and the operational amplifier must have low bias current as this will have the same effect as leakage.

FIGURE 7.24

Simplified diagram of Sony CX-20017. The high-order and low-order current sources (I_H and I_L) and associated timing circuits can be seen. The necessary integrator is external.

FIGURE 7.25

In an integrator convertor, the output level is stable only when the ramp finishes. An analog switch is necessary to isolate the ramp from subsequent circuits. The switch can also be used to produce a PAM (pulse amplitude-modulated) signal, which has a flatter frequency response than a zero-order hold (staircase) signal.

The output of the integrator will remain constant once the current sources are turned off, and the resampling switch will be closed during the voltage plateau to produce the pulse amplitude-modulated output. Clearly this device cannot produce a zero-order hold output without an additional sample-hold stage, so it is naturally complemented by resampling. Once the output pulse has been gated to the reconstruction filter, the capacitor is discharged with a further switch in preparation for the next conversion. The conversion count must take place in rather less than one sample period to permit the resampling and discharge phases. A clock frequency of about 20 MHz is adequate for a 16-bit 48 kHz unit, which permits the ramp to complete in 12.8 ms, leaving 8 ms for resampling and reset.

FIGURE 7.26

Imprecise tracking in a dual-slope convertor results in the transfer function shown here.

BASIC ANALOG-TO-DIGITAL CONVERSION

A conventional analog-to-digital subsystem is shown in Figure 7.27. Following the anti-aliasing filter there will be a sampling process. Many of the ADCs described here will need a finite time to operate, whereas an instantaneous sample must be taken from the input. The solution is to use a track/hold circuit. Following sampling the sample voltage is quantized. The number of the quantized level is then converted into a binary code, typically two's complement. This section is concerned primarily with the implementation of the quantizing step.

The flash convertor is probably the simplest technique available for PCM and DPCM (differential pulse code modulation) conversion. The principle was shown in Chapter 4. Although the device is simple in principle, it contains a lot of circuitry and can be practicably implemented only on a chip. A 16-bit device would need a ridiculous 65,535 comparators, and thus these convertors are not practicable for direct audio conversion, although they will be used to advantage in the DPCM and oversampling convertors described later in this chapter. The analog signal has to drive a lot of inputs, which results in a significant parallel capacitance, and a low-impedance driver is essential to avoid restricting the slewing rate of the input. The extreme speed of a flash convertor is a distinct advantage in oversampling. Because computation of all bits is performed simultaneously, no track/hold circuit is required, and droop is eliminated.

FIGURE 7.27

A conventional analog-to-digital subsystem. Following the anti-aliasing filter there will be a sampling process, which may include a track-hold circuit. Following quantizing, the number of the quantized level is then converted to a binary code, typically two's complement.

Reduction in component complexity can be achieved by quantizing serially. The most primitive method of generating different quantized voltages is to connect a counter to a DAC as in Figure 7.28. The resulting staircase voltage is compared with the input and used to stop the clock to the counter when the DAC output has just exceeded the input. This method is painfully slow, and is not used, as a much faster method that is only slightly more complex exists. Using successive approximation, each bit is tested in turn, starting with the MSB. If the input is greater than half-range, the MSB will be retained and used as a base to test the next bit, which will be retained if the input exceeds three-quarters range, and so on. The number of decisions is equal to the number of bits in the word, in contrast to the number of quantizing intervals, which was the case in the previous example. A drawback of the successive approximation convertor is that the least significant bits are computed last, when droop is at its worst. Figures 7.29 and 7.30 show that droop can cause a successive approximation convertor to make a significant error under certain circumstances.

FIGURE 7.28

A simple-ramp ADC compares the output of the DAC with the input. The count is stopped when the DAC output just exceeds the input. This method, although potentially accurate, is much too slow for digital audio.

Analog-to-digital conversion can also be performed using the dual-current-source type DAC principle in a feedback system; the major difference is that the two current sources must work sequentially rather than concurrently. Figure 7.31 shows a 16-bit application in which the capacitor of the track-hold circuit is also used as the ramp integrator. The system operates as follows. When the track-hold FET switches off, the capacitor C will be holding the sample voltage. Two currents of ratio 128:1 are capable of discharging the capacitor. As a result of this ratio, the smaller current will be used to determine the seven least significant bits, and the larger current will determine the nine most significant bits. The currents are provided by current sources of ratio 127:1. When both run together, the current produced is 128 times that from the smaller source alone. This approach means that the current can be changed simply by turning off the larger source, rather than by attempting a change-over.

With both current sources enabled, the high-order counter counts up until the capacitor voltage has fallen below the reference of −128Q supplied to comparator 1. At the next clock edge, the larger current source is turned off. Waiting for the next clock edge is important, because it ensures that the larger source can run only for entire clock periods, which will discharge the integrator by integer multiples of 128Q. The integrator voltage will overshoot the 128Q reference, and the remaining voltage on the integrator will be less than 128Q and will be measured by counting the number of clocks for which the smaller current source runs before the integrator voltage reaches zero. This process is termed residual expansion. The break in the slope of the integrator voltage gives rise to the alternative title of gear-change convertor. Following ramping to ground in the conversion process, the track-hold circuit must settle in time for the next conversion. In this 16-bit example, the high-order conversion needs a maximum count of 512, and the low order needs 128: a total of 640. Allowing 25 percent of the sample period for the track-hold circuit to operate, a 48 kHz convertor would need to be clocked at some 40 MHz. This is rather faster than the clock needed for the DAC using the same technology.

FIGURE 7.29

Successive approximation tests each bit in turn, starting with the most significant. The DAC output is compared with the input. If the DAC output is below the input (3) the bit is made 1; if the DAC output is above the input () the bit is made 0.

FIGURE 7.30

Two drooping track-hold signals (solid and dashed lines) that differ by one quantizing interval Q are shown here to result in conversions that are 4Q apart. Thus droop can destroy the monotonicity of a convertor. Low-level signals (near the midrange of the number system) are especially vulnerable.

FIGURE 7.31

Dual-ramp ADC using track-hold capacitor as integrator.

ALTERNATIVE CONVERTORS

Although PCM audio is universal because of the ease with which it can be recorded and processed numerically, there are several alternative related methods of converting an analog waveform to a bitstream. The output of these convertor types is not Nyquist rate PCM, but this can be obtained from them by appropriate digital processing. In advanced conversion systems it is possible to adopt an alternative convertor technique specifically to take advantage of a particular characteristic. The output is then digitally converted to Nyquist rate PCM to obtain the advantages of both.

Conventional PCM has already been introduced. In PCM, the amplitude of the signal depends only on the number range of the quantizer and is independent of the frequency of the input. Similarly, the amplitude of the unwanted signals introduced by the quantizing process is also largely independent of input frequency.

Figure 7.32 introduces the alternative convertor structures. The top half of the diagram shows convertors that are differential. In differential coding the value of the output code represents the difference between the current sample voltage and that of the previous sample. The lower half of the diagram shows convertors that are PCM. In addition, the left side of the diagram shows single-bit convertors, whereas the right side shows multibit convertors.

In DPCM, shown at top right, the difference between the previous absolute sample value and the current one is quantized into a multibit binary code. It is possible to produce a DPCM signal from a PCM signal simply by subtracting successive samples; this is digital differentiation. Similarly the reverse process is possible by using an accumulator or digital integrator (see Chapter 3) to compute sample values from the differences received. The problem with this approach is that it is very easy to lose the baseline of the signal if it commences at some arbitrary time. A digital high-pass filter can be used to prevent unwanted offsets.

Differential convertors do not have an absolute amplitude limit. Instead there is a limit to the maximum rate at which the input signal voltage can change. They are said to be slew rate limited, and thus the permissible signal amplitude falls at six dB per octave. As the quantizing steps are still uniform, the quantizing error amplitude has the same limits as PCM. As input frequency rises, ultimately the signal amplitude available will fall down to it.

FIGURE 7.32

The four main alternatives to simple PCM conversion are compared here. Delta modulation is a one-bit case of differential PCM and conveys the slope of the signal. The digital output of both can be integrated to give PCM. S-D (sigma-delta) is a one-bit case of S-DPCM. The application of integrator before differentiator makes the output true PCM, but tilts the noise floor; hence these can be referred to as “noise-shaping” convertors.

If DPCM is taken to the extreme case in which only a binary output signal is available, then the process is described as delta modulation (top left in Figure 7.32). The meaning of the binary output signal is that the current analog input is above or below the accumulation of all previous bits. The characteristics of the system show the same trends as DPCM, except that there is severe limiting of the rate of change of the input signal. A DPCM decoder must accumulate all the difference bits to provide a PCM output for conversion to analog, but with a one-bit signal the function of the accumulator can be performed by an analog integrator.

If an integrator is placed in the input to a delta modulator, the integrator's amplitude response loss of six dB per octave parallels the convertor's amplitude limit of six dB per octave; thus the system amplitude limit becomes independent of frequency. This integration is responsible for the term sigma-delta modulation, because in mathematics sigma (S) is used to denote summation. The input integrator can be combined with the integrator already present in a delta modulator by a slight rearrangement of the components (bottom left in Figure 7.32). The transmitted signal is now the amplitude of the input, not the slope; thus the receiving integrator can be dispensed with, and all that is necessary to do after the DAC is a low-pass filter to smooth the bits. The removal of the integration stage at the decoder now means that the quantizing error amplitude rises at six dB per octave, ultimately meeting the level of the wanted signal.

The principle of using an input integrator can also be applied to a true DPCM system and the result should perhaps be called sigma-DPCM (bottom right in Figure 7.32). The dynamic range improvement over delta-sigma modulation is six dB for every extra bit in the code. Because the level of the quantizing error signal rises at six dB per octave in both delta-sigma modulation and sigma-DPCM, these systems are sometimes referred to as “noise-shaping” convertors, although the word “noise” must be used with some caution. The output of a sigma-DPCM system is again PCM, and a DAC will be needed to receive it, because it is a binary code.

As the differential group of systems suffers from a wanted signal that converges with the unwanted signal as frequency rises, all the systems in the group must use very high sampling rates.¹² It is possible to convert from sigma-DPCM to conventional PCM by reducing the sampling rate digitally. When the sampling Oversampling and Noise Shaping rate is reduced in this way, the reduction of bandwidth excludes a disproportionate amount of noise because the noise shaping concentrated it at frequencies beyond the audio band. The use of noise shaping and oversampling is the key to the high resolution obtained in advanced convertors.

OVERSAMPLING AND NOISE SHAPING

It was seen in Chapter 4 that oversampling has a number of advantages for video conversion and the same is true for audio. Although it can be used alone, the advantages of oversampling in audio are better realized when it is used in conjunction with noise shaping. Thus in practice the two processes are generally used together and the terms are often seen used in the loose sense as if they were synonymous. For a detailed and quantitative analysis of audio oversampling having exhaustive references the serious reader is referred to Hauser.¹³

Under Basic Analog-to-Digital Conversion, where dynamic element matching was described, it was seen that component accuracy was traded for accuracy in the time domain. Oversampling is another example of the same principle. Oversampling permits the use of a convertor element of shorter word length, making it possible to use a flash convertor for audio conversion. The flash convertor is capable of working at very high frequency and so large oversampling factors are easily realized. The flash convertor needs no track-hold system as it works instantaneously. The drawbacks of track-hold set out under Alternative Convertors are thus eliminated.

If the sigma-DPCM convertor structure of Figure 7.32 is realized with a flash convertor element, it can be used with a high oversampling factor. This class of convertor has a rising noise floor. If the highly oversampled output is fed to a digital low-pass filter that has the same frequency response as an analog anti-aliasing filter used for Nyquist rate sampling, the result is a disproportionate reduction in noise because the majority of the noise is outside the audio band. A high-resolution convertor can be obtained using this technology without requiring unattainable component tolerances.

Noise shaping dates from the work of Cutler¹⁴ in the 1950s. It is a feedback technique applicable to quantizers and requantizers in which the quantizing process of the current sample is modified in some way by the quantizing error of the previous sample.

When used with requantizing, noise shaping is an entirely digital process, which is used, for example, following word extension due to the arithmetic in digital mixers or filters to return to the required word length. It will be found in this form in oversampling DACs. When used with quantizing, part of the noise-shaping circuitry will be analog. As the feedback loop is placed around an ADC it must contain a DAC. When used in convertors, noise shaping is primarily an implementation technology. It allows processes that are conveniently available in integrated circuits to be put to use in audio conversion. Once integrated circuits can be employed, complexity ceases to be a drawback and low-cost mass production is possible.

It has been stressed throughout this chapter that a series of numerical values or samples is just another analog of an audio waveform. Chapter 3 showed that all analog processes such as mixing, attenuation, or integration have exact numerical parallels. It has been demonstrated that digitally dithered requantizing is no more than a digital simulation of analog quantizing. It should be no surprise that in this section noise shaping will be treated in the same way. Noise shaping can be performed by manipulating analog voltages or numbers representing them or both. If the reader is content to make a conceptual switch between the two, many obstacles to understanding fall, not just in this topic, but in digital audio and video in general.

The term “noise shaping” is idiomatic and in some respects unsatisfactory because not all devices that are called noise shapers produce true noise. The caution that was given when treating quantizing error as noise is also relevant in this context. Whilst “quantizing-error-spectrum shaping” is a bit of a mouthful, it is useful to keep in mind that noise shaping means just that in order to avoid some pitfalls. Some noise shaper architectures do not produce a signal-decorrelated quantizing error and need to be dithered.

Figure 7.33a shows a requantizer using a simple form of noise shaping. The low-order bits that are lost in requantizing are the quantizing error. If the value of these bits is added to the next sample before it is requantized, the quantizing error will be reduced. The process is somewhat like the use of negative feedback in an operational amplifier except that it is not instantaneous, but encounters a one-sample delay. With a constant input, the mean or average quantizing error will be brought to zero over a number of samples, achieving one of the goals of additive dither. The more rapidly the input changes, the greater the effect of the delay and the less effective the error feedback will be. Figure 7.33b shows the equivalent circuit seen by the quantizing error, which is created at the requantizer and subtracted from itself one sample period later. As a result the quantizing error spectrum is not uniform, but has the shape of a raised sine wave shown in Figure 7.33c, hence the term noise shaping. The noise is very small at DC and rises with frequency, peaking at the Nyquist frequency at a level determined by the size of the quantizing step. If used with oversampling, the noise peak can be moved outside the audio band.

FIGURE 7.33

(a) A simple requantizer that feeds back the quantizing error to reduce the error of subsequent samples. The one-sample delay causes the quantizing error to see the equivalent circuit shown in (b), which results in a sinusoidal quantizing error spectrum shown in (c).

FIGURE 7.34

By adding the error caused by truncation to the next value, the resolution of the lost bits is maintained in the duty cycle of the output. Here, truncation of 011 by two bits would give continuous 0’s, but the system repeats 0111, 0111, which, after filtering, will produce a level of three quarters of a bit.

Figure 7.34 shows a simple example in which 2 low-order bits need to be removed from each sample. The accumulated error is controlled by using the bits that were neglected in the truncation, and adding them to the next sample. In this example, with a steady input, the roundoff mechanism will produce an output of 01110111.… If this is low-pass filtered, the three ones and one zero result in a level of three-quarters of a quantizing interval, which is precisely the level that would have been obtained by direct conversion of the full digital input. Thus the resolution is maintained even though two bits have been removed.

Noise shaping can also be used without oversampling. In this case the noise cannot be pushed outside the audio band. Instead the noise floor is shaped or weighted to complement the unequal spectral sensitivity of the ear to noise.^15–17 Unless we wish to violate Shannon's theory, this psychoacoustically optimal noise shaping can reduce the noise power at certain frequencies only by increasing it at others. Thus the average log PSD (power spectral density) over the audio band remains the same, although it may be raised slightly by noise induced by imperfect processing.

Figure 7.35 shows noise shaping applied to a digitally dithered requantizer. Such a device might be used when, for example, making a CD master from a 20-bit recording format. The input to the dithered requantizer is subtracted from the output to give the error due to requantizing. This error is filtered (and inevitably delayed) before being subtracted from the system input. The filter is not designed to be the exact inverse of the perceptual weighting curve because this would cause extreme noise levels at the ends of the band. Instead the perceptual curve is levelled off¹⁸ such that it cannot fall more than, e.g., 40 dB below the peak.

FIGURE 7.35

Perceptual filtering in a requantizer gives a subjectively improved SNR.

FIGURE 7.36

The Σ-DPCM convertor of Figure 7.32 is shown here in more detail.

Psycho-acoustically optimal noise shaping can offer nearly three bits of increased dynamic range compared with optimal spectrally flat dither. Enhanced Compact Discs recorded using these techniques are now available. The sigma-DPCM convertor introduced in Figure 7.32 has a natural application here and is shown in more detail in Figure 7.36. The current digital sample from the quantizer is converted back to analog in the embedded DAC. The DAC output differs from the ADC input by the quantizing error. It is subtracted from the analog input to produce an error, which is integrated to drive the quantizer in such a way that the error is reduced. With a constant input voltage the average error will be zero because the loop gain is infinite at DC. If the average error is zero, the mean or average of the DAC outputs must be equal to the analog input. The instantaneous output will deviate from the average in what is called an idling pattern. The presence of the integrator in the error feedback loop makes the loop gain fall with rising frequency. With the feedback falling at six dB per octave, the noise floor will rise at the same rate.

Figure 7.37 shows a simple oversampling system using a sigma-DPCM convertor and an oversampling factor of only four. The sampling spectrum shows that the noise is concentrated at frequencies outside the audio part of the over-sampling baseband. Because the scale used here means that noise power is represented by the area under the graph, the area left under the graph after the filter shows the noise-power reduction. Using the relative areas of similar triangles shows that the reduction has been by a factor of 16. The corresponding noise-voltage reduction would be a factor of four or 12 dB, which corresponds to an additional two bits in word length. These bits will be available in the word length extension that takes place in the decimating filter. Due to the rise of 6 dB per octave in the PSD of the noise, the SNR (signal-to-noise ratio) will be 3 dB worse at the edge of the audio band.

FIGURE 7.37

In a Σ-DPCM or Σ-Δ convertor, noise amplitude increases by 6 dB/octave, noise power by 12 dB/octave. In this 4× oversampling convertor, the digital filter reduces bandwidth by 4, but noise power is reduced by a factor of 16. Noise voltage falls by a factor of 4 or 12 dB.

One way in which the operation of the system can be understood is to consider that the coarse DAC in the loop defines fixed points in the audio transfer function. The time averaging that takes place in the decimator then allows the transfer function to be interpolated between the fixed points. True signal-independent noise of sufficient amplitude will allow this to be done to infinite resolution. By making the noise primarily outside the audio band the resolution is maintained, but the audio band signal-to-noise ratio can be extended. A first-order noise-shaping ADC of the kind shown can produce signal-dependent quantizing error and requires analog dither. However, this can be outside the audio band and so need not reduce the SNR achieved.

A greater improvement in dynamic range can be obtained if the integrator is supplanted to realize a higher-order filter.¹⁹ The filter is in the feedback loop and so the noise will have the opposite response to the filter and will therefore rise more steeply to allow a greater SNR enhancement after decimation. Figure 7.38 shows the theoretical SNR enhancement possible for various loop filter orders and oversampling factors. A further advantage of high-order loop filters is that the quantizing noise can be decorrelated from the signal, making dither unnecessary. High-order loop filters were at one time thought to be impossible to stabilize, but this is no longer the case, although care is necessary. One technique that may be used is to include some feedforward paths as shown in Figure 7.39.

FIGURE 7.38

The enhancement of SNR possible with various filter orders and oversampling factors in noise-shaping convertors.

An ADC with high-order noise shaping was disclosed by Adams²⁰ and a simplified diagram is shown in Figure 7.40. The comparator outputs of the 128 times oversampled four-bit flash ADC are directly fed to the DAC, which consists of 15 equal resistors fed by CMOS switches. As with all feedback loops, the transfer characteristic cannot be more accurate than the feedback, and in this case the feedback accuracy is determined by the precision of the DAC.²¹ Driving the DAC directly from the ADC comparators is more accurate because each input has equal weighting. The stringent MSB tolerance of the conventional binary-weighted DAC is then avoided. The comparators also drive a 16 to 4 priority encoder to provide the four-bit PCM output to the decimator. The DAC output is subtracted from the analog input at the integrator. The integrator is followed by a pair of conventional analog operational amplifiers having frequency-dependent feedback and a passive network, which gives the loop a fourth-order response overall. The noise floor is thus shaped to rise at 24 dB per octave beyond the audio band. The time constants of the loop filter are optimized to minimize the amplitude of the idling pattern as this is an indicator of the loop stability. The four-bit PCM output is low-pass filtered and decimated to the Nyquist frequency. The high oversampling factor and high-order noise shaping extend the dynamic range of the four-bit flash ADC to 108 dB at the output.

FIGURE 7.39

Stabilizing the loop filter in a noise-shaping convertor can be assisted by the incorporation of feedforward paths as shown here.

FIGURE 7.40

An example of a high-order noise-shaping ADC. See text for details.

ONE-BIT CONVERTORS

It might be thought that the waveform from a one-bit DAC is simply the same as the digital input waveform. In practice this is not the case. The input signal is a logic signal that needs only to be above or below a threshold for its binary value to be correctly received. It may have a variety of waveform distortions and a duty cycle offset. The area under the pulses can vary enormously. In the DAC output the amplitude needs to be extremely accurate. A one-bit DAC uses only the binary information from the input, but reclocks to produce accurate timing and uses a reference voltage to produce accurate levels. The area of pulses produced is then constant. One-bit DACs will be found in noise-shaping ADCs as well as in the more obvious application of producing analog audio.

Figure 7.41a shows a one-bit DAC that is implemented with MOS field-effect switches and a pair of capacitors. Quanta of charge are driven into or out of a virtual earth amplifier configured as an integrator by the switched capacitor action. Figure 7.41b shows the associated waveforms. Each data bit period is divided into two equal portions: that for which the clock is high and that for which it is low. During the first half of the bit period, pulse P+ is generated if the data bit is a one, or pulse P− is generated if the data bit is a zero. The reference input is a clean voltage corresponding to the gain required.

C1 is discharged during the second half of every cycle by the switches driven from the complemented clock. If the next bit is a one, during the next high period of the clock the capacitor will be connected between the reference and the virtual earth. Current will flow into the virtual earth until the capacitor is charged. If the next bit is not a one, the current through C1 will flow to ground.

FIGURE 7.41

(a) The operation of a one-bit DAC relies on switched capacitors. (b) The switching waveforms are shown.

C2 is charged to reference voltage during the second half of every cycle by the switches driven from the complemented clock. On the next high period of the clock, the reference end of C2 will be grounded, and so the op-amp end will assume a negative reference voltage. If the next bit is a zero, this negative reference will be switched into the virtual earth. If not, the capacitor will be discharged.

Thus on every cycle of the clock, a quantum of charge is either pumped into the integrator by C1 or pumped out by C2. The analog output therefore precisely reflects the ratio of ones to zeros.

To overcome the DAC accuracy constraint of the sigma-DPCM convertor, the sigma-delta convertor can be used as it has only one-bit internal resolution. A one-bit DAC cannot be nonlinear by definition, as it defines only two points on a transfer function. It can, however, suffer from other deficiencies such as DC offset and gain error, although these are less offensive in audio. The one-bit ADC is a comparator.

As the sigma-delta convertor is only a one-bit device, clearly it must use a high oversampling factor and high-order noise shaping to have sufficiently good SNR for audio.²² In practice the oversampling factor is limited not so much by the convertor technology as by the difficulty of computation in the decimator. A sigma-delta convertor has the advantage that the filter input “words” are one bit long and this simplifies the filter design as multiplications can be replaced by selection of constants.

Conventional analysis of loops falls down heavily in the one-bit case. In particular the gain of a comparator is difficult to quantify, and the loop is highly nonlinear so that considering the quantizing error as additive white noise to use a linear loop model gives rather optimistic results. In the absence of an accurate mathematical model, progress has been made empirically, with listening tests and by using simulation.

Single-bit sigma-delta convertors are prone to long idling patterns because the low resolution in the voltage domain requires more bits in the time domain to be integrated to cancel the error. Clearly the longer the period of an idling pattern, the more likely it is to enter the audio band as an objectionable whistle or “birdie.” They also exhibit threshold effects or deadbands when the output fails to react to an input change at certain levels. The problem is reduced by the order of the filter and the word length of the embedded DAC. Second- and third-order feedback loops are still prone to audible idling patterns and threshold effect.²³ The traditional approach to linearizing sigma-delta convertors is to use dither. Unlike conventional quantizers, the dither used was of a frequency outside the audio band and of considerable level. Square-wave dither has been used and it is advantageous to choose a frequency that is a multiple of the final output sampling rate, as then the harmonics will coincide with the troughs in the stopband ripple of the decimator. Unfortunately the level of dither needed to linearize the convertor is high enough to cause premature clipping of high-level signals, reducing the dynamic range. This problem is overcome by using in-band white-noise dither at low level.²⁴

An advantage of the one-bit approach is that in the one-bit DAC, precision components are replaced by precise timing in switched capacitor networks. The same approach can be used to implement the loop filter in an ADC. Figure 7.42 shows a third-order sigma-delta modulator incorporating a DAC based on the principle of Figure 7.41. The loop filter is also implemented with switched capacitors.

FIGURE 7.42

A third-order Σ-Δ modulator using a switched capacitor loop filter.

OPERATING LEVELS IN DIGITAL AUDIO

Analog tape recorders use operating levels that are some way below saturation. The range between the operating level and saturation is called the headroom. In this range, distortion becomes progressively worse and sustained recording in the headroom is avoided. However, transients may be recorded in the headroom as the ear cannot respond to distortion products unless they are sustained. The PPM level meter has an attack time constant, which simulates the temporal distortion sensitivity of the ear. If a transient is too brief to deflect a PPM into the headroom, it will not be heard, either.

Operating levels are used in two ways. On making a recording from a microphone, the gain is increased until distortion is just avoided, thereby obtaining a recording having the best SNR. In post-production the gain will be set to whatever level is required to obtain the desired subjective effect in the context of the program material. This is particularly important to broadcasters who require the relative loudness of different material to be controlled so that the listener does not need to make continuous adjustments to the volume control.

To maintain level accuracy, analog recordings are traditionally preceded by line-up tones at standard operating level. These are used to adjust the gain in various stages of dubbing and transfer along landlines so that no level changes occur to the program material.

Unlike analog recorders, digital recorders do not have headroom, as there is no progressive onset of distortion until convertor clipping, the equivalent of saturation, occurs at zero dBFs. Accordingly many digital recorders have level meters that read in dBFs. The scales are marked with zero at the clipping level and all operating levels are below that. This causes no difficulty provided the user is aware of the consequences.

In the situation in which a digital copy of an analog tape is to be made, however, it is very easy to set the input gain of the digital recorder so that line-up tone from the analog tape reads zero dB. This lines up digital clipping with the analog operating level. When the tape is dubbed, all signals in the headroom suffer convertor clipping.

To prevent such problems, manufacturers and broadcasters have introduced artificial headroom on digital level meters, simply by calibrating the scale and changing the analog input sensitivity so that zero dB analog is some way below clipping. Unfortunately there has been little agreement on how much artificial headroom should be provided, and machines that have it are seldom labelled with the amount. There is an argument that suggests that the amount of headroom should be a function of the sample word length, but this causes difficulties when transferring from one word length to another. The EBU²⁵ concluded that a single relationship between analog and digital level was desirable. In 16-bit working, 12 dB of headroom is a useful figure, but now that 18- and 20-bit convertors are available, the later EBU recommendation specifies 18 dB. Some modern equipment allows the user to specify the amount of artificial headroom.

DIGITAL AUDIO AND VIDEO SYNCHRONISATION

Digital audio with video is rather more difficult than in an audio-only environment because the characteristic frame rate of the video must be considered. This first came to be important with the development of digital video recorders, but the development of hard-disk-based workstations and digital transmission using MPEG transport streams has maintained the importance of synchronisation.

In digital VTRs and workstations, the audio data and the video data are both referenced to timecode for access and editing purposes. Broadcast timecode is always locked to the video standard. It follows that to avoid slippage between video and audio there must be a fixed number of audio samples in a timecode frame. This can be achieved only if the audio sampling rate is derived from the video timing system.

In MPEG transport streams, programs, i.e., video and associated sound channels, are carried in a packet-based multiplex. However, for each program there is only one timing system that is video locked. Consequently if the audio sampling rate is not synchronous with video the result will be buffer overflow or underflow in MPEG decoders.

In a practical system the master timing generator for a facility not only will generate video reference signals but also will derive synchronous audio clocks. Alternatively video syncs alone may be distributed and each device must obtain its own synchronous audio sampling rate from the video reference.

MPEG AUDIO COMPRESSION

The ISO (International Standards Organization) and the IEC (International Electrotechnical Commission) recognized that compression would have an important part to play in future digital video products and in 1988 established the ISO/IEC/MPEG (Moving Picture Experts Group) to compare and assess various coding schemes to arrive at an international standard. The terms of reference were extended the same year to include audio, and the MPEG/Audio group was formed.

As part of the Eureka 147 project, a system known as MUSICAM²⁶ (masking pattern adapted universal subband integrated coding and multiplexing) was developed jointly by CCETT in France, IRT in Germany, and Philips in The Netherlands. MUSICAM was designed to be suitable for DAB (digital audio broadcasting).

As a parallel development, the ASPEC²⁷ (adaptive spectral perceptual entropy coding) system was developed from a number of earlier systems as a joint proposal by AT&T Bell Labs, Thomson, the Fraunhofer Society, and CNET. ASPEC was designed for high degrees of compression to allow audio transmission on ISDN.

These two systems were both fully implemented by July 1990, when comprehensive subjective testing took place at the Swedish Broadcasting Corporation.^28,29 As a result of these tests, the MPEG/Audio group combined the attributes of both ASPEC and MUSICAM into a draft standard³⁰ having three levels of complexity and performance.

These three different levels are needed because of the number of possible applications. Audio coders can be operated at various compression factors with different quality expectations. Stereophonic classical music requires different quality criteria compared to monophonic speech. As was seen in Chapter 6, the complexity of the coder will be reduced with a smaller compression factor. For moderate compression, a simple codec will be more cost-effective. On the other hand, as the compression factor is increased, it will be necessary to employ a more complex coder to maintain quality.

At each level, MPEG coding allows input sampling rates of 32, 44.1, and 48 kHz and supports output bit rates of 32, 48, 56, 64, 96, 112, 128, 192, 256, and 384 kbps. The transmission can be mono, dual channel (e.g., bilingual), stereo, and joint stereo, which is where advantage is taken of redundancy between the two audio channels.

MPEG Layer 1 is a simplified version of MUSICAM that is appropriate for mild compression applications at low cost. It is very similar to PASC. Layer II is identical to MUSICAM and is very likely to be used for DAB. Layer III is a combination of the best features of ASPEC and MUSICAM and is mainly applicable to telecommunications, in which high compression factors are required.

The earlier MPEG-1 standard compresses audio and video into about 1.5 Mbps. The audio content of MPEG-1 may be used on its own to encode one or two channels at bit rates up to 448 kbps. MPEG-2 allows the number of channels to increase to 5: left, right, centre, left surround, right surround, and subwoofer. To retain reverse compatibility with MPEG-1, the MPEG-2 coding converts the five-channel input to a compatible two-channel signal, L_o and R_o, by matrixing.³¹ The data from these two channels are encoded in a standard MPEG-1 audio frame, and this is followed in MPEG-2 by an ancillary data frame, which an MPEG-1 decoder will ignore. The ancillary frame contains data for three further audio channels. An MPEG-2 decoder will extract those three channels in addition to the MPEG-1 frame and then recover all five original channels by an inverse matrix.

In various countries, it has been proposed to use an alternative compression technique for the audio content of DVB. This is the AC-3 system developed by Dolby Laboratories. The MPEG transport stream structure has also been standardised to allow it to carry AC-3-coded audio. The digital video disc can also carry AC-3 or MPEG audio coding.

There are many different approaches to audio compression, each having advantages and disadvantages. MPEG audio coding combines these tools in various ways in the three different coding levels. The approach of this section will be to examine the tools separately before seeing how they are used in MPEG and AC-3.

The simplest coding tool is companding, which is a digital parallel of the analog noise reducers used in tape recording. Figure 7.43a shows that in companding the input signal level is monitored. Whenever the input level falls below maximum, it is amplified at the coder. The gain, which was applied at the coder, is added to the data stream so that the decoder can apply an equal attenuation. The advantage of companding is that the signal is kept as far away from the noise floor as possible. In analog noise reduction this is used to maximize the SNR of a tape recorder, whereas in digital compression it is used to keep the signal level as far as possible above the noises and artifacts introduced by various coding steps.

One common way of obtaining coding gain is to shorten the word length of samples so that fewer bits need to be transmitted. Figure 7.43b shows that when this is done, the noise floor will rise by 6 dB for every bit removed. This is because removing a bit halves the number of quantizing intervals, which then must be twice as large, doubling the noise level. Clearly if this step follows the compander in Figure 7.43a, the audibility of the noise will be minimized. As an alternative to shortening the word length, the uniform quantized PCM signal can be converted to a nonuniform format. In nonuniform coding, shown in Figure 7.43c, the size of the quantizing step rises with the magnitude of the sample so that the noise level is greater when higher levels exist.

Companding is a relative of floating-point coding shown in Figure 7.44, in which the sample value is expressed as a mantissa and a binary exponent, which determines how the mantissa needs to be shifted to have its correct absolute value on a PCM scale. The exponent is the equivalent of the gain setting or scale factor of a compander.

FIGURE 7.43

Digital companding. (a) The encoder amplifies the input to maximum level and the decoder attenuates by the same amount. (b) In a companded system, the signal is kept as far as possible above the noise caused by shortening the sample word length.

FIGURE 7.44

In this example of floating-point notation, the radix point can have eight positions determined by the exponent E. The point is placed to the left of the first “1”, and the next four bits to the right form the mantissa M. As the MSB of the mantissa is always 1, it need not always be stored.

Clearly in floating point the signal-to-noise ratio is defined by the number of bits in the mantissa, and as shown in Figure 7.45, this will vary as a sawtooth function of signal level, as the best value, obtained when the mantissa is near overflow, is replaced by the worst value when the mantissa overflows and the exponent is incremented. Floating-point notation is used within DSP chips as it eases the computational problems involved in handling long word lengths. For example, when multiplying floating-point numbers, only the mantissae need to be multiplied. The exponents are simply added.

FIGURE 7.45

In this example of an 8-bit mantissa, 3-bit exponent system, the maximum SNR is 6 × 8 = 48 dB, with maximum input of 0 dB. As input level falls by 6 dB, the convertor noise remains the same, so the SNR falls to 42 dB. Further reduction in signal level causes the convertor to shift range (point A in the diagram) by increasing the input analog gain by 6 dB. The SNR is restored, and the exponent changes from 7 to 6 to cause the same gain change at the receiver. The noise modulation would be audible in this simple system. A longer mantissa word is needed in practice.

A floating-point system requires one exponent to be carried with each mantissa and this is wasteful because in real audio material the level does not change so rapidly and there is redundancy in the exponents. A better alternative is floating-point block coding, also known as near-instantaneous companding, in which the magnitude of the largest sample in a block is used to determine the value of an exponent that is valid for the whole block. Sending one exponent per block requires a lower data rate than in true floating point.³²

In block coding the requantizing in the coder raises the quantizing noise, but it does so over the entire duration of the block. Figure 7.46 shows that if a transient occurs toward the end of a block, the decoder will reproduce the waveform correctly, but the quantizing noise will start at the beginning of the block and may result in a pre-noise (also called pre-echo), where the noise is audible before the transient. Temporal masking may be used to make this inaudible. With a one ms block, the artifacts are too brief to be heard.

FIGURE 7.46

If a transient occurs toward the end of a transform block, the quantizing noise will still be present at the beginning of the block and may result in a pre-echo, in which the noise is audible before the transient.

Another solution is to use a variable time window according to the transient content of the audio waveform. When musical transients occur, short blocks are necessary and the coding gain will be low. At other times the blocks become longer, allowing a greater coding gain.

Whilst the above systems used alone do allow coding gain, the compression factor has to be limited because little benefit is obtained from masking. This is because the techniques above produce noise that spreads equally over the entire audio band. If the audio input spectrum is narrow, the noise will not be masked.

Subband coding splits the audio spectrum up into many different frequency bands. Once this has been done, each band can individually be processed. In real audio signals most bands will contain signals at a lower level than the loudest one. Individual companding of each band will be more effective than broadband companding. Subband coding also allows the noise floor to be raised selectively so that noise is added only at frequencies at which spectral masking will be effective.

There is little conceptual difference between a subband coder with a large number of bands and a transform coder. In transform coding, a FFT or DCT of the waveform is computed periodically. Because the transform of an audio signal changes slowly, it needs to be sent much less often than audio samples. The receiver performs an inverse transform. Finally the data may be subject to a lossless binary compression using, for example, a Huffman code.

Audio is usually considered to be a time domain waveform as this is what emerges from a microphone. As has been seen in Chapter 3, spectral analysis allows any periodic waveform to be represented by a set of harmonically related components of suitable amplitude and phase. In theory it is perfectly possible to decompose a periodic input waveform into its constituent frequencies and phases and to record or transmit the transform. The transform can then be inverted and the original waveform will be precisely re-created.

Although one can think of exceptions, the transform of a typical audio waveform changes relatively slowly. The slow speech of an organ pipe or a violin string, or the slow decay of most musical sounds, allows the rate at which the transform is sampled to be reduced, and a coding gain results. At some frequencies the level will be below maximum and a shorter word length can be used to describe the coefficient. Further coding gain will be achieved if the coefficients describing frequencies that will experience masking are quantized more coarsely. The transform of an audio signal is computed in the main signal path in a transform coder and has sufficient frequency resolution to drive the masking model directly.

In practice there are some difficulties. Real sounds are not periodic, but contain transients that transformation cannot accurately locate in time. The solution to this difficulty is to cut the waveform into short segments and then to transform each individually. The delay is reduced, as is the computational task, but there is a possibility of artifacts arising because of the truncation of the waveform into rectangular time windows. A solution is to use window functions (see Chapter 3) and to overlap the segments as shown in Figure 7.47. Thus every input sample appears in just two transforms, but with variable weighting depending upon its position along the time axis.

The DFT (discrete frequency transform) does not produce a continuous spectrum, but instead produces coefficients at discrete frequencies. The frequency resolution (i.e., the number of different frequency coefficients) is equal to the number of samples in the window. If overlapped windows are used, twice as many coefficients are produced as are theoretically necessary. In addition, the DFT requires intensive computation, due to the requirement to use complex arithmetic to render the phase of the components as well as the amplitude. An alternative is to use discrete cosine transforms.

FIGURE 7.47

Transform coding can be practically performed only on short blocks. These are overlapped using window functions to handle continuous waveforms.

Figure 7.48 shows a block diagram of a Layer I coder, which is a simplified version of that used in the MUSICAM system. A polyphase quadrature mirror filter network divides the audio spectrum into 32 equal sub-bands. The output data rate of the filter bank is no higher than the input rate because each band has been heterodyned to a frequency range from DC upward.

Subband compression takes advantage of the fact that real sounds do not have uniform spectral energy. The word length of PCM audio is based on the dynamic range required and this is generally constant with frequency although any pre-emphasis will affect the situation. When a signal with an uneven spectrum is conveyed by PCM, the whole dynamic range is occupied only by the loudest spectral component, and all the other components are coded with excessive headroom. In its simplest form, sub-band coding³³ works by splitting the audio signal into a number of frequency bands and companding each band according to its own level. Bands in which there is little energy result in small amplitudes, which can be transmitted with short word length. Thus each band results in variable-length samples, but the sum of all the sample word lengths is less than that of PCM and so a coding gain can be obtained.

FIGURE 7.48

A simple sub-band coder. The bit allocation may come from analysis of the sub-band energy, or, for greater reduction, from a spectral analysis in a side chain.

As MPEG audio coding relies on auditory masking, the subbands should preferably be narrower than the critical bands of the ear, hence the large number required. Figure 7.49 shows the critical condition in which the masking tone is at the top edge of the subband. It will be seen that the narrower the subband, the higher the requantizing noise that can be masked. The use of an excessive number of sub-bands will, however, raise complexity and the coding delay.

Constant-size input blocks containing 384 samples are used. At 48 kHz, 384 samples corresponds to a period of eight ms. After the subband filter each band contains 12 samples per block. The block size was based on the premasking phenomenon of Figure 7.46. The samples in each subband block, or bin, are companded according to the peak value in the bin. A six-bit scale factor is used for each subband, which applies to all 12 samples.

FIGURE 7.49

In sub-band coding the worst case occurs when the masking tone is at the top edge of the sub-band. The narrower the band, the higher the noise level which can be masked.

If a fixed compression factor is employed, the size of the coded output block will be fixed. The word lengths in each bin will have to be such that the sum of the bits from all of the subband equals the size of the coded block. Thus some subbands can have long word-length coding if others have short word-length coding. The process of determining the requantization step size, and hence the word length in each subband, is known as bit allocation.

For simplicity, in Layer I the levels in the 32 subbands themselves are used as a crude spectral analysis of the input to drive the masking model. The masking model uses the input spectrum to determine a new threshold of hearing, which in turn determines how much the noise floor can be raised in each subband. When masking takes place, the signal is quantized more coarsely until the quantizing noise is raised to just below the masking level. The coarse quantization requires shorter word lengths and allows a coding gain. The bit allocation may be iterative as adjustments are made to obtain the best NMR within the allowable data rate.

The samples of differing word length in each bin are then assembled into the output coded block. Unlike a PCM block, which contains samples of fixed word length, a coded block contains many different word lengths, and these can vary from one block to the next. To de-serialize the block into samples of various word lengths and demultiplex the samples into the appropriate frequency bins, the decoder has to be told what bit allocations were used when it was packed, and some synchronising means is needed to allow the beginning of the block to be identified.

The compression factor is determined by the bit-allocation system. It is not difficult to change the output block size parameter to obtain a different compression factor. If a larger block is specified, the bit allocator simply iterates until the new block size is filled. Similarly the decoder need correctly de-serialize only the larger block into coded samples and then the expansion process is identical except for the fact that expanded words contain less noise. Thus codecs with varying degrees of compression are available, which can perform different bandwidth/performance tasks with the same hardware.

Figure 7.50 shows the format of the Layer I data stream. The frame begins with a sync pattern, to reset the phase of de-serialization, and a header, which describes the sampling rate and any use of preemphasis. Following this is a block of 32 four-bit allocation codes. These specify the word length used in each subband and allow the decoder to de-serialize the subband sample block. This is followed by a block of 32 six-bit scale factor indices, which specify the gain given to each band during companding. The last block contains 32 sets of 12 samples. These samples vary in word length from one block to the next and can be from zero to 15 bits long. The de-serializer has to use the 32 allocation information codes to work out how to de-serialize the sample block into individual samples of variable length.

FIGURE 7.50

The Layer I data frame showing the allocation codes, the scale factors and the sub-band samples.

The Layer I MPEG decoder is shown in Figure 7.51. The elementary stream is deserialized using the sync pattern and the variable-length samples are assembled using the allocation codes. The variable-length samples are returned to 15-bit word length by adding zeros. The scale factor indices are then used to determine multiplication factors used to return the waveform in each sub-band to its original level. The 32 sub-band signals are then merged into one spectrum by the synthesis filter. This is a set of bandpass filters, which returns every sub-band to the correct place in the audio spectrum and then adds them to produce the audio output.

MPEG Layer II audio coding is identical to MUSICAM. The same 32-band filter bank and the same block companding scheme as in Layer I is used. Figure 7.52 shows that using the level in a sub-band to drive the masking model is suboptimal because it is not known where in the sub-band the energy lies. As the skirts of the masking curve are asymmetrical, the noise floor can be raised higher if the masker is at the low end of the sub-band than if it is at the high end.

FIGURE 7.51

The Layer I decoder. See text for details.

FIGURE 7.52

Accurate knowledge of the spectrum allows the noise floor to be raised higher while remaining masked.

To give better spectral resolution than the filter bank, a side-chain FFT having 1024 points is computed, resulting in an analysis of the audio spectrum eight times better than the subband width. The FFT drives the masking model, which controls the bit allocation. To give the FFT sufficient resolution, the block length is increased to 1152 samples. This is three times the block length of Layer I.

The QMF (quadrature mirror filter) band-splitting technique is restricted to bands of equal width. It might be thought that this is a drawback because the critical bands of the ear are nonuniform. In fact this is only a problem when very low bit rates are required. In all cases it is the masking model of hearing that must have correct critical bands. This model can then be superimposed on bands of any width to determine how much masking and therefore coding gain is possible. Uniform-width sub-bands will not be able to obtain as much masking as bands that are matched to critical bands, but for many applications the additional coding gain is not worth the added filter complexity.

The block-companding scheme of Layer II is the same as in Layer I because the 1152-sample block is divided into three 384-sample blocks. However, not all the scale factors are transmitted, because they contain a degree of redundancy on real program material. The difference between scale factors in successive blocks in the same band exceeds two dB less than 10 percent of the time. Layer II analyses the set of three successive scale factors in each subband. On a stationary program, these will be the same and only one scale factor of three is sent. As the transient content increases in a given subband, two or three scale factors will be sent. A scale factor select code must be sent to allow the decoder to determine what has been sent in each subband. This technique effectively halves the scale factor bit rate. The requantized samples in each subband, bit-allocation data, scale factors, and scale factor select codes are multiplexed into the output bitstream.

The Layer II decoder is not much more complex than the Layer I decoder as the only additional processing is to decode the compressed scale factors to produce one scale factor per 384-sample block. This is the most complex layer of the ISO standard and is really necessary only when the most severe data rate constraints must be met with high quality. It is a transform code based on the ASPEC system with certain modifications to give a degree of commonality with Layer II. The original ASPEC coder used a direct MDCT (modified discrete cosine transform) on the input samples. In Layer III this was modified to use a hybrid transform incorporating the existing polyphase 32-band QMF of Layers I and II. In Layer III, the 32 subbands from the QMF are each processed by a 12-band MDCT to obtain 384 output coefficients. Two window sizes are used to avoid pre-echo on transients. The window switching is performed by the psycho-acoustic model. It has been found that preecho is associated with the entropy in the audio rising above the average value.

A highly accurate perceptive model is used to take advantage of the high-frequency resolution available. Nonuniform quantizing is used, along with Huffman coding. This is a technique in which the most common code values are allocated the shortest word length.

DOLBY AC-3

Dolby AC-3 is in fact a family of transform coders based on time domain aliasing cancellation (TDAC), which allow various compromises between coding delay and bit rate to be used. In the MDCT,³⁴ windows with 50 percent overlap are used. Thus twice as many coefficients as necessary are produced. These are subsampled by a factor of two to give a critically sampled transform, which results in potential aliasing in the frequency domain. However, by making a slight change to the transform, the alias products in the second half of a given window are equal in size but of opposite polarity to the alias products in the first half of the next window and so will be cancelled on reconstruction. This is the principle of TDAC.

Figure 7.53 shows the generic block diagram of the AC-3 coder. Input audio is divided into 50 percent overlapped blocks of 512 samples. These are subject to a TDAC transform, which uses alternate modified sine and cosine transforms. The transforms produce 512 coefficients per block, but these are redundant, and after the redundancy has been removed there are 256 coefficients per block. The input waveform is constantly analysed for the presence of transients and if these are present the block length will be halved to prevent prenoise. This halves the frequency resolution but doubles the temporal resolution.

The coefficients have high frequency resolution and are selectively combined in subbands that approximate the critical bands. Coefficients in each subband are normalized and expressed in floating-point block notation with common exponents. The exponents in fact represent the logarithmic spectral envelope of the signal and can be used to drive the perceptive model, which operates the bit allocation. The mantissae of the transform coefficients are then requantized according to the bit allocation.

FIGURE 7.53

Block diagram of the Dolby AC-3 coder. See text for details.

The output bitstream consists of the requantized coefficients and the log spectral envelope in the shape of the exponents. There is a great deal of redundancy in the exponents. In any block, only the first exponent, corresponding to the lowest frequency, is transmitted absolutely. Remaining coefficients are transmitted differentially. When the input has a smooth spectrum the exponents in several bands will be the same and the differences will then be zero. In this case exponents can be grouped using flags.

Further use is made of temporal redundancy. An AC-3 sync frame contains six blocks. The first block of the frame contains absolute exponent data, but where stationary audio is encountered, successive blocks in the frame can use the same exponents.

The receiver uses the log spectral envelope to de-serialize the mantissae of the coefficients into the correct word lengths. The highly redundant exponents are decoded starting with the lowest frequency coefficient in the first block of the frame and adding differences to create the remainder. The exponents are then used to return the coefficients to fixed-point notation. Inverse transforms are then computed, followed by a weighted overlapping of the windows to obtain PCM data.

References

1. Johnston, J.D. Transform coding of audio signals using perceptual noise criteria. IEEE J. Selected Areas in Communications, JSAC-6, 314–323 (1988).

2. Moore, B.C.J. An Introduction to the Psychology of Hearing, London: Academic Press (1989).

3. Muraoka, T., Iwahara, M., and Yamada, Y. Examination of audio bandwidth requirements for optimum sound signal transmission. J. Audio Eng. Soc., 29, 2–9 (1982).

4. Muraoka, T., Yamada, Y., and Yamazaki, M. Sampling frequency considerations in digital audio. J. Audio Eng. Soc., 26, 252–256 (1978).

5. Fincham, L.R. The subjective importance of uniform group delay at low frequencies. Presented at the 74th Audio Engineering Society Convention (New York), Preprint No. 2056 (H-1) (1983).

6. Fletcher, H. Auditory patterns. Rev. Modern Physics, 12, 47–65 (1940).

7. Zwicker, E. Subdivision of the audible frequency range into critical bands. J. Acoust. Soc. Amer., 33, 248 (1961).

8. Moore, B., Glasberg, B., Formulae describing frequency selectivity as a function of frequency and level, and their use in calculating excitation patterns. Hearing Res., 28, 209–225 (1987).

9. Anonymous. AES recommended practice for professional digital audio applications employing pulse code modulation: preferred sampling frequencies. AES5-1984 (ANSI S4.28-1984). J. Audio Eng. Soc., 32, 781–785 (1984).

10. v.d. Plassche, R.J. Dynamic element matching puts trimless convertors on chip. Electronics, 16 June (1983).

11. v.d. Plassche, R.J., and Goedhart, D. A monolithic 14 bit D/A convertor. IEEE J. Solid-State Circuits, SC-14, 552–556 (1979).

12. Adams, R.W. Companded predictive delta modulation: a low-cost technique for digital recording. J. Audio Eng. Soc., 32, 659–672 (1984).

13. Hauser, M.W. Principles of oversampling A/D conversion. J. Audio Eng. Soc., 39, 3–26 (1991).

14. Cutler, C.C. Transmission systems employing quantization. U.S. Patent No. 2.927,962 (1960).

15. Fielder, L.D. Human auditory capabilities and their consequences in digital audio convertor design. In Audio in Digital Times, New York: Audio Engineering Society (1989).

16. Gerzon, M., and Craven, P.G. Optimal noise shaping and dither of digital signals. Presented at the 87th Audio Engineering Society Convention (New York), Preprint No. 2822 (J-1) (1989).

17. Wannamaker, R.A. Psychoacoustically optimal noise shaping. J. Audio Eng. Soc., 40, 611–620 (1992).

18. Lipshitz, S.P., Wannamaker, R.A., and Vanderkooy, J. Minimally audible noise shaping. J. Audio Eng. Soc., 39, 836–852 (1991).

19. Adams, R.W. Design and implementation of an audio 18-bit A/D convertor using oversampling techniques. Presented at the 77th Audio Engineering Society Convention (Hamburg), Preprint No. 2182 (1985).

20. Adams, R.W. An IC chip set for 20 bit A/D conversion. In Audio in Digital Times, New York: Audio Engineering Society (1989).

21. Richards, M., Improvements in oversampling analogue to digital convertors. Presented at the 84th Audio Engineering Society Convention (Paris), Preprint No. 2588 (D-8) (1988).

22. Inose, H., and Yasuda, Y. A unity bit coding method by negative feedback. Proc. IEEE, 51, 1524–1535 (1963).

23. Naus, P.J., et al. Low signal level distortion in sigma-delta modulators. Presented at the 84th Audio Engineering Society Convention (Paris), Preprint No. 2584 (1988).

24. Stikvoort, E. High order one bit coder for audio applications. Presented at the 84th Audio Engineering Society Convention (Paris), Preprint No. 2583 (D-3) (1988).

25. Moller, L. Signal levels across the EBU/AES digital audio interface. In Proceedings of the 1st NAB Radio Montreux Symposium (Montreux) pp. 16–28 (1992).

26. Wiese, D. MUSICAM: flexible bitrate reduction standard for high quality audio. Presented at the Digital Audio Broadcasting Conference (London) (1992).

27. Brandenburg, K. ASPEC coding. In Proceedings of the 10th Audio Engineering Society International Conference (New York) pp. 81–90 (1991).

28. ISO/IEC JTC1/SC2/WG11 N0030: MPEG/AUDIO test report. Stockholm (1990).

29. ISO/IEC JTC1/SC2/WG11: MPEG 91/010, the SR report on the MPEG/AUDIO subjective listening test. Stockholm (1991).

30. ISO/IEC JTC1/SC2/WG11: Committee draft 11172.

31. Bonicel, P., et al. A real time ISO/MPEG2 multichannel decoder. Presented at the 96th Audio Engineering Society Convention, Preprint No. 3798 (P3.7) (1994).

32. Caine, C.R., English, A.R., and O'Clarey, J.W.H. NICAM-3: near-instantaneous companded digital transmission for high-quality sound programmes. J. IERE, 50, 519–530 (1980).

33. Crochiere, R.E. Sub-band coding. Bell Syst. Tech. J., 60, 1633–1653 (1981).

34. Princen, J.P., Johnson, A., and Bradley, A.B. Sub-band/transform coding using filter bank designs based on time domain aliasing cancellation. Proc. ICASSP, 2161–2164 (1987).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 7 Digital Audio in Video

Create new playlist

Sign In

Sign Up