Chapter 2: An introduction to digital audio and video

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

An Introduction to Digital Audio and Video

2.1 What is an Audio Signal?

Actual sounds are converted to electrical signals for convenience of handling, recording and conveying from one place to another. This is the job of the microphone. There are two basic types of microphone: those which measure the variations in air pressure due to sound, and those which measure the air velocity due to sound, although there are numerous practical types which are a combination of both.

The sound pressure or velocity varies with time and so does the output voltage of the microphone, in proportion. The output voltage of the microphone is thus an analog of the sound pressure or velocity.

As sound causes no overall air movement, the average velocity of all sounds is zero, which corresponds to silence. As a result the bi-directional air movement gives rise to bipolar signals from the microphone, where silence is in the centre of the voltage range, and instantaneously negative or positive voltages are possible. Clearly the average voltage of all audio signals is also zero, and so when level is measured, it is necessary to take the modulus of the voltage, which is the job of the rectifier in the level meter. When this is done, the greater the amplitude of the audio signal, the greater the modulus, and so a higher level is displayed.

2.2 Types of Audio Signal

Whilst the nature of an audio signal is very simple, there are many applications of audio, each requiring different bandwidth and dynamic range. Like video signals, audio signals exist in a variety of formats. Audio can mean a single monophonic signal, or several signals can be required to deliver sound with spatial attributes. In general timing and level errors between these signals will be detrimental. In stereo, two channels are required to give a spread of virtual sound sources between a pair of loudspeakers. With more channels, loudspeakers can be provided at the rear and sides to give a feeling of ambience.

A common surround sound format has five channels, where one of these feeds a front centre loudspeaker. This centre channel is not strictly necessary but exists because of the rather different circumstances in which cinema sound has traditionally been produced. In some cases a sixth channel is used for low frequency effects.

In the digital domain it is straightforward to deliver several discrete audio channels, whereas in analog systems it is not. As a result systems such as Dolby Surround were developed to allow some surround effect to be encoded into only two analog channels.

2.3 What is a Video Signal?

The goal of television is to allow a moving picture to be seen at a remote place. The picture is a two-dimensional image, which changes as a function of time. This is a three-dimensional information source where the dimensions are distance across the screen, distance down the screen and progression in time.

Whilst telescopes convey these three dimensions directly, this cannot be done with electrical signals or radio transmissions, which are restricted to a single parameter varying with time.

The solution in film and television is to convert the three-dimensional moving image into a series of still pictures, taken at the frame rate, and then, in television only, the two-dimensional images are scanned as a series of lines to produce a single voltage varying with time which can be recorded or transmitted.

2.4 Types of Video Signal

Figure 2.1 shows some of the basic types of analog colour video; each of these types can, of course, exist in a variety of line standards. Since practical colour cameras generally have three separate sensors, one for each primary colour, an RGB system will exist at some stage in the internal workings of the camera, even if it does not emerge in that form.

Figure 2.1 The major types of analog video. Red, green and blue signals emerge from the camera sensors, needing full bandwidth. If a luminance signal is obtained by a weighted sum of R, G and B, it will need full bandwidth, but the colour difference signals R–Y and B–Y need less bandwidth. Combining R–Y and B–Y in a subcarrier modulation scheme allows colour transmission in the same bandwidth as monochrome.

RGB consists of three parallel signals each having the same spectrum, and is used where the highest accuracy is needed, often for production of still pictures. Examples of this are paint systems and in computer-aided design (CAD) displays. RGB is seldom used for real-time video recording; there is no standard RGB recording format for post-production or broadcast. As the red, green and blue signals directly represent part of the image, this approach is known as component video.

Some saving of bandwidth can be obtained by using colour difference working. The human eye relies on brightness to convey detail, and much less resolution is needed in the colour information. R, G and B are matrixed together to form a luminance (and monochrome compatible) signal Y, which has full bandwidth. The matrix also produces two colour difference signals, R–Y and B–Y, but these do not need the same bandwidth as Y; one-half or one-quarter will do depending on the application. In casual parlance, colour difference formats are often called component formats to distinguish them from composite formats.

For colour television broadcast in a single channel, the PAL and NTSC systems interleave into the spectrum of a monochrome signal a subcarrier that carries two colour difference signals of restricted bandwidth. The subcarrier is intended to be invisible on the screen of a monochrome television set. A subcarrier-based colour system is generally referred to as composite video, and the modulated subcarrier is called chroma.

The majority of today’s broadcast standards use 2:1 interlace in order to save bandwidth. Figure 2.2(a) shows that in such a system, there are an odd number of lines in a frame, and the frame is split into two fields. The first field begins with a whole line and ends with a half line, and the second field begins with a half line, which allows it to interleave spatially with the first field. Interlace may be viewed as a crude compression technique which was essentially rendered obsolete by the development of digital image compression techniques such as MPEG.

Figure 2.2(a) 2:1 interlace.

The field rate is intended to determine the flicker frequency, whereas the frame rate determines the bandwidth needed, which is thus halved along with the information rate. Information theory tells us that halving the information rate must reduce quality, and so the saving in bandwidth is accompanied by a variety of effects that are actually compression artefacts. Figure 2.2(b) shows the spatial/temporal sampling points in a 2:1 interlaced system. If an object has a sharp horizontal edge, it will be present in one field but not in the next. The refresh rate of the edge will be reduced to frame rate, and becomes visible as twitter. Whilst the vertical resolution of a test card is maintained with interlace, apart from the twitter noted, the ability of an interlaced standard to convey motion is halved. In the light of what is now known¹, interlace causes degradation roughly proportional to bandwidth reduction, and so should not be considered for any future standards.

Figure 2.2(b) In an interlaced system, a given point A is only refreshed at frame rate, causing twitter on fine vertical detail.

A wide variety of frame rates exist. Initially it was thought that a frame rate above the critical flicker frequency (CFF) of human vision at about 50 Hz would suffice, but this is only true for still pictures. When a moving object is portrayed, the eye tracks it and thus sees the background presented as a series of stills each in a different place. It is thus the prevention of background strobing that should be the criterion for frame rate. Needless to say the frame rate of film at 24 Hz is totally inadequate and the 50 and 60 Hz rates of television are suboptimal. Equipment running at 75 Hz or more gives obviously more fluid and realistic motion.

Originally film ran at 18 fps and to reduce flicker each frame was shown three times using a multi-blade shutter. During the development of ‘talking pictures’ it was found that the linear film speed at 18 Hz provided insufficient audio bandwidth and the frame rate was increased to 24Hz to improve the sound quality. Thus the frame rate of movie film to this day is based on no considerations of human vision whatsoever. The adoption of digital film production techniques that simply retain this inadequate frame rate shows a lack of vision. A digitized film frame is essentially the same as a progressively scanned television frame, but at 24 or 25Hz, a video monitor will be unacceptable because of flicker. This gave rise to the idea of a ‘segmented frame’ in which the simultaneously captured frame was displayed as fields containing alternately the odd and even lines. This allows digital film material to be viewed on television production equipment.

European television standards reflect the local AC power frequency and use 50 fields per second. American standards initially used 60Hz field rate but on the introduction of colour the rate had to be reduced by 0.1% to 59.94 Hz to prevent chroma and sound interference.

Film cameras are flexible devices and where film is being shot for television purposes the film frame rate may be made 25 Hz or even 30 Hz so that two television fields may be created from each film frame in a telecine machine.

With the move towards high definition television, obviously more lines were required in the picture, with more bandwidth to allow more detail along the line. This, of course, only gives a better static resolution, whereas what was needed was better resolution in the case of motion. This requires an increase in frame rate, but none was forthcoming. High definition formats using large line counts but which retained interlace gave poor results because the higher static resolution made it more obvious how poor the resolution became in the presence of motion.

Those high definition standards using progressive scan gave better results, with 720 P easily outperforming interlaced formats. Digital broadcasting systems support the use of progressively scanned pictures. Video interfaces have to accept the shortcomings inherent in the signals they convey. Their job is simply to pass on the signals without any further degradation.

2.5 What is a Digital Signal?

It is a characteristic of analog systems that degradations cannot be separated from the original signal, so nothing can be done about them. At the end of a system a signal carries the sum of all degradations introduced at each stage through which it passed. This sets a limit to the number of stages through which a signal can be passed before it is useless. Alternatively, if many stages are envisaged, each piece of equipment must be far better than necessary so that the signal is still acceptable at the end. The equipment will naturally be more expensive.

One of the vital concepts to grasp is that digital audio and video are simply alternative means of carrying the same information. An ideal digital recorder has the same characteristics as an ideal analog recorder: both of them are totally transparent and reproduce the original applied waveform without error. One need only compare high quality analog and digital equipment side by side with the same signals to realize how transparent modern equipment can be. Needless to say, in the real world, ideal conditions seldom prevail, so analog and digital equipment both fall short of the ideal. Digital equipment simply falls short of the ideal to a smaller extent than does analog and at lower cost, or, if the designer chooses, can have the same performance as analog at much lower cost.

Although there are a number of ways in which audio and video waveform can be represented digitally, there is one system, known as pulse code modulation (PCM), that is in virtually universal use. Figure 2.3 shows how PCM works. Instead of being continuous, the time axis is represented in a discrete, or stepwise, manner. The waveform is not carried by continuous representation, but by measurement at regular intervals. This process is called sampling and the frequency with which samples are taken is called the sampling rate or sampling frequency F_s. The sampling rate is generally fixed and is not necessarily a function of any frequency in the signal, although in video it may be for convenience. If every effort is made to rid the sampling clock of jitter, or time instability, every sample will be made at an exactly even time step. Clearly if there is any subsequent timebase error, the instants at which samples arrive will be changed and the effect can be detected. If samples arrive at some destination with an irregular timebase, the effect can be eliminated by storing the samples temporarily in a memory and reading them out using a stable, locally generated clock. This process is called timebase correction and all properly engineered digital systems must use it. Clearly timebase error is not simply reduced; it can be totally eliminated. As a result there is little point measuring the wow and flutter or timebase error of a digital recorder; it doesn’t have any. What happens is that the crystal clock in the timebase corrector measures stability of the measuring equipment. It should be stressed that sampling is an analog process. Each sample still varies infinitely as the original waveform did.

Figure 2.3 The major process in PCM conversion. A/D conversion (top) and D/A conversion (bottom). Note that the quantizing step can be omitted to examine sampling and reconstruction independently of quantizing (dotted arrow).

Those who are not familiar with digital processes often worry that sampling takes away something from a signal because it is not taking notice of what happened between the samples. This would be true in a system having infinite bandwidth, but no analog signal can have infinite bandwidth. All analog signal sources from microphones, tape decks, cameras and so on have a frequency response limit, as indeed do our ears and eyes. When a signal has finite bandwidth, the rate at which it can change is limited, and the way in which it changes becomes predictable. When a waveform can only change between samples in one way, it is then only necessary to carry the samples and the original waveform can be reconstructed from them.

Figure 2.3 also shows that each sample is also discrete, or represented in a stepwise manner. The length of the sample, which will be proportional to the voltage of the waveform, is represented by a whole number. This process is known as quantizing and results in an approximation, but the size of the error can be controlled until it is negligible. If, for example, we were to measure the height of humans to the nearest metre, virtually all adults would register 2 metres high and obvious difficulties would result. These are generally overcome by measuring height to the nearest centimetre. Clearly there is no advantage in going further and expressing our height in a whole number of millimetres or even micrometres. The point is that an appropriate resolution can be found just as readily for audio or video, and greater accuracy is not beneficial. The link between quality and sample resolution is explored later in this chapter. The advantage of using whole numbers is that they are not prone to drift. If a whole number can be carried from one place to another without numerical error, it has not changed at all. By describing waveforms numerically, the original information has been expressed in a way that is better able to resist unwanted changes.

Essentially, digital systems carry the original waveform numerically. The number of the sample is an analog of time, and the magnitude of the sample is an analog of the signal voltage. As both axes of the waveform are discrete, the waveform can be accurately restored from numbers as if it were being drawn on graph paper. If we require greater accuracy, we simply choose paper with smaller squares. Clearly more numbers are required and each one could change over a larger range.

In simple terms, the waveform is conveyed in a digital recorder as if the voltage had been measured at regular intervals with a digital meter and the readings had been written down on a roll of paper. The rate at which the measurements were taken and the accuracy of the meter are the only factors which determine the quality, because once a parameter is expressed as a discrete number, a series of such numbers can be conveyed unchanged. Clearly in this example the handwriting used and the grade of paper have no effect on the information. The quality is determined only by the accuracy of conversion and is independent of the quality of the signal path.

In practical systems, binary numbers are used, as was explained in Chapter 1 in which it was also shown that there are two ways in which binary signals can be used to carry samples. When each digit of the binary number is carried on a separate wire this is called parallel transmission. The states of the signals change at the sampling rate. This approach is used in the parallel video interfaces, as video needs a relatively short word length: eight or ten bits. Using multiple wires is cumbersome where a long word length is in use, and a single wire can be used where successive digits from each sample are sent serially. This is the definition of pulse code modulation. Clearly the clock frequency must now be higher than the sampling rate. Whilst the transmission of audio by such a scheme is advantageous in that noise and timebase error have been eliminated, there is a penalty that a single high quality audio channel requires around one million bits per second. Digital audio could only come into use when such a data rate could be handled economically.

As a digital video channel requires of the order of two hundred million bits per second it is not surprising that digital audio equipment became common somewhat before digital video.

2.6 Why Digital?

There are two main answers to this question, and it is not possible to say which is the most important, as it will depend on one’s standpoint:

(a) The quality of reproduction of a well-engineered digital system is independent of the medium and, in the absence of a compression scheme, depends only on the quality of the conversion processes.

(b) The conversion to the digital domain allows tremendous opportunities that were denied to analog signals.

Someone who is only interested in quality will judge the former the most relevant. If good quality convertors can be obtained, all of the shortcomings of analog recording and transmission can be eliminated to great advantage. One’s greatest effort is expended in the design of convertors, whereas those parts of the system that handle data need only be workmanlike. Wow, flutter, timebase error, vector jitter, crosstalk, particulate noise, print-through, dropouts, modulation noise, HF squashing, azimuth error and interchannel phase errors are all history. When a digital recording is copied, the same numbers appear on the copy: it is not a dub, it is a clone. If the copy is indistinguishable from the original, there has been no generation loss. Digital recordings can be copied indefinitely without loss of quality through a digital interface.

In the real world everything has a cost, and one of the greatest strengths of digital technology is low cost. If copying causes no quality loss, recorders do not need to be far better than necessary in order to withstand generation loss. They need only be of adequate quality on the first generation if that quality is then maintained. There is no need for the great size and extravagant tape consumption of professional analog recorders. When the information to be recorded is discrete numbers, they can be packed densely on the medium without quality loss. Should some bits be in error because of noise or dropout, error correction can restore the original value. Digital recordings take up less space than analog recordings for the same or better quality. Tape costs are far less and storage costs are reduced.

Digital circuitry costs less to manufacture. Switching circuitry handling binary can be integrated more densely than analog circuitry. More functionality can be put in the same chip. Analog circuits are built from a host of different component types having a variety of shapes and sizes and are costly to assemble and adjust. Digital circuitry uses standardized component outlines and is easier to assemble on automated equipment. Little if any adjustment is needed. Once audio or video signals are in the digital domain, they become data, and apart from the need to be reproduced at the correct sampling rate, are indistinguishable from any other type of data. Systems and techniques developed in other industries for other purposes can be used for audio and video. Computer equipment is available at low cost because the volume of production is far greater than that of professional audio equipment. Disk drives and memories developed for computers can be put to use in audio and video products. A word processor adapted to handle audio samples becomes a workstation. If video data is handled the result is a non-linear editor. Communications networks developed to handle data can happily carry digital audio and video over indefinite distances without quality loss. Digital audio broadcasting (DAB) makes use of these techniques to eliminate the interference, fading and multipath reception problems of analog broadcasting. At the same time, more efficient use is made of available bandwidth. Digital broadcasting techniques are now being applied to television signals, with DVB being used in Europe and other parts of the world.

Digital equipment can have self-diagnosis programs built in. The machine points out its own failures. Routine, mind-numbing adjustment of analog circuits to counteract drift is no longer needed. The cost of maintenance falls. A small operation may not need maintenance staff at all; a service contract is sufficient. A larger organization will still need maintenance staff, but they will be fewer in number and more highly skilled.

2.7 The Information Content of an Analog Signal

Any analog signal source can be characterized by a given useful bandwidth and signal-to-noise ratio. Video signals have very wide bandwidth extending over several MHz and require only 50 dB or so S/N ratio whereas audio signals require only 20 kHz but need a much better S/N ratio. If a well-engineered digital channel having a wider bandwidth and a greater signal-to-noise ratio is put in series with such a source, it is only necessary to set the levels correctly and the signal is then subject to no loss of information whatsoever.

Provided the digital clipping level is above the largest input signal, the digital noise floor is below the inherent noise in the input signal and the low- and high-frequency response of the digital channel extends beyond the frequencies in the input signal, then the digital channel is a ‘wider window’ than the input signal needs and its extremities cannot be explored by that signal. As a result there is no test known which can reliably tell whether or not the digital system was or was not present, unless, of course, it is deficient in some quantifiable way.

In audio the wider-window effect is obvious on certain Compact Discs made from analog master tapes. The CD player faithfully reproduces the tape hiss, dropouts and HF squashing of the analog master, which render the entire CD mastering and reproduction system transparent by comparison.

On the other hand, if an analog source can be found which has a wider window than the digital system, then the presence of the digital system will be evident either due to the reduction in bandwidth or the reduction in dynamic range. This will be evident on listening to DAB broadcasts that use compression and are inferior to the best analog practice.

2.8 Introduction to Conversion

There are a number of ways in which an audio waveform can be digitally represented, but the most useful and therefore common is PCM, which was introduced in Chapter 1. The input is a continuous-time, continuous-voltage waveform, and this is converted into a discrete-time, discrete-voltage format by a combination of sampling and quantizing. As these two processes are orthogonal (at right angles to one another) they are totally independent and can be performed in either order. Figure 2.4(a) shows an analog sampler preceding a quantizer, whereas (b) shows an asynchronous quantizer preceding a digital sampler. Ideally, both will give the same results; in practice each suffers from different deficiencies. Both approaches will be found in real equipment. In video convertors operating speed is a priority and the flash convertor is universally used. In the flash convertor the quantizing step is performed first and the quantized signal is subsequently sampled. In audio the sample accuracy is a priority and sampling first has the effect of freezing the signal voltage to allow time for an accurate quantizing process.

Figure 2.4 Since sampling and quantizing are orthogonal, the order in which they are performed is not important. At (a) sampling is performed first and the samples are quantized. This is common in audio convertors. At (b) the analog input is quantized into an asynchronous binary code. Sampling takes place when this code is latched on sampling clock edges. This approach is universal in video convertors.

The independence of sampling and quantizing allows each to be discussed quite separately in some detail, prior to combining the processes for a full understanding of conversion.

2.8.1 Sampling and Aliasing

Sampling is no more than periodic measurement, and it will be shown here that there is no theoretical need for sampling to be audible. Practical equipment may, of course, be less than ideal, but, given good engineering practice, the ideal may be approached quite closely.

Sampling must be precisely regular, because the subsequent process of timebase correction assumes a regular original process. The sampling process originates with a pulse train shown in Figure 2.5(a) to be of constant amplitude and period. The signal waveform amplitude-modulates the pulse train in much the same way as the carrier is modulated in an AM radio transmitter. One must be careful to avoid overmodulating the pulse train as shown in (b) and this is helped by applying a DC offset to the analog waveform so that, in audio, silence corresponds to a level half-way up the pulses as in (c). This approach will also be used in colour difference signals where blanking level will be shifted up to the midpoint of the scale. Clipping due to any excessive input level will then be symmetrical.

Figure 2.5 The sampling process requires a constant amplitude pulse train as shown at (a). This is amplitude modulated by the waveform to be sampled. If the input waveform has excessive amplitude or incorrect level, the pulse train clips as shown at (b). For an audio waveform, the greatest signal level is possible when an offset of half the pulse amplitude is used to centre the waveform as shown at (c).

In the same way that AM radio produces sidebands or images above and below the carrier, sampling also produces sidebands; although the carrier is now a pulse train and has an infinite series of harmonics as can be seen in Figure 2.6(a). The sidebands in (b) repeat above and below each harmonic of the sampling rate.

Figure 2.6 (a) Spectrum of sampling pulses. (b) Spectrum of samples. (c) Aliasing due to sideband overlap. (d) Beat-frequency production. (e) Four-times oversampling.

The sampled signal can be returned to the continuous-time domain simply by passing it into a low-pass filter. This filter has a frequency response that prevents the images from passing, and only the baseband signal emerges, completely unchanged.

If an input is supplied having an excessive bandwidth for the sampling rate in use, the sidebands will overlap, and the result is aliasing, where certain output frequencies are not the same as their input frequencies but become difference frequencies. It will be seen from (c) that aliasing occurs when the input frequency exceeds half the sampling rate, and this derives the most fundamental rule of sampling, first stated by Shannon in the West and at about the same time by Kotelnikov in Russia. This states that the sampling rate must be at least twice the highest input frequency.

In addition to the low-pass filter needed at the output to return to the continuous-time domain, a further low-pass filter is needed at the input to prevent aliasing. If input frequencies of more than half the sampling rate cannot reach the sampler, aliasing cannot occur.

Whilst aliasing has been described above in the frequency domain, it can equally be described in the time domain. In Figure 2.7(a) the sampling rate is obviously adequate to describe the waveform, but in (b) it is inadequate and aliasing has occurred.

Figure 2.7 At (a) the sampling rate is adequate to reconstruct the original signal. At (b) the sampling rate is inadequate, and reconstruction produces the wrong waveform (dotted). Aliasing has taken place.

Aliasing is commonly seen on television and in the cinema, owing to the relatively low frame rates used. With a frame rate of 24 Hz, a film camera will alias on any object changing at more than 12 Hz. Such objects include the spokes of stagecoach wheels, especially when being chased by Native Americans. When the spoke-passing frequency reaches 24Hz the wheels appear to stop. Aliasing partly explains the inability of opinion polls to predict the results of elections. Aliasing does, however, have useful applications, including the stroboscope, which makes rotating machinery appear stationary, and the sampling oscilloscope, which can display periodic waveform of much greater frequency than the sweep speed of the tube normally allows.

In television systems the input image that falls on the camera sensor will be continuous in time, and continuous in two spatial dimensions corresponding to the height and width of the sensor. All three of these continuous dimensions will be sampled in a digital system. There is a direct connection between the concept of temporal sampling, where the input signal changes with respect to time at some frequency and is sampled at some other frequency, and spatial sampling, where an image changes a given number of times per unit distance and is sampled at some other number of times per unit distance. The connection between the two is the process of scanning. Temporal frequency can be obtained by multiplying spatial frequency by the speed of the scan. Figure 2.8 shows a hypothetical image sensor having 1000 discrete sensors across a width of 1 centimetre. The spatial sampling rate of this sensor is thus 1000 per centimetre. If the sensors are measured sequentially during a scan taking 1 millisecond to go across the 1 centimetre width, the result will be a temporal sampling rate of 1 MHz.

Figure 2.8 If the above spatial sampling arrangement of 1000 points per centimetre is scanned in 1 millisecond, the sampling rate will become 1 megahertz.

2.8.2 Reconstruction

If ideal low-pass anti-aliasing and anti-image filters are assumed, having a vertical cut-off slope at half the sampling rate, an ideal spectrum shown at Figure 2.9(a) is obtained. Figure 2.9(b) shows that the impulse response of a phase linear ideal low-pass filter is a sin x/x waveform in the time domain. Such a waveform passes through zero volts periodically. If the cut-off frequency of the filter is one-half of the sampling rate, the impulse passes through zero at the sites of all other samples. Thus at the output of such a filter, the voltage at the centre of a sample is due to that sample alone, since the value of all other samples is zero at that instant. In other words the continuous time output waveform must join up the tops of the input samples. In between the sample instants, the output of the filter is the sum of the contributions from many impulses, and the waveform smoothly joins the tops of the samples. If the time domain is being considered, the antiimage filter of the frequency domain can equally well be called the reconstruction filter. It is a consequence of the band-limiting of the original anti-aliasing filter that the filtered analog waveform could only travel between the sample points in one way. As the reconstruction filter has the same frequency response, the reconstructed output waveform must be identical to the original band-limited waveform prior to sampling.

Figure 2.9 The impulse response of a low-pass filter which cuts off at F_s/2 has zeros at 1/F_s spacing which correspond to the position of adjacent samples, as shown at (b). The output will be a signal which has the value of each sample at the sample instant, but with smooth transitions from sample to sample.

It follows that sampling need not be audible. A rigorous mathematical proof of the above has been available since the 1930s, when PCM was invented, and can also be found in Betts².

2.8.3 Filter Design

The ideal filter with a vertical ‘brick-wall’ cut-off slope is difficult to implement. As the slope tends to vertical, the delay caused by the filter goes to infinity: the quality is marvellous but you do not live to measure it. In practice, a filter with a finite slope is accepted as shown in Figure 2.10, and the sampling rate has to be raised a little to prevent aliasing. There is no absolute factor by which the sampling rate must be raised; it depends upon the available filters.

Figure 2.10 As filters with finite slope are needed in practical systems, the sampling rate is raised slightly beyond twice the highest frequency in the baseband.

It is not easy to specify such filters, particularly the amount of stopband rejection needed. The amount of aliasing resulting would depend on, among other things, the amount of out-of-band energy in the input signal. This is seldom a problem in video, but can warrant attention in audio where overspecified bandwidths are sometimes found. As a further complication, an out-of-band signal will be attenuated by the response of the anti-aliasing filter to that frequency, but the residual signal will then alias, and the reconstruction filter will reject it according to its attenuation at the new frequency to which it has aliased.

It could be argued that the reconstruction filter is unnecessary in audio, since all of the images are outside the range of human hearing. However, the slightest non-linearity in subsequent stages would result in gross intermodulation distortion. The possibility of damage to tweeters must also be considered. It would, however, be acceptable to bypass one of the filters involved in a copy from one digital machine to another via the analog domain, although a digital transfer is of course to be preferred. In video the filters are essential to constrain the bandwidth of the signal to the allowable broadcast channel width.

The nature of the filters used has a great bearing on the subjective quality of the system. Entire books have been written about analog filters, so they will only be treated briefly here. Figure 2.11 shows the terminology to be used to describe the common elliptic low-pass filter. These filters are popular because they can be realized with fewer components than other filters of similar response. It is a characteristic of these elliptic filters that there are ripples in the passband and stopband. In much equipment the antialiasing filter and the reconstruction filter will have the same specification, so that the passband ripple is doubled with a corresponding increase in dispersion. Sometimes slightly different filters are used to reduce the effect.

Figure 2.11 The important features and terminology of low-pass filters used for anti-aliasing and reconstruction.

It is difficult to produce an analog filter with low distortion. Passive filters using inductors suffer non-linearity at high levels due to the B/H curve of the cores. Active filters can simulate linear inductors using op-amp techniques, but they tend to suffer non-linearity at high frequencies where the falling open-loop gain reduces the effect of feedback. Active filters can also contribute noise, but this is not necessarily a bad thing in controlled amounts, since it can act as a dither source.

It is instructive to examine the phase response of such filters. Since a sharp cut-off is generally achieved by cascading many filter sections that cut at a similar frequency, the phase responses of these sections will accumulate. The phase may start to leave linearity at a fraction of the cut-off frequency and by the time this is reached, the phase may have completed several revolutions. Meyer³ suggests that these phase errors are audible and that equalization is necessary. In video, phase linearity is essential as otherwise the different frequency components of a contrast step are smeared across the screen. An advantage of linear phase filters is that ringing is minimized, and there is less possibility of clipping on transients.

It is possible to construct a ripple-free phase-linear filter with the required stopband rejection⁴^,⁵, but the design effort and component complexity result in expense, and the filter might drift out of specification as components age. The effort may be more effectively directed towards avoiding the need for such a filter. Much effort can be saved in analog filter design by using oversampling. As shown in Figure 2.12 a high sampling rate produces a large spectral gap between the baseband and the first lower sideband. The anti-aliasing and reconstruction filters need only have a gentle roll-off, causing minimum disturbance to phase linearity in the baseband, and the Butterworth configuration, which does not have ripple or dispersion, can be used. The penalty of oversampling is that an excessive data rate results. It is necessary to reduce the rate using a digital low-pass filter (LPF). Digital filters can be made perfectly phase linear and, using LSI, can be inexpensive to construct. The superiority of oversampling convertors means that they have become universal in audio and appear set to do so in video.

Figure 2.12 In this 4-times oversampling system, the large separation between baseband and sidebands allows a gentle roll-off reconstruction filter to be used.

2.8.4 Sampling Clock Jitter

The instants at which samples are taken in an A/D convertor (ADC) and the instants at which D/A convertors (DACs) make conversions must be evenly spaced; otherwise unwanted signals can be added to the waveform. Figure 2.13(a) shows the effect of sampling clock jitter on a sloping waveform. Samples are taken at the wrong times. When these samples have passed through a system, the timebase correction stage prior to the DAC will remove the jitter, and the result is shown in Figure 2.13(b). The magnitude of the unwanted signal is proportional to the slope of the signal waveform and so increases with frequency. The nature of the unwanted signal depends on the spectrum of the jitter. If the jitter is random, the effect is noise-like and relatively benign unless the amplitude is excessive. Clock jitter is, however, not necessarily random. Figure 2.14 shows that one source of clock jitter is crosstalk on the clock signal. The unwanted additional signal changes the time at which the sloping clock signal appears to cross the threshold voltage of the clock receiver. The threshold itself may be changed by ripple on the clock receiver power supply. There is no reason why these effects should be random; they may be periodic and potentially discernible.

Figure 2.13 The effect of sampling timing jitter on noise. (a) A ramp sampled with jitter has an error proportional to the slope. (b) When the jitter is removed by later circuits, the error appears as noise added to samples. The superimposition of jitter may also be considered as a modulation process.

Figure 2.14 Crosstalk in transmission can result in unwanted signals being added to the clock waveform. It can be seen here that a low-frequency interference signal affects the slicing of the clock and causes a periodic jitter.

The allowable jitter is measured in picoseconds in both audio and video signals, as shown in Figure 2.13, and clearly steps must be taken to eliminate it by design. Convertor clocks must be generated from clean power supplies that are well decoupled from the power used by the logic because a convertor clock must have a good signal-to-noise ratio. If an external clock is used, it cannot be used directly, but must be fed through a well-damped phase-locked loop that will filter out the jitter. The external clock signal is sometimes fed into the clean circuitry using an optical coupler to improve isolation.

Although it has been documented for many years, attention to control of clock jitter is not as great in actual audio hardware as it might be. It accounts for much of the slight audible differences between convertors reproducing the same data. A well-engineered convertor should substantially reject jitter on an external clock and should sound the same when reproducing the same data irrespective of the source of the data. A remote convertor which sounds different when reproducing, for example, the same Compact Disc via the digital outputs of a variety of CD players is simply not well engineered and should be rejected. Similarly if the effect of changing the type of digital cable feeding the convertor can be heard, the unit is a dud. Unfortunately many consumer external DACs fall into this category, as the steps outlined above have not been taken.

Jitter tends to be less noticeable on digital video signals and is generally not an issue until it becomes great enough to cause data errors.

2.8.5 Aperture Effect

The reconstruction process of Figure 2.9 only operates exactly as shown if the impulses are of negligible duration. In many convertors this is not the case, and many keep the analog output constant until a different sample value is input and produces a waveform more like a staircase than a pulse train. In this case the pulses have effectively been extended in width to become equal to the sample period. This is known as a zero-order hold system and has a 100% aperture ratio. Note that the aperture effect is not apparent in a track-hold system; the holding period is for the convenience of the quantizer which outputs a value corresponding to the input voltage at the instant hold mode was entered.

Whereas pulses of negligible width have a uniform spectrum, which is flat within the audio band, pulses of 100% aperture have a sin x/x spectrum shown in Figure 2.15. The frequency response falls to a null at the sampling rate, and as a result is about 4 dB down at the edge of the baseband. If the pulse width is stable, the reduction of high frequencies is constant and predictable, and an appropriate equalization circuit can render the overall response flat once more. An alternative is to use resampling whose effect is shown in Figure 2.16. Resampling passes the zero-orderhold waveform through a further synchronous sampling stage consisting of an analog switch that closes briefly in the centre of each sample period. The output of the switch will be pulses that are narrower than the original. If, for example, the aperture ratio is reduced to 50% of the sample period, the first response null is now at twice the sampling rate, and the loss at the edge of the audio band is reduced. As the figure shows, the frequency response becomes flatter as the aperture ratio falls. The process should not be carried too far, as with very small aperture ratios there is little energy in the pulses and noise can be a problem. A practical limit is around 12.5% where the frequency response is virtually ideal.

Figure 2.15 Frequency response with 100% aperture nulls at multiples of the sampling rate. The area of interest is up to half the sampling rate.

Figure 2.16 (a) A resampling circuit eliminates transients and reduces aperture ratio.(b) Response of various aperture ratios.

The aperture effect will show up in many aspects of television. Lenses have finite modulation transfer functions, such that a very small object becomes spread in the image. The image sensor will also have a finite aperture function. In tube cameras, the beam will have a finite radius, and will not necessarily have a uniform energy distribution across its diameter. In CCD cameras, the sensor is split into elements that may almost touch in some cases. The element integrates light falling on its surface, and so will have a rectangular aperture. In both cases there will be a rolloff of higher spatial frequencies.

It is highly desirable to prevent spatial aliasing, since the result is visually irritating. In tube cameras the aliasing will be in the vertical dimension only, since the horizontal dimension is continuously scanned. Such cameras seldom attempt to prevent vertical aliasing. CCD sensors can, however, alias in both horizontal and vertical dimensions, and so an antialiasing optical filter is generally fitted between the lens and the sensor. This takes the form of a plate that diffuses the image formed by the lens. Such a device can never have a sharp cut-off nor will the aperture be rectangular. The aperture of the anti-aliasing plate is in series with the aperture effect of the CCD elements, and the combination of the two effectively prevents spatial aliasing, and generally gives a good balance between horizontal and vertical resolution, allowing the picture a natural appearance.

Conventional tube cameras generally have better horizontal resolution, and produce vertical aliasing which has a similar spatial frequency to real picture information, but which lacks realism. With a conventional approach, there are effectively two choices. If aliasing is permitted, the theoretical information rate of the system can be approached. If aliasing is prevented, realizable anti-aliasing filters cannot sharp cut, and the information conveyed is below system capacity.

These considerations also apply at the television display. The display must filter out spatial frequencies above one-half the sampling rate. In a conventional CRT this means that an optical filter should be fitted in front of the screen to render the raster invisible. Again the aperture of a simply realizable filter would attenuate too much of the wanted spectrum, and so the technique is not used. The technique of spot wobble was most effective in reducing raster visibility in monochrome television sets without affecting horizontal resolution, and its neglect remains a mystery.

As noted, in conventional tube cameras and CRTs the horizontal dimension is continuous, whereas the vertical dimension is sampled. The aperture effect means that the vertical resolution in real systems will be less than sampling theory permits, and to obtain equal horizontal and vertical resolutions a greater number of lines are necessary. The magnitude of the increase is described by the so-called Kell factors⁶, although the term factor is a misnomer since it can have a range of values depending on the apertures in use and the methods used to measure resolution⁷. In digital video, sampling takes place in horizontal and vertical dimensions, and the Kell parameter becomes unnecessary. The outputs of digital systems will, however, be displayed on raster scan CRTs, and the Kell parameter of the display will then be effectively in series with the other system constraints.

2.8.6 Choice of Audio Sampling Rate

The Nyquist criterion is only the beginning of the process that must be followed to arrive at a suitable sampling rate. The slope of available filters will compel designers to raise the sampling rate above the theoretical Nyquist rate. For consumer products, the lower the sampling rate the better, since the cost of the medium is directly proportional to the sampling rate: thus sampling rates near to twice 20 kHz are to be expected. For professional products, there is a need to operate at variable speed for pitch correction. When the speed of a digital recorder is reduced, the offtape sampling rate falls, and Figure 2.17 shows that with a minimal sampling rate the first image frequency can become low enough to pass the reconstruction filter. If the sampling frequency is raised without changing the response of the filters, the speed can be reduced without this problem. It follows that variable-speed recorders must use a higher sampling rate.

Figure 2.17 At normal speed, the reconstruction filter correctly prevents images entering the baseband, as at (a). (b) When speed is reduced, the sampling rate falls, and a fixed filter will allow part of the lower sideband of the sampling frequency to pass. (c) If the sampling rate of the machine is raised, but the filter characteristics remain the same, the problem can be avoided.

In the early days of digital audio research, the necessary bandwidth of about 1 megabit per second per audio channel was difficult to store. Disk drives had the bandwidth but not the capacity for long recording time, so attention turned to video recorders. These were adapted to store audio samples by creating a pseudo-video waveform that could convey binary as black and white levels⁸. The sampling rate of such a system is constrained to relate simply to the field rate and field structure of the television standard used, so that an integer number of samples can be stored on each usable TV line in the field. Such a recording can be made on a monochrome recorder, and these recordings are made in two standards, 525 lines at 60Hz and 625 lines at 50 Hz. Thus it is possible to find a frequency which is a common multiple of the two and also suitable for use as a sampling rate.

The allowable sampling rates in a pseudo-video system can be deduced by multiplying the field rate by the number of active lines in a field (blanked lines cannot be used) and again by the number of samples in a line. By careful choice of parameters it is possible to use either 525/60 or 625/50 video with a sampling rate of 44.1 kHz.

In 60 Hz video, there are 35 blanked lines, leaving 490 lines per frame, or 245 lines per field for samples. If three samples are stored per line, the sampling rate becomes 60×245×3=44.1 kHz.

In 50 Hz video, there are 37 lines of blanking, leaving 588 active lines per frame, or 294 per field, so the same sampling rate is given by 50×294×3=44.1 kHz.

The sampling rate of 44.1 kHz came to be that of the Compact Disc. Even though CD has no video circuitry, the equipment used to make CD masters was originally video based and determined the sampling rate.

For landlines to FM stereo broadcast transmitters having a 15kHz audio bandwidth, the sampling rate of 32 kHz is more than adequate, and has been in use for some time in the United Kingdom and Japan. This frequency is also in use in the NICAM 728 stereo TV sound system. The professional sampling rate of 48 kHz was proposed as having a simple relationship to 32 kHz, being far enough above 40 kHz for variable-speed operation, and having a simple relationship with PAL video timing which would allow digital video recorders to store the convenient number of 960 audio samples per video field. This is the sampling rate used by all of the professional DVTR formats⁹. The field rate offset of NTSC does not easily relate to any of the above sampling rates, and requires special handling that will be discussed further in this book.

Although in a perfect world the adoption of a single sampling rate might have had virtues, for practical and economic reasons digital audio now has essentially three rates to support: 32 kHz for broadcast, 44.1 kHz for CD, and 48 kHz for professional use¹⁰. Variations of the DAT format will support all of these rates, although the 32 kHz version is uncommon.

Recently there have been suggestions that extremely high sampling rates such as 96 and even 192 kHz are necessary for audio. Repeatable experiments to verify that this is the case seem to be elusive. It seems unlikely that all the world’s psychoacoustic researchers would have so seriously underestimated the bandwidth of human hearing. It may simply be that the cost of electronic devices and storage media have fallen to the extent that there is little penalty in adopting such techniques, even if greater benefit would be gained by attending to areas in which progress would truly be beneficial, such as microphones and loudspeakers.

2.8.7 Choice of Video Sampling Rate

Component or colour difference signals are used primarily for post-production work where quality and flexibility are paramount. In a digital colour difference system, the analog input video will be converted to the digital domain and will generally remain there whilst being operated on in digital effects units. The finished work will then be returned to the analog domain for transmission. The number of conversions is relatively few.

In contrast, composite digital video recorders are designed to work in an existing analog environment and so the number of conversions a signal passes through will be greater. It follows that composite will require a higher sampling rate than component so that filters with a more gentle slope can be used to reduce the build-up of response ripples.

In colour difference working, the important requirement is for image manipulation in the digital domain. This is facilitated by a sampling rate that is a multiple of line rate because then there is a whole number of samples in a line and samples are always in the same position along the line and can form neat columns. A practical difficulty is that the line period of the 525 and 625 systems is slightly different. The problem was overcome by the use of a sampling clock which is an integer multiple of both line rates. For standard definition working, the rate of 13.5 MHz is sufficient for the requirements of sampling theory, yet allows a whole number of samples in both line standards with a common clock frequency.

The colour difference signals have only half the bandwidth of the luminance and so can be sampled at one-half the luminance rate, i.e. 6.75MHz. Extended and high definition systems require proportionately higher rates.

In composite video, the most likely process to be undertaken in the digital domain is decoding to components. This is performed, for example, in video recorders running in slow motion. Separation of luminance and chroma in the digital domain is simpler if the sampling rate is a multiple of the subcarrier frequency. Whilst three times the frequency of subcarrier is adequate from a sampling theory standpoint, this does require relatively steep filters, so as a practical matter four times subcarrier is used.

In high definition systems, clearly the sampling rate must rise. Doubling the resolution requires twice as many lines and twice as many pixels in each line, so the sampling rate must be quadrupled. However, a further factor is the adoption of 16:9 aspect ratio instead of 4:3 so that the rate becomes yet higher as each line has more than twice as many pixels. In some cases 74.25 MHz is used. Eventually it became clear that the proliferation of frame rates and image sizes would make it impossible to interface. The solution for HD was to adopt what is essentially a fixed clock rate interface that can support a variety of actual data rates by using variable amounts of blanking. Thus the actual video sampling rate can be whatever is appropriate for the picture size and rate to give square pixels.

2.8.8 Quantizing

Quantizing is the process of expressing some infinitely variable quantity by discrete or stepped values. Quantizing turns up in a remarkable number of everyday guises. Figure 2.18 shows that an inclined ramp allows infinitely variable height, whereas on a stepladder only discrete height is possible. A stepladder quantizes height. When accountants round off sums of money to the nearest pound or dollar they are quantizing.

Figure 2.18 An analog parameter is continuous whereas a quantized parameter is restricted to certain values. Here the sloping side of a ramp can be used to obtain any height whereas a ladder only allows discrete heights.

In audio and video the values to be quantized are infinitely variable voltages from an analog source. Strict quantizing is a process restricted to the signal amplitude domain only. For the purpose of studying the quantizing of a single sample, time is assumed to stand still. This is achieved in practice either by the use of a track-hold circuit or the adoption of a quantizer technology that operates before the sampling stage.

Figure 2.19(a) shows that the process of quantizing divides the signal voltage range into quantizing intervals Q. In applications such as telephony these may be of differing size, but for digital audio and video the quantizing intervals are made as identical as possible. If this is done, the binary num-bers which result are truly proportional to the original analog voltage, and the digital equivalents of mixing and gain changing can be performed by adding and multiplying sample values. If the quantizing intervals are unequal this cannot be done. When all quantizing intervals are the same, the term uniform quantizing is used. The term linear quantizing will be found, but this is, like military intelligence, a contradiction in terms.

Figure 2.19 Quantizing assigns discrete numbers to variable voltages. All voltages within the same quantizing interval are assigned the same number which causes the DAC to produce the voltage at the centre of the intervals shown by the dashed lines at (a). This is the characteristic of the mid-tread quantizer shown at (b). An alternative system is the mid-riser system shown at (c). Hero 0 volts analog falls between two codes and there is no code for zero. Such quantizing cannot be used prior to signal processing because the number is no longer proportional to the voltage. Quantizing error cannot exceed 0.5Q as shown at (d).

The term LSB (Least Significant Bit) will also be found in place of quantizing interval in some treatments, but this is a poor term because quantizing is not always used to create binary values and because a bit can only have two values. In studying quantizing we wish to discuss values smaller than a quantizing interval, but a fraction of an LSB is a contradiction in terms.

Whatever the exact voltage of the input signal, the quantizer will determine the quantizing interval in which it lies. In what may be considered a separate step, the quantizing interval is then allocated a code value that is typically some form of binary number. The information sent is the number of the quantizing interval in which the input voltage lay. Exactly where that voltage lay within the interval is not conveyed, and this mechanism puts a limit on the accuracy of the quantizer. When the number of the quantizing interval is converted back to the analog domain, it will result in a voltage at the centre of the quantizing interval as this minimizes the magnitude of the error between input and output. The number range is limited by word length of the binary numbers used. In a 16-bit system commonly used for audio, 65 536 different quantizing intervals exist, whereas video systems typically have eight-bit systems having 256 quantizing intervals.

2.8.9 Quantizing Error

It is possible to draw a transfer function for such an ideal quantizer followed by an ideal DAC, and this is shown in Figure 2.19(b). A transfer function is simply a graph of the output with respect to the input. When the term linearity is used, this generally means the straightness of the transfer function. Linearity is a goal in audio and video, yet it will be seen that an ideal quantizer is anything but.

Figure 2.19(b) shows that the transfer function is somewhat like a staircase, and the voltage corresponding to audio muting or video blanking is half way up a quantizing interval, or on the centre of a tread. This is the so-called mid-tread quantizer universally used in audio and video. Figure 2.19(c) shows the alternative mid-riser transfer function that causes difficulty because it does not have a code value at muting/blanking level and as a result the code value is not proportional to the signal voltage.

Quantizing causes a voltage error in the sample given by the difference between the actual staircase transfer function and the ideal straight line. This is shown in Figure 2.19(d) to be a sawtooth function periodic in Q. The amplitude cannot exceed ±0.5Q unless the input is so large that clipping occurs.

In studying the transfer function it is better to avoid complicating matters with the aperture effect of the DAC. For this reason it is assumed here that output samples are of negligible duration. Then impulses from the DAC can be compared with the original analog waveform and the difference will be impulses representing the quantizing error waveform. As can be seen in Figure 2.20 the quantizing error waveform can be thought of as an unwanted signal that the quantizing process adds to the perfect original. As the transfer function is non-linear, ideal quantizing can cause distortion. As a result practical digital audio devices use non-ideal quantizers to achieve linearity. The quantizing error of an ideal quantizer is a complex function, and it has been researched in great depth¹¹. It is not intended to go into such depth here. The characteristics of an ideal quantizer will only be pursued far enough to convince the reader that they cannot be used.

Figure 2.20 At (a) an arbitary signal is represented to finite accuracy by PAM needles whose peaks are at the centre of the quantizing intervals. The errors caused can be thought of as an unwanted signal (b) added to the original.

As the magnitude of the quantizing error is limited, the effect can be minimized by the use of a larger signal. This will require more quantizing intervals and more bits to express them. The number of quantizing intervals multiplied by their size gives the quantizing range of the convertor. A signal outside the range will be clipped. Clearly if clipping is avoided, the larger the signal the less will be the effect of the quantizing error.

Consider first the case where the input signal exercises the whole quantizing range and has a complex waveform. In audio this might be orchestral music; in video a bright, detailed contrasting scene. In these cases successive samples will have widely varying numerical values and the quantizing error on a given sample will be independent of that on others.

In this case the size of the quantizing error will be distributed with equal probability between the limits. Figure 2.21(a) shows the resultant uniform probability density. In this case the unwanted signal added by quantizing is an additive broadband noise uncorrelated with the signal, and it is appropriate in this case to call it quantizing noise. This is not quite the same as thermal noise which has a Gaussian probability shown in Figure 2.21(b). The subjective difference is slight. Treatments that then assume that quantizing error is always noise give results at variance with reality. Such approaches only work if the probability density of the quantizing error is uniform. Unfortunately at low levels, and particularly with pure or simple waveforms, this is simply not true.

Figure 2.21 (a) The amplitude of a quantizing error needle will be from −0.5Q to +0.5Q with equal probability. (b) White noise in analog circuits generally has Gaussian amplitude distribution.

At low levels, quantizing error ceases to be random, and becomes a function of the input waveform and the quantizing structure. Once an unwanted signal becomes a deterministic function of the wanted signal, it has to be classed as a distortion rather than a noise. We predicted a distortion because of the non-linearity or staircase nature of the transfer function. With a large signal, there are so many steps involved that we must stand well back, and a staircase with many steps appears to be a slope. With a small signal there are few steps and they can no longer be ignored.

The non-linearity of the transfer function results in distortion, which produces harmonics. Unfortunately these harmonics are generated after the anti-aliasing filter, and so any which exceed half the sampling rate will alias. Figure 2.22 shows how this results in anharmonic distortion in audio. These anharmonics result in spurious tones known as birdsinging. When the sampling rate is a multiple of the input frequency the result is harmonic distortion. This is shown in Figure 2.23. Where more than one frequency is present in the input, intermodulation distortion occurs, which is known as granulation.

Figure 2.22 Quantizing produces distortion after the anti-aliasing filter, thus the distortion products will fold back to produce anharmonics in the audio band. Here the fundamental of 15 kHz produces 2nd and 3rd harmonic distortion at 30 and 45 kHz. This results in aliased products at 40–30 10 kHz and 40–45=(–)5 kHz.

Figure 2.23 Mathematically derived quantizing error waveform for a sine wave sampled at a multiple of itself. The numerous autocorrelations between quantizing errors show that there are harmonics of the signal in the error, and that the error is not random, but deterministic.

As the input signal is further reduced in level, it may remain within one quantizing interval. The output will be silent because the signal is now the quantizing error. In this condition, low-frequency signals such as air-conditioning rumble can shift the input in and out of a quantizing interval so that the quantizing distortion comes and goes, resulting in noise modulation.

In video, quantizing error results in visible contouring on low-key scenes or flat fields. Continuous change of brightness across the screen is replaced by a series of sudden steps between areas of constant brightness.

Needless to say any one of the above effects would prevent the use of an ideal quantizer for high quality work. There is little point in studying the adverse effects further as they can be eliminated completely in practical equipment by the use of dither.

2.8.10 Dither

At high signal level, quantizing error is effectively noise. As the level falls, the quantizing error of an ideal quantizer becomes more strongly correlated with the signal and the result is distortion. If the quantizing error can be decorrelated from the input in some way, the system can remain linear. Dither performs the job of decorrelation by making the action of the quantizer unpredictable.

The first documented use of dither was in picture coding¹². In this system, the noise added prior to quantizing was subtracted after reconversion to analog. This is known as subtractive dither. Although subsequent subtraction has some slight advantages¹¹, it suffers from practical drawbacks, since the original noise waveform must accompany the samples or must be synchronously recreated at the DAC. This is virtually impossible in a system where the signal may have been edited. Practical systems use non-subtractive dither where the dither signal is added prior to quantization and no subsequent attempt is made to remove it. The introduction of dither inevitably causes a slight reduction in the signal-to-noise ratio attainable, but this reduction is a small price to pay for the elimination of non-linearity. As linearity is an essential requirement for digital audio and video, the use of dither is equally essential and there is little point in deriving a signal-to-noise ratio for an undithered system as this has no relevance to real applications. Instead, a study of dither will be used to derive the dither amplitude necessary for linearity and freedom from artefacts, and the signal-to-noise ratio available should follow from that.

The ideal (noiseless) quantizer of Figure 2.19 has fixed quantizing intervals and must always produce the same quantizing error from the same signal. In Figure 2.24 it can be seen that an ideal quantizer can be dithered by linearly adding a controlled level of noise either to the input signal or to the reference voltage used to derive the quantizing intervals. There are several ways of considering how dither works, all of which are valid. The addition of dither means that successive samples effectively find the quantizing intervals in different places on the voltage scale. The quantizing error becomes a function of the dither, rather than just a function of the input signal. The quantizing error is not eliminated, but the subjectively unacceptable distortion is converted into broadband noise that is more benign.

Figure 2.24 Dither can be applied to a quantizer in one of two ways. At (a) the dither is linearly added to the analog input signal whereas at (b) it is added to the reference voltages of the quantizer.

That noise may be also frequency shaped to take account of the frequency shaping of the ear and eye responses.

An alternative way of looking at dither is to consider the situation where a low level input signal is changing slowly within a quantizing interval. Without dither, the same numerical code results, and the variations within the interval are lost. Dither has the effect of forcing the quantizer to switch between two or more states. The higher the voltage of the input signal within the interval, the more probable it becomes that the output code will take on a higher value. The lower the input voltage within the interval, the more probable it is that the output code will take the lower value. The dither has resulted in a form of duty cycle modulation, and the resolution of the system has been extended indefinitely instead of being limited by the size of the steps.

Dither can also be understood by considering the effect it has on the transfer function of the quantizer. This is normally a perfect staircase, but in the presence of dither it is smeared horizontally until, when the amplitude has increased sufficiently, the average transfer function becomes straight.

The characteristics of the noise used are rather important for optimal performance, although many suboptimal but nevertheless effective systems are in use. The main parameters of interest are the peak-to-peak amplitude, and the probability distribution of the amplitude. Triangular probability works best and this can be obtained by summing the output of two uniform probability processes.

The use of dither invalidates the conventional calculations of signal-to-noise ratio available for a given word length. This is of little consequence as the rule of thumb that multiplying the number of bits in the word length by 6 dB gives the S/N ratio will be close enough for all practical purposes.

The technique can also be used to convert signals quantized with more intervals to a lower number of intervals.

It has only been possible to introduce the principles of conversion of audio and video signals here. For more details of the operation of convertors the reader is referred elsewhere¹³^,¹⁴.

2.9 Binary Codes for Audio

For audio use, the prime purpose of binary numbers is to express the values of the samples representing the original analog sound-pressure waveform. There will be a fixed number of bits in the sample, which determines the number range. In a 16-bit code there are 65 536 different numbers. Each number represents a different analog signal voltage, and care must be taken during conversion to ensure that the signal does not go outside the convertor range, or it will be clipped. In Figure 2.25, it will be seen that in a unipolar system, the number range goes from 0000 hex, which represents the largest negative voltage, through 7FFF hex, which represents the smallest negative voltage, through 8000 hex, which represents the smallest positive voltage, to FFFF hex, which represents the largest positive voltage. Effectively the number range of the convertor has been shifted so that positive and negative voltages in a bipolar signal may be expressed with a positive-only binary number. This approach is called offset binary, and is perfectly acceptable where the signal has been digitized only for recording or transmission from one place to another, after which it will be converted back to analog. Under these conditions it is not necessary for the quantizing steps to be uniform, provided both ADC and DAC are constructed to the same standard. In practice, it is the requirements of signal processing in the digital domain that make both non-uniform quantizing and offset binary unsuitable.

Figure 2.25 Offset binary coding is simple but causes problems in digital audio processing. It is seldom used.

Figure 2.26 shows an audio signal voltage is referred to midrange. The level of the signal is measured by how far the waveform deviates from midrange, and attenuation, gain and mixing all take place around midrange. It is necessary to add sample values from two or more different sources to perform the mixing function, and adding circuits assumes that all bits represent the same quantizing interval so that the sum of two sample values will represent the sum of the two original analog voltages. In non-uniform quantizing this is not the case, and such signals cannot readily be processed. If two offset binary sample streams are added together in an attempt to perform digital mixing, the result will be an offset that may lead to an overflow. Similarly, if an attempt is made to attenuate by, say, 6 dB by dividing all of the sample values by two, Figure 2.27 shows that a further offset results. The problem is that offset binary is referred to one end of the range. What is needed is a numbering system that operates symmetrically about the centre of the range.

Figure 2.26 Attenuation of an audio signal takes place with respect to midrange.

Figure 2.27 If two pure binary data streams are added to simulate mixing, offset or overflow will result.

In the two’s complement system, which has this property, the upper half of the pure binary number range has been defined to represent negative quantities. If a pure binary counter is constantly incremented and allowed to overflow, it will produce all the numbers in the range permitted by the number of available bits, and these are shown for a four-bit example drawn around the circle in Figure 2.28. In two’s complement, however, the number range this represents does not start at zero, but starts on the opposite side of the circle. Zero is midrange, and all numbers with the most significant bit set are considered negative. This system allows two sample values to be added, where the result is referred to the system midrange; this is analogous to adding analog signals in an operational amplifier. A further asset of two’s complement notation is that binary subtraction can be performed using only adding logic. The two’s complement is added to perform a subtraction. This permits a significant reduction of hardware complexity, since only carry logic is necessary and no borrow mechanism need be supported.

Figure 2.28 In this example of a four-bit two’s complement code, the number range is from −8 to +7. Note that the MSB determines polarity.

For these reasons, two’s complement notation is in virtually universal use in digital audio processing, and is accordingly adopted by all the major digital recording formats, and consequently it is specified for the AES/EBU digital audio interface.

Fortunately the process of conversion to two’s complement is simple. The signed binary number to be expressed as a negative number is written down with leading zeros if necessary to occupy the word length of the system. All bits are then inverted, to form the one’s complement, and one is added. To return to signed binary, if the most significant bit of the two’s complement number is set to zero, no action needs to be taken. If the most significant bit is one, the sign is negative, all bits are inverted, and one is added. Figure 2.29 shows some examples of conversion to and from two’s complement, and illustrates how the addition of two’s complement samples simulates the mixing process.

Figure 2.29 (a) Two’s complement conversion from binary. (b) Binary conversion from two’s complement. (c) Some examples of two’s complement arithmetic. (d) Using two’s complement arithmetic, single values from two waveforms are added together with respect to midrange to give a correct mixing function.

Reference to Figure 2.28 will show that inverting the MSB results in a jump to the diametrically opposite code that is the equivalent of an offset of half full scale. Thus it is possible to obtain two’s complement coded data by adding an offset of half full scale to the analog input of a convertor and then inverting the MSB as shown in Figure 2.30.

Figure 2.30 A two’s complement ADC. At (a) an analog offset voltage equal to one-half the quantizing range is added to the bipolar analog signal in order to make it unipolar as at (b). The ADC produces positive only numbers at (c), but the MSB is then inverted at (d) to give a two’s complement output.

2.10 Binary Codes for Video

There are a wide variety of waveform types encountered in video, and all of these can be successfully digitized. All that is needed is for the quantizing range to be a little greater than the useful voltage range of the waveform.

Composite video contains an embedded subcarrier, which has meaningful levels above white and below black. The quantizing range has to be extended to embrace the total possible excursion of luminance plus subcarrier. The resultant range is so close to the overall range of the signal that, in composite working, the whole signal, including syncs, is made to fit into the quantizing range as Figure 2.31 (b) shows. This is particularly useful in PAL because the sampling points are locked to subcarrier, and have a complex relationship to sync. Clearly black level no longer corresponds to digital zero. In eight-bit PAL it is 64₁₀. It is as if a constant of 64 had been added to every sample, and gives rise to the term offset binary.

Figure 2.31 The unipolar quantizing range of an eight-bit pure binary system is shown at (a). The analog input must be shifted to fit into the quantizing range, as shown for PAL at (b). In component, sync pulses are not digitized, so the quantizing intervals can be smaller as at (c). An offset of half scale is used for colour difference signals (d).

In the luminance component signal, the useful waveform is unipolar and can only vary between black and white, since the syncs carry no information that cannot be recreated. Only the active line is coded and the quantizing range is optimized to suit the gamut of unblanked luminance as shown in Figure 2.31 (c). The same approach is used in HD component signals.

Chroma and colour difference signals are bipolar and can have values above and below blanking. In this they have the same characteristics as audio and require to be processed using two’s complement coding as described in the previous section. However, two’s complement is not used in standard digital video interfaces as the codes of all ones and all zeros are reserved for synchronizing and in two’s complement these would appear in the centre of the quantizing range. Instead digital video interfaces use an offset of half full scale to shift blanking level to the centre of the scale as shown in Figure 2.31 (d). All ones and all zeros are then at the ends of the scale. Such an offset binary signal can easily be converted to two’s complement by inverting the MSB as shown in Figure 2.30. If the quantizing range were set to exactly the video signal range or gamut, a slightly excessive gain somewhere in the analog input path would cause clipping. In practice the quantizing range will be a little greater than the nominal signal range and as a result even the unipolar luminance signal has an offset blanking level. A video interface format must specify the numerical values that result from reference analog levels so that all transmissions will result in identical signals at all receiving devices.

2.11 Requantizing and Digital Dither

The advanced ADC technology now available allows 20 or more bits of resolution to be obtained in audio. The situation then arises that an existing 16-bit device such as a digital recorder needs to be connected to the output of an ADC having greater word length. In a similar fashion digital video equipment has recently moved up from eight- to ten-bit working with the result that ten-bit signals are presented to eight-bit devices. In both cases the words need to be shortened in some way.

When a sample value is attenuated, the extra low-order bits that come into existence below the radix point preserve the resolution of the signal. The dither in the least significant bit(s) linearizes the system. The same word extension will occur in any process involving multiplication, such as mixing or digital filtering. It will subsequently be necessary to shorten the word length. Clearly the high-order bits cannot be discarded in two’s complement as this would cause clipping of positive half cycles and a level shift on negative half cycles due to the loss of the sign bit. Low-order bits must be removed instead. Even if the original conversion was correctly dithered, the random element in the low-order bits will now be some way below the end of the intended word. If the word is simply truncated by discarding the unwanted low-order bits or rounded to the nearest integer the linearizing effect of the original dither will be lost. In audio the result will be low-level distortion; in video the result will be contouring.

Shortening the word length of a sample reduces the number of quantizing intervals available without changing the signal amplitude. As Figure 2.32 shows, the quantizing intervals become larger and the original signal is requantized with the new interval structure. This will introduce requantizing distortion having the same characteristics as quantizing distortion in an ADC. It then is obvious that when shortening the word length of a 20-bit convertor to 16 bits, the four low-order bits must be removed in a way that displays the same overall quantizing structure as if the original convertor had been only of 16-bit word length. It will be seen from Figure 2.32 that truncation cannot be used because it does not meet the above requirement but results in signal dependent offsets because it always rounds in the same direction. Proper numerical rounding is essential in audio and video applications. Rounding in two’s complement is a little more complex than in pure binary. Requantizing by numerical rounding accurately simulates analog quantizing to the new interval size. Unfortunately the 20-bit convertor will have a dither amplitude appropriate to quantizing intervals onesixteenth the size of a 16-bit unit and the result will be highly non-linear.

Figure 2.32 Shortening the word length of a sample reduces the number of codes which can describe the voltage of the waveform. This makes the quantizing steps bigger, hence the term requantizing. It can be seen that simple truncation or omission of the bits does not give analogous behaviour. Rounding is necessary to give the same result as if the larger steps had been used in the original conversion.

In practice, the word length of samples must be shortened in such a way that the requantizing error is converted to noise rather than distortion. One technique that meets this requirement is to use digital dithering¹⁵ prior to rounding. This is directly equivalent to the analog dithering in an ADC.

Digital dither is a pseudo-random sequence of numbers. If it is required to simulate the analog dither signal of Figure 2.24, then it is obvious that the noise must be bipolar so that it can have an average voltage of zero. Two’s complement coding must be used for the dither values to obtain this characteristic.

Figure 2.33 shows a simple digital dithering system (i.e. one without noise shaping) for shortening sample word length. The output of a two’s complement pseudo-random sequence generator of appropriate word length is added to input samples prior to rounding. The most significant of the bits to be discarded is examined in order to determine whether the bits to be removed sum to more or less than half a quantizing interval. The dithered sample is either rounded down, that is, the unwanted bits are simply discarded, or rounded up, that is the unwanted bits are discarded but 1 is added to the value of the new short word. The rounding process is no longer deterministic because of the added dither that provides a linearizing random component.

Figure 2.33 In a simple digital dithering system, two’s complement values from a random number generator are added to low-order bits of the input. The dithered values are then rounded up or down according to the value of the bits to be removed. The dither linearizes the requantizing.

If this process is compared with that of Figure 2.24 it will be seen that the principles of analog and digital dither are identical; the processes simply take place in different domains using numbers that are rounded or voltages that are quantized as appropriate. In fact quantization of an analog dithered waveform is identical to the hypothetical case of rounding after bipolar digital dither where the number of bits to be removed is infinite, and remains identical for practical purposes when as few as eight bits are to be removed. The probability density of the pseudo-random sequence is important. Vanderkooy and Lipshitz¹⁵ found that uniform probability density produced noise modulation, in which the amplitude of the random component varies as a function of the amplitude of the samples. A triangular probability density function obtained by adding together two pseudo-random sequences eliminated the noise modulation to yield a signal-independent white-noise component in the least significant bit. It is vital that such steps are taken when sample word length is to be reduced. More recently, sequences yielding signal-independent noise with a weighted frequency component have been used to improve the subjective effects of the noise component leading to a perceived reduction in noise.

2.12 Introduction to Compression

Compression, bit rate reduction and data reduction are all terms meaning basically the same thing in this context. In essence the same (or nearly the same) information is carried using a smaller quantity or rate of data. In transmission systems, compression allows a reduction in bandwidth and will generally result in a reduction in cost to make possible some process that would be impracticable without it. If a given bandwidth is available to an uncompressed signal, compression allows faster than real-time transmission in the same bandwidth. If a given bandwidth is available, compression allows a better quality signal in the same bandwidth.

Compression is summarized in Figure 2.34. It will be seen in (a) that the data rate is reduced at source by the compressor. The compressed data is then passed through a communication channel and returned to the original rate by the expander. The ratio between the source data rate and the channel data rate is called the compression factor. The term coding gain is also used. Sometimes a compressor and expander in series are referred to as a compander. The compressor may equally well be referred to as a coder and the expander a decoder in which case the tandem pair may be called a codec.

Figure 2.34 In (a) a compression system consists of compressor or coder, a transmission channel and a matching expander or decoder. The combination of coder and decoder is known as a codec. (b) MPEG is asymmetrical since the encoder is much more complex than the decoder.

In audio and video compression, where the encoder is more complex than the decoder the system is said to be asymmetrical as in (b). The encoder needs to be algorithmic or adaptive whereas the decoder is ‘dumb’ and only carries out actions specified by the incoming bitstream. This is advantageous in applications such as broadcasting where the number of expensive complex encoders is small but the number of simple inexpensive decoders is large. In point-to-point applications the advantage of asymmetrical coding is not so great.

Although there are many different coding techniques, all of them fall into one or other of these categories. In lossless coding, the data from the expander is identical bit for bit with the original source data. Lossless coding is generally restricted to compression factors of around 2:1. A lossless coder cannot guarantee a particular compression factor and the channel used with it must be able to function with the variable output data rate.

In lossy coding data from the expander is not identical bit for bit with the source data and as a result comparison of the input with the output is bound to reveal a difference. Lossy codecs are not suitable for computer data, but are used in MPEG¹⁶ as they allow greater compression factors than lossless codecs. Successful lossy codecs are those in which the errors are arranged so that a human viewer or listener finds them subjectively difficult to detect. Thus lossy codecs must be based on an understanding of psycho-acoustic and psycho-visual perception and are often called perceptive codes.

In perceptive coding, the greater the compression factor required, the more accurately must the human senses be modelled. Perceptive coders can be forced to operate at a fixed compression factor. This is convenient for practical transmission applications where a fixed data rate is easier to handle than a variable rate. Source data that results in poor compression factors on a given codec is described as difficult. The result of a fixed compression factor is that the subjective quality can vary with the ‘difficulty’ of the input material. Perceptive codecs should not be concatenated indiscriminately especially if they use different algorithms.

Although the adoption of digital techniques is recent, compression itself is as old as television. Figure 2.35 shows some of the compression techniques used in traditional television systems.

Figure 2.35 Compression is a old as television. (a) Interlace is a primitive way of halving the bandwidth. (b) Colour difference working invisibly reduces colour resolution.(c) Composite video transmits colour in the same bandwidth as monochrome.

One of the oldest techniques is interlace, which has been used in analog television from the very beginning as a primitive way of reducing bandwidth. Interlace is not without its problems, particularly in motion rendering. MPEG-2 supports interlace simply because legacy interlaced signals exist and there is a requirement to compress them. This should not be taken to mean that it is a good idea.

The generation of colour difference signals from RGB in video represents an application of perceptive coding. The human visual system (HVS) sees no change in quality although the bandwidth of the colour difference signals is reduced. This is because human perception of detail in colour changes is much less than in brightness changes. This approach is sensibly retained in MPEG.

Composite video systems such as PAL, NTSC and SECAM are all analog compression schemes that embed a subcarrier in the luminance signal so that colour pictures are available in the same bandwidth as monochrome. In comparison with a progressive scan RGB picture, interlaced composite video has a compression factor of 6:1.

In a sense MPEG-2 can be considered to be a modern digital equivalent of analog composite video as it has most of the same attributes. For example, the eight-field sequence of PAL subcarrier that makes editing difficult has its equivalent in the GOP (group of pictures) of MPEG.

In a PCM digital system the bit rate is the product of the sampling rate and the number of bits in each sample and this is generally constant. Nevertheless the information rate of a real signal varies. In all real signals, part of the signal is obvious from what has gone before or what may come later and a suitable receiver can predict that part so that only the true information actually has to be sent. If the characteristics of a predicting receiver are known, the transmitter can omit parts of the message in the knowledge that the receiver has the ability to recreate it. Thus all encoders must contain a model of the decoder.

The difference between the information rate and the overall bit rate is known as the redundancy. Compression systems are designed to eliminate as much of that redundancy as practicable or perhaps affordable. One way in which this can be done is to exploit statistical predictability in signals. The information content or entropy of a sample is a function of how different it is from the predicted value. Most signals have some degree of predictability. A sine wave is highly predictable, because all cycles look the same. According to Shannon’s theory, any signal that is totally predictable carries no information. In the case of the sine wave this is clear because it represents a single frequency and so has no bandwidth.

At the opposite extreme a signal such as noise is completely unpredictable and as a result all codecs find noise difficult. There are two consequences of this characteristic. First, a codec designed using the statistics of real material should not be tested with random noise because it is not a representative test. Second, a codec that performs well with clean source material may perform badly with source material containing superimposed noise. Most practical compression units require some form of preprocessing before the compression stage proper and appropriate noise reduction should be incorporated into the pre-processing if noisy signals are anticipated. It will also be necessary to restrict the degree of compression applied to noisy signals.

All real signals fall mid-way between the extremes of total predictability and total unpredictability or noisiness. If the bandwidth (set by the sampling rate) and the dynamic range (set by the word length) of the transmission system are used to delineate an area, this sets a limit on the information capacity of the system. Figure 2.36(a) shows that most real signals only occupy part of that area. The signal may not contain all frequencies, or it may not have full dynamics at certain frequencies.

Figure 2.36 (a) A perfect coder removes only the redundancy from the input signal and results in subjectively lossless coding. If the remaining entropy is beyond the capacity of the channel some of it must be lost and the codec will then be lossy. An imperfect coder will also be lossy as it falls to keep all entropy. (b) As the compression factor rises, the complexity must also rise to maintain quality. (c) High compression factors also tend to increase latency or delay through the system.

Entropy can be thought of as a measure of the actual area occupied by the signal. This is the area that must be transmitted if there are to be no subjective differences or artefacts in the received signal. The remaining area is called the redundancy because it adds nothing to the information conveyed. Thus an ideal coder could be imagined which miraculously sorts out the entropy from the redundancy and only sends the former. An ideal decoder would then recreate the original impression of the information quite perfectly.

As the ideal is approached, the coder complexity and the latency or delay both rise. Figure 2.36(b) shows how complexity increases with compression factor and (c) shows how increasing the codec latency can improve the compression factor. Obviously we would have to provide a channel that could accept whatever entropy the coder extracts in order to have transparent quality. As a result, moderate coding gain that only removes redundancy need not cause artefacts and results in systems described as subjectively lossless.

If the channel capacity is not sufficient for that, then the coder will have to discard some of the entropy and with it useful information. Larger coding gain that removes some of the entropy must result in artefacts. It will also be seen from Figure 2.36 that an imperfect coder will fail to separate the redundancy and may discard entropy instead, resulting in artefacts at a suboptimal compression factor.

A single variable-rate transmission is unrealistic in broadcasting where fixed channel allocations exist. The variable-rate requirement can be met by combining several compressed channels into one constant rate transmission in a way that flexibly allocates data rate between the channels. Provided the material is unrelated, the probability of all channels reaching peak entropy at once is very small and so those channels that are at one instant passing easy material will free up transmission capacity for those channels that are handling difficult material. This is the principle of statistical multiplexing.

Lossless codes are less common for audio and video coding where perceptive codes are permissible. The perceptive codes often obtain a coding gain by shortening the word length of the data representing the signal waveform. This must increase the noise level and the trick is to ensure that the resultant noise is placed at frequencies where human senses are least able to perceive it. As a result although the received signal is measurably different from the source data, it can appear the same to the human listener or viewer at moderate compression factors. As these codes rely on the characteristics of human sight and hearing, they can only fully be tested subjectively.

The compression factor of such codes can be set at will by choosing the bit rate of the compressed data. Whilst mild compression will be undetectable, with greater compression factors, artefacts become noticeable. Figure 2.36 shows this to be inevitable from entropy considerations.

2.13 Introduction to Audio Compression

The human auditory system (HAS) is complex and at various times works in the frequency domain to analyse timbre or in the time domain to localize sound sources. In practice the precision of the HAS is finite in both domains and if this precision is known, it is possible to render signals less precise than their original condition, but which still seem as precise to the ear. In practice few audio compression systems are used in his way. Instead lower bit rates are used and the coder attempts to minimize the audible damage.

There are a number of coding tools available, but none of these is appropriate for all circumstances. Consequently practical coders will use combinations of different tools, usually selecting the most appropriate for the type of input.

The simplest coding tool is companding: a digital parallel of the noise reducers used in analog tape recording. Figure 2.37(a) shows that in companding systems the input signal level is monitored. Whenever the input level falls below maximum, it is amplified at the coder. The gain that was applied at the coder is added to the data stream so that the decoder can apply an equal attenuation. The advantage of companding is that the signal is kept as far away from the noise floor as possible. In analog noise reduction this is used to maximize the SNR of a tape recorder, whereas in digital compression it is used to keep the signal level as far as possible above the distortion introduced by various coding steps.

Figure 2.37 Digital companding. In (a) the encoder amplifies the input to maximum level and the decoder attenuates by the same amount. (b) In a companded system, the signal is kept as far as possible above the noise caused by shortening the sample word length.

One common way of obtaining coding gain is to shorten the word length of samples so that fewer bits need to be transmitted. Figure 2.37(b) shows that when this is done, the distortion will rise by 6 dB for every bit removed. This is because removing a bit halves the number of quantizing intervals which then must be twice as large, doubling the error amplitude.

Clearly if this step follows the compander of (a), the audibility of the distortion will be minimized. As an alternative to shortening the word length, the uniform quantized PCM signal can be converted to a nonuniform format. In non-uniform coding, shown at (c), the size of the quantizing step rises with the magnitude of the sample so that the distortion level is greater when higher levels exist.

In sub-band coding, the audio spectrum is split into many different frequency bands. Once this has been done, each band can be individually processed. In real audio signals many bands will contain lower-level signals than the loudest one. Individual companding of each band will be more effective than broadband companding. Sub-band coding also allows the level of distortion products to be raised selectively so that distortion is created only at frequencies where spectral masking will be effective.

Transform coding is an extreme case of sub-band coding in which the sub-bands have become so narrow that they can be described by one coefficient.

Prediction is a coding tool in which the coder attempts to predict or anticipate the value of a future parameter from those already known to both encoder and decoder. It can be used in the time domain or the frequency domain. When used in the time domain, a predictor attempts to predict the value of the next audio sample. When used in the frequency domain, the predictor attempts to predict the value of the next frequency coefficient from those already sent.

The prediction will be subtracted from the actual value to obtain the prediction error, also known as a residual, which is transmitted. The decoder also contains a predictor that runs from the same signal history as the predictor in the encoder and will thus make the same prediction. By adding the residual, the decoder’s predictor will obtain the correct sample value. Provided the prediction error is transmitted intact, it will be clear that predictive coding is lossless.

In MPEG Layer 1 audio coding, the input is split into 32 bands and individually companded in each. A Layer 1 block is thus quite simple, containing 32 gain factors and 32 word length codes to allow deserialization of the 32 sets of variable length samples.

In MPEG Layer 2 coding, the same number of sub-bands is used, but further coding gain is possible because redundancy in the gain factors is explored allowing the same parameter to be used for several blocks.

In MPEG Layer 3 coding, a transform is performed that decomposes the spectrum into 192 or 576 coefficients, depending on the transient content of the signal. These coefficients are then subject to non-uniform quantizing followed by a lossless mathematical packing algorithm.

The MPEG AAC (advanced audio coding) algorithm represents an improvement over the earlier codecs as it uses prediction that can adaptively switch between time and frequency domain to give better quality for a given bit rate.

2.14 Introduction to Video Compression

Video signals exist in four dimensions: these are the attributes of the sample, the horizontal and vertical spatial axes and the time axis. Compression can be applied in any or all of those four dimensions. MPEG-2 assumes eight-bit colour difference signals as the input, requiring rounding if the source is ten bit. The sampling rate of the colour signals is less than that of the luminance. This is done by a down-sampling of the colour samples horizontally and generally vertically as well. Essentially an MPEG-2 system has three parallel simultaneous channels, one for luminance and two for colour difference, which after coding are multiplexed into a single bitstream.

Figure 2.38(a) shows that spatial redundancy is redundancy within a single image, for example repeated pixel values in a large area of blue sky. Temporal redundancy (b) exists between successive images.

Figure 2.38 (a) Spatial or intra-coding works on individual images. (b) Temporal or inter-coding works on successive images. (c) In MPEG inter-coding is used to create difference images. These are the compressed spatially.

Where temporal compression is used, the current picture is not sent in its entirety; instead the difference between the current picture and the previous picture is sent. The decoder already has the previous picture, and so it can add the difference to make the current picture. A difference picture is created by subtracting every pixel in one picture from the corresponding pixel in another pixel.

A difference picture is an image of a kind, although not a viewable one, and so should contain some kind of spatial redundancy. Figure 2.38(c) shows that MPEG-2 takes advantage of both forms of redundancy. Picture differences are spatially compressed prior to transmission. At the decoder the spatial compression is decoded to recreate the difference picture, then this difference picture is added to the previous picture to complete the decoding process.

Whenever objects move they will be in a different place in successive pictures. This will result in large amounts of difference data. MPEG-2 overcomes the problem using motion compensation. The encoder contains a motion estimator that measures the direction and distance of motion between pictures and outputs these as vectors that are sent to the decoder. When the decoder receives the vectors it uses them to shift data in a previous picture to resemble the current picture more closely. Effectively the vectors are describing the optic flow axis of some moving screen area, along which axis the image is highly redundant. Vectors are bipolar codes that determine the amount of horizontal and vertical shift required.

In real images, moving objects do not necessarily maintain their appearance as they move. For example, objects may turn, move into shade or light, or move behind other objects. Consequently motion compensation can never be ideal and it is still necessary to send a picture difference to make up for any shortcomings in the motion compensation.

Figure 2.39 shows how this works. In addition to the motion-encoding system, the coder also contains a motion decoder. When the encoder outputs motion vectors, it also uses them locally in the same way that a real decoder will, and is able to produce a predicted picture based solely on the previous picture shifted by motion vectors. This is then subtracted from the actual current picture to produce a prediction error or residual which is an image of a kind that can be spatially compressed.

Figure 2.39 A motion-compensated compression system. The coder calculates motion vectors which are transmitted as well as being used locally to create a predicted picture. The difference between the predicted picture and the actual picture is transmitted as a prediction error.

The decoder takes the previous picture, shifts it with the vectors to recreate the predicted picture and then decodes and adds the prediction error to produce the actual picture. Picture data sent as vectors plus prediction error are said to be P coded.

The simple prediction system of Figure 2.39 is of limited use as in the case of a transmission error, every subsequent picture would be affected. Channel switching in a television set would also be impossible. In practical systems a modification is required. The approach used in MPEG is that periodically some absolute picture data is transmitted in place of difference data.

This absolute picture data, known as I or intra pictures, is interleaved with pictures which are created using difference data, known as P or predicted pictures. The I pictures require a large amount of data, whereas the P pictures require fewer data. As a result the instantaneous data rate varies dramatically and buffering has to be used to allow a constant transmission rate.

The I picture and all the P pictures prior to the next I picture are called a group of pictures (GOP). For a high compression factor, a large number of P pictures should be present between I pictures, making a long GOP. However, a long GOP delays recovery from a transmission error. The compressed bitstream can only be edited at I pictures.

Bi-directional coding is shown in Figure 2.40. Where moving objects reveal a background this is completely unknown in previous pictures and forward prediction fails. However, more of the background is visible in later pictures. In the centre of the diagram, a moving object has revealed some background. The previous picture can contribute nothing, whereas the next picture contains all that is required.

Figure 2.40 In bi-directional coding, a number of B pictures can be inserted between periodic forward predicted pictures. See text.

Bi-directional coding uses a combination of motion compensation and the addition of a prediction error. This can be done by forward prediction from a previous picture or backward prediction from a subsequent picture. It is also possible to use an average of both forward and backward prediction. On noisy material this may result in some reduction in bit rate. The technique is also a useful way of portraying a dissolve.

Typically two B pictures are inserted between P pictures or between I and P pictures. As can be seen, B pictures are never predicted from one another, only from I or P pictures. A typical GOP for broadcasting purposes might have the structure IBBPBBPBBPBB. Note that the last B pictures in the GOP require the I picture in the next GOP for decoding and so the GOPs are not truly independent. Independence can be obtained by creating a closed GOP which may contain B pictures but which ends with a P picture.

Bi-directional coding is very powerful. Figure 2.41 is a constant quality curve showing how the bit rate changes with the type of coding. On the left, only I or spatial coding is used, whereas on the right an IBBP structure is used having two bi-directionally coded pictures in between a spatially coded picture (I) and a forward predicted picture (P). Note how for the same quality the system that only uses spatial coding needs two and a half times the bit rate that the bi-directionally coded system needs.

Figure 2.41 Bi-directional coding is very powerful as it allows the same quality with only 40% of the bit rate of intra-coding. However, the encoding and decoding delays must increase. Coding over a longer time span is more efficient but editing is more difficult.

Clearly information in the future has yet to be transmitted and so is not normally available to the decoder. MPEG-2 gets around the problem by sending pictures in the wrong order. Picture reordering requires delay in the encoder and a delay in the decoder to put the order right again. Thus the overall codec delay must rise when bi-directional coding is used. This is quite consistent with Figure 2.36 in which it was shown that as the compression factor rises the latency must also rise.

Figure 2.42 shows that although the original picture sequence is IBBPBBPBBIBB …, this is transmitted as IPBBPBBIBB … so that the future picture is already in the decoder before bi-directional decoding begins. Note that the I picture of the next GOP is actually sent before the last B pictures of the current GOP.

Figure 2.42 Comparison of pictures before and after compression showing sequence change and varying amount of data needed by each picture type. I, P, B pictures use unequal amounts of data.

Figure 2.42 also shows that the amount of data required by each picture is dramatically different. I pictures have only spatial redundancy and so need a lot of data to describe them. P pictures need fewer data because they are created by shifts of I picture using vectors and then adding a prediction error picture. B pictures need the least data of all because they can be created from I or P.

With pictures requiring a variable length of time to transmit, arriving in the wrong order, the decoder needs some help. This takes the form of picture-type flags and time stamps.

References

1. Watkinson, J.R., Convergence in Broadcast and Communications Media, Ch. 7, Focal Press, Oxford, (2001)

2. Betts, J.A., Signal Processing Modulation and Noise, Ch. 6, Hodder and Stoughton, Sevenoaks (1970)

3. Meyer, J., Time correction of anti-aliasing filters used in digital audio systems. J. Audio Eng. Soc., vol. 32, pp. 132–137 (1984)

4. Blesser, B., Advanced A/D conversion and filtering: data conversion. In Digital Audio, ed.B. A. Blesser, B. Locanthi and T. G. Stockham Jr, pp. 37–53, Audio Engineering Society, New York (1983)

5. Lagadec, R., Weiss, D. and Greutmann, R., High-quality analog filters for digital audio. Presented at the 67th Audio Engineering Society Convention (New York), preprint 1707 (B-4) (1980)

6. Hsu, S., The Kell factor: past and present. SMPTE Journal, vol. 95, pp. 206–214 (1986)

7. Jesty, L.C., The relationship between picture size, viewing distance and picture quality. Proc. IEE, vol. 105B, pp. 425–439 (1958)

8. Ishida, Y. et al., A PCM digital audio processor for home use VTRs. Presented at 64th Audio Engineering Society Convention (New York), preprint 1528 (1979)

9. Watkinson, J.R., The Digital Video Tape Recorder, Focal Press, Oxford (1994)

10. AES, AES recommended practice for professional digital audio applications employing pulse code modulation: preferred sampling frequencies. AES5–1984 (ANSI S4.28–1984),J. Audio Eng. Soc., vol. 32, pp. 781–785 (1984)

11. Lipshitz, S.P. et al., Quantization and dither: a theoretical survey. J. Audio Eng. Soc., vol. 40, pp. 355–375 (1992)

12. Roberts, L.G., Picture coding using pseudo-random noise. IRETrans. Inform. Theory, vol. IT-8, pp. 145–154 (1962)

13. Watkinson, J.R., The Art of Digital Audio, Third Edn, Focal Press, Oxford (2001)

14. Watkinson, J.R., The Art of Digital Video, Third Edn, Focal Press, Oxford (2000)

15. Vanderkooy, J. and Lipshitz, S.P., Digital dither. Presented at the 81st Audio Engineering Society Convention (Los Angeles), preprint 2412(C-8) (1986)

16. Watkinson, J.R., The MPEG Handbook, Focal Press, Oxford (2001)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2: An introduction to digital audio and video

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2: An introduction to digital audio and video