CHAPTER 18

Sound Quality

 

CHAPTER CONTENTS

What is Sound Quality?

Objective and subjective quality

Sound quality attributes

Quality and fidelity

Quality and naturalness

Quality and liking

Methods of Sound Quality Evaluation

Listening tests

Listeners

Blind tests

Types of listening test

Perceptual models for sound quality

Aspects of Audio System Performance Affecting Sound Quality

Frequency response

Harmonic distortion

Intermodulation (IM) distortion

Dynamic range and signal-to-noise ratio

Wow, flutter and jitter

Sound quality in the digital signal chain

Sound quality in audio codecs

 

This chapter is intended as an introduction to the topic of sound quality — what it means and what affects it. Examples are taken from both the analog and digital domains, and there is also an introduction to the topic of listening tests, including some of the related international standards. Finally there is an introduction to the question of how to measure sound quality and the use of perceptual models.

WHAT IS SOUND QUALITY?

It is possible to talk about sound quality in physical or technical terms, and in perceptual terms. In physical terms it generally relates to certain desirable measured characteristics of audio devices, transmission channels or signals. In perceptual terms, however, it relates to what is heard, interpreted and judged by human listeners. In an ideal world one domain could be related or mapped directly to the other. However, there may be aspects of sound quality that can be perceived, even though they cannot be measured, and some that can be measured but not perceived. One of the goals of perceptual model research, discussed later on, is to find better ways of measuring those aspects of audio signals that predict perceived quality.

Tomasz Letowski distinguishes between sound quality and sound character in his writings on the topic, the former having to do with a value judgment about the superiority of one sound over another and the latter being purely descriptive. Jens Blauert, in his Audio Engineering Society masterclass on sound quality (see ‘Recommended further reading’, below), refers to perceived sound quality in a broader sense as being related to a judgment about the perceived character of a sound in terms of its suitability for a particular task, expectation or pragmatic purpose. In his model it requires comparison of the sound to a reference set defined for the context in question, because otherwise the concept of sound quality ‘floats’ on a scale that has no fixed points and no meaning. The choice of the reference is therefore crucially important in defining the context for sound quality evaluation. Blauert also refers to different levels of abstraction when talking about sound quality, where the low levels are closely related to features of audio signals themselves, while the higher levels are related to ideas, concepts and meanings of sound (see Figure 18.1, which was inspired by Blauert and Jekosch, but not identical to their representation). This suggests that it is important to be careful what one means when talking about sound quality. One should be clear about the conceptual level at which one is talking — whether about signal quality, some descriptive attribute of the sound, or some more abstracted aspect of its overall cognition or perception.

Objective and subjective quality

Sound quality can be defined both in ‘objective’ and ‘subjective’ terms, but this categorization can be misleading and workers in the field have interpreted the meaning of these words differently. The term objective is often related to measurable aspects of sound quality, whereas the term subjective is often related to perceived aspects of sound quality. However, the term objective, in one sense of the word, means ‘free of any bias or prejudice caused by personal feelings’ and it has been shown that some descriptive attributes of sound quality can be evaluated by experienced listeners in a reliable and repeatable way that conforms to this definition. This can result in a form of perceptual measurement that is almost as ‘objective’ as a technical measurement. For this reason it may be more appropriate to differentiate between physically measured and perceived attributes of quality, reserving the term ‘subjective’ for questions such as liking and preference (discussed below). Figure 18.2, adapted from Bech and Zacharov’s Perceptual Audio Evaluation (see Recommended further reading, below) shows this in a graphical form.

image

FIGURE 18.1
Sound quality can be considered at different levels of abstraction from low to high. The lower levels are closely related to audio signals themselves and to associated auditory percepts whereas the higher levels have more cognitive complexity and relate to the meaning or signification of sound. (Adapted from Blauert.)

Sound quality attributes

Perceived sound quality is a many-faceted concept, and a term often used to describe this is ‘multi-dimensional’. In other words, there are many perceptual dimensions or attributes making up human judgments about sound quality. These may be arranged in a hierarchy, with an integrative judgment of quality at the top, and judgments of individual descriptive attributes at the bottom (see Figure 18.3). According to Letowski’s model this ‘tree’ may be divided broadly into spatial and timbral attributes, the spatial attributes referring to the three-dimensional features of sounds such as their location, width and distance, and the timbral attributes referring to aspects of sound color. Effects of non-linear distortion and noise are also sometimes put in the timbral group. The higher one goes up the tree, the more one is usually talking about the acceptability or suitability of the sound for some purpose and in relation to some frame of reference, whereas at the lower levels one may be able to evaluate the attributes concerned in value-free terms. In other words a high-level judgment of quality is an integrative evaluation that takes into account all of the lower-level attributes and weighs up their contribution. As mentioned above, the nature of the reference, the context, and the definition of the task govern the way in which the listener decides which aspects of the sound should be taken into consideration. One conception of such a model suggests that it may be possible to arrive at an overall prediction of quality by some weighted combination of ratings of individual low-level attributes. However, it is highly likely that such weightings are strongly context and task-dependent. (For example, whilst accurately located sources may be crucial for flight simulators, they may be less important in home entertainment systems.)

image

FIGURE 18.2 Sound quality evaluation can be modeled as shown, involving perceptual and cognitive ‘filters’ at different stages of the process. Processes to the right can be said to be more ‘subjective’ than those towards the left, as they are likely to vary more between individuals and contexts. (Adapted from Bech and Zacharov.)

image

FIGURE 18.3
Overall sound quality can be subdivided into timbral and spatial quality domains. Each of these consists of contributions from a number of relevant low-level attributes

Quality and fidelity

The concept of fidelity has been of fundamental importance in defining the role of sound recording and reproduction. Fidelity can be defined variously as relating to faithfulness, as well as to accuracy in the description or reporting of facts and their details. In sound recording, it concerns the extent to which technical equipment is capable of accurately capturing, storing and reproducing sounds. The fidelity of two reproductions should really be a measure of their similarity. However, there has been a tendency in the history of sound quality evaluation either explicitly or implicitly to include a value judgment in concepts of fidelity.

Floyd Toole, for example, describes his concept of fidelity in a paper on listening tests, and states that in addition to rating various aspects of sound quality ‘listeners conclude with an overall “fidelity rating” intended to reflect the extent to which the reproduced sound resembles an ideal. With some music and voice the ideal is a recollection of live sound, with other source material the ideal must be what listeners imagine to be the intended sound’ (Toole, 1982: 440). Fidelity is thus defined in relation to a memorized or imagined ideal reference. Toole’s fidelity scale, shown in Figure 18.4, is really a hybrid of a value judgment (e.g. using terms like ‘worse’) and a faithfulness judgment (e.g. ‘faithful to the ideal’). There is not the option of ‘less faithful’ meaning ‘better’ (although this appears to be an unlikely combination it highlights the assumption implicit in this scale). It assumes that listeners know what is correct reproduction, and that what is correct is good. Gabrielsson and Lindstrøm, on the other hand, define fidelity as ‘the similarity of the reproduction to the original sound{…}the music sounds exactly as you heard it in the same room where it was originally performed’ (1985: 52), but acknowledge the difficulty in judging this when listeners do not know what the music sounded like in reality (such as in studio-manufactured pop music).

Quality and naturalness

Naturalness is another term that crops up frequently in human descriptions of sound quality. It may be related to a higher-level cognitive response whereby the ‘plausibility’ of lower-level factors is somehow weighed in relation to a remembered experience of natural listening conditions. Naturalness and liking or preference are often found to be highly correlated, suggesting that listeners have a built-in preference for ‘natural’ sounds. This suggests that auditory cues in reproduced sound that contradict those encountered in the natural environment, or that are combined in an unnatural way, will be responded to negatively by listeners. Whether this phenomenon will continue to be noticed as people are exposed to more and more artificial sounds is open to question.

image

FIGURE 18.4 Toole’s fidelity scale from 1982 includes elements of value judgment as well as similarity to an ideal

Quality and liking

One of the biggest mistakes that can be made when talking about sound quality is to confuse liking with correctness. One cannot automatically assume that the sound with the least distortion or flattest frequency response will be the most liked, although in many cases it is so. Some people actively like the sound of certain kinds of distortion, and this may have something to do with the preference shown by some for analog recording systems or vinyl LP records, for example. There is nothing wrong with liking distorted recordings, but liking should not be confused with fidelity. Learning and familiarity also play an important role in determining preferred sound quality. In one early experiment, for example, students who had spent a period of time listening to restricted frequency range audio systems demonstrated a preference for those compared with full range systems.

In audio engineering there is a long tradition of the concept of fidelity — the assumption that there is a correct way to reproduce a recording and it is the job of a sound system to reproduce as faithfully as possible an artist’s intention or a natural sound. The idea of a reference reproduction is well embedded in the hearts of most audio engineers. If you like something that is different from what is correct, you are entitled to your opinion but basically you are wrong, so the argument runs. There is not space here to expound this argument in more than a basic way, but an alternative view is inspired by the food and beverage industry, which attempts to optimize products primarily for consumer preference. The most liked product is de facto the best, there being little room for a notion of correct reproduction of an ideal product. There is no ‘correct’ bottle of Shiraz wine, for example, just one that is most liked by a certain group of consumers. There may be no real conflict here in reality because in most cases sound systems are conduits for a creative product (e.g. a recorded song) that someone wants to convey to a consumer. It stands to reason that this conduit should be as transparent as possible, rather as the wine bottle and cork should taint the wine as little as possible. When talking about sound design on the other hand, perhaps there is more of a parallel with the food and beverage industry, as one is likely to be more interested in how to optimize different aspects of the sonic product so that people like the result. In the end it probably depends on defining clearly what the ‘product’ is, and who has control over it. This will certainly become more important in an age of interactive sonic products that can be altered or created on the fly by consumers who have no notion of a correct version of that product, or where multiple ways of rendering the product are available.

It is interesting in this respect to consider an experiment conducted by the author and his colleagues that attempted to discover which aspects of fidelity contributed most to listener preference in surround sound reproduction. It was found that untrained listeners simply did not care about precise stereo imaging in the frontal arc, whereas trained listeners regarded it as highly important. The untrained listeners mainly liked a broadly surrounding effect in the reproduction. This suggests that it may matter whose opinion is sought when evaluating liking for different aspects of sound reproduction. In the food industry, for example, it is common to use subjects representative of different populations to evaluate consumer preference because it is known that this depends on demographic and socio-economic factors. Sean Olive (2003), however, found that there is a fairly consistent relationship between measured loudspeaker performance and listener preference, for both trained and untrained listeners. This suggests that there are at least some contexts in which there is a relationship between technical fidelity of audio products and listeners’ hedonic judgments about sound quality.

METHODS OF SOUND QUALITY EVALUATION

This section is intended as an introduction to methods of sound quality evaluation, although it is by no means a comprehensive exposition. It includes an introduction to the topic of listening tests and also to that of perceptual models in sound quality prediction. Those wishing to study the topic in more detail are referred to the excellent book by Bech and Zacharov (see ‘Recommended further reading’, below).

Listening tests

Listening tests are the most widely used formal method for sound quality evaluation. When properly structured and carried out in a scientific manner with experienced listeners they can provide reliable and repeatable information about a number of perceived aspects of sound quality, as introduced earlier in this chapter. In order to carry out reliable listening tests it is important to be aware of the various difficulties inherent in evaluating sensory information using human subjects.

Listeners

Inevitably human judgment carries with it a degree of uncertainty and variability, and it is the goal of most experimenters to reduce this to a minimum by using experienced listeners who are carefully trained to recognize and grade the quality attributes in question. On the other hand, it can be important to recognize that variability between humans is not always to be regarded as annoying statistical ‘noise’, but may represent a real difference in the way that people interpret what they hear. It should be clear whether one is looking for the opinions or sentiments of different potential user groups, or for a clearly defined objective description of sound quality features. In the former case it may be necessary to engage a pool of listeners representing different consumer types, and quite large numbers are usually needed in order to observe trends. In the latter case sufficient training, experience or familiarization is needed on behalf of a smaller group of subjects to enable them to employ a common language, understanding and reliable scaling method. In some types of listening test it is also common to employ so-called ‘expert listeners’ to detect and rate changes in overall sound quality in a consistent and reliable manner. In such cases the aim is not so much the profiling of consumer opinion but the accurate rating of sound quality changes caused by audio processing.

There is, however, no escaping the fact that a listener’s background, experience and training will bias his or her judgment of sound quality, so it is important to recognize the limitations in generalizability of results.

Blind tests

It is widely accepted that knowing what you are listening to affects your judgment. In an interesting series of experiments by Toole and Olive, for example, in which they compared listener preferences for loudspeakers in both blind and sighted conditions, it was found that being aware of the model of loudspeaker under test affected the results as much as other variables such as listening position or program material. People’s responses were so strongly biased by their expectations of certain brands or models that their judgments of sound quality were sometimes radically altered from their ratings under blind conditions. This applied to experienced listeners as well as to more naïve ones, suggesting that even people who ‘know what sounds good’ can have their opinions altered by their expectation or visual bias. ‘Obviously’, said the authors, ‘listeners’ opinions were more attached to the products that they could see, than they were to the differences in sound associated with program’ (Toole and Olive, 1994: 13).

Because of this tendency, controlled listening tests employ ‘blind’ evaluation, which means that the listener is not allowed to know the identity of the stimuli concerned. ‘Blind’ does not necessarily mean blindfolded, unless there is a specific intention to ensure the removal of a visual bias, but it does mean that listeners should not be told the identity of the things they are to evaluate. Sometimes loudspeakers are hidden behind a curtain, for example. This aims to ensure that the evaluation is based only on what is heard and not biased by expectations, prejudices or other information that might change the result of the test. In the case of double-blind experiments, the test supervisor is also supposed to be unaware of the exact stimuli being presented to the listener at any one time in order that there is no chance of inadvertent biasing.

Types of listening test

A number of types of formal listening test are commonly employed, depending on the aim of the investigation. It is not intended to describe all of them, but only the most commonly encountered. Common to nearly all is the need to define exactly what is to be evaluated, whether this is an overall quality judgment supposed to include all aspects of sound quality, or whether it is one or more specific attributes of sound quality (such as ‘brightness’ or ‘spaciousness’). It is also very common to employ a reference of some sort, which is a signal against which others are compared. This is most likely in cases involving the evaluation of impairments in quality, such as caused by audio codecs or processing devices. Human listeners tend to be quite poor at judging sound quality attributes when they have nothing to compare with, but can be quite reliable when there is one or more ‘anchor’ points on the scale concerned.

Most listening tests require the listener to rate a quality attribute on a scale of some kind, and in some cases the scale is labeled with words or numbers at intervals. The words may describe some level of quality or magnitude of some attribute in terms that are meaningful to listeners (e.g. ‘excellent’ and ‘bad’, or perhaps ‘bright’ and ‘dark’). There is considerable debate about the problems arising from the use of scale labels as they may not have a universal meaning, neither may the words represent equal perceptual intervals. There is not space here to go into the matter further, but it is sufficient to note that these problems have led some to propose the use of alternative indirect scaling methods, or alternatives involving label-free scales, ‘audible labels’, or even approaches involving the observation of listener behavior when exposed to sound stimuli with different levels of quality.

The method of evaluation often involves presenting listeners with a number of stimuli to evaluate, either in pairs, triplets or multiples, since comparative judgment tends to be much more reliable than one-off grading. The ability to switch freely between stimuli also aids reliable comparison. It is also common practice to employ hidden reference signals in the case of some tests, in order to enable the experimenter to determine whether or not the listener is able to identify any reference signal consistently. This is one way of weeding out unreliable listeners.

The two primary evaluation methods employed in the international standards that deal with listening tests are the ‘triple stimulus with hidden reference’ approach used in ITU-R BS. 1116, and the ‘multiple stimulus with hidden reference and anchors’ approach used in ITU-R BS. 1534. The latter is commonly known as ‘MUSHRA’ and both are intended for the evaluation of quality impairments, mainly those caused by low bit rate codecs, although they can be adapted for other purposes. The BS.1116 approach presents the listener with three stimuli at a time — A, B and C — where the original unprocessed reference is always A and the other two are randomly assigned to the reference or the impaired signal. The listener has to decide which of B or C is the same as the reference and rate it ‘5’ on the 5-point scale, rating the other stimulus according to its perceived quality. The MUSHRA method presents the listener with a number of stimuli together on the same screen, enabling him to switch between them at will and enter gradings for each one. Hidden among the stimuli are anchor signals representing specific quality levels, and a reference signal. The listener is expected to rate at least one of the stimuli (the hidden anchor, one hopes) at the top of the scale. Examples of scales for these test types are shown in Figure 18.5. Scales range from ‘imperceptible’ to ‘very annoying’ (when rating quality impairment) or from ‘excellent’ to ‘bad’ (when rating quality). BS.1 116 is intended when the impairments involved are relatively small and hard to detect, whereas BS. 1534 is intended for intermediate levels of quality.

An alternative to these types of test is the so-called ‘ABX’ test, which is intended for determining only whether a difference can be reliably perceived between a pair of stimuli. In an ABX test one or more listeners is presented with a pair of stimuli and asked to identify whether X is the same as A or the same as B. X is randomly assigned to being one or the other. This process is repeated a large number of times. If there is no perceivable difference between the two then on average, over a large number of trials, half of the answers should be correct (simply what could be arrived at by chance). If there is truly a difference between the stimuli then the number of correct answers will be significantly higher than half, and the proportion gives information about the strength of the difference. This is one approach that can be used when trying to find out whether some supposedly inaudible process, such as an audio watermark, can be heard. Or it could be used, for example, if trying to discover whether listeners really can tell the difference between two loudspeaker cables.

image

FIGURE 18.5
Scales used in ITU-R listening tests. (a) BS.1116 impairment scale, (b) BS.1534 quality scale. Note that in the latter the labels denote the meaning of intervals rather than points on the scale

Perceptual models for sound quality

Listening tests are time consuming and resource intensive, so there is a strong motivation for the development of perceptual models that aim to predict the human response to different aspects of sound quality. As opposed to making the relatively basic conventional measurements of frequency response and distortion, such models tend to be based on measurements or ‘metrics’ corresponding to a range of dynamic and static features of the audio signals in question, which are known to have an effect on perceived quality. Either by employing a calibrated statistical model or a trained neural network, relationships are established between these metrics and the results of listening tests on selected signals so that predictions can be made about the perceived quality of other signals. Such models either base their predictions on measurements made using program material or they make use of specially prepared test items or probe signals with characteristics designed to stress the perceptual range of interest. A standard model exists for predicting the overall quality of low bit rate audio codecs (commercially known as ‘PEAQ’ — for Perceptual Evaluation of Audio Quality) and another is available for speech coding quality prediction (commercially known as ‘PESQ’ — for Perceptual Evaluation of Speech Quality). Others have attempted to build models for predicting spatial quality, and for individual quality attributes.

A typical perceptual model for sound quality is calibrated in a similar way to that shown in Figure 18.6. Audio signals, usually consisting of a set of reference and impaired versions of chosen program items, are scaled in standard listening tests to generate a database of ‘subjective’ grades. In parallel with this a set of audio features is defined and measured, leading to a set of metrics representing perceptually relevant aspects of the audio signals concerned. These are sometimes termed ‘objective metrics’. The statistical model or neural network is then calibrated or trained based on these data so as to make a more or less accurate prediction of the quality ratings given by the listeners. The output of the prediction is commonly referred to as an ‘objective’ measurement of sound quality, although this may be misleading. It is only as objective as the ‘subjective’ data used in its calibration, the appropriateness of the metrics and test signals used, and the sophistication of the mathematical model employed. The generalizability of such models needs wide testing.

Such models are typically ‘double-ended’ in that they have access to both the original unimpaired reference signal and the processed, impaired signal (see Figure 18.7). As such they are primarily used as quality impairment models, predicting perceived reductions in quality compared with a known ‘correct’ reference. It is much more difficult to predict quality without a known signal for comparison, although it may be possible to predict individual descriptive quality attributes in this way if the scales are calibrated using known anchor stimuli.

One further use for predictive quality models is in preference mapping. This derives from methods used in the food and beverage industry where it is often wished to know the relationship between certain descriptive features of quality and consumer preference. In such cases a set of expert-derived quality metrics relating to calibrated sensory attributes of the food are mapped onto preference ratings obtained from different consumer groups. In this way it is attempted to ‘engineer’ the sensory attributes of food products (relating to taste, texture and smell) so that they are maximally preferred by the target consumer. As discussed earlier in this chapter, there is still considerable debate about the relevance of such approaches to audio engineering.

image

FIGURE 18.6
A typical approach used in the calibration of sound quality prediction models

image

FIGURE 18.7
Most predictive models for sound quality, such as those standardized by the ITU, compare a reference (unimpaired) and impaired version of the same audio signal, leading to a prediction of the difference in the perceived quality

ASPECTS OF AUDIO SYSTEM PERFORMANCE AFFECTING SOUND QUALITY

In the following sections a number of basic aspects of audio systems will be discussed in relation to sound quality. In particular the technical performance of devices and systems will be related where possible to audible artefacts.

Frequency response

The most commonly quoted specification for a piece of audio equipment is its frequency response. It is a parameter that describes the frequency range handled by the device — that is, the range of frequencies that it can pick up, record, transmit or reproduce. To take a simple view, for high-quality reproduction the device would normally be expected to cover the whole audiofrequency range, which was defined earlier in this book as being from 20 Hz to 20 kHz, although some have argued that a response which extends above the human hearing range has audible benefits. It is not enough, though, simply to consider the range of frequencies reproduced, since this says nothing about the relative levels of different frequencies or the amplitude of signals at the extremes of the range. If further qualification is not given then a frequency response specification of 20Hz-20kHz is not particularly useful.

The ideal frequency response for transparent transmission is one which is ‘flat’ — that is with all frequencies treated equally and none amplified more than others. Technically, this means that the gain of the system should be the same at all frequencies, and this could be verified by plotting the amplitude of the output signal on a graph, over the given frequency range, assuming a constant-level input signal. An example of this is shown in Figure 18.8, where it will be seen that the graph of output level versus frequency is a straight horizontal line between the limits of 20 Hz and 20 kHz — that is, a flat frequency response. Also shown in Figure 18.8 are examples of non-flat responses, and it will be seen that these boost some frequencies and cut others, affecting the balance between different parts of the sound spectrum. Some discussion of the effects of this on sound quality is provided in Fact File 18.1.

image

FIGURE 18.8
Plot of a flat frequency response from 20 Hz to 20kHz. (b) Examples of two non-flat responses

Concerning the effects of very low and very high frequencies, close to the limits of human hearing, it can be shown that the reproduction of sounds below 20 Hz does sometimes offer an improved listening experience, since it can cause realistic vibrations of the surroundings. Also, the ear’s frequency response does not cut off suddenly at the extremes, but gradually decreases, and thus it is not true that one hears nothing below 20 Hz and above 20 kHz — one simply hears much less and the amplitude has to be exceptionally high to create a sensation. Recent research suggests that the sound pressure level has to be well above 70 dB for any response to be detected above 20 kHz in the auditory brain stem, for example. Similarly, extended HF responses can sometimes help sound quality, mainly because a gentle HF roll-off above 20 kHz usually implies less steep filtering of the signal, which may have the by-product of improved quality for other reasons. The sensitivity of humans to frequency extremes is also level dependent, as discussed in Fact File 2.2, which affects the perceived frequency balance of reproduced sound. For example, a mix that is made at a very high listening level may seem lacking in bass when played back more quietly. This is one reason why ‘loudness’ controls are provided on some consumer reproduction systems — they boost the bass (and sometimes treble) to compensate for the reduced sensitivity to these regions when replay is quiet. It also helps to explain why even small differences in reproduction level can give rise to noticeable differences in perceived sound quality.

FACT FILE 18.1 EFFECTS OF FREQUENCY RESPONSE ON SOUND QUALITY

Deviations from a flat frequency response will affect perceived sound quality. If the aim is to carry through the original signal without modifying it, then a flat response will ensure that the original amplitude relationships between different parts of the frequency spectrum are not changed. Some forms of modification to the ideal flat response are more acceptable than others. For example, a gentle roll-off at the high-frequency (HF) end of the range is often regarded as quite pleasant in some microphones. FM radio receivers, for example, tend to have an upper limit of around 15 kHz, but are relatively flat below this and thus do not sound unpleasant. Frequency responses that deviate wildly from flat over the audio-frequency range, on the other hand, sound much worse, even if the overall range of reproduction is wider than that of FM radio. Middle frequency peaks and troughs in the response are particularly objectionable, sounding very ‘colored’ and sometimes with a ringing effect. However, extremely narrow troughs in loudspeaker responses may not be noticed under some circumstances. If the frequency response of a system rises at high frequencies then the sibilant components of the sound will be emphasized, music will sound very ‘bright’ and ‘scratchy’, and any background hiss will be emphasized. If the response is down at high frequencies then the sound will become dull and muffled, and any background hiss may appear to be reduced. If the frequency response rises at low frequencies then the sound will be more ‘boomy’, and bass notes will be emphasized. If low frequencies are missing, the sound will be very ‘thin’ and ‘tinny’. A rise in the middle frequency range will result in a somewhat ‘nasal’ or ‘honky’ sound, perhaps having a rather harsh quality, depending on the exact frequency range concerned.

Electronic devices and digital recording systems tend to have a flatter response than analog recording systems or electro-acoustical transducers (microphones or loudspeakers). An amplifier is an example of the former case, and it is unusual to find a well-designed power amplifier, say, that does not have a flat frequency response these days — flat often to within a fraction of a decibel from 5 Hz up to perhaps 100 kHz. (This does not, however, imply that the full power of the amplifier is necessarily available over this whole range, making the frequency response of power amplifiers a potentially misleading specification.) Essentially, however, a flat frequency response is relatively easy to engineer in most electronic audio systems today.

Transducers are the most prone of audio devices to frequency response errors, and some poor loudspeakers exhibit deviations of 10 dB or more from ‘flat’. Since such devices are also affected by the acoustics of rooms it is difficult to divorce a discussion of their own response from a discussion of the way in which they interact with their surroundings. The room in which a loudspeaker is placed has a significant effect on the perceived response, since the room will resonate at certain frequencies, creating pressure peaks and troughs throughout the room. Depending on the location of the listener, some frequencies may be emphasized more than others. (For further in-depth discussion of this topic see Floyd Toole’s book, Sound Reproduction: Loudspeakers and Rooms, listed in ‘Recommended further reading’, below.) A loudspeaker’s response can be measured in so-called ‘anechoic’ conditions, where the room is mostly absorbent and cannot produce significant effects of its own, although other methods now exist which do not require the use of such a room. These alternative methods, such as time-delay spectrometry and maximum length sequence analysis, enable the system to exclude the portion of the time response that includes reflections, concentrating only on the direct sound. This has limitations in terms of the low-frequency range that can be measured correctly, depending on the delay before the first reflection.

FACT FILE 18.2 EFFECTS OF HARMONIC DISTORTION ON SOUND QUALITY

Harmonic distortion is not always unpleasant, indeed many people find it quite satisfying and link it with such subjective parameters as ‘warmth’ and ‘fullness’ in reproduced sound, calling sound which has less distortion ‘clinical’ and ‘cold’. Since the distortion is harmonically related to the signal that caused it, the effect may not be unmusical and may serve to reinforce the pitch of the fundamental in the case of even-harmonic distortion.

Because distortion tended to increase gradually with increasing recording level on analog tape recordings, the onset of distortion was less noticeable than it is when an amplifier or digital recording system ‘clips’, for example. Many old analog tape recordings contain quite high percentages of odd harmonic distortion that have been deemed acceptable (e.g. 1–5%). Amplifier or digital system clipping, on the other hand, is very sudden and results in a ‘squaring-off’ of the audio waveform when it exceeds a certain level, at which point the harmonic distortion becomes severe. This effect can be heard when the batteries are going flat on a transistor radio, or when a loudspeaker is driven exceedingly hard from a low-powered amplifier. It sounds like a buzzing or ‘fuzz-like’ breaking up of the sound on peaks of the signal. It is no accident that this sounds like the fuzz box effect on a rock electric guitar, as this effect is created by overdriving an amplifier circuit.

A good loudspeaker will have a response that covers the majority of the audio-frequency range, with a tolerance of perhaps ± 3 dB, but the LF end is less easy to extend than the HF end because it requires large cabinet volume and driver displacement. Smaller loudspeakers will only have a response that extends to perhaps 50 or 60 Hz. The off-axis response of a loudspeaker has also been found to have an important influence on the perceived quality, since this often governs the frequency content of the reflected sound that makes up a large proportion of what is heard. An interesting paper by Sean Olive relates measured parameters of the loudspeaker response to listener preference ratings, showing the relative contributions of various on-and off-axis measurements (see ‘Recommended further reading’, below).

Some typical frequency response specifications of devices are shown in Table 18.1.

Harmonic distortion

Harmonic distortion is another common parameter used in the specifications of audio systems. Such distortion is the result of so-called ‘non-linearity’ within a device — in other words, when there is not a 1:1 relationship between what comes out of the device and what went in, when looked at over the whole signal amplitude range. In Chapter 1 it was shown that only simple sinusoidal waveforms are completely ‘pure’, consisting of a single frequency without harmonics. More complex repetitive waveforms can be analyzed into a set of harmonic components based on the fundamental frequency of the wave. Harmonic distortion in audio equipment arises when the shape of the sound waveform is changed slightly between input and output, such that harmonics are introduced into the signal which were not originally present, thus modifying the sound to some extent (see Figure 18.9). It is virtually impossible to avoid a small amount of harmonic distortion, since no device carries through a signal entirely unmodified, but it can be reduced to extremely low levels in modern electronic audio equipment.

Table 18.1 Examples of typical frequency responses of audio system

Device Typical frequency response
Telephone system 300Hz-3kHz
AM radio 50Hz-6kHz
Consumer cassette machine 40Hz-15kHz(±3dB)
Professional analog tape recorder 30Hz-25kHz(±ldB)
CD player 20Hz-20kHz(±0.5dB)
Good-quality small loudspeaker 60Hz-20kHz(±3dB)
Good-quality large loudspeaker 35Hz-20kHz(±3dB)
Good-quality power amplifier 6Hz-100kHz( + 0, −1dB)
Good-quality omni microphone 20Hz-20kHz(±3dB)
image

FIGURE 18.9
A sine wave input signal is subject to harmonic distortion in the device under test. The waveform at the output is a different shape to that at the input, and its equivalent line spectrum contains components at harmonics of the original sine wave frequency

Harmonic distortion is normally quoted as a percentage of the signal which caused it (e.g. THD 0.1% @ 1 kHz), but, as with frequency response, it is important to be specific about what type of harmonic distortion is being quoted, and under what conditions. One should distinguish, for example, between third-harmonic distortion and total harmonic distortion, and unfortunately both can be abbreviated to ‘THD’ (although THD most often refers to total harmonic distortion). Total harmonic distortion is the sum of the contributions from all the harmonic components introduced by the device, assuming that the fundamental has been filtered out, and is normally measured by introducing a 1 kHz sine wave into the device and measuring the resulting distortion at a recognized input level. The level and frequency of the sine wave used depends on the type of device and the test standard used. Third-harmonic distortion, on the other hand, is a measurement of the amplitude of the third harmonic of the input frequency only, and was commonly used in analog tape recorder tests. It can be important to be specific about the level and frequency at which the distortion specification is made, since in many audio devices distortion varies with these parameters.

Typical examples of harmonic distortion in audio products are given in Table 18.2.

Intermodulation (IM) distortion

IM distortion results when two or more signals are passed through a nonlinear device. Since all audio equipment has some non-linearity there will always be small amounts of IM distortion, but these can be very low. Low IM distortion figures are an important mark of a high-quality system, since such distortion is a major contributor to poor sound quality, but it is less often quoted than THD. Unlike harmonic distortion, IM distortion may not be harmonically related to the frequency of the signals causing the distortion, and thus it is audibly more unpleasant. If two sine-wave tones are passed through a non-linear device, sum and difference tones may arise between them (see Figure 18.10). For example, a tone at f1 = 1100Hz and a tone at f2 = 1000 Hz might give rise to IM products at f1f2 = 100 Hz, and also at f1 + f2 = 2100 Hz, as well as subsidiary products at 2 f1f2 and so on. The dominant components will depend on the nature of the non-linearity.

Table 18.2 Typical THD percentages

Device % THD
Good power amplifier @ rated power <0.05%(20Hz-20kHz)
16 bit digital recorder (via own convertors) <0.05%(−15dB input level)
Loudspeaker <1%(25W, 200Hz)
Professional analog tape recorder <1% (ref. level, 1kHz)
Professional capacitor microphone <0.5%(1kHz, 94dBSPL)
image

FIGURE 18.10
Intermodulation distortion between two input signals in a non-linear device results in low-level sum-and-difference frequency components in the output signal

image

FIGURE 18.11 Signal-to-noise ratio is often quoted as the number of decibels between the reference level and the noise floor. Available dynamic range may be greater than this, and is often quoted as the difference between the peak level and the noise floor

Dynamic range and signal-to-noise ratio

Dynamic range and signal-to-noise (S/N) ratio are often considered to be interchangeable terms for the same thing. This depends on how the figures are arrived at. S/N ratio is normally considered to be the number of decibels between the ‘reference level’ and the noise floor of the system (see Figure 18.11). The noise floor may be weighted according to one of the standard curves which attempts to account for the potential ‘annoyance’ of the noise by amplifying some parts of the frequency spectrum and attenuating others (see Fact Files 1.4 and 18.3). Dynamic range may be the same thing, or it may be the number of decibels between the peak level and the noise floor, indicating the ‘maximum-to-minimum’ range of signal levels that may be handled by the system. Either parameter quoted without qualification is difficult to interpret. It is also difficult to compare S/N ratios between devices measured using different weighting curves. In analog tape recorders, for example, dynamic range is sometimes quoted as the number of decibels between the 3% MOL and the weighted noise floor. This gives an idea of the available recording ‘window’, since the MOL is often well above the reference level. In digital recorders, the peak recording level is really also the reference level, since there is no point in recording above this point due to the sudden clipping of the signal.

FACT FILE 18.3 NOISE WEIGHTING CURVES

Weighting filters are used when measuring noise, to produce a figure that more closely represents the subjective annoyance value of the noise. Some examples of regularly used weighting curves are shown in the diagram, and it will be seen that they are similar but not the same. Here 0dB on the vertical axis represents the point at which the gain of the filter is ‘unity’, that is where it neither attenuates nor amplifies the signal. The ‘A’ curve is not normally used for measuring audio equipment noise, since it was designed for measuring acoustic background noise in buildings. The various DIN and CCIR curves are more commonly used in audio equipment specifications.

image

In general a constant level of background hiss is much less subjectively annoying than intermittent noises that draw attention to themselves. Mains-induced hum or buzz is also potentially annoying because of its tonal quality and the harmonics of the fundamental frequency that are often present. Noise that is modulated in some way by the audio signal, such as the digital codec noise described below, can also be subjectively quite annoying.

Wow, flutter and jitter

Wow and flutter are names used to describe speed variations of an analog tape machine or turntable, which are usually translated into frequency fluctuations in the audio signal. Wow means slow variations in speed while flutter means faster variations in speed. A weighting filter (usually to the DIN standard) is used when measuring to produce a figure that closely correlates with one’s perception of the annoyance of speed variations. A machine with poor W&F results will sound most unpleasant, with either uncomfortable deviations in the pitch of notes or a ‘roughness’ in the sound and possibly some intermodulation distortion. Digital systems, on the other hand, tend to suffer more from ‘jitter’ as discussed in Fact File 18.4.

Sound quality in the digital signal chain

A number of basic aspects of digital conversion that affect sound quality were already mentioned in Chapter 8, such as the relationship between quantizing resolution and dynamic range, and between sampling frequency and frequency response. In this section some operational and systems issues will be considered.

Typical audio systems today have a very wide dynamic range that equals or exceeds that of the human hearing system. The distortion and noise inherent in the recording or processing of audio are at exceptionally low levels owing to the use of high resolution A/D convertors, up to 24 bit storage, and wide range floating-point signal processing. The sound quality achievable with modern audio devices is therefore exceptionally high. Of course there will always be those for whom improvements can be made, but technical performance of digital audio systems is no longer really a major issue today.

If one accepts the foregoing argument, the maintenance of sound quality in the digital signal chain comes down more to understanding the operational areas in which quality can be compromised. These include things like ensuring as few A/D and D/A conversions as possible, maintaining audio resolution at 20 bits or more throughout the signal chain (assuming this is possible), redithering appropriately at points where requantizing is done, and avoiding sampling frequency conversions. The rule of thumb should be to use the highest sampling frequency and resolution that one can afford, but no higher than strictly necessary for the purpose, otherwise storage space and signal processing power will be squandered. The scientific merits of exceptionally high sampling frequencies are dubious, although the marketing value may be considerable.

FACT FILE 18.4 JITTER

Happily the phenomenon of wow and flutter is now consigned almost entirely to history, as modern digital recorders do not suffer from it owing to the fact that transport or disk speed variations can be ironed out by reclocking the audio signal before reproduction. The modern equivalent, however, is jitter. Jitter is the term used to describe clock speed or sample timing variations in digital audio systems, and can give rise to effects of a similar technical nature to wow and flutter, but with a different spectral spread and character. It typically only affects sound quality when it interferes with the A/D or D/A conversion process. Because of the typical frequency and temporal characteristics of jitter it tends to manifest itself as a rise in the noise floor or distortion content of the digital signal, leading to a less ‘clean’ sound when jitter is high. If an A/D convertor suffers from jitter there is no way to remove the distortion it creates from the digital signal subsequently, so it pays to use convertors with very low jitter specifications.

The point at which quality can be affected in a digital audio system is at A/D and D/A conversion. In fact the quality of an analog signal is irretrievably fixed at the point of A/D conversion, so this should be done with the best equipment available. There is very little that can be done afterwards to improve the quality of a poorly converted signal. At conversion stages the stability of timing of the sampling clock is crucial, because if it is unstable the audio signal will contain modulation artefacts that give rise to increased distortions and noise of various kinds. This so-called clock jitter (see Fact File 18.4) is one of the biggest factors affecting sound quality in convertors and high-quality external convertors usually have much lower jitter than the internal convertors used on PC sound cards.

The quality of a digital audio signal, provided it stays in the digital domain, is not altered unless the values of the samples are altered. It follows that if a signal is recorded, replayed, transferred or copied without altering sample values then the quality will not have been affected, despite what anyone may say. Sound quality, once in the digital domain, therefore depends entirely on the signal processing algorithms used to modify the program. There is little a user can do about this except choose high quality plug-ins and other software, written by manufacturers that have a good reputation for DSP that takes care of rounding errors, truncation, phase errors and all the other nasties that can arise in signal processing. This is really no different from the problems of choosing good-sounding analog equipment. Certainly not all digital equalizer plug-ins sound the same, for example, because this depends on the filter design. Storage of digital data, on the other hand, does not affect sound quality at all, provided that no errors arise and that the signal is stored at full resolution in its raw PCM form (in other words not using some form of lossy coding).

The sound quality the user hears when listening to the output of a professional system in the studio is not necessarily what the consumer will hear when the resulting program is issued on the release medium. One reason for this is that the sound quality depends on the quality of the D/A convertors used for monitoring. The consumer may hear better or worse, depending on the convertors used, assuming the bit stream is delivered without modification. One hopes that the convertors used in professional environments are better than those used by consumers, but this is not always the case. High-resolution audio may be mastered at a lower resolution for consumer release (e.g. 96 kHz, 24 bit recordings reduced to 44.1kHz, 16 bits for release on CD), and this can affect sound quality. It is very important that any down-conversion of master recordings be done using the best dithering and/or sampling frequency conversion possible, especially when sampling frequency conversion is of a non-integer ratio.

When considering the authoring of interactive media such as games or virtual reality audio, there is a greater likelihood that the engineer, author, programmer and producer will have less control over the ultimate sound quality of what the consumer hears. This is because much of the sound material may be represented in the form of encoded ‘objects’ that will be rendered at the replay stage, as shown in Figure 18.12. Here the quality depends more on the quality of the consumer’s rendering engine, which may involve resynthesis of some elements, based on control data. This is a little like the situation that arises when distributing a song as a MIDI sound file, using General MIDI voices. The audible results, unless one uses downloadable sounds (and even then there is some potential for variation), depends on the method of synthesis and the precise nature of the voices available at the consumer end of the chain.

image

FIGURE 18.12 (a) In conventional audio production and delivery sources are combined and delivered at a fixed quality to the user, who simply has to replay the signal. The quality is limited by the resolution of the delivery link. (b) In some virtual and synthetic approaches the audio information is coded in the form of described objects that are rendered at the replay stage. Here the quality is strongly dependent on the capabilities of the rendering engine and the accuracy of description

Sound quality in audio codecs

It is increasingly common for digital audio signals to be processed using one or more perceptual audio codecs (see Chapter 8 for further details). This is typically done in order to limit the bit rate for transmission, internet delivery, or storage on portable devices such as phones and iPods. In mobile telephony and communications audio coding is often used to transmit speech signals at very low bit rates. Such coding normally has an effect on sound quality because the systems involved achieve the reduction in bit rate by allowing an increase in noise and distortion. However, the effect is a dynamic one and depends on the current signal and the perceptual processing algorithms employed to optimize the bit rate. Therefore the perceived effects on sound quality are often more difficult to describe, evaluate and measure than when a signal is affected by the simpler processes described in the previous section.

The aim of perceptual audio coding is to achieve the highest possible perceived quality at the bit rate in question. At the highest bit rates it is possible to achieve results that are often termed ‘transparent’ because they appear to convey the audio signal without audible changes in noise or distortion. As the bit rate is reduced there is an increasing likelihood that some of the effects of audio coding will be noticed perceptually, and these effects are normally referred to as coding artefacts. In the following section some of the most common coding artefacts will be considered, along with suggestions about how to keep them under control. Some tips to help minimize coding artefacts are included in Fact File 18.5.

Although the nature of coding artefacts depends to some extent on the type of codec, there are enough similarities between codecs to be able to generalize about this to some degree. This makes it possible to say that the most common artefacts can normally be grouped into categories involving coding noise of various types, bandwidth limitation, temporal smearing and spatial effects. Coding noise is generally increased quantizing noise that is modulated by various features of the audio signal. At high bit rates it is often possible to ensure that most or all of the noise resulting from requantization of signals is constrained so that it lies underneath the masking threshold of the audio signal, but at lower bit rates there may be regions of the spectrum where it becomes audible at certain times. The ‘swooshing’ effect of coding noise is the result of its changing spectrum, as shaped by the perceptual model employed, and this can be easily heard if the original uncoded signal is subtracted from a version that has been coded and decoded. It is remarkable to hear just how much noise has been added to the original signal by the encoding process when performing this revealing operation, yet to realize that the majority of it remains masked by the audio signal.

FACT FILE 18.5 MINIMIZING CODING ARTEFACTS

As a rule coding artefacts can be minimized by using a high-quality codec at as high a bit rate as possible. Joint stereo coding is one way of helping to reduce the impact of coding artefacts when using MP3 codecs at lower bit rates, but it does this by simplifying the spatial information. It is therefore partly a matter of trading off one type of distortion against another. (It seems to be generally true that timbral distortions are perceived as more objectionable than spatial distortions by the majority of listeners.).

Avoidance of long chains of codecs (‘cascading’) will also help to maintain high audio quality, as each generation of coding and decoding will add artefacts of its own that are compounded by the next stage. This danger is particularly prevalent in broadcasting systems where an audio signal may have been recorded in the field on a device employing some form of perceptual encoding, transmitted to the broadcasting center over a link employing another codec, and then coded again for final transmission.

Limiting the bandwidth of audio signals to be encoded is another way of reducing annoying coding artefacts, although many codecs tend to do this automatically as the first ‘tool in the armory’ when trying to maximize quality at low bit rates. (A fixed limitation in bandwidth is generally perceived as less annoying than time-varying coding noises.) Under conditions of very low bit rate, such as when using streaming codecs for mobile or Internet applications, the audio signal may require some artificial enhancement such as compression or equalization to make it sound acceptable. This should be considered as a form of mastering that aims to optimize sound quality, taking into account the limitations of the transmission medium.

Bandwidth limitation can be either static or dynamic and gives rise to various perceivable artefacts. Static bandwidth limitation is sometimes used to restrict the frequency range over which a codec has to operate, in order to allow the available bits to be used more efficiently. Some codecs therefore operate at reduced overall bandwidths when low bit rates are needed, leading to the slightly muffled or ‘dull’ sound that results from a loss of very high frequencies. This is often done in preference to allowing the more unpleasant coding artefacts that would result from maintaining full audio bandwidth, because a lot of audio program material, particularly speech, does not suffer unduly from having the highest frequencies removed. The so-called ‘birdies’ artefact, on the other hand, is related to a dynamic form of spectral change that sounds like twittering birds in the high-frequency range. It tends to result from the changing presence or lack of energy in individual frequency bands, as determined by the perceptual algorithms employed in the codec.

Temporal smearing is mainly the result of audio being coded in blocks of a number of milliseconds at a time. This can have the effect of spreading noise over a period of time so that it becomes more audible, particularly at transients (when the signal energy rises or falls rapidly), or causing pre- and post-echoes (small repetitions of the signal either before or after its intended time). The main perceptual effect of this tends to be a blurring of transients and a dulling or ‘fuzzing’ of attacks in musical signals, which is most noticeable on percussive sounds. A well-known test signal that tends to reveal this effect is a recording of solo castanets, which can be found on the EBU SQAM (Sound Quality Assessment Material) test disk (see ‘Recommended further reading’, below).

Many codecs act on stereophonic signals involving two or more channels, and often the bits available are shared between the channels using a variety of algorithms that minimize the information needed to transmit spatial information. These algorithms include intensity panning, M-S coding (see Chapter 16), and parametric spatial audio coding. Intensity panning and parametric spatial coding both simplify the spatial information used for representing source positions and diffuse spatial impression by boiling it down to a set of instructions about the interchannel relationships in terms of time, intensity or correlation. A simplified ‘downmix’, either to mono or two-channel stereo, can be transmitted and accompanied by the parameters needed to enable the original interchannel relationships to be reconstructed, as shown in Figure 18.13. This is the basic principle of MPEG spatial audio coding, for example. The degree to which this succeeds perceptually depends on the rate at which the spatial information can be transmitted, and the accuracy with which the interchannel relationships can be reconstructed. When the bit rate is low compromises have to be made, leading to potential distortions of the perceived spatial scene, which can include narrowing of the stereo image, blurring or movement of source locations, and a reduction in the sense of ‘spaciousness’ or envelopment arising from reverberation or other diffuse background sounds.

image

FIGURE 18.13
Block diagram of a typical spatial audio codec

RECOMMENDED FURTHER READING

AES, 2002. Perceptual audio coders: what to listen for. A tutorial CD-ROM on coding artefacts. Available from: http://www.aes.org/publications.

Blauert, J., 2008. Masterclass: Concepts in Sound Quality [online]. Audio Engineering Society, New York. Available at: www.aes.org/tutorials (Accessed 10 November 2008).

Bech, S., Zacharov, N., 2006. Perceptual Audio Evaluation: Theory, Method and Application. John Wiley.

EBU Tech. 3253, 1988. Sound Quality Assessment Material (SQAM). European Broadcasting Union, Geneva. (Most of the audio material on the SQAM disk can be downloaded freely at: http://www.ebu.ch/en/technical/publications/tech3000_series/tech3253/index.php).

Gabrielsson, A., Lindström, B., 1985. Perceived sound quality of high fidelity loudspeakers. J. Audio. Eng. Soc. 33(1), 33–53.

ITU-R Recommendation BS. 1116-1, 1994. Methods for Subjective Assessment of Small Impairments in Audio Systems Including Multichannel Sound Systems. International Telecommunications Union, Geneva.

ITU-R Recommendation BS.1534-1, 2003. Method for the Subjective Assessment of Intermediate Quality Level of Coding Systems. International Telecommunications Union, Geneva.

Letowski, T., 1989. Sound quality assessment: cardinal concepts. Presented at the 87th Audio Engineering Society Convention, New York. Preprint 2825.

Olive, S., 2003. Differences in performance and preference of trained versus untrained listeners in loudspeaker tests: a case study. J. Audio. Eng. Soc. 51(9), 806–825.

Toole, F., 1982. Listening tests: turning opinion into fact. J. Audio. Eng. Soc. 30(6), 431–445.

Toole, F., Olive, S., 1994. Hearing is believing vs. believing is hearing: blind vs. sighted listening tests, and other interesting things. Presented at the 97th Audio Engineering Society Convention, San Francisco. Preprint 3894.

Toole, F., 2008. Sound Reproduction: Loudspeakers and Rooms. Focal Press.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset