Chapter 9

Sound Field

Rozenn Nicol

Introduction

In general, stereophony and multichannel surround sound can be defined as loudspeaker and listener-centric channel-based methods, wherein sound reproduction is based on a specific set of audio channels associated for a given loudspeaker setup. Using these systems, each channel is contributing to a focused sound image for a listener located in the sweet spot. As opposed to stereophony and multichannel surround sound, the sound field approach is based on a non-speaker-centric physical representation of the sound waves.

The term sound field refers to the capture, reproduction and description of sound waves. This is in contrast to binaural, stereo or surround sound systems, where the objective is to create perceived object(s), or auditory event(s). This directional information is interpreted as spatial properties by the auditory system. With the sound field approach, the properties that are controlled to create or to reproduce sounds are the physical properties of sound waves, whereas using binaural, stereo or surround sound techniques, the properties under control are at the perceptual level. Sound field properties are linked to all of the acoustic phenomena encountered by the sound wave from its point of origin to its point of observation.

Free-field propagation is the simplest case where the sound wave propagates in a straight line (at least for atmospheric propagation over a reasonably short distance). The source directivity and the distance of propagation, which causes a delay of the arrival time and a decrease of amplitude, are the most critical properties. These properties define the direct wave. Inside a room, or in the presence of any obstacle, the sound wave is affected by acoustic reflections, diffusion and scattering and/or diffraction, which result in the addition of a set of modified and delayed copies of the direct wave. Thus, a sound field is the superposition of all these components (i.e. direct wave, reflected wave, diffuse wave, diffracted wave, etc.). These components can be captured together by taking the impulse response of a room. Each component is characterized by several parameters including arrival time, frequency content and incidence angle. These parameters describe the acoustic and geometric properties of the sound source(s) and its environment.

This chapter will be divided into five parts. First, general ideas about the sound field approach and its development, starting from coincident stereo microphone recording techniques introduced by Blumlein to Ambisonics and High Order Ambisonics (HOA), will be presented. Then capture methods, recording formats and reproduction of sound fields will be discussed in further detail. The physical and mathematical tools in connection of sound fields will be given last as further insight.

Development of the Sound Field

Starting Point: The X/Y and M/S Techniques

The initial development of the sound field approach can be traced to the pioneering work of Blumlein on X/Y and the M/S techniques for stereophonic recording (Blumlein, 1931). The X/Y pair is composed of two directional microphones, which are arranged perpendicularly. The M/S pair is composed of one microphone facing forward (the “Mid” or M signal) and one figure-of-eight (i.e. bidirectional) microphone (the “Side” or S signal) along the left/right axis. In each case, the two microphones are theoretically coincident, but in practice one is put above the other. Since a cardioid microphone can be seen as the combination of one pressure microphone (i.e. omnidirectional) and one figure-of-eight microphone, it is easily shown that the X/Y pair of cardioid microphones and an M/S configuration (with a cardioid for the M microphone) are equivalent (Hibbing, 1989).

The Mid microphone picks up the frontal components of the sound field, whereas the Side microphone gets the lateral components. By convention the left components have a positive phase, and the right ones a negative phase. Thus the two microphones achieve a kind of amplitude and phase encoding of the direction of the sound source. At the reproduction step, this spatial encoding is used to restitute the location of all the components of the sound field. Since the left components picked up by the S microphone are in phase with the frontal components, summing the outputs of the M and S microphones leads to extract the left part of the sound field. On the contrary, since the right components are in opposite phase with the frontal ones, they are eliminated. In the same way, if the difference of the M and S outputs is computed instead of their sum, the right components are extracted and the left ones eliminated. The left (L) and right (R) signals for a stereophonic reproduction are thus derived from the M and S signals by the following matrix process:

L = M + S
R = M - S
(9.1)

This equation summarizes how spatial information (namely left/right separation) can be extracted from the M/S signals. The M/S pair can be understood as a first level of sound field recording, which is restricted to the horizontal plane. For a full discussion on the X/Y and M/S techniques, see Chapter 3.

Ambisonics

Based on the previous work of Cooper, Shiga and Bruck, Michael Gerzon presented a system that would treat all directional sounds equally, representing both horizontal and vertical sounds. Gerzon (1973) believed that this could be achieved by recording values of sound pressure on the surface of a sphere equally from all directions. Realizing the practical limitations of recording, broadcast and playback systems of the day, Gerzon suggested that a minimal number of channels and speakers could be used to fulfill the basic psychoacoustic requirements to perceive both horizontal and vertical sound by placing the listener in a space bounded by loudspeakers defined in a cube (see Figure 9.1). He acknowledged that by increasing the number of spherical harmonics, the directional resolution of the system is improved; however, he desired a system that was practical. Gerzon later introduced Ambisonics Technology as an alternative to channel-based stereo and surround systems. He defined a 4-channel system (using signals X, Y, Z and W discussed later in this chapter) that was compatible with stereo and quadraphonic surround playback systems yet could achieve, in his words, a “full spherical portrayal of directionality” (Gerzon, 1985, p. 859).

First Order Sound Field Capture

Acoustic pressure and particle velocity are two physical variables that are essential to fully describe a sound wave. The former is recorded by a pressure microphone, the latter typically by a ribbon figure-of-eight microphone. Sound field recording techniques lead to a full and accurate representation of sound waves.1 The first consequence is that the input signals of the loudspeakers are not always the discrete output signals of the microphones. A processing step (for example using an M/S matrix or an Ambisonics decoder) is required to extract the loudspeaker signals from the microphone signals. A second consequence is that sound reproduction is not limited to one single point, but can be extended to a large area, the boundary of which is determined by the accuracy of the sound field information recorded. Another important difference between sound field and conventional multichannel surround sound is that for sound field systems, all directions are equally considered without any frontal focus.

The tetrahedral microphone was proposed by Craven and Gerzon (Craven & Gerzon, 1977; Farrar, 1979a,b). It is composed of four cardioid2 microphones arranged in a tetrahedron (see Figure 9.2). The fundamental objective of the tetrahedral microphone is to acquire the components (W, X, Y, Z), which represent the 0th and 1st order components of the Spherical Harmonics expansion of the sound field. The most intuitive way to record these signals is to use one omnidirectional microphone (for the 0th order component W) and three figure-of-eight microphones (for the 1st order components X, Y and Z), which are pointed respectively along the x-, y- and z-axis. Four cardioid microphones arranged in a tetrahedron can be used instead, especially when the coincident setup of four microphones (three figure-of-eights and one omnidirectional) is not practical. The tetrahedral microphone (Craven & Gerzon, 1977) provides an elegant solution. It should be noted that the tetrahedral microphone does not directly deliver the components (W, X, Y, Z). A matrixing process (expressed by Equation 9.2) is required to derive these latter components from its output signals (LF, RF, LB, RB). This step is commonly referred to as the encoding.

Similar to the X/Y pair, each cardioid microphone can be decomposed into one omnidirectional and one figure-of-eight capsule. By appropriate recombination of the resulting elements, it is shown that the tetrahedral microphone is equivalent to one omnidirectional capsule (W component) and three figure-of-eight capsules (X, Y and Z components), each one oriented along respectively the x-axis, the y-axis and the z-axis (Farrar, 1979a), see Figure 9.3. All the capsules are theoretically coincident. The four cardioid microphones are referred to as LF (for Left Front), RF (for Right Front), LB (for Left Back) and RB (for Right Back), and the components (W, X, Y, Z) are expressed as a linear sum of the signals (LF, RF, LB, RB), as follows:

W = LF + LB + RF + RB
X = LF - LB + RF - RB
Y = LF + LB - RF - RB
Z = LF - LB - RF + RB
(9.2)

The components (W, X, Y, Z) allow us to reformulate the spatial analysis performed by the tetrahedral microphone. The W component corresponds to an omnidirectional recording of the sound field, and is a kind of “0th-order” analysis of spatial information. The X, Y and Z components provide respectively front/back, left/right and top/bottom separation. In the same way, the X/Y and M/S setups can be recomposed as the combination of one omnidirectional capsule and two figure-of-eight capsules, each one oriented along respectively the x-axis and the y-axis. Therefore the X/Y and M/S can be really considered as the 2D restriction of sound field recording.

As previously highlighted, a sound field has some specific features. First, it is composed of many components (i.e. direct and reflected waves), which all have temporal, frequency and spatial properties. Second, a sound field is defined intrinsically over an extensive area. As mentioned earlier, a minimum of two microphones is required to extract some spatial differential. More generally, a sound field can be captured by spatial sampling using a microphone array. However, this solution raises several problems. Analogous to time sampling, the minimum sampling rate is determined by the maximal spatial frequency contained in the sound field. For time sampling, Shannon theorem recommends to take at least two samples per period. In the same way, for spatial sampling, two samples are needed per wavelength. For instance, at a frequency of 17 kHz, assuming that the speed of sound is 340 m/s, the wavelength will be 2 cm. Thus, to properly sample frequencies as high as 17 kHz, a minimum spacing of 1 cm is required. Consequently, to cover an extensive area will require a huge number of microphones. For these reasons, ideal spatial sampling of a sound field is hardly ever feasible.

A promising alternative is sparse sampling by an irregular array of microphones. Methods of source separation and source localization are then required to extract all the sound components and the associated spatial information from the microphone signals (Gallo, Tsingos & Lemaitre, 2007). Current research indicates that the required post-processing is likely to cause audible artifacts. Most of the time, the audio quality of these systems does not meet the requirements of commercial sound engineers, but the related signal processing methods are constantly being improved.

As introduced in the previous section, the tetrahedral microphone, which is essentially a 3D extension of X/Y and M/S microphone techniques, offers an effective and practical solution to record a sound field. Its concept can be explained in several ways. Intuitively, since the tetrahedral microphone is composed of four cardioid microphones regularly distributed over a sphere, the direction of each sound component is encoded by the differences in amplitude between the four capsules. In a more physical way, it is shown that the microphone outputs can be recombined to form the (W, X, Y, Z) signals, which correspond to the first components of the Spherical Harmonics expansion of the sound field (see Equations 9.39 and 9.41). As it will be explained in the next section, with Higher Order Ambisonics (HOA), this concept is generalized to spherical arrays composed of a higher number of microphones, leading to the extraction of the Spherical Harmonics components of higher orders.

First Order Sound Field Reproduction

In the case of the M/S technique, we have seen that a matrix process (see Equation 9.1) is required to obtain the loudspeaker signals. A similar processing is needed for Ambisonics reproduction. The loudspeaker inputs are obtained as a weighted sum of the components (W, X, Y, Z). This matrix process is called decoding in Ambisonics terminology. Because Ambisonics is not a channel-centric reproduction system, it is not constrained to one single and standardized setup of loudspeakers. On the contrary, various loudspeaker layouts can be used, both in terms of the number and the position of the loudspeakers. In addition, for each configuration, there are several ways to decode the components (W, X, Y, Z) into the loudspeaker signals. This flexibility of sound field reproduction is a major advantage of Ambisonics. Ambisonics signals are even compatible with monophonic or stereophonic reproduction. The W component is equivalent to a monophonic recording of the sound field and is therefore useful for monophonic reproduction.

For stereophonic reproduction, there are several ways to matrix the signals (W, X, Y, Z) into plausible stereophonic signals. One solution is to derive the signals (Mv , Sv) corresponding to a virtual M/S pair:

Mv=W2+XSv=Y (9.3)

Then Equation 9.1 is used to compute stereophonic signals like for a real M/S recording. In the same way, a virtual X/Y recording can be simulated as:

Xv=X+Y2Yv=XY2 (9.4)

The signals (Xv, Yv) directly feed the left and right loudspeakers.

For explicit Ambisonics reproduction, the decoding is less intuitive and depends on the loudspeaker layout. Even though the number and the position of the loudspeaker can be freely chosen, some rules (mainly common sense) must be satisfied. First, since spatial information is represented by four signals (i.e. the components W, X, Y, Z), at least four loudspeakers are required (as shown in Figure 9.1). Second, the more regular the layout is, the simpler (and more robust) the decoding is. To reproduce the full 3D spatial information, a 3D array of loudspeakers is needed. A regular layout means that the loudspeakers are arranged as the vertices of a regular polyhedron, which correspond to a regular sampling of a sphere. The simplest example is a cube. It is also possible to focus on spatial information in the horizontal plane, which leads to 2D Ambisonics reproduction, and for which only the (W, X, Y) components are needed. In that case, a regular layout corresponds to a circular array of equidistant loudspeakers. The simplest example is a square.

To illustrate the decoding process, we will examine the following two examples (see Figure 9.4): 2D reproduction over a square setup (i.e. four loudspeakers arranged in a square centered around the listener) and 3D reproduction over a cube setup (i.e. eight loudspeakers arranged as the vertices of a cube). The general idea of Ambisonics reproduction is that the spatial information is remapped over the loudspeaker array, in order that the sum of the contributions of all the loudspeakers properly reconstruct the sound field in the listening area. The input signal of each loudspeaker is therefore a weighted sum of the components (W, X, Y, Z). The weights are defined as a function of the loudspeaker position. Several methods have been proposed to compute the decoding matrix (Gerzon, 1992; Daniel, 2001), which derives the loudspeaker signals from the components (W, X, Y, Z). Each method corresponds to specific properties of the sound field reconstruction. These aspects will be detailed later. Here we will present only one method, which is called basic decoding and, in which the weights are simply the spatial coordinates of the loudspeakers. The loudspeaker located in the direction (ϕl , θl) is thus fed by the signal:

Ll=1NL(W2+X  cosϕlcosθl  + Y  sinϕlcosθl+Z  sinθl) (9.5)

For a square setup, the four loudspeaker signals are obtained as:

LRF=12(W+XY)LLF=12(W+X+Y)LLB=12(WX+Y)LRB=12(WXY) (9.6)

where the signals LRF , LLF , LLB, LRB refer respectively to the right front, the left front, the left back and the right back loudspeakers. In the same way, for a cube setup, the eight loudspeakers are fed by:

LRFU=W2+12(XY)+Z2LLFU=W2+12(X+Y)+Z2LLBU=W2+12(X+Y)+Z2LLBD=W2+12(XY)Z2LLFD=W2+12(X+Y)Z2LLBD=W2+12(X+Y)Z2LRBD=W212(X+Y)Z2 (9.7)

where the letters “U” and “D” refer respectively to the upward and downward loudspeakers.

Two localization criteria, introduced by Gerzon, are used to estimate the perceived direction of the virtual sound sources reproduced by a loudspeaker array: the velocity vector and the energy vector (Gerzon, 1992). These criteria are derived from the localization model of Makita for a stereophonic system (Makita, 1962). If the unitary vector xl refers to the direction of the lst loudspeaker and, s(l, ω), to its signal, the velocity (V) and energy (E) vectors are respectively defined by:

V=l=1Nls(l,ω)xll=1Nls(l,ω), E=l=1Nl|s(l,ω)|2xll=1Nl|s(l,ω)|2=rExE (9.8)

The direction of the pointed vector corresponds to the average direction of the energy arrival and can be interpreted as the spatial barycenter of the reproduced sound field. The norm of the vector reflects its spatial spread; if the norm is close to one, it means that the sound energy is focused on only a few loudspeakers. The velocity vector is a low-frequency criterium in the sense that the phase of the loudspeaker signals is taken into account, which is relevant only for low frequencies. The energy vector may be seen as its high-frequency version, in which the energy of the loudspeaker signals is considered instead, since the auditory system is not sensitive to phase for high frequencies.

The reproduction of a sound field raises some specific issues. The system has to create a complex acoustic wave over an extensive area characterized by various time, frequency and spatial properties. To do this, loudspeaker arrays are required. Each loudspeaker can be seen as a secondary source, which emits a wavelet, so that the sum of all the loudspeaker contributions leads to the reconstruction of the target sound field at any point in the reproduction area. The loudspeakers are considered secondary to the source in the sense that they create only a synthetic copy of the target sound field, as opposed to a real or virtual sound source, which would have created a sound field. The virtual or real sound source is called the primary source for this reason.

Both the amplitude and the phase of the signal feeding each secondary source are controlled to accurately create the proper features of the reproduced sound field. To compute the loudspeaker signals, sound field synthesis, which is defined by either Equation 9.34 or 9.35, can be used. One example is the Wave Field Synthesis (WFS) method (discussed in Chapter 10).

Another more general method is sound field control, where the loudspeaker signals are obtained using the constrained optimization solution. More precisely, the reproduction system is composed of L loudspeakers and M error sensors, which are distributed at control points over the reproduction array. The error is computed as the difference between the target sound field and the sound field synthesized by the loudspeaker array. An error vector is defined as the set of errors evaluated at the location of the M sensors. Ideally the error should be null. The objective is therefore to minimize the error vector with respect to the loudspeaker amplitude (Gauthier & Berry, 2006). The loudspeaker signals are then obtained as the solution that minimizes a given cost function, which is expressed as a function of the quadratic error and may include a regularization term to prevent it from causing any ill-conditioning problems.

Higher Order Ambisonics (HOA)

The tetrahedral microphone provides a tool allowing the capture of full 3D sound field. However, since it is only composed of four capsules, its spatial resolution is low, which means that the discrimination between the sound components is not accurate. Intuitively, it seems relevant to generalize its concept by using a spherical array of microphones with a higher number of sensors. To help this generalization, Bamford and Daniel pointed out the link between Ambisonics and Spherical Harmonics (Bamford, 1995; Daniel, 2001). Spherical Harmonics are spatial functions, which allow one to represent any sound wave as a linear sum of directional components (see Equations 9.25 and 9.34). The omnidirectional component (W) is the Spherical Harmonics of 0th order, whereas the bidirectional ones (i.e. figure-of-eight, namely X, Y, Z) are the three Spherical Harmonics of 1st order. The components of orders higher than 1 have more complex directivity (see Figure 9.5). The (W, X, Y, Z) representation of a sound field can thus be extended by including Spherical Harmonics of higher orders. As the number of microphones increases, it is less and less possible to arrange them in a coincident setup. Furthermore, directional microphones corresponding to the directivity of the Spherical Harmonics of the highest orders do not exist. In a similar way as for the tetrahedral microphone, a convenient solution for spatial sampling of the sound field is a spherical array of cardioid microphones. The spatial information is extracted from the microphone outputs, by a proper matrixing process, which is close to Equation 9.2, leading to the Higher Order Ambisonics (HOA) representation of the sound field, which is the extension of the 1st order Ambisonics (i.e. W, X, Y, Z components) to higher orders.

The number of microphones of the spherical array imposes the maximal order that can be extracted. One example is the Eigenmike® (see Figure 9.6), which is composed of 32 microphones and allows HOA encoding up to the 4th order. For sound reproduction, a second step of appropriate decoding is needed to correctly map the spatial information contained in the HOA components to the loudspeaker array, in order to compute the loudspeaker input signals.

In its original definition, Ambisonics is based on the Spherical Harmonics expansion limited to 0th and 1st order components. HOA generalizes this concept by including components of order m greater than 1, as shown in (Bamford, 1995) and (Daniel, 2001). If the Spherical Harmonics expansion is truncated to the order m = M, the HOA representation of the acoustic pressure is composed of (M + 1)2 components, which are the (M+1)2Bmnσ(ω) coefficients of the Spherical Harmonics expansion (see Equation 9.34). These HOA components convey spatial variation as a function of the azimuth and elevation angles (see Figure 9.5). Each order m is composed of (2m + 1) components with various directivities. It should be noted that some components are characterized by a null response in the horizontal plane. The consequence is that they do not contribute to any spatial horizontal information. By contrast, the directivity of the remaining components is symmetrical to the horizontal plane. These latter components are referred to as the “2D Ambisonics components”, in the sense that, if the sound field reproduction is restricted to the horizontal plane (i.e. the loudspeaker setup is limited to the horizontal plane), only these components must be considered. On the contrary, if a full 3D reproduction is expected, all the components are used and the reproduction setup requires both horizontal and elevated loudspeakers to render spatial height information.

The component of 0th order corresponds to the spatial equivalent of the DC component and is characterized by no spatial variation. In other words the 0th order component, W, is the monophonic recording of a sound field by a pressure microphone. The three 1st order components are characterized by a figure-of-eight variation (i.e. cosine or sine function). As the order increases, the spatial variation is faster and faster as a function of the angle, as illustrated in Figure 9.5. A first benefit of including components of higher order is therefore to enhance the spatial accuracy and the spatial definition (resolution) of the sound field representation. This is due to the increase of the high frequency cutoff of the associated spatial spectrum.3 The resulting effect on the reproduced sound field is complex: both the size of the listening area and the bandwidth of the “time spectrum” are affected. Indeed, 1st order Ambisonics reproduction is penalized by a phenomenon of “sweet spot”: the sound field is correctly reproduced only at the close vicinity of the center of the loudspeaker setup. In addition, for a given reproduction area, low frequencies, which are linked to large wavelengths and therefore to slow spatial variations, are better reconstructed than high frequencies. Adding Ambisonics components of order higher than M = 1 increases both the size of the listening area and the high-frequency cutoff of the time spectrum. Thus, small movements of the listener are then allowed. In Figure 9.7, the sound field reproduced by Ambisonics systems of various orders is illustrated in the case of a plane wave. It is observed that a low-frequency plane wave (f = 250 Hz) is well reconstructed over a wide area by a 4th order system. If the frequency increases up to 1 kHz, the area of accurate reproduction shrinks considerably. An upgrade to a 19th order system is needed to achieve a listening area the size of which is equivalent to that obtained by the 4th order system at f = 250 Hz. Thus, if the sound field reproduced is observed over a fixed area, the high-frequency cutoff decreases as a function of the maximal Ambisonics order M. In the same way, if the sound field is observed at a fixed frequency, the size of accurate reproduction decreases as a function of the maximal Ambisonics order M. In (Ward & Abhayapala, 2001), a rule of thumb was proposed to estimate the reproduction order as a function of the wave number k and the radius r of the reproduction sphere, to achieve a maximum threshold of the truncation error equal to 4%. The order M is obtained as:

M = [kr]

where . denotes rounding up to the nearest integer. For instance, if we consider a radius of the reproduction sphere equal to 8.5 cm, which is close to the average radius of a human head, 1st order Ambisonics achieves valid reconstruction of the sound field (i.e. truncation error lower than 4%) only up to 637 Hz. To increase the frequency cutoff up to 16 kHz for the same area, it is needed to include HOA components up to the M = 25th order.

HOA Microphones

The concept of HOA is to represent the sound field by a series of signals Bmnσ(ω), which are the coefficients of the Spherical Harmonics expansion of the sound field (see Equation 9.34). The tetrahedral microphone is the most convenient solution to record 0th order and 1st order components Bmnσ(ω), but how can we record the components of order greater than 1 to upgrade to HOA? The solution is very close to the strategy adopted for the tetrahedral microphone. Intuitively, the Bmnσ(ω) components could be recorded by a set of directional microphones, the directivity of which is defined by the directivity of the Spherical Harmonics (see Figure 9.5). This solution, that is not feasible for the 0th and 1st order components, is even less possible for the higher components, because the number of coincident microphones is higher and their directivity more complex. For HOA, it is suggested to use instead a spherical array of microphones (i.e. a tetrahedron array) and then to derive the Ambisonics components from the microphone outputs by appropriate matrixing (see Equation 9.2). This is the inspiration for the general concept of HOA microphones. The recording system is conceptually a spherical array of microphones, by which the acoustic pressure and the pressure gradient (or the acoustic velocity) over a sphere is captured. In addition, the microphone array is coupled to an encoding matrixing process to obtain the HOA components.

More precisely, the solution is based on the mathematical definition of HOA components. They are defined as the coefficients of the Spherical Harmonics expansion. Any sound field can be developed as a linear and weighted sum of Spherical Harmonics, since Spherical Harmonics are the eigenfunctions of the equation of the acoustic wave, in the same way as any time function can be expressed as a linear and weighted sum of sine and cosine functions, which is called a “Fourier series expansion”. In other words, Spherical Harmonics are the equivalent of sines and cosines for space variations. Therefore, by definition of eigenfunctions, the coefficients of Spherical Harmonics expansion are computed from the projection of the sound field over the orthonormal basis of Spherical Harmonics (see Equation 9.27). If Umnσ(ω) is the result of the projection of the acoustic pressure p(r, ϕ, θ, ω) on the Spherical Harmonic Ymnσ(ϕ,θ) over the sphere (see Equation 9.29), it is shown that the component Bmnσ(ω) is given by:

Bmnσ(ω)=Eq(m, kr)Umnσ(ω) (9.9)

where the term Eq(m,kr)=1imjm(kr) can be interpreted as an equalization. Thus, to obtain Ambisonics components, the first step is to measure the acoustic pressure over the sphere of radius r, from which the signals Umnσ(ω) are computed. Then Equation 9.9 gives the Bmnσ(ω) signals.

Challenges of HOA

However, this process raises two main problems: one is the zeroes of the spherical Bessel function of first kind (i.e. jm(kr)), the other is the spatial sampling of the sound field. Indeed, whenever the function jm(kr) is equal to zero, the equalization term Eq(m, kr) does no longer exist. What’s more, as soon as the function jm(kr) is close to zero, the equalization term Eq(m, kr) increases dramatically, which leads to a considerable amplification of the signals, and consequently to an alteration of the audio quality of the signals because of the resulting expansion of the microphone noise. One solution is to replace the acoustic pressure by another acoustic variable to describe the sound field, in order to modify the equalization term. For instance, instead of pressure microphones, cardioid microphones can be used. As already mentioned, these latter may be seen as a linear sum of a pressure microphone and a gradient pressure microphone. Thus the equalization term becomes (Moreau, Daniel & Bertet, 2006):

Ec(m,kr)=1im[ jm(kr)+k jm(kr)r]  (9.10)

In that case, the denominator is never null. More generally, the directivity function of the microphone can be modified (e.g. by introducing acoustic diffraction through a solid structure) in order to design an equalization term Eq(m, kr) in accordance with expected properties (Epain & Daniel, 2008).

As for the second issue to solve, ideally, Ambisonics components should be derived from the knowledge of the continuous sound field over the sphere of radius r. However, in practice, the acoustic signals cannot be measured at any point, but only at a finite set of locations defined by a microphone array, which involves spatial sampling. The microphones are distributed over a sphere and form a spherical array. The main question is to choose properly their location in order to catch optimally the sound field information so that the signals Bmnσ(ω) can be accurately estimated. The solution is the best compromise between the following constraints: minimizing the error of the estimation of the signals Bmnσ(ω), minimal number of microphones, achieving a feasible geometry of the microphone array. By analogy with time sampling, spatial sampling is possible under the assumption that the sound field is spatially band-limited, i.e. that the components Bmnσ(ω) are null for any order m greater than a maximal value mmax. If this condition is satisfied, the sound field can be correctly sampled and the signals Bmnσ(ω) exactly estimated, provided that the azimuth and elevation angles are regularly and separately sampled (Driscoll & Healy, 1994). The drawback of this solution is a high number of microphones: for instance, to record the sound field up to the order M, at least 4(M + 1)2 microphones are required. To decrease the number Nc of microphones, an approximation is proposed. Assuming that cardioid microphones are used, the output of the qth cardioid microphone located at the location (r, ϕq, θq) is given by, in accordance with the theoretical definition of the cardioid directivity:

c(q,ω)c(r,ϕq,θq,ω)=p(r,ϕq,θq,ω)p(r,ϕq,θq,ω)nik, q[1,n,Nc]  (9.11)

Instead of computing the signals Bmnσ(ω) from the projection over the orthonormal basis of Spherical Harmonics, the Ambisonics components are derived by replacing the acoustic pressure by its Spherical Harmonics expansion (see Equation 9.34) in Equation 9.11:

c(q,ω)=m=0Mim[ jm(kr)+k jm(kr)r]n=0mσ=±1Bmnσ(ω)Ymnσ(ϕq,θq) (9.12)

This equation defines a system of linear equations, which can be reformulated in a matrix form:

C = YcWcB (9.13)

where the terms C, B, Yc and Wc are given by:

c=[c(1,ω)c(2,ω)nc(Nc,ω)],B=B001(ω)B101(ω)nBMM1(ω),Yc=[Y001(ϕ1,θ1)Y101(ϕ1,θ1)YMM1(ϕ1,θ1)Y001(ϕ2,θ2)Y101(ϕ2,θ2)YMM1(ϕ2,θ2)nnnnY001(ϕNc,θNc)Y101(ϕNc,θNc)YMM1(ϕNc,θNc)],
Wc=[ j0(kr)+k j0(kr)r000i[ j1(kr)+k j1(kr)r]0nnnn00iM[ jM(kr)+k jM(kr)r]]

Provided that the number of microphones Nc (i.e. the number of equations) is greater than or equal to the number of unknowns (i.e. the number of Ambisonics components: (M + 1)2), Equation 9.13 can be solved by using the Moore-Penrose pseudoinverse of Yc:

B=E(YctYc)c1YctC (9.14)

where Ec=[Ec(0,kr)000Ec(1,kr)0nnnn00Ec(M,kr)] and Yct refers to the transpose conjugate of Yc. It should be kept in mind that this solution is just an approximate estimate of the Ambisonics components Bmnσ(ω). Errors are introduced for instance by the internal noise, or possible mispositioning of the microphones. The matrix Ec can cause instability, which is minimized by a regularization method (Moreau et al., 2006). Besides, in Equation 9.14, the term YctYc is of particular interest: if the spatial sampling of the sound field (i.e. the geometry of the microphone array) keeps the orthonormality property of Spherical Harmonics (see Equation 9.27), this term can be simplified to Bmnσ(ω) 15031-1076-Fullbook_ineqn043.eps YctYc = 1 (where 1 is the identity matrix). In that case, Equation 9.14 turns out to be the sampled version of Equation 9.9, in which cardioid microphones are considered instead of pressure microphones. If the geometry of the microphone array is arbitrarily chosen, the orthonormality property is generally not satisfied and the term Bmnσ(ω) 15031-1076-Fullbook_ineqn043.eps YctYc quantifies the amount of spatial aliasing (i.e. orthonormality error). However it is difficult to find a geometry for which Bmnσ(ω) 15031-1076-Fullbook_ineqn043.eps YctYc. Regular and semi-regular polyhedrons provide solutions which are valid only up to a maximal order maxmax (Moreau et al., 2006). Thus the issue of the microphone array geometry must be carefully examined.

HOA in Practice

In practice, when designing an HOA microphone, the first step is to choose the maximal order M of the Spherical Harmonics expansion, which is expected. Then the value of M imposes the minimal number of microphones: Nc = (M + 1)2. The third step is to find the array geometry (preferably a semi-regular polyhedron) composed of at least Nc elements, which minimizes the orthonormality error. The fourth step concerns the radius r of the microphone array, which affects the equalization term of Equation 9.10. The optimal radius is a difficult compromise to minimize spatial aliasing by decreasing r, while keeping a reliable estimation of Ambisonics components for low frequencies, which is better for larger r (Moreau et al., 2006). The tetrahedral system is one example of such a design for M = 1. It is composed of Nc = (M + 1)2 = 4 microphones arranged in a tetrahedron, which is the simplest regular polyhedron. This setup implicitly provides a rough spatial sampling of the sound field and thus allows one to estimate the 0th and 1st order Ambisonics components.

An alternative solution was proposed in (Zotkin, Duraiswami & Gumerov, 2010). Broadly speaking, the idea is to find the set of plane waves that best explains the sound pressure collected by the set of microphones. The process is very close to Ambisonics encoding: a weighting matrix is applied to the microphone signals to derive the coefficients of the plane wave expansion.

HOA: A Promising Format for Sound Field Description

The HOA representation of the sound field (i.e. the signals Bmnσ) is an attractive format to describe a sound scene. This format has three valued properties; it is generic, universal and scalable.

First, the matrixing process to compute the loudspeaker signals for a given layout may be interpreted as a transcoding of Ambisonics components into the domain of the loudspeakers (see Equations 9.15 and 9.19). Thus the HOA representation is really a generic format, where generic means that it is independent of both the recording format (i.e. microphone signals) and the reproduction format (i.e. loudspeaker signals). The advantage is that any editing or post-production of the sound scene (e.g. spatial transformations, such as rotation, angular distortion like forward dominance) is done preferably in the HOA domain (Daniel, 2009), and thus will not require to compute a new set of loudspeaker signals. The loudspeaker input signals are decoded only once—during sound field reproduction. HOA signals are therefore relevant for data storage.

Second, the HOA format is universal in that it is an exact representation of the acoustic wave, provided that all the terms of the Spherical Harmonics expansion are kept up to infinity.

Additionally, it is valid for any acoustic wave, whatever its spatial and propagation properties. Errors come only from the limitation of the Spherical Harmonics expansion to a finite order M, and from the estimation of the Bmnσ(ω) signals by HOA microphones.

Third, scalability means that the HOA components of the lowest order convey a full description of the sound scene. In the extreme, the 0th order HOA component (i.e. the W component which is equivalent to a monophonic recording) is a full representation of the sound field (though with absolutely no spatial information), in the sense that a listener is able to listen to it and to interpret it as a coherent sound scene composed of various acoustic sources.

Adding the components of higher orders improve only the spatial definition of the sound field. Consequently, at any time, it is possible to discard the highest-order components, in order to adapt the maximal order M of HOA signals to the available bitrate of transmission/storage, or to the configuration of the listening setup (number of loudspeakers).

Ambisonics and HOA Reproduction

HOA reproduction aims at synthesizing the original sound field p by a loudspeaker array. For this, the Ambisonics components Bmnσ(ω) have to be mixed together to build the proper input signal for each loudspeaker, so that the resulting synthetic sound field p matches the original one as close as possible. The input loudspeaker signals are derived from the Bmnσ(ω) signals by using a decoding Nl × (M + 1)2 matrix D, defined by:

S = DB (9.15)

where S is the vector composed of the Nl loudspeaker signals S=[s(1,ω)ns(l,ω)ns(Nl,ω)]. The decoding matrix D is computed by equating the synthetic sound field to the original one. The former is given by:

p^(r,ω)=l=1Nls(l,ω)pl(r,ω) (9.16)

where pi refers to the elementary acoustic wave emitted by the lst loudspeaker. This elementary wave can be developed over the Spherical Harmonics basis in accordance with Equation 9.34, which leads to:

p^(r,ω)=l=1Nls(l,ω)m=0+imjm(kr)n=0mσ=±1Lmnσ(l,ω)Ymnσ(ϕ,θ) (9.17)

Thus, the synthetic wave exactly matches the target sound field p, if and only if the coefficients of their respective Spherical Harmonics expansion are equal, that is to say:

B = LS (9.18)

where L=[L001(ϕ1,θ1)L001(ϕ2,θ2)L001(ϕNl,θNl)L101(ϕ1,θ1)L101(ϕ2,θ2)L101(ϕNl,θNl)nnnnLMM1(ϕ1,θ1)LMM1(ϕ2,θ2)LMM1(ϕNl,θNl)].

This matrix L represents the coefficients of the Spherical Harmonics expansion of the wave emitted by each loudspeaker. In other words, the matrix L contains the information about the spatial coordinates of the loudspeakers. This is needed for the remapping of the spatial information of the sound field over the loudspeaker array.

Equation 9.18 defines a system of (M + 1)2 linear equations with Nl unknowns (the loudspeaker input signals). In practice, it is recommended to choose the number of loudspeakers as Nl = (M + 1)2, which ensures an optimal reproduction of the sound field (Poletti, 2005). Consequently, if Nl < (M + 1)2, it is preferable to discard the Ambisonics components of the highest orders mmax < m < (M + 1)2 until Nl = (mmax + 1)2. On the contrary, if Nl > (M + 1)2, part of the loudspeakers should be muted to keep only (M + 1)2 of them. Otherwise the reproduced sound field is likely to be unstable and some auditory artifacts like “phasiness” are observed (Daniel, 2009).

Decoding Matrix

Theoretically any arbitrary geometry of the loudspeaker array can be chosen, which is a remarkable advantage of HOA reproduction in contrast with channel-based formats, such as 5.1 or 22.2. The decoding matrix D is responsible for the adaptation of the HOA components, Bmnσ(ω), to the loudspeaker signals (i.e. the “loudspeaker domain” versus the Spherical Harmonics domain), and is able to compensate for any loudspeaker layout. Indeed the decoding matrix takes into account both the location and the acoustic radiation of the loudspeaker through the matrix L (see Equation 9.18). For instance, if the waves emitted by the loudspeakers are assumed to be plane waves (far-field assumption), the matrix L is given by (see Equation 9.40):

L=Yl=[Y001(ϕ1,θ1)Y001(ϕ2,θ2)Y001(ϕNl,θNl)Y101(ϕ1,θ1)Y101(ϕ2,θ2)Y101(ϕNl,θNl)nnnnYMM1(ϕ1,θ1)YMM1(ϕ2,θ2)YMM1(ϕNl,θNl)] (9.19)

Even if the geometry of the loudspeaker is free, a regular layout is preferably chosen, since a regular array leads to a decoding matrix that is mathematically more simple and stable. Therefore, the loudspeakers are generally distributed on the surface of a sphere of radius rl . In that case, if spherical waves are considered instead of plane waves, the matrix L becomes L = WlYl (Morse & Feshback, 1953; Morse & Ingard, 1968; Daniel, 2001), with:

Wl=[(i)000h1(krl)k0nnnn00hM(krl)ki(M+1)] (9.20)

The exact positioning of the loudspeakers on the sphere must be regular in the sense that the orthonormality property of Spherical Harmonics is preserved by the resulting spatial sampling (see Equation 9.27). In other words, the loudspeakers have to be arranged at the vertices of a regular or semi-regular polyhedron, so that ideally: Ylt Yl =1. If the reproduction is restricted to the horizontal plane (i.e. 2D reproduction), this requirement is achieved by a circular array of equally spaced loudspeakers. In the case of a regular setup satisfying Ylt Yl =1, the decoding matrix turns out to be: D=Lt. under the assumption of plane waves.

Decoding Rules

The decoding matrix solving Equation 9.18 is one strategy of HOA reproduction, by which a perfect reconstruction of the sound field is intended (i.e. perfect match between the recorded sound field and the reproduced one), and which is termed basic decoding. There are many alternatives (Daniel, 2001). The velocity and energy vectors, which were presented in the first section (see Equation 9.8), are useful to optimize the sound field reproduction (Gerzon, 1992). Particularly when the frequency increases, perfect reconstruction is no longer achievable, at least over an extensive listening area. Instead, approximate solutions, which are computed by optimizing a given set of constraints, allow one to improve the sound field rendering. Thus, the velocity and energy criteria can be used by the optimization process as specific constraints. For example, during reconstruction, the maximum rE constraint uses the loudspeakers which are the closest to the direction of the virtual sound source, by maximizing the norm of the energy vector. In the same way, the in-phase constraint imposes to mute the loudspeakers that are at the opposite of the direction of the virtual sound source, which improves the sound field rendering for off-centered listeners. These are examples of alternative decoding rules to compute the decoding matrix. In practice, different decoding rules can be applied as a function of frequency, e.g. basic decoding for low frequencies and maximum rE decoding for high frequencies.

Sound Field Synthesis

In order to illustrate HOA reproduction of a sound field, Figure 9.8 depicts the acoustic pressure wave synthesized by a 2D circular loudspeaker array of radius rl = 3 m. The target sound field is a harmonic plane wave coming from azimuth ϕ= 60° at the frequency f = 1 kHz. The loudspeaker signals are computed from the theoretical HOA components (see Equation 9.40) through a basic decoding and assuming that loudspeakers are emitting plane waves. The number of the loudspeakers is fixed to the closest of the optimal value Nl = (2M + 1). The wave synthesized by the 4th order HOA system is accurate only in the immediate vicinity of the center of the loudspeaker array. When upgrading the HOA synthesis up to order M = 19, the expected plane wave is correctly reproduced over almost all the area inside the loudspeaker array. The benefit of higher orders to enlarge the listening area is clearly demonstrated.

To assess the performances of the sound field reproduction in terms of the perceived localization of virtual sound sources, the ITD (Interaural Time Difference) and ILD (Interaural Level Difference) values are estimated for a set of locations within the listening area (see Figure 9.9). For this simulation, the wave propagation between each loudspeaker and the listener’s ears includes HRTF (Head Related Transfer Function) to account for the acoustic diffraction of the acoustic wave by the listener’s morphology (particularly the pinna). Ideally the ITD and ILD are respectively around −500 µs and −12 dB, assuming that the listener is facing the 0° direction. In Figure 9.9, it is observed that for most of the locations, the lateralization (i.e. perceived azimuth) of the virtual sound source is roughly correct, even though the exact localization is not accurate. However the ITD is affected by many spatial instabilities, and these artifacts are minimized when the HOA order M increases. In addition, a third criterion, the ISSD (Inter Subject Spectrum Difference), is estimated to quantify the spectrum distortion. The ISSD is a measure of the dissimilarity between two spectrum magnitudes (i.e. the target spectrum and the reproduced one at one given location) and is defined as the variance of the difference of the dB-magnitudes (Middlebrooks, 1999). Figure 9.9 shows that the spectrum distortion is high for almost all the locations in the listening area, which means timbre artifacts.

Figure 9.10 and Figure 9.11 illustrate the reproduction of a spherical wave, which would have been emitted by a sound source located outside the loudspeaker array. The 4th order HOA system does not succeed in correctly synthesizing the curvature of the wavefront. Higher orders up to around M = 19 are required for a better reconstruction of the spherical wave within the listening area. As for the localization cues, the ILD is better reproduced than the ITD, which exhibits strong spatial inhomogeneities. The same behavior is observed in the case of the reproduction of a spherical wave emitted by a sound source located this time inside the loudspeaker array (see Figure 9.12 and Figure 9.13). Particularly, only the 19th order HOA system is able to accurately reconstruct the spherical wavefront.

Sound Field Formats

A-, B-, C- and D-Formats From Ambisonics Terminology

By format, it is meant any representation of the sound field by a set of signals. From recording to reproduction, there are potentially various formats. The output signals of the microphones can be seen as a recording format, which may be different from the reproduction format defined by the input signals of the loudspeakers. This is the case for Ambisonics and HOA technology. In the specific case of 1st order representation (i.e. Ambisonics), the output signals of the tetrahedral microphone (i.e. the LF, RF, LB, RB signals) are referred to as the A-format, which is a recording format and should be distinguished from the B-format, which consists of the components (W, X, Y, Z). This B-format is fully independent of the recording setup and the reproduction layout. The set of the input signals to the loudspeakers defines the D-format, also called G-format. Initially, the G-format corresponded to the specific loudspeaker signals computed for a 5.1 layout, but it found general use for any loudspeaker configuration.

The C-format, most often referred to as the UHJ format, is another Ambisonics format, which was introduced for broadcast and diffusion purpose (CD, DVD, television or radio), “C” meaning “Consumer” (Gerzon, 1985). The main goal of the UHJ format is to provide signals that are directly compatible with the conventional systems of reproduction, namely monophonic and stereophonic reproduction. It is composed at least of two signals: the left (L) and right (R) stereophonic signals. Two signals T and Q can be added to increase the spatial accuracy in the horizontal plane (T) and to convey height information (Q). More precisely, the encoding of B-format into UHJ format is based on a set of six signals, which are composed on the basis of the sum (S) and difference (D) signals:

S=0.9396926 W+0.18555740 XD=j(0.3420201 W+0.5098604 X)+0.6554516 Y (9.21)

and, on the other hand, the (L, R, T, Q) signals:

L=0.5 (S+D) R=0.5 (SD)T=j(0.1432 W+0.6512 X)0.7071 YQ=0.9772 Z (9.22)

where j is + 90° phase shift.

To obtain the L and R stereophonic signals, the matrixing equations are inspired by the M/S technique (Equation 9.1). The monophonic reproduction only uses the signal S. The matrixing equations to go back to B-format from UHJ format are:

S=0.5 (L+R) D=0.5 (LR)W=0.982 S+0.197j(0.828 D+0.768 T)X=0.419j(0.828 D+0.768 T)Y=0.187j  S+(0.796 D0.676 T)Z=1.023 Z (9.23)

These concepts also hold for HOA. Recording is performed by a spherical array of microphones, which provides a spatial sampling of the acoustic pressure over a sphere, and which deliver the recording format. Then, from the microphone signals (see Equation 9.14) are derived the HOA components Bmnσ(ω) (spatial encoding), which defines a kind of kernel format in the Ambisonics chain. This format is independent from both the recording and the reproduction setup, and can therefore be considered as a generic representation of the sound field. In the end, the loudspeaker inputs are computed by matrixing the signals Bmnσ(ω) to recombine spatial information in an appropriate way (re-encoding or transcoding) for a proper synthesis of the sound field by the loudspeaker array. In the current terminology of audio formats, the signals Bmnσ(ω) are typically referred to as a sound field–based format, as opposed to the channel-based format (e.g. 5.1, 10.2 or 22.2), which is composed of the loudspeaker signals, and the object-based format, in which the sound scene is described as a set of elementary components (i.e. sound sources) in combination with their spatial coordinates and trajectory (Geier, Ahrens & Spors, 2010; Bleidt et al., 2014).

Conclusion

This chapter was dedicated to sound field, which means that our main concern was the reproduction of an acoustic wave over an extended listening area. Three main questions were investigated: how to record, how to represent and how to reproduce a sound field. After an overview of the general concepts, these investigations were illustrated in the specific case of HOA technology (microphones, format, encoding, decoding matrix, loudspeaker setup). Examples of HOA reproduction were finally given for the case of plane and spherical waves.

In practice, the quality of the actual source capture in an HOA recording will strongly affect the reproduced outcome. Microphone parameters like phase accuracy, matching the amplitude to frequency response and the signal-to-noise ratio are all important factors to consider. The quality of the HOA playback systems and the listening environments will have a great effect as well. Currently, the high demand for microphones and playback systems to support new immersive audio consumer formats is driving innovation and improvements to HOA technology as engineers, content providers and consumers look to sound field technology. Further assessment of HOA reproduction in terms of perceived quality is also needed. In most of the existing systems, the highest order is M = 4 or 5. One question, which is still unsolved, concerns the limit order, beyond which the improvement of the sound field reconstruction is not perceptible. But to investigate this issue, methods to assess the multi-dimensional perception of sound field should progress.

Notes

1Ribbon microphones with figure-of-eight pickup patterns are sensitive to particle velocity. A figure-of-eight pattern can also be achieved with two coincident, opposing 180° in orientation, cardioid pressure gradient microphones, with one of them inverted in phase.

2However, it should be noticed that the first practical implementation of the Soundfield Microphone employed sub-cardoid capsules rather than cardioids.

3Like for time variations, spatial variations may be represented in two alternative domains: either in the domain of spatial coordinates or in the dual domain of spatial frequencies. The spatial frequency can be interpreted as the inverse of the wavelength in the case of a harmonic plane wave. Thus the spatial spectrum is the representation of the sound field in the domain of spatial frequencies. As an example, the Bmnσ(ω) coefficients of HOA representation is the spatial spectrum obtained in the dual domain of Spherical Harmonics. The spatial spectrum is opposed to the time spectrum which is derived from the Fourier Transform of a time signal.

References

Bamford, J. S. (1995). An Analysis of Ambisonics Systems of First and Second Order, Ph.D. Thesis, University of Waterloo, Ontario, Canada.

Bleidt, R., Borsum, A., Fuchs, H., & Merrill Weiss, S. (2014). Object-based audio: Opportunities for improved listening experience and increased listener involvement. SMPTE Conference Proceedings, October 2014.

Blumlein, A. D. (1931). U.K. Patent 394325.

Craven, P. G., & Gerzon, M. A. (1977). U.S. Patent 4,042,779.

Daniel, J. (2001). Représentation de champs acoustiques, application à la transmission et à la reproduction de scènes sonores complexes dans un contexte multimédia, Ph.D. Thesis, University of Paris VI, France.

Daniel, J. (2009). Evolving views on HOA: From technological to pragmatic concerns. Ambisonics Symposium 2009, June 25–27, Graz.

Daniel, J., Nicol, R., & Moreau, S. (2003). Further investigations of higher order Ambisonics and wavefield synthesis for holophonic sound imaging. 114th AES Convention, April 2003. Amsterdam.

Driscoll, J. R., & Healy, D. M. (1994). Computing Fourier transforms and convolutions on the 2sphere. Advances in Applied Mathematics, 15, 202–250.

Epain, N., & Daniel, J. (2008). Improving spherical microphone arrays. 124th AES Convention, May 2008. Amsterdam, Netherlands.

Farrar, K. (1979a). Soundfield microphone. Wireless World, October 1979, 48–50.

Farrar, K. (1979b). Soundfield microphone—2. Wireless World, November 1979, 99–103.

Gallo, E., Tsingos, N., & Lemaitre, G. (2007). 3D-Audio matting, postediting, and rerendering from field recordings. EURASIP Journal on Advances in Signal Processing, 2007(1), 047970.

Gauthier, P. A., & Berry, A. (2006). Adaptive wave field synthesis with independent radiation mode control for active sound field reproduction: Theory. Journal of the Acoustical Society of America, 119(5), May 2006, 2721–2737.

Geier, M., Ahrens, J., & Spors, S. (2010). Object-based audio reproduction and the audio scene description format. Organised Sound, 15(3), 219–227.

Gerzon, M. A. (1973). Periphony: With-height sound reproduction. Journal of Audio Engineering Society, 21(1), 2–10.

Gerzon, M. A. (1985). Ambisonics in Multichannel broadcasting and video. Journal of Audio Engineering Society, 33(11), 859–871.

Gerzon, M. A. (1992). General metatheory of auditory localisation. Proceedings of the A.E.S. 92nd Convention, 1992.

Hibbing, M. (1989). XY and MS microphone techniques in comparison. Presented at 86th AES Convention, Hamburg. Preprint 2811 (A‑5).

Makita, Y. (1962). On the directional localisation of sound in the stereophonic sound field. E.B.U. Review, June 1962, 102–108.

Middlebrooks, J. (1999). Virtual localization improved by scaling nonindividualized external-ear transfer functions in frequency. Journal of Acoustical Society of America, 106(3), 1493–1510.

Moreau, S., Daniel, J., & Bertet, S. (2006). 3D Sound Field Recording with higher order ambisonics—objective measurements and validation of a 4th order spherical microphone. 120th AES Convention, May 2006. Paris, France.

Morse, P. M., & Feshback, H. (1953). Methods of Theoretical Physics. New York: McGraw-Hill.

Morse, P. M., & Ingard, K. U. (1968). Theoretical Acoustics. New York: McGraw-Hill.

Nicol, R., & Emerit, M. (1999). 3D-sound reproduction over an extensive area: A hybrid method derived from Holophony and Ambisonic. AES 16th International Conference on Spatial Sound Reproduction, April 1999, Rovaniemi.

Poletti, A. (2005). Three-dimensional surround sound systems based on spherical harmonics. Journal of Audio Engineering Society, 53(11), 1004–1024.

Ward, D. B., & Abhayapala, T. D. (2001). Reproduction of a plane wave sound field using an array of loudspeakers. IEEE Transactions on Speech and Audio Processing, 9(6), September 2001, 697–707.

Zotkin, D. N., Duraiswami, R., & Gumerov, N. A. (2010). Plane-wave decomposition of acoustical scenes via spherical and cylindrical microphone arrays. IEEE Transactions on Audio, Speech and Language Processing, 18(1), 2–16.

Appendix A

Mathematics and Physics of Sound Field

Equation of Acoustic Waves

Any acoustic wave is the solution of the general problem, defined as follows for a space-time domain Ω×[t1,t2] by:

  • The equation of acoustic waves, given by:
(Δ1c22t2)ψ(r,t)=  s(r,t)rΩ, t[t1,t2]

where ψ(r,t) refers to the potential velocity (at location r and time t), which is linked to the acoustic pressure p(r,t) and the particle velocity  v(r,t) by the relations: p(r,t)= ρ0ψ(r,t)t, and v(r,t)=  ψ(r,t). In these equations, c and ρ0 are respectively the speed of sound and the volumetric mass density of the propagation medium. The term s(r,t) refers to the presence of acoustic sources.

  • In combination with boundary conditions, which specify the values of either ψ(r,t) or  ψ(r,t) (or even both) on the boundary ∂Ω of the domain Ω, and can be used for instance to introduce the effect of walls in a room, with phenomena of acoustic reflection, scattering and diffusion, or diffraction.
  • And in combination with initial conditions, which express the values of ψ(r,t) and ψ(r,t)t at the starting time.

In this formulation, all the variables are expressed as a function of time. By taking the Fourier Transform of the equations, it is possible to derive the equivalent problem in the frequency domain. There are many ways to solve this problem.

Deriving the Solution of the Equation of the Acoustic Waves With Spherical Harmonics

One way to solve the equation of acoustic waves is to use eigenfunctions which constitute an orthonormal basis and which are able to represent any acoustic wave. If the coordinate system is spherical (i.e. any location r is described by a radius r, an azimuth angle ϕ and an elevation angle θ, see Figure 9.14), the eigenfunctions of the equation of acoustic waves are composed of Spherical Bessel functions (namely Spherical Bessel function of first kind jm(kr) and second kind nm(kr), and/or Spherical Hankel function of first kind hm+(kr) and second kind hm(kr), respectively for decreasing r and increasing r propagating waves) and Spherical Harmonics Ymnσ(ϕ,θ) (Morse & Feshback, 1953; Morse & Ingard, 1968). Spherical Bessel functions account for spatial variations as a function of radius, whereas Spherical Harmonics convey spatial variations as a function of azimuth and elevation angles (see Figure 9.5). The Spherical Harmonics are given by:

Ymnσ(ϕ,θ)=(2m+1)nn(mn)!(m+n)! Pmn(sinθ)×{cos(nϕ) si σ=+1sin(nϕ) si σ=1 (9.25)

where the coefficient n is equal to 1 if n = 0 and 2 if n > 0. The functions Pmn (sin θ) are Legendre polynomials defined by:

Pmn(sinθ)=dnPm(sinθ)d(sinθ)n (9.26)

where the function Pm is the Legendre polynomial of first kind of order n. Spherical Harmonics form a complete set of orthonormal functions for any square-integrable function on the unit sphere. Therefore they fulfill the orthonormality property in the sense of the product defined by:

14πϕ=02πθ=π2π2Ymnσ(ϕ,θ)Ym,n'σ'(ϕ,θ)cosθdθdϕ=δmm'δnn'δσσ' (9.27)

where δmm, denotes the Kronecker delta, equal to 1 if m = m&apos; and to 0 otherwise.

To express the acoustic pressure at a given location r, a kind of “listening” area is created around this point. This area is delimited by two spheres of radius R1 and R2, such as R1 < r < R2, and is totally free of any acoustic source. Then the eigenfunctions previously introduced can be used to expand the acoustic wave. It is thus shown that the acoustic pressure p at location r resulting from any sound wave generated by acoustic sources located outside of the so-called listening area can be expressed as a linear sum of Spherical Bessel functions in combination of Spherical Harmonics:

p(r,ω)=m=0+imhm(kr)n=0mσ=±1Amnσ(ω)Ymnσ(ϕ,θ)+m=0+imjm(kr)n=0mσ=±1Bmnσ(ω)Ymnσ(ϕ,θ) (9.28)

where ω denotes the pulsation (i.e. ω =2πf, where f is the frequency). The coefficients Amnσ and Bmnσ are the weights of the eigenfunctions and thus define the representation of the acoustic wave in the associated basis. In other terms, these coefficients are the equivalent of the coefficients of a Fourier series, but in the present case spatial variations are considered instead of time variations. It should be noted that the coefficients Amnσ and Bmnσ depend on frequency f, and by this way convey spectral/time information. For an exact representation of the sound field, an infinite series of terms is required in Equation 9.28. However this series may be truncated to a finite number: for instance if only the Spherical Harmonics components up to order m = M are kept, the series is composed of (M + 1)2 terms. This result forms the fundamental idea of Ambisonics representation of a sound field.

As the signals Amnσ and Bmnσ are the coefficients of the Spherical Harmonics expansion (see Equation 9.28), they are obtained by using the orthonormality property of Spherical Harmonics (see Equation 9.27). The amplitude of each component Amnσ and Bmnσ is derived by computing the result, Umnσ(ω), of the projection of the acoustic pressure p(r, ϕ, θ, ω) on the associated Spherical Harmonic Ymnσ(ϕ,θ) in accordance with the product defined by Equation 9.27:

Umnσ(ω)=14πr2ϕ=02πθ=π2π2p(r,ϕ,θ,ω)Ymnσ(ϕ,θ)cosθdθdϕ (9.29)

To compute Umnσ(ω), it is only required to know the acoustic pressure p(r, ϕ, θ, ω) at any point on the whole surface of the sphere of radius r and centered at the origin of the coordinate system. The objective is to derive the signals Amnσ(ω) and Bmnσ(ω) from Umnσ(ω). For this, in Equation 9.29, the acoustic pressure is replaced by its Spherical Harmonic expansion defined by Equation 9.28, leading to, through the orthonormality property:

Umnσ(ω)=imhm(kr)Amnσ(ω)+imjm(kr)Bmnσ(ω) (9.30)

Unfortunately this equation is not sufficient to compute the signals Amnσ(ω) and Bmnσ(ω). A second equation is needed. In addition to the acoustic pressure, the knowledge of the radial acoustic velocity vr(r, ϕ, θ, ω) will be used to get the quantity Vmnσ(ω) (Daniel et al., 2003):

Vmnσ(ω)=14πr2ϕ=02πθ=π2π2vr(r,ϕ,θ,ω)Ymnσ(ϕ,θ)cosθdθdϕ (9.31)

In the same way as for the acoustic pressure, by using the Spherical Harmonic expansion (see Equation 9.28) and the Euler Equation, which allows one to obtain the acoustic velocity from the acoustic pressure, Equation 9.31 becomes:

Vmnσ(ω)=im1cr hmr(kr)Amnσ(ω)+im1cr jmr(kr)Bmnσ(ω) (9.16)

From Equations 9.30 and 9.32, it is now possible to extract both the signals Amnσ(ω) and Bmnσ(ω) :

Amnσ(ω)= imjmr(kr)Umnσ(ω)icrjm(kr)Vmnσ(ω)jmr(kr)hm(kr)jm(kr)hmr(kr)Bmnσ(ω)=imhmr(kr)Umnσ(ω)icrhm(kr)Vmnσ(ω)jm(kr)hmr(kr)jmr(kr)hm(kr) (9.33)

Equation 9.33 shows how to compute the representation of any acoustic wave in the Spherical Harmonics domain from the knowledge of solely the acoustic pressure and the acoustic velocity on the surface of the sphere of radius r. Let us go back to Equation 9.28, which can be reinterpreted as the expansion of the acoustic wave as a superposition of wavelets of type jm(kr)Ymnσ(ϕ,θ). The amplitude of the former is Amnσ(ω), and that of the latter is Bmnσ(ω). The wavelets of the first type propagates in the direction of increasing r and are therefore due to acoustic sources which are located inside the sphere of radius R1, whereas those of the second type result from acoustic sources which are located outside the sphere of radius R2(Daniel, 2003). Thus, Spherical Harmonics expansion allows one to separate inside and outside components in the sound field. In most of the cases, there is no source inside the sphere of radius R1, which leads to consider that all the signals Amnσ(ω) are null. Equation 9.28 becomes then:

p(r,ω)=m=0+imjm(kr)n=0mσ=±1Bmnσ(ω)Ymnσ(ϕ,θ) (9.34)

By default, Spherical Harmonics expansion refers to this expression. In consequence, the sound field is fully described only by the signals Bmnσ(ω).

Deriving the Solution of the Equation of the Acoustic Waves With Green Functions

To solve the equation of acoustic waves (Equation 9.24), an alternative to Spherical Harmonics expansion is to use the Green function related to the problem (Morse & Feshback, 1953). In a similar way as previously, the considered domain Ω is divided into two subareas, Ω1 and Ω2. All the acoustic sources are exclusively in Ω1, so that Ω2 defines a “listening” area which is disturbed by no source. By applying the Green theorem, it is shown that the acoustic pressure p at any location r ϵ Ω2 may be expressed as a 2D integral called the “Kirchhoff-Helmholtz” integral:

p(r,ω)=Ω0[g(rr0,ω)p(r0,ω)p(r0,ω)g(rr0,ω)]ndS0 (9.35)

where g is the associated Green function, ∂Ω0 is the boundary separating the two subareas Ω1 and Ω2, and n is the unit vector perpendicular to the surface ∂Ω0. Similarly to Equation 9.28, in Equation 9.35 the acoustic wave is a sum of elementary components (another type of wavelets), which reminds us of the Huyghens’ Principle. A second observation which is common with the Spherical Harmonics expansion is that both the acoustic pressure p(r0,ω) and the pressure gradient p(r0,ω) (in other words the acoustic velocity through Euler’s formula) are needed to fully describe the sound wave (Daniel, 2003). It should be kept in mind that both Equation 9.28 and 9.35 are equivalent and exact representations of the acoustic pressure. What’s more it was shown in Nicol and Emerit (1999) that Equation 9.28 can be deduced from Equation 9.35 under some assumptions (namely the acoustic wave is a plane wave, and the boundary ∂Ω0 is a circle which extends into the infinite).

Appendix B

Mathematical Derivation of W, X, Y, Z

There is another interpretation of the components (W, X, Y, Z). The component W is recorded by an omnidirectional microphone, which, in practice, corresponds to a pressure microphone. In other words, the component W is assimilated to the acoustic pressure. In the same way, the components (X, Y, Z) are recorded by figure-of-eight microphones, which are obtained by combining pressure gradient microphones. Because of the Euler’s relation (see Equation 9.37) between the pressure gradient and the particle velocity, the components (X, Y, Z) can therefore be assimilated to the x-, y- and z-components of the particle velocity (X, Y, Z). Now, from the point of view of sound field capture, what is the meaning of these signals (W, X, Y, Z)? Let us take the case of an acoustic plane wave. The acoustic pressure is given by:

p(r,ω)=p0ejknr (9.36)

where p0 is the wave amplitude and k the wave vector. Through Euler’s formula, the particle velocity is derived from the pressure gradient:

v(r,ω)=jkp0ejknr (9.37)

Let us consider the four signals (W, X, Y, Z) as the acoustic pressure and the particle velocity measured at the origin r=0

W=p(r,ω)=p0X=vx(0,ω)=jp0kxY=vy(0,ω)=jp0kyZ=vz(0,ω)=jp0kz (9.38)

The direct interpretation of this result is that the signal W is in fact the wave amplitude, whereas the signals (X, Y, Z) are the x-, y- and z-components of the wave vector, which means that these signals convey the information of propagation direction of the wave. In other words, they contain the spatial information. Thus, the signals (W, X, Y, Z) form a full representation of the acoustic wave. If the propagation direction of the wave is defined by the azimuth angle ϕ0 and the elevation angle θ0, the signals (W, X, Y, Z) become:

W=p0X=jp0kcosθ0cosϕ0Y=jp0kcosθ0sinϕ0Z=jp0ksinθ0 (9.39)

It is time to prove in the specific case of a plane wave that the signals (W, X, Y, Z) are the 0th and 1st order components of the Spherical Harmonics expansion. Like any acoustic wave, the plane wave defined by Equation 9.36 can be expressed as a linear sum of Spherical Harmonics Ymnσ(ϕ,θ) (see Equation 9.34). In the case of a plane wave, the coefficients of this Spherical Harmonics expansion, i.e. the signals Bmnσ(ω), are given by Morse and Feshback (1953) and Morse and Ingard (1968):

Bmnσ(ω)=p0Ymnσ(ϕ0,θ0) (9.40)

If the Spherical Harmonics expansion is restricted up to order M = 1, only four terms are kept, corresponding to the 0th order and the three 1st order components:

B001(ω)=p0Y001(ϕ0,θ0)=p0B111(ω)=p0Y111(ϕ0,θ0)=p0cosθ0cosϕ0B111(ω)=p0Y111(ϕ0,θ0)=p0cosθ0sinϕ0B101(ω)=p0Y101(ϕ0,θ0)=p0sinθ0 (9.41)

This result is achieved by defining the expression of the Spherical Harmonics Ymnσ(ϕ0,θ0) in accordance with Equations 9.25 and 9.26. A comparison of Equations 9.39 and 9.41 shows that the signals (W, X, Y, Z) are indeed the 0th and 1st order Bmnσ(ω) coefficients of the Spherical Harmonics expansion of the plane wave, which gives a further insight into the physical meaning of the signals (W, X, Y, Z), and confirms their relevance as a full (though rough because of the truncation up to order M = 1) representation of the sound field.

Appendix C

The Optimal Number of Loudspeakers

Equation 9.18 defines a system of (M + 1)2 linear equations with Nl unknowns (the loudspeaker input signals). Three cases must be distinguished (Poletti, 2005). First, if Nl < (M + 1)2, the problem is overdetermined and is solved by quadratic minimization. Second, if Nl = (M + 1)2, the matrix L is square. Provided that its inverse L-1 exists, the decoding matrix is given by D = L-1. Third, if Nl > (M + 1)2, the problem is underdetermined and has an infinity of solutions. The solution that minimizes the signal energy is achieved through the pseudoinverse of L and the decoding matrix is then D = Lt(LLt)-1. In practice, it is recommended to choose the number of loudspeakers as Nl = (M + 1)2, which ensures an optimal reproduction of the sound field (Poletti, 2005).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset