Chapter 10

Wave Field Synthesis

Thomas Sporer, Karlheinz Brandenburg, Sandra Brix, and Christoph Sladeczek

Motivation and History

Perhaps the earliest example of an immersive sound system is the acoustic curtain developed by Steinberg and Snow at AT&T Bell Laboratories and published in 1934 (Steinberg & Snow, 1934). Several microphones placed in a row in the recording room were wired 1:1 to loudspeakers in the reproduction room. This is depicted in Figure 10.1.

A few years later, researchers at Bell Labs suggested reducing the number of loudspeakers of the acoustic curtain down to three channels. Later, in 1953/1955 Snow reported that “the number of channels will depend upon the size of the stage and listening rooms, and the precision in localization required” (Snow, 1955, p. 48). The effect of adding channels is not great enough to justify additional technical and economical efforts (Snow, 1955). Snow’s suggestion indeed came to pass, as 2-channel stereo systems did eventually become the standard in many homes.

As discussed in Chapter 6, stereo systems can lack precise virtual sound source positioning especially when working with phantom images. Further, the existence of a stereo sweet-spot limits the area where accurate spatial impressions and immersion can be experienced within a listening room. There have been a number of proposals for systems that use more than two loudspeakers to overcome these limitations, and Wave Field Synthesis (WFS) is one such system.

As proposed by Gus Berkhout in 1988 (Berkhout, 1988), the idea for WFS originally comes from seismic research and oil exploration. Sound waves created by explosions traveling through different layers of soil are reflected and bend at boundaries. These wavefronts are recorded with an array of microphones to analyze certain properties of the layers. Based on this theory, Berkhout had the idea to replace the microphones with loudspeakers making it possible to regenerate a sound field.

WFS can be considered a holophonic method of sound reproduction, capable of generating sound fields that maintain the temporal and spatial properties that represent virtual sound sources within an area bounded by loudspeakers. Using WFS, virtual sources can be placed not only along the speaker array, but behind and in front of the array as well. This is a remarkable feature that sets WFS apart from conventional stereo and surround systems. Due to its ability to enable the free positioning of virtual sound sources, WFS can be classified as an object-based audio method and therefore has several advantages when compared to discrete channel-based audio reproduction methods. With the recent progress made in microelectronics and the decreasing costs of computing power, loudspeakers, and power amplifiers, WFS systems have become more accessible and have appeared in the commercial marketplace.

Mathematical Background—From the Wave Equation to Wave Field Synthesis

Since an array of loudspeakers is used to synthesize a sound field representing virtual sources in WFS, a driving function is needed to calculate the individual speaker signals. In this section, an introduction to the mathematical concept behind WFS is given. The reader who would like to get a deeper mathematical background in WFS is referred to the referenced literature.

The concept of wave field synthesis is based on Huygen’s Principle, published in 1690, which is shown in Figure 10.2.

Huygens’ Principle states that a propagating wave front of a primary source Ψ can be synthesized by an infinite number of so-called secondary sources, placed on the primary source’s wave front. To reconstruct the wave front, all secondary sources are fed by the signal emitted from the primary source. The superposition of all secondary source signals results in an accurate copy of the primary source wave front. Looking at Figure 10.3, we can assume that the primary sound source Ψ causes a sound pressure P(rR,ω) at the listening position R inside a volume V. If the sound pressure and velocity, caused by the primary source, is known on the surface S, then the sound pressure field at R can be reconstructed by using an infinite number of monopole and dipole sources on the three-dimensional surface S. The general mathematical description of this principle in the frequency domain was given by Kirchhoff in 1883, which is known as the Kirchhoff-Helmholtz integral.

P(rR,ω)= 14π [(P(r0,ω)jk|r||r|monopole sources+P(rR,ω)jk|r||r|dipole sources)]dS0. (10.1)

To calculate the sound pressure at the wave front ω is the angular frequency, k=ω/c is the wave number, c is the speed of sound, n is the normal vector to S and r=rRr0.

The realization of this integral is hard to achieve because it means covering the volume around the listening space with an infinite number of monopole and dipole loudspeakers. By eliminating one secondary source type and choosing a special geometry of V the Kirchhoff-Helmholtz integral can be reduced to Rayleigh’s integrals incorporating some further limitations (Start, 1997). Assuming only secondary sources with monopole characteristics, the Rayleigh integral is given as:

P(rR,ω)= 12π(n.P(rR,ω))Q(r,ω)ejkΔrΔrdxdz. (10.2)

with the geometry depicted in Figure 10.4.

In this case the sound field of the primary source Ψ will no longer be synthesized by a surrounding volume of secondary sources but by a planar array S of monopoles. Furthermore, the synthesis is limited to the region y < ys (see Figure 10.4). With the exception of some very special applications, the use of a full planar array of loudspeakers is not practical (Reussner et al., 2013).

In order to reduce the planar array of secondary sources to a line array of transducers, a mathematical approximation called the stationary-phase approximation is applied to Equation (10.2). Based on this approximation technique we can identify the loudspeakers along y that have the greatest impact on the sound pressure at the listening position R. As the energy radiated by a sound source decreases with increasing distance from the source, we find that speakers that are far enough away can be neglected. In order to derive a simple wave field synthesis driving function, the primary source is assumed to be omnidirectional. Mathematically this is described by the following formula:

PΨ(r,ω)=S(ω)ejkrr. (10.3)

Inserting Equation (10.3) into Equation (10.2) and applying the simplification, the secondary source driving function for linear loudspeaker arrays yields this equation (Verheijen, 1998):

Q(r,ω)=S(ω)jk2πΔrr+Δrcosφejkrr, (10.4)

The synthesis integral is as follows:

P(rR,ω)= Q(r,ω)ejkΔrΔrdx. (10.5)

Equation (10.4) is called the 2.5D synthesis operator whereas S(ω) is the input signal of the virtual source in the frequency domain. The prefix 2.5D, as opposed to 3D, indicates that the sound field of the virtual source is only synthesized on the horizontal plane when z = 0. The underlying geometry is depicted in Figure 10.4. As a result of the approximation, the distance-dependent pressure loss of the virtual source does not match the real source. This can be seen by comparing the last term of the synthesis operator and Equation (10.3). While the distance-dependent attenuation of the virtual source 1/√r is equal to a line source, a behavior of 1/r was desired. To compensate for this the individual pressure behavior, described by the term Δrr+Δr, matches both on a reference line. Another result of the mathematical approximation is the term √(jk/2π) that represents a static high-pass filter since it does not depend on the virtual source position. The cosine term describes another gain, depending on the direction of the primary source according to the secondary source. The last part, e(-jkr), defines a frequency-independent delay, related to the distance of the primary source to the secondary source in the time domain. Using this synthesis technique, a virtual source from the perspective of the listener located behind the loudspeaker array can be created. Depending on the position of these distant sources called objects, the virtual wave front is constructed. Figure 10.5 shows a simulation of two individual positions. Image (a) corresponds to a virtual source positioned 1 m behind the loudspeaker array whereas image (b) shows the virtual source is placed 1,000 m behind the loudspeaker array.

Focused Sound Sources

The principle of time reversal is a common technique in signal processing (Fink, 1992). It is based on the assumption that the path between source and receiver, in this case the virtual sound source and listener, is reversible. Verheijen (1998) applied this technique to the 2.5D WFS operator resulting in the focusing operator:

Qf(r,ω)=S(ω)k2πjΔrΔrrcosφejkrr, (10.6)

with the synthesis integral modified to:

P(rR,ω)= Qf(r,ω)ejkΔrΔrdx (10.7)

Figure 10.6 shows a simulation of a focused sound source that is created in front of the loudspeaker array.

Arbitrarily Shaped Loudspeaker Distributions

The 2.5D synthesis operator derived is only valid if a linear loudspeaker array is used. This also holds true for the focusing operator. In the derivation of the 2.5D synthesis operator the stationary- phase approximation was used. To reduce the number of loudspeakers for a distribution on a plane with a linear distribution, each column of speakers was analyzed, using Rayleigh’s integral, according to their contribution to the overall pressure at the listener position R. To obtain a 2.5D synthesis operator, the loudspeaker with the shortest distance is the most relevant. If this concept is applied to the linear loudspeaker array, it can be shown that, for a single listener on the reference line, one loudspeaker will have the main influence on the synthesized sound pressure level (Start, 1996; Verheijen, 1998). This is visualized in Figure 10.7. The loudspeaker that contributes the most is always the one that is directly between the virtual source and the receiver position. The secondary source line, and the reference line, can be shaped arbitrarily using gain to compensate for the incorrect pressure level of the virtual source. This can be determined individually for each loudspeaker. The only limitation is that one intersection between the line ∆r and a secondary source and the reference line exists. The same is true for the path between secondary source and virtual source position. Because of this, loudspeaker array configurations can be in a line, or arranged with a soft bend.

Separation of Sound Objects and Room

Using a WFS system, it is possible to reproduce both point sources and plane waves. The sound field of a natural environment usually contains foreground sound objects that carry information about speech, objects close by, and music, as well as some information about the room. In general, foreground sound objects require a well-defined position in the reproduction space. During the recording process, they are often recorded with dedicated microphones to best represent point sources. The room information typically consists of early reflections and some diffuse reverberation. The pattern for the early reflections is influenced by the position of the sound source with respect to the listener within the room. In most applications, especially in small reproduction rooms, the listeners are not placed close to the walls of the virtual room. Therefore, the precise listener position is not that critical when creating virtual early reflections.1 Similarly, creating diffuse reverberation is a stochastic process. It has been shown in perceptual experiments at TU Delft that for small rooms, eight plane waves are sufficient to encode the whole reflection pattern information (de Bruijn, Piccolo & Boone, 1998). Using eight plane waves to represent the virtual room acoustics, it is possible to separate the virtual foreground objects and room information.

Using the method below, it is possible to modify the position or level of the foreground objects to some degree in post-production.

  • Foreground objects are recorded with close-up spot-microphones and their audio information together with the position of each sound object is stored and used in reproduction.
  • In the production process the reflection pattern of all sound sources is recorded simultaneously or simulated separately and merged to form eight equally distributed plane waves. Any ambient (background) sound would be added to these eight plane waves.

However, larger modifications made to the foreground objects do not cause a proper modification of the room reflection patterns. A way to overcome this is to change the storage and reproduction strategy as follows (Brix, Sporer & Plogsties, 2001):

  • Foreground objects are recorded with close-up spot-microphones and their audio information together with the position of each sound object is stored and used in reproduction.
  • Background (ambient) sound is stored in eight plane waves.
  • The room information is stored as sets of eight impulse responses for a set of expected positions of objects in the reproduction room. These impulse responses have been recorded and processed as eight plane waves.
  • For the reproduction, the sound information of each sound object is convolved with the eight plane waves closest to the intended reproduction position of this object.

Using this approach it is possible to modify the position and level of sound sources. The room impression can be changed completely as well thus making it possible to apply the room acoustics of a target space to sound objects recorded at another location. To be as precise as possible, it would be necessary to record the room acoustics at many locations. This procedure may prove to be too expensive or time-consuming. Room reproduction in WFS can also be based on room simulations or hybrid schemes. Based on the position of an object in a room, the reflection pattern can be generated by a room simulation module. For example, if the image model (mirror source) simulation technique is used, it is useful to include some low-order mirror sources directly as point sources in the rendering. Abiding to the mirror source simulation method, the virtual sound source is mirrored at each room plane. Image sources are mirrored again, which leads to second, third, etc., order image sources. As a result, the original room can be represented by an infinite pattern. Higher-order mirror sources, or measured diffuse reverberation tails, are included as plane waves (Melchior, 2011).

Separation of Capturing and Reproduction

The acoustic curtain of Steinberg and Snow used a 1:1 relationship between microphones and loudspeakers. Using this technique in post-production becomes impractical. The mathematical background of WFS described herein only considers the reproduction of audio objects, sound capture is not considered. Several practical methods may be used to generate content which can be reproduced by WFS.

Spot-Microphones

A spot-microphone is located close to each audio source to be recorded. If the audio source is at a fixed location, the position of the microphone is measured once and used as metadata for this object. If the position of the object changes over time, either automatic or manual tracking is necessary to create the metadata associated with this object. In both cases, the position and acoustic properties of each object can be modified individually. Very often this method is used in combination with either some main microphones to capture the room reflections caused by all audio objects, or with some room simulation enabling to reproduce individual reflections for each audio object. The latter has the benefit that the positions of early reflections are changed correctly if the position of an audio object is changed in post-production. When recording complex scenes, it is often not possible to avoid crosstalk between microphones capturing different objects. This crosstalk can be minimized through microphone type selection or in post-production. For sound sources with non-uniform directivity or which have some spatial extension, like a choir, virtual panning spots are used (see below).

Acoustic Scene Analysis

Using the acoustic scene analysis technique, a number of microphones are placed around the acoustic space. The first step in post-production is to separate the signals into sound objects and residual sound. The residual sound usually contains ambient noise, room reflections, and diffuse reverberation. The number and position of microphones generally limits the separation of objects. Also, each microphone creates some noise. By increasing the number of microphones, the total noise will increase as well. It is therefore recommended to capture each object with the least number of microphones as possible.

Virtual Panning Spots

Somewhere between the spot-microphones and acoustic scene analysis lays the concept of virtual panning spots. Large sound sources, such as choirs, can be recorded with a small number of microphones resulting in a 2-channel stereo recording. Instead of reproducing these stereo signals via real loudspeakers, these signals are used as point sources in a WFS reproduction. In that way, virtual loudspeakers create virtual phantom sources (Theile, Wittek & Reisinger, 2002). By adjusting the position of the virtual loudspeakers, the width of large sound objects can be changed.

WFS Reproduction: Challenges and Solutions

A perfect reproduction can be obtained if all assumptions of the WFS driving functions are met; however, in practice, there are a number of restrictions that limit the performance of the WFS system. Some of the limitations are inherent to WFS, while other limitations are due to more practical issues. This section explains what challenges exist using WFS and how special algorithms can improve the performance of the system.

Distance of Loudspeakers and Alias Frequency

The Kirchhoff-Helmholtz integral and the Rayleigh integral are both based on a continuous driving function, assuming that an infinitely large number of infinitely small loudspeakers exist. Obviously, in practice, loudspeakers are not infinitely small and the distance between loudspeakers is limited by the size of the loudspeakers and enclosures. More often, the distance between loudspeakers has to be increased in order to reduce parts and installation costs.

The use of discrete loudspeaker positions is analogous to sampling in analog-to-digital conversion (ADC). In ADC, a continuous audio signal is discretized. The alias terms are in the frequency domain. In WFS theory, the mapping of audio objects to an infinite number of loudspeakers is continuous. In WFS systems, sampling is the process that leads to a discrete number of loudspeakers. This reduction introduces errors due to the spatial sampling of the wavelengths that causes spatial aliasing. It should be noted that in ADC, the aliasing occurs when capturing the content whereas in WFS, the alias occurs in the reproduction. In contrast to analog-digital conversion where the alias frequency is only dependent on the sampling rate, in spatial sampling, the alias frequency is also dependent on the position in the reproduction room and the direction of the wave front. The alias frequency fA=ωA2π depends on the distance of the loudspeakers seen by the wave front. For plane waves this is shown in Figure 10.8. The following equation holds to determine the aliasing frequency:

fA= c2Δxsinα (10.8)

where ∆x is the distance between the loudspeakers and α is the angle between loudspeaker array and the wave front. It can be seen that the lowest alias frequency occurs, albeit the worst case, if the wave fronts are parallel to the loudspeaker array. To generate wave fronts with a direction rectangular to the array, the alias frequency would have to be infinite.

Figure 10.9 shows a snapshot of the amplitude in space for a virtual sound source with a frequency higher than the alias frequency. Distortions are largest close to the loudspeaker array. The sound field becomes smoother with increased distance between the listener and the loudspeaker array (Corteel, 2006).

The distance between the centers of the adjacent membranes are used for the calculation of the distance between loudspeakers. A typical spacing of 17 cm results in a worst-case alias frequency of about 1 kHz. Calculations based on numerical simulations suggest that above the alias frequency, the sound field is severely distorted. However, listening tests show that distortion due to aliasing is often inaudible. There are several reasons for this effect:

  • Spatial aliasing causes dips rather than peaks in the sound pressure. Dips are much less annoying than peaks, but simulations usually only consider the absolute difference between the natural sound field and the generated sound field. Therefore, in general, the perceptual effect of the error is overestimated.
  • Just above the alias frequency the errors are still small. The largest errors occur at higher frequencies. At higher frequencies the position-dependent dips in the spectrum are very narrow band, but at these frequencies, the resolution of the human auditory system is poor. Narrow band dips are therefore only detectable if the input signal is extremely narrow band, and this rarely happens in practice.
  • Just above the alias frequency, the dips are rather broad and not very deep. At higher frequencies, the dips are more deep and narrow. While simulations are based on ideal microphones with infinitely small sensors, the human auditory system averages the sound around the outer ear. Therefore, small-spaced frequency dips are inaudible.
  • In natural indoor environments, there is an abundance of reflections causing narrow band spectral coloration of the sound. Humans usually adapt to environments and therefore do not perceive such constant sound colorations.

There are two situations when colorations due to spatial aliasing become audible; one is if the colorations are distinctly different in the two ears, and the other is if the listener and/or sound sources are moving quickly through the room.

In the past, several attempts have been made to reduce the problem of spatial aliasing. The OPSI2 concept, invented by Helmut Wittek (2007), combines WFS at low frequencies with traditional 2-channel stereo at higher frequencies. Below the alias frequency, the loudspeaker array is controlled via WFS. Above the alias frequency, only the two loudspeakers closest to the sound source position are used with amplitude panning. Wittek proved that the sound coloration based on this approach is less audible than for pure WFS. Although, there are some disadvantages using the OPSI method. First, the objects with dominant high frequencies always sound like they are between the loudspeakers, and second, the distance and size of objects is not reproduced.

Other attempts are based on the assumption that listeners are usually not located throughout the entire listening room, but are restricted to a smaller listening area. Based on this assumption it is possible to control the sound field in a way that reduces coloration for that area (Franck et al., 2007; Spors, 2006; Spors, 2007; Ahrens & Spors, 2008; Melchior et al., 2008).

Limited Length of the Loudspeaker Array—Level

The mathematical derivation for WFS assumes an infinitely long array of loudspeakers. In practice, the size of the reproduction room and the length of the array are limited. If a virtual sound source is in proximity to the loudspeaker array, the loudspeakers closest to the virtual sound source provide the major part of the sound energy. The missing loudspeakers outside the room would only contribute with an insignificant energy to the sound at the listener’s position. If a sound source is far behind the loudspeaker array, all loudspeakers have to emit almost the same energy. The amount of sound energy not reproduced, due to the missing loudspeakers, is significant. In audio-visual installations, where virtual sound sources are moved around, this may cause a mismatch between the visual and the auditory position because the virtual sound source is losing loudness at a rate faster than it would in a natural acoustic environment.

An easy way to overcome this problem is to use an analysis-by-synthesis approach as follows: The renderer first simulates and compares the sound pressure level of the existing loudspeakers to the sound pressure level of an ideal reproduction system at a reference position in the reproduction room. Then, the driving signal of all existing loudspeakers is amplified to achieve the same reproduction level in both simulations. Due to the fact that the correction factor is only dependent on the geometry of the loudspeaker array and the position of the virtual sound source, the correction factor can be pre-calculated at once and stored. This analysis-by-synthesis approach also solves the problem of gaps in loudspeaker arrays. In many installations, there are places where loudspeakers cannot be installed such as at the sides of the screen in movie theaters where curtains are located. If a virtual sound source is moved in the proximity of such a gap, it would be reproduced too softly. The analysis-by-synthesis method described here automatically allocates the energy to the loudspeakers closest to the gap.

Limited Length of the Loudspeaker Array—Truncation

In addition to the level problem described above, the limited length of the loudspeaker array can also cause artifacts. A simulation plot of this effect can be seen in Figure 10.10. Shadow waves are emitted from the edges of the array. The effect is similar to a discrete Fourier transform (DFT) with a rectangular window. In a DFT, the solution is to use windowing functions that reduce the level of the samples at the beginning and at the end of each transform block. Examples of such windowing functions are the Hann window or the cosine taper. A similar solution can be applied to WFS. Here, the driving function for the loudspeakers close to the end of the array is modified by a window function. In most practical implementations, WFS is not used as a straight array alone, rather, systems have several arrays around the listener. Here, the loudspeakers on the side arrays can compensate for the missing contributions of the loudspeakers in the front. To avoid problems with moving sound sources around the corner, rectangular arrangements of arrays should be avoided. Very often, a small number of loudspeakers at 45° are inserted at the corners of the room (see Figure 10.11). Together, the analysis-by-synthesis approach mentioned earlier and these corner arrays avoid the spatial distortions caused by truncation.3

Position-Dependent Filtering

The driving function for pure WFS systems contain the term √(jk /2π). This term represents a −3 dB per octave shelving filter designed to compensate for the excess bass amplification caused by sampling. The derivation of the driving function assumes that virtual sound sources are in the far field, not close to the loudspeaker array. If a virtual sound source is close to the array, the bass amplification becomes less. If the sound source is placed exactly at the position of a loudspeaker, no compensation filter is necessary. The necessary filter is dependent on the source position, the direction of the wave front, and finally the loudspeaker geometry. An analysis-by-synthesis approach is the most practical solution here to find the correct filter. Most implementations do not use different filter settings for each loudspeaker; instead, a common filter for each sound object is used. Due to the fact that the calculation of the filter coefficients is computationally complex, the values are often pre-calculated and stored. In general, a set of filter coefficients for each spatial region is sufficient.

Directivity of Loudspeakers

As described in the WFS theory section, the Kirchhoff-Helmholtz integral implies that an infinite number of monopoles and dipoles encircling the reproduction space are necessary to achieve perfect results. This assumes that the reproduced sound field outside the listening space does not extend behind the speakers. Practical implementations of WFS are usually based on monopole loudspeakers. The directivity of real loudspeakers is neither an ideal monopole, nor an ideal dipole, but rather a cardioid. Depending on the size of the membrane, the loudspeaker behaves like a monopole below some frequency, but becomes more directional at higher frequencies. Therefore, for stereo reproduction, 2-way and 3-way systems are used with separate drivers for the low, mid, and high frequencies, with a crossover system to split the input signal accordingly. In general, problems with phase coherence can occur at the crossover area. WFS is based on the phase coherent super-positioning of loudspeakers. Therefore, unclear phase behavior of the individual loudspeakers can be harmful. However, the perceptual trouble caused by the directivity of sound sources is much more disturbing than problems due to unclear phase coherence, especially with moving sound sources (Klehs & Sporer, 2003).

In the past, attempts have been made to compensate for symptomatic directivity issues associated with loudspeakers (de Vries, 1996; de Vries, 2009; Ahrens & Spors, 2008). However, these approaches either lack flexibility to be used in real systems, are computationally very complex, or both.

Influence of the Reproduction Room

Using only monopoles has another unwanted side effect; the loudspeakers send as much energy to the back as towards the listener. Behind the loudspeakers there is often a reflective wall that sends part of the energy back to the listener. These unwanted reflections cause a super-positioning of reflected wave fronts with the intended signal from the front side of the loudspeakers. The reflection pattern depends on the position of the loudspeaker respective to the walls of the reproduction room. The perceptual process of the listener is influenced by this reflection pattern when localizing the position of the loudspeakers. If this happens, the super-positioned wave front from the loudspeakers’ signal is distorted and the perceived position of the virtual sound source is closer to the loudspeaker array than intended. This effect is especially strong for focused sound sources.

The simplest way to overcome this problem is to acoustically treat the reproduction room. If there are no hard reflections from the walls, ceiling, and floor, localization of the individual loudspeakers will not occur. Acoustically treating the reproduction room also solves a more general problem. Often, the recording room (e.g. a concert hall) is larger than the reproduction room (e.g. a living room). If the reflection pattern of such a reproduction room contains hard reflections, they would arrive before the reflections intended in the recording. Strong early reflections lead to the perception of a small room first, and therefore instead of the intended large room a mixture of both rooms is perceived.

On the other hand, it has been found that some diffuse reverberation caused by the reproduction room can improve the quality of the overall experience. The diffuse components enrich the envelopment and may mask some otherwise audible spatial aliasing artifacts.

Time Domain Effects in Large Loudspeaker Arrays

In large auditoria, like Bregenz lakeside open-air stage in Austria with its 7,000 seats, pure WFS causes audible distortions. This effect in general is not audible for steady state sounds but is audible for transients. To understand this effect it is necessary to look at the impulse response and at perceptual factors. A plane wave that is parallel to the loudspeaker array is emitted by firing all of the loudspeakers at the same time. For long arrays, the signal from loudspeakers near the center is received earlier than the signal from loudspeakers near the end of the array. If this time difference is above the detection threshold for echoes,4 the signals from the loudspeakers are no longer merged to form a single sound event. The detection threshold for this effect is greater than the detection threshold for a pure reflection because there are many loudspeaker signals filling gaps between the closest and the farthest loudspeakers, and also because the farthest loudspeakers provide less energy due to their greater distance. A solution to this problem is to subdivide the long array into subsections thus avoiding plane waves in long arrays.

WFS With Elevation

In traditional channel-based reproduction without height channels, a small number of loudspeakers are located in a single plane. The sweet-spot (place or area of optimal listening) is limited, therefore a critical listener needs to be close to or in the sweet-spot. Information about the elevation of sound sources and/or reflections from the ceiling are mixed in with the normal loudspeaker channels. Due to the fact that the position of the listener is well defined, binaural cues of the original recording are preserved and the listener has the impression that there is also sound from above. If the listener moves away from the sweet-spot the spatial information becomes distorted and the wrong elevation information is perceived. Because of the large deviations in the horizontal plane, the errors concerning reflections are not noticed. In WFS, the effective listening area encompasses almost the whole reproduction room. When listeners experience WFS, they often notice that information from above is missing. The improved spatial localization along the azimuth makes the missing information in the elevation direction more apparent. Several attempts have been made to solve this problem.

The most natural solution would be to extend the WFS system to a 3D WFS system. The additional loudspeakers would be located at several different elevations separated by a small distance. However this approach is very expensive and not adequate for most applications. There are three factors which help reduce the number of loudspeakers. First, the ears of humans are on the side of the head at about the same height; therefore, the precision of sound localization for elevation is much less than it is in the azimuthal direction. Second, in most situations, minor sound components, like ceiling reflections, are presented at the same time as sounds from the horizontal plane. Third, if a dominant sound is coming from above, humans turn their head to look in the direction of the sound. If the sound source is not visible, there is no indication of a wrong position; therefore, the loose perception that a sound source is somewhere above is sufficient. The number of height loudspeakers needed depends on the size of the reproduction space.

Newer WFS installations tend to use a small number of additional loudspeakers. To make content scalable, the position of audio objects is stored as 3D Cartesian coordinates (x, y, z) or as polar coordinates (azimuth, elevation, distance). The mapping of elevated sound sources to the actual loudspeaker setup is done in the rendering. While sound objects close to the horizontal plane are reproduced via WFS only (i.e. as point and focused sources) elevated sound objects can be reproduced as a combination of WFS and Vector Base Amplitude Panning (VBAP) for the elevated loudspeakers. For small reproduction sites like home cinemas, where there is no space for ceiling loudspeakers, binaural cues are used to simulate height loudspeakers. The signal that would be coming from the ceiling speaker, that does not exist, is filtered according to the Blauert’s directional bands and reproduced via the existing loudspeakers in the horizontal plane. A detailed description of the algorithm is part of the MPEG-H 3D audio standard (ISO/IEC 23008–3, 2015).

Audio Metadata and WFS

There are three types of objects in WFS; point sources, focused sources, and plane waves. The main difference between a point source and a focused source is the algorithm used to calculate the signals that drive the loudspeakers. Depending on the size of the reproduction system, an object might be inside the encircling loudspeaker array or outside. Therefore, the metadata does not distinguish between these two sets. For both a point source and a focused source, the metadata contains the position of the object. For plane waves, just the direction is stored. In the past, Cartesian coordinates have been used (Brix et al., 2001), but today most systems are based on polar coordinates where the azimuth angle starts in the front and goes counterclock-wise with a range from 0° to 360° or from −180° to 180°. The elevation angle starts at the horizontal plane with a range from −90° to 90°. The distance is stored as well. A reasonable number of bits for the resolution of azimuth and elevation angles is 8 and 6, respectively. Some metadata sets use the same set for point sources and plane waves. The maximum distance is reserved as a flag indicating that the source is a plane wave. This format is possible because listeners cannot detect the curvature of nearly flat wave fronts with an acoustic horizon about 15 m away. Therefore, it is unnecessary to encode larger distances precisely. In many applications, a few additional values/flags are useful (Ruiz, Sladeczek & Sporer, 2015):

  • Distance-Dependent Delay: When rendering moving objects, WFS generates a natural Doppler shift. Such an effect is not welcomed by most mix engineers. Therefore, a flag is set to change the behavior of the renderer to avoid the Doppler effect, and the arrival time of an object traveling to the center of the reproduction space is changed. By switching the distance-dependent delay off, the time of arrival becomes independent of the distance. It should be noted that when distance-dependent delays are switched off, the sound source close to the listener has a greater curvature than a sound source far away.
  • Distance-Dependent Level: When the distance between an object and the listener position is increased, WFS creates a natural decrease in level. A flag tells the renderer to compensate for this decrease in level.
  • Sound Object Visible: In cinema applications, sound objects might be also visible. For such applications, it is advisable to adjust the position of the audio object to the screen size. This flag indicates whether an object should be adjusted or not.

An additional element related to metadata is the calibration of reproduction systems. WFS rendering creates proper driving signals when the position of each loudspeaker is known. However, loudspeakers need to be calibrated as close as possible. As a result of the standardization of MPEG-H 3D audio, a calibration procedure beneficial for WFS was defined as follows:

  • In the center of the reproduction system is an omnidirectional microphone.
  • Each loudspeaker is equalized so that the frequency response at the microphone is flat.
  • The delay of each loudspeaker is adjusted in a way that an impulse emitted by any loudspeaker arrives at the same time at the microphone.
  • The level of each loudspeaker is adjusted so that the level of the signal emitted by any loudspeaker arrives with the same sound pressure level at the microphone.

In general, equalization, time, and level calibration are implemented as FIR filters for each loudspeaker.

Applications Based on WFS and Hybrid Schemes

Early developers of WFS thought of many applications and related systems. The European Union co-funded project CARROUSO (Brix et al., 2001) that demonstrated a complete system chain from the distributed microphones to the encoding of the object-based audio using MPEG-4 and WFS-based rendering. A similar system was demonstrated in 2003 at an Ilmenau movie theater. The 5.1 channel audio that was then available on any film was mapped to virtual loudspeakers and rendered in real time via a 192-speaker WFS system in the cinema. In addition native object-based demo content was reproduced, too.

Applications of WFS include cinema, concerts, planetariums, exhibitions, theme parks, automotive, medical rehabilitation, and VR caves. While the application of WFS to home cinema has been a goal from early on, there have only been a small number of prototype implementations of WFS in home cinema. On one hand, this is due to missing content for a wider audience; on the other hand, home acoustics are far from ideal for immersive audio rendering. One big obstacle for the implementation of a large number of WFS venues is the number of required loudspeakers. As discussed above, a good distance between speakers to reduce the amount of spatial aliasing below audibility is 17 cm or less. This translates to dozens of loudspeakers for living rooms and hundreds of loudspeakers for large auditoria. While in the latter case this has been done, it is prohibitive in terms of cost and, not to be neglected, the visual effects on the room. To overcome this limitation, a number of hybrid systems using both WFS theory and ideas derived from Vector Base Amplitude Panning (VBAP) (Pulkki, 1997) have been developed. Nearly all of the immersive sound applications in the past few years have fallen into this category.

WFS and Object-Based Sound Production

Multichannel audio systems have the potential to reproduce spatial sound. The production task includes creating impressions of sound coming from certain directions. One way to achieve this is by panning the audio signal. Mixing desks and digital audio workstations provide dedicated tools for this task. However, these tools assume that the audio signal is being played back over a standardized loudspeaker setup like stereo, 5.1, 7.1, and so forth. When varying existing loudspeaker setups or introducing new reproduction formats, this assumption will not be met. As a consequence of this, every new reproduction format will need a new panning tool.

In the field of audio post-production for motion pictures, the audio production process is currently highly parallel and segregated. The introduction of new spatial audio systems will require even more production steps, including spatial authoring, to accomplish a rich auditory experience for the audience.

Currently, most mixing and sound design processes are based on the channel paradigm, where the coding format defines the reproduction setup. For these systems, any changes will require doing the complete mix again. As described below, if the WFS processor used for reproduction knows the target setup, it can render the loudspeaker signals in a way that fits the targeted setup perfectly.

The mixing process for wave field synthesis is based on a sound object paradigm. This method overcomes the limitations of the channel-based approach. The position of the sound object in an audio scene is needed and determined during the mixing process. Tracks and channels, which are indirectly considered in the process, form a sound object, which can be moved in an audio scene using a WFS authoring tool. Besides the audio information, the sound objects can have directivity information and interact with the virtual spatial environment. Based on the virtual object position, information for the direct sound, early reflections, and diffuse reverberations can be calculated, processed, and rendered to any loudspeaker configuration. Room acoustic information can be based on real or artificial spatial data. Therefore, the final WFS mix does not contain loudspeaker-related material. All audio signals that represent sound objects in the final mix are sent to the WFS rendering processor, which calculates the signals for all loudspeakers for the reproduction.

Notes

1 As known from psychoacoustics listeners merge the first room reflections with the direct sound to localize sound objects. These early reflections are dependent on source and listener position. In small reproduction rooms the listener position is of minor influence.
2 Optimised Phantom Source Imaging of the high-frequency content of virtual sources in Wave Field Synthesis.
3 In research also circular arrays of loudspeakers have been used. While circular arrays do not have any problems with truncation, such geometry is not adequate for most practical applications.
4 The echo threshold for transients is between 50 ms (for speech) and 100 ms (for music) (Blauert, 1996).

References

Ahrens, J., & Spors, S. (2008). Notes on rendering focused directional virtual sources in wave field synthesis. 34: Jahrestagung für Akustik (DAGA). Dresden, Germany, Deutsche Gesellschaft für Akustik (DEGA).

Berkhout, A. J. (1988). A holographic approach to acoustic control. Journal of the Audio Engineering Society (JAES), 36(12), 977–995.

Blauert, J. (1996). Spatial Hearing. Cambridge: The MIT Press.

Brix, S., Sporer, T., & Plogsties, J. (2001). Carrouso—an European approach to 3d-audio. Proceedings of the 110th Audio Engineering Society (AES) Convention. Amsterdam.

Bruijn, W. de, Piccolo, T., & Boone, M. M. (1998). Sound recording techniques for wavefield synthesis and other multichannel sound systems. Proceedings of the 104th Audio Engineering Society Convention. Amsterdam.

Corteel, E. (2006). On the use of irregularly spaced loudspeaker arrays for Wave Field Synthesis, potential impact on spatial aliasing frequency. Proceedings of the International Conference on Digital Audio Effects (DAFx-06). Montreal, Quebec, Canada.

de Bruijn, W., Piccolo, T., Boone, M.M. (1998). “Sound recording techniques for wavefield synthesis and other multichannel sound systems,” In: Proceedings of the 104th Audio Engineering Society Convention, Amsterdam.

de Vries, D. (1996). “Sound reinforcement by wave_eld synthesis: Adaptation of the synthesis operator to the loudspeaker directivity characteristics,” Journal of the Acoustical Society of America (JASA), 44(12): 1120–1131.

de Vries, D. (2009). Wave Field Synthesis, Audio Engineering Society (AES), 2009, AES Monograph.

Fink, M. (1992). Time reversal of ultrasonic fields. i. basic principles. Ultrasonics, Ferroelectrics, and Frequency Control, IEEE Transactions on, 39(5), 555–566.

Franck, A., Gräfe, A., Korn, T., & Strauß, M. (2007). Reproduction of moving sound sources by wave field synthesis: an analysis of artifacts. Proceedings of the 32nd International Conference of the Audio Engineering Society (AES). Hillerrod, Denmark.

Huygens, C. (1690). Traité de la Lumière. Leiden: Pieter van der Aa.

ISO/IEC 23008–3. (2015). Information technology: High efficiency coding and media delivery in heterogeneous environments‑Part 3: 3D audio. Standard, International Organization for Standardization. Geneva, CH, October 2015.

Kirchhoff, G. (1883). Zur Theorie der Lichtstrahlen. Annals of Physics, 254, 663–695.

Klehs, B., & Sporer, T. (2003). Wavefield synthesis in the real world: Part 1-in the living room. Proceedings of the 114th Audio Engineering Society Convention. Amsterdam.

Melchior, F. (2011). Investigations on Spatial Sound Design Based on Measured Room Impulse Responses, Ph.D. thesis, Technische Universiteit Delft, June 2011.

Melchior, F., Brix, S., Sporer, T., Roder, T., & Klehs, B. (2003). Wave field syntheses in combination with 2d video projection. Proceedings of the 24th International Audio Engineering Society Conference: Multichannel Audio, the New Reality, Banff, Canada.

Melchior, F., Sladeczek, C., de Vries, D., & Fröhlich, B. (2008). User-dependent optimization of wave field synthesis reproduction for directive sound fields. Proceedings of the 124th Convention of the Audio Engineering Society (AES). Amsterdam.

Pulkki, V. (1997). Virtual sound source positioning using vector base amplitude panning. Journal of Audio Engineering Society, 45(6), 456–466.

Reussner, T., Sladeczek, C., Rath, M., Brix, S., Preidl, K., & Scheck, H. (2013). Audio network based massive multichannel loudspeaker system for flexible use in spatial audio research. Journal of the Audio Engineering Society (JAES), 61(4), 235–245.

Ruiz, A., Sladeczek, C., & Sporer, T. (2015). A description of an object-based audio workflow for media productions. Proceedings of the 57th Audio Engineering Society Conference: The Future of Audio Entertainment Technology—Cinema, Television and the Internet.

Snow, W. B. (1955). Basic Principles of Stereophonic Sound. Audio, IRE Transactions on Audio, 3(2), 42–53.

Spors, S. (2006). Spatial Aliasing Artifacts produced by Linear Loudspeaker Arrays Used for Wave Field Synthesis. Proceedings of the 2nd International Symposium on Communications, Control and Signal Processing (ISCCSP). IEEE Signal Processing Society.

Spors, S. (2007). Extension of an analytic secondary source selection criterion for wave field synthesis. Proceedings of the 123rd Audio Engineering Society (AES) Convention. New York, USA.

Start, E. (1996). Application of curved arrays in wave field synthesis. Proceedings of the 100th Audio Engineering Society (AES) Convention. Copenhagen, Denmark.

Start, E. (1997). Direct Sound Enhancement by Wave Field Synthesis, Ph.D. thesis, Technische Universiteit Delft, June 1997.

Steinberg, J. C., & Snow, W. B. (1934). Physical factors. Bell System Technical Journal, 13(2), 245–258.

Theile, G., Wittek, H., & Reisinger, M. (2002). Wellenfeldsynthese-Verfahren: Ein Weg für neue Möglichkeiten der räumlichen Tongestaltung. 22: Tonmeistertagung, Hannover, Germany, November 2002. Verband Deutscher Tonmeister e.V.

Verheijen, E. (1998). Sound Reproduction by Wave Field Synthesis, Ph.D. thesis, Technische Universiteit Delft, January 1998.

Weinzierl, S. (2008). Handbuch der Audiotechnik. Springer Science & Business Media, in German.

Wittek, H. (2007). Perceptual Differences Between Wave Field Synthesis and Stereophony, Ph.D. thesis, University of Surrey, 2007.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset