Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7
Multi-Channel Sound Acquisition Using a Multi-Wave Sound Field Model

Oliver Thiergart and Emanuël Habets

International Audio Laboratories Erlangen, a Joint Institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer IIS, Erlangen, Germany

7.1 Introduction

During the last decade, different devices with two or more microphones have emerged that enable multi-channel sound acquisition. Typical examples include mobile phones, digital cameras, and smart television screens, which can be used for a huge variety of audio applications including hands-free communication, spatial sound recording, and speech-based human–machine interaction. Key to the realization of these applications with different devices is flexible and efficient sound acquisition and processing. “Flexible” means that the sound can be captured with different microphone configurations while being able to generate one or more desired output signals at the reproduction side depending on the application. “Efficient” means that only a few audio signals, in comparison to the number of microphones used, need to be transmitted to the reproduction side while maintaining full flexibility.

This flexible and efficient sound acquisition and processing can be achieved using a parametric description of the spatial sound. This approach is used, for example, in directional audio coding (DirAC) as discussed in Chapter 5 (Pulkki, 2007) or high angular resolution plane-wave expansion (HARPEX; Berge and Barrett, 2010a,b), which represent two well-known parametric approaches to the analysis and reproduction of spatial sound. See Chapter 4 for a broader introduction to the techniques. In DirAC, it is assumed that for each time-frequency instance the sound field can be decomposed into a direct sound component and a diffuse sound component. In practice, model violations may occur when multiple sources are active simultaneously (Thiergart and Habets, 2012). To reduce such model violations, a higher-order extension was proposed by Politis et al. (2015) as discussed in the previous chapter, which performs a multi-directional energetic analysis. This approach requires spherical harmonics of higher orders as input signals, which can be obtained using a spherical microphone array. To reduce the model violations with more practical and almost arbitrary microphone setups, we introduced in Thiergart et al. (2014b) a multi-wave sound field model in which multiple direct sound components plus a diffuse sound component are assumed per time–frequency instance.

As illustrated in Figure 7.1, the parametric processing is performed in two successive steps. In the first step (parameter estimation and filtering), the sound field is analyzed in narrow frequency bands using multiple microphones to obtain the parametric representation of the spatial sound in terms of the direct sound and diffuse sound components and parametric side information, namely the direction of arrival (DOA) of the direct sounds. In the second step (application-dependent processing), one or more output signals are synthesized from the parametric representation depending on the application. As in DirAC, the described scheme allows for an efficient transmission of sound scenes to the reproduction side. In fact, instead of transmitting all microphone signals to the reproduction side for the application-dependent processing, only the compact parametric representation of the spatial sound is transmitted.

Diagram shows spatial sound processing which is parametric with markings for microphone signal(s), parameter estimation, parametric information, filtering, parametric representation of spatial sound, side information (DOA), direct sound(s) and diffuse sound, et cetera. — **Figure 7.1** Block scheme of the parametric processing of spatial sound. Source: Thiergart 2015. Reproduced with permission of Oliver Thiergart.

The direct sounds and diffuse sound in Figure 7.1 are extracted using optimal multi-channel filters. These filters can outperform the single-channel filters that are typically used in state-of-the-art (SOA) approaches such as DirAC. For example, multi-channel filters can extract the direct sounds while at the same time reducing the microphone noise and diffuse sound. Moreover, when extracting the diffuse sound using multi-channel filters, the leakage of undesired direct sounds into the diffuse sound estimate can be greatly reduced, as shown later. The multi-channel filters presented in this chapter are computed using instantaneous information on the underlying parametric sound field model, such as the instantaneous DOA or second-order statistics (SOS) of the direct sounds and diffuse sound. Incorporating this information enables the filters to adapt quickly to changes in the acoustic scene, which is of paramount importance for typical acoustic scenes where multiple sources are active at the same time in a reverberant environment.

The remainder of this chapter is organized as follows:

Section 7.2 explains how the parametric description of the spatial sound can be employed to realize the desired flexibility with respect to different audio applications.
Section 7.3 explains the parametric sound field model and introduces the corresponding microphone signals.
Section 7.4 introduces the multi-channel filters for the extraction of the direct sounds and diffuse sound.
Section 7.5 discusses the estimation of the parametric information, which is required to compute the filters.
Section 7.6 shows an example application of the approaches presented, namely the application to spatial sound reproduction.
Section 7.7 summarizes the chapter.

Each section also provides a brief overview of relevant SOA approaches. Note that throughout this work we focus on omnidirectional microphone configurations. However, in most cases it is straightforward to extend the discussed filters and estimators to directional microphone configurations.

7.2 Parametric Sound Acquisition and Processing

To derive the aforementioned filters and realize the different applications, we assume a rather dynamic scenario where multiple sound sources are active at the same time in a reverberant environment. The number of existing sound sources and their positions are unknown and sources may move, emerge, or disappear. To cope with such a scenario, we model the sound field in the time–frequency domain. We assume that the sound field P(k, n, r) at time instant n, frequency band k, and position r is composed of L direct sound components and a diffuse sound component:

(7.1)

where the L direct sounds P_s, l(k, n, r) model the direct sound of the existing sources, and the diffuse sound models the reverberant or ambient sound. As discussed in more detail in Section 7.3, we assume that each direct sound P_s, l(k, n, r) can be represented as a single plane wave with DOA expressed by the unit-norm vector n_l(k, n) or azimuth angle φ_l(k, n) and elevation angle ϑ_l(k, n). The DOA of the direct sounds represents a crucial parameter in this work. An example plane wave is depicted in Figure 7.2(a), while Figure 7.2(b) shows an example diffuse field. The resulting sound field P(k, n, r) is shown in Figure 7.2(c). The sound field model in Equation (7.1) is similar to the one in DirAC (Pulkki, 2007) and HARPEX (Berge and Barrett, 2010a,b). However, DirAC only assumes a single plane wave (L = 1) plus diffuse sound, while HARPEX assumes L = 2 plane waves and no diffuse sound.

Image described by caption and surrounding text — **Figure 7.2** Example showing a direct sound field (plane wave), a diffuse sound field, and the sum of both fields, for a frequency of f = 500 Hz. Source: Thiergart 2015. Reproduced with permission of Oliver Thiergart.

7.2.1 Problem Formulation

The ultimate goal throughout this work is to capture the L direct sounds P_s, 1…L(k, n, r) and the diffuse sound at a reference position denoted by r₁ with a specific, application-dependent response. This means that the target signal Y(k, n) we wish to obtain is a weighted sum of the L direct sounds and the diffuse sound at r₁:

(7.2a)

(7.2b)

The signal is the target direct signal which is the weighted sum of the L direct sounds at r₁ contained in . The potentially complex target responses for the direct sounds are contained in the vector . The signal is the target diffuse signal, with the target diffuse response given by .

The problem formulation in Equation (7.2) covers a huge variety of applications. The target direct responses D_s, l(k, n) and target diffuse response are the values of application-dependent response functions, for example and . In our applications, the direct sound response function depends on the time- and frequency-dependent DOA of the direct sound, for example, on the azimuth angle φ(k, n). Both and the diffuse sound response function are discussed below for different applications.

Speech Enhancement for Hands-Free Communication

In this application, we aim to extract direct sounds from all directions unaltered, while attenuating the diffuse sound that would reduce the speech intelligibility (Naylor and Gaubitch, 2010). For this purpose, we set the direct response function to 1 for all DOAs, that is, , while the diffuse response function is set to 0. Alternatively, we can adjust the direct response function depending on the loudness of the direct sounds to achieve a spatial automatic gain control (AGC), as proposed by Braun et al. (2014).

Source Extraction

In source extraction we wish to extract the direct sounds from specific desired directions unaltered, while attenuating the direct sounds from other, interfering, directions. The diffuse sound is undesired. Therefore, we set the diffuse response function to 0, that is, . The direct response function corresponds to a spatial window function that possesses a high gain for the desired directions and a low gain for the undesired interfering directions. For instance, if the desired sources are located around 60°, we can use the window function depicted in Figure 7.3(a), which only extracts direct sounds arriving close to 60°. The source extraction application is explained, for example, in Thiergart et al. (2014b).

Graphs show examples of window function and panning function on varphi in degrees from 0 to 180 versus ds (k, varphi) from 0 to 1 with plots for left LS and right LS. — **Figure 7.3** Example target response functions for the direct sound. The left response function can be used, for instance, in sound extraction applications. The right response functions can be used in spatial sound reproduction employing a stereo loudspeaker (LS) setup. Source: Thiergart 2014b. Reproduced with permission of IEEE.

Spatial Sound Reproduction

In spatial sound reproduction we aim to recreate the original spatial impression of the sound field that was present during recording. On the reproduction side we typically use multiple loudspeakers (for example, a stereo or 5.1 setup), and Y(k, n) in Equation (7.2) is one of the loudspeaker signals. Note that aim not at recreating the original physical sound field, but at reproducing the sound such that it is perceptually identical to the original field in r₁. Similarly to DirAC (Pulkki, 2007), this is achieved by reproducing the L direct sounds P_s, l(k, n, r) from the original directions, indicated by the DOAs φ_l(k, n), while the diffuse sound is reproduced from all directions. To reproduce the direct sounds from the original directions, the direct sound response function for the particular loudspeaker corresponds to a so-called loudspeaker panning function. Example panning functions for the left and right loudspeaker of a stereo reproduction setup following the vector base amplitude panning (VBAP) scheme (Pulkki, 1997) are depicted in Figure 7.3(b). Alternatively to using loudspeaker panning functions, we can also use head-related transfer functions (HRTFs) to achieve spatial sound reproduction with headphones (Laitinen and Pulkki, 2009). The diffuse sound response function is set to a fixed value for all loudspeakers such that the diffuse sound is reproduced with the same power from all directions. This means that in this application the diffuse sound represents a desired component.

Acoustical Zooming

In acoustical zooming, we aim to mimic acoustically the visual zooming effect of a camera, such that the acoustical image and the visual image are aligned. For instance, when zooming in, the direct sound of the visible sources should be reproduced from the directions where the sources are visible in the video, while sources outside the visual image should be attenuated. Moreover, the diffuse sound should be reproduced from all directions but the signal-to-diffuse ratio (SDR) on the reproduction side should be increased to mimic the smaller opening angle of the camera. To achieve the correct reproduction of the direct sounds, we use the same approach as for the spatial sound reproduction explained above: the direct sound response function corresponds to a loudspeaker panning function. However, the loudspeaker panning functions are modified depending on the zoom factor of the camera to increase or decrease the width of the reproduced spatial image according to the opening angle of the camera. Moreover, we include an additional spatial window to attenuate direct sounds of sources that are not visible in the video. To achieve a plausible reproduction of the diffuse sound, we vary the diffuse response function depending on the zoom factor. For small zoom factors, where the opening angle of the camera is large, we set to 1, which means that we reproduce the diffuse sound with the original strength. When zooming in, we lower , that is, less diffuse sound is reproduced leading to a larger SDR at the reproduction side. Acoustical zooming is explained in Thiergart et al. (2014a).

7.2.2 Principal Estimation of the Target Signal

To estimate the target signal Y(k, n) in Equation (7.2), we capture the sound P(k, n, r) in Equation (7.1) with M microphones located at the positions r_1…M. One microphone is located at the reference position r₁ and is referred to as the reference microphone. In general, the target signal Y(k, n) is estimated by applying linear multi-channel filters to the recorded microphone signals. We have two principal possibilities to estimate the target signal Y(k, n) from the captured microphone signals, namely:

direct estimation of the target signal Y(k, n), and
separate estimation of the target direct signal and the target diffuse signal , and then computing Y(k, n) using Equation (7.2b).

Possibility (i) means that we jointly extract the direct sounds and diffuse sound with the desired target responses from the microphone signals. This means applying a single filter to the microphone signals to obtain the target signal Y(k, n). Possibility (ii) means that we extract the direct and diffuse sounds separately with the desired target responses. This means that we apply two separate filters to the microphone signals to obtain and independently of each other.

Both possibilities have different advantages and disadvantages. In general, (i) is more accurate and the computational complexity is lower since only a single filter is computed. However, since the filter depends on the application (for example, the spatial window in source extraction, the loudspeaker setup in spatial sound reproduction, or the zooming factor in acoustical zooming), the filter must be computed and applied at the reproduction side. This means that we need to store or transmit all M microphone signals to the reproduction side, which requires high bandwidth. We can overcome this drawback when using (ii). With this approach, the two filters can be decomposed into two application-independent filters, which can be computed and applied at the recording side, and application-dependent target responses that can be applied at the reproduction side. This is illustrated in Figure 7.4. The filters on the recording side extract the direct sounds P_s, 1…L(k, n, r₁) and the diffuse sound , which are transmitted to the reproduction side and then combined depending on the application. Therefore, we need to transmit only L + 1 audio signals instead of M microphone signals,¹ and still maintain full flexibility with respect to the sound preproduction. The extracted direct sounds and diffuse sound together with the DOA information represent a compact parametric representation of the spatial sound that enables a fully flexible sound reproduction, independent of how the sound was recorded. Additionally, we have access to the separate target diffuse signal . This is required in spatial sound reproduction applications, where we usually apply decorrelation filters to the diffuse sound to enable a realistic reproduction of the ambient sound (Pulkki, 2007).

Diagram shows estimation of target signal Y(k, n) which is principal with markings for microphone signal(s), filter weights, direct sound extraction, parameter estimation, diffuse sound extraction, direct sound filtering, transmission storage (optional), et cetera. — **Figure 7.4** Principal estimation of the target signal Y(k, n) using the indirect approach. Source: Thiergart 2015. Reproduced with permission of Oliver Thiergart.

7.3 Multi-Wave Sound Field and Signal Model

This section explains the sound field model in Equation (7.1) in more detail. As explained in Section 7.2, the L plane waves P_s, l(k, n, r) represent the direct sound of multiple sound sources in a reverberant environment, and the diffuse sound represents the reverberation. The number of plane waves L is usually smaller than the actual number of active sources, assuming that the source signals are sufficiently sparse in the time–frequency domain. This is typically the case for speech signals (Yilmaz and Rickard, 2004). The impact of the assumed number L is discussed in Section 7.4.1. The sound is captured with M > L omnidirectional microphones positioned at r_1…M. The microphone signals are written as

(7.3)

where x_s, l(k, n) is the lth plane wave measured with the different microphones, is the measured diffuse field, and x_n(k, n) is a noise component.

Assuming that the three terms in Equation (7.3) are mutually uncorrelated, we can express the power spectral density (PSD) matrix of the microphone signals as

(7.4a)

(7.4b)

where is the direct sound PSD matrix, is the diffuse sound PSD matrix, and is the stationary noise PSD matrix. The different signal components (direct sound, diffuse sound, and noise) and corresponding PSD matrices are explained below.

7.3.1 Direct Sound Model

Without loss of generality, we consider the microphone located at r₁ as the reference microphone. The direct sound x_s, l(k, n) in Equation (7.3) is written as

(7.5)

where is the propagation vector of the lth plane wave with respect to the first microphone. The mth element of the lth propagation vector, A_{m, l}(k, n), is the relative transfer function (RTF) between the first and mth microphones for the lth plane wave. Using the plane wave model for omnidirectional microphones, the RTF can be written as

(7.6)

where is the imaginary unit, , and κ is the wavenumber. The propagation vector depends on the DOA of the plane wave, which is expressed by the unit-norm vector n_l(k, n). For plane waves propagating in three-dimensional space, n_l(k, n) is defined as

(7.7)

where φ_l(k, n) denotes the azimuth and ϑ_l(k, n) is the elevation angle. Note that, especially in dynamic multi-source scenarios, the DOAs of the plane waves can vary rapidly across time and frequency.

Given the direct sound model above, we can express in Equation (7.6) as

(7.8)

where A(k, n) = [a₁(k, n), a₂(k, n), …, a_L(k, n)] is the propagation matrix. Moreover, the PSD matrix of the L plane waves is given by

(7.9)

where contains the L plane wave signals at r₁, as explained in Section 7.2. For mutually uncorrelated plane waves, is a diagonal matrix where

(7.10)

are the powers of the L plane waves.

7.3.2 Diffuse Sound Model

As mentioned before, the diffuse sound models the reverberant sound that is present at the recording location. Reverberation is commonly modeled as a homogenous and isotropic, time-varying, diffuse field (Nélisse and Nicolas, 1997; Jacobsen and Roisin, 2000). Such a diffuse field can be modeled as the sum of N ≫ L mutually uncorrelated plane waves arriving with equal power and random phases from uniformly distributed directions (Nélisse and Nicolas, 1997; Jacobsen and Roisin, 2000). In this case, the mth microphone signal corresponding to the diffuse sound can be expressed as

(7.11)

where m_i(k, n) describes the DOA of the ith plane wave forming the diffuse sound, θ_i(k, n) is the phase of the wave in the origin of the coordinate system, and is the mean diffuse power. The phase terms θ_1…N(k, n) are mutually uncorrelated, uniformly distributed, random variables, that is, . A spherically isotropic diffuse field is obtained when the directions m_i are uniformly distributed on a sphere. The mean power of the diffuse field varies rapidly across time and frequency for a typical reverberant field. An example diffuse field computed with Equation (7.11) is depicted in Figure 7.2(b).

Given the model in Equation (7.11), the PSD matrix of the diffuse sound in (7.4b) can be written as

(7.12)

The (m′, m)th element of , denoted by γ_d, m′m(k), is the spatial coherence between microphones m′ and m for a purely diffuse sound field, which is time invariant and known a priori (Elko, 2001). For instance, for a spherically isotropic diffuse field and omnidirectional microphones, we have (Nélisse and Nicolas, 1997)

(7.13)

In the following, we define the vector

(7.14)

which relates the diffuse sound captured by the different microphones to the diffuse field at the reference position. Clearly, as the diffuse field for each (k, n) is formed by a new realization of random plane waves in Equation (7.11), the elements of u(k, n) must be assumed non-deterministic and unobservable. As shown by Thiergart and Habets (2014), the expectation of u(k, n) is known, as it is given by

(7.15)

where is the diffuse coherence vector, that is, the first column of . The definitions in Equations (7.14) and (7.15) are helpful for the derivation of the filters in Section 7.4.2.

7.3.3 Noise Model

With the noise in Equation (7.3) we can model, for instance, a stationary background noise such as fan noise or the microphone self-noise. In contrast to the direct sound and diffuse sound, the noise PSD matrix given in Equation (7.4) is assumed to be time invariant. This assumption allows us to estimate directly from the microphone signals, for example during speech pauses, as discussed in Section 7.5.4. Unless otherwise stated, we make no further assumptions on the noise besides the stationary assumption. In practice, this assumption can be relaxed to slowly time-varying.

7.4 Direct and Diffuse Signal Estimation

This section explains the estimation of the target direct signal and target diffuse signal in Equation (7.2b) using multi-channel filters. We consider estimators for which closed-form solutions exist such that these can be computed efficiently for each time and frequency with current information on the DOA of the L direct sounds and SOS of the sound components.

7.4.1 Estimation of the Direct Signal

State of the Art

Existing spatial filtering approaches, which can be used to estimate , can be divided into signal-independent spatial filters and signal-dependent spatial filters (Van Veen and Buckley, 1988; Van Trees, 2002). Signal-independent filters extract desired sounds without taking time-varying SOS of the sound components into account. Typical examples are the delay-and-sum beamformer in Van Veen and Buckley (1988), the filter-and-sum beamformer in Doclo and Moonen (2003), differential microphone arrays (Elko et al., 1998; Elko, 2000; Teutsch and Elko, 2001; Benesty et al., 2008), or the superdirective beamformer (Cox et al., 1986; Bitzer and Simmer, 2001). Computing these filters only requires the array steering vectors a_1…L(k, n) that specify the desired or undesired directions from which to capture or attenuate the sound. Unfortunately, the signal-independent filters cannot adapt to changing acoustic situations (that is, time-varying SOS of the sound components). Therefore, the superdirective beamformer, for example, performs well when the sound field is diffuse and the spatially white noise is low, but it performs poorly in noisy situations since its robustness against noise is low (Doclo and Moonen, 2007).

Signal-dependent spatial filters overcome this drawback by taking the time-varying SOS of the desired or undesired sound components into account. Examples are the minimum variance distortionless response (MVDR) beamformer (Capon, 1969) and the linearly constrained minimum variance (LCMV) filter (Van Trees, 2002), both of which minimize the residual power of the undesired components at the filter output subject to one or more linear constraints, respectively. The multi-channel Wiener filter represents a signal-dependent filter that does not satisfy a linear constraint but minimizes the mean squared error (MSE) between the true and estimated desired signal (Doclo and Moonen, 2002). This filter provides a stronger reduction of the undesired components compared to the linearly constrained filters, but introduces signal distortions. To achieve a trade-off between signal distortions and noise reduction, the parametric Wiener filter was proposed (Spriet et al., 2004; Doclo and Moonen, 2005; Doclo et al., 2005). Later, this filter was derived for the multi-wave case (Markovich-Golan et al., 2012).

The optimal signal-dependent filter weights can be obtained directly or iteratively (Frost, 1972; Affes et al., 1996; Herbordt and Kellermann, 2003; Gannot and Cohen, 2008). The latter filters are referred to as adaptive filters and typically achieve a strong reduction of undesired signal components. Unfortunately, the adaptive filters can suffer from a strong cancelation of the desired signal in practice. Therefore, many approaches have been proposed to improve the robustness of these filters (Cox et al., 1987; Nordholm et al., 1993; Hoshuyama et al., 1999; Gannot et al., 2001; Reuven et al., 2008; Talmon et al., 2009; Markovich et al., 2009; Krueger et al., 2011). Another drawback of these filters is the inability to adapt sufficiently quickly to the optimal solution in dynamic situations, for example, source movements, competing speakers that become active when the desired source is active, or changing power ratios between the noise and reverberant sound. In contrast, the signal-dependent filters, which can be computed directly, can adapt quickly to changes in the acoustic scene. However, this requires that the array steering vectors a_1…L(k, n) and SOS of the desired and undesired components, which are required to compute the filters, are estimated with sufficient accuracy and temporal resolution from the microphone signals.

Informed Direct Sound Filtering

In the following we consider signal-dependent filters to estimate . The filters are computed for each time and frequency with current information on the underlying parametric sound field model. This includes current information on the SOS of the desired and undesired signal components, but also current directional information on the L plane waves. This enables the filters to quickly adapt to changes in the acoustic scene. These filters are referred to as informed spatial filters (ISFs) in the following.

In general, an estimate of the desired signal is obtained by a linear combination of the microphone signals x(k, n):

(7.16)

As shown later, the multi-channel filter can be decomposed as

(7.17)

where is a filter matrix and are the target direct sound responses introduced in Section 7.1. Inserting the previous equation into Equation (7.16) yields

(7.18)

where

(7.19)

The filter matrix is application independent and can be applied to the microphone signals x(k, n) at the recording side. Thus, we need to transmit only the L estimated plane waves in to the reproduction side, instead of the M microphone signals, and still maintain full flexibility with respect to the direct sound reproduction.² A corresponding block scheme which visualizes the recording and reproduction processing is depicted in Figure 7.4.

To obtain an accurate estimate of using Equation (7.16), a filter is required that can capture multiple source signals, namely the L plane waves, with the desired direct sound responses . In Thiergart and Habets (2013), the filter weights in Equation (7.16) were derived as an informed LCMV filter, which minimizes the diffuse sound and stationary noise at the filter output. In Thiergart et al. (2013), the weights were derived as an informed minimum mean square error (MMSE) filter, which minimizes the mean square error between and . As explained later, both filters have specific advantages and disadvantages.

Parametric Multi-Channel Wiener Filter

In the following, we introduce an optimal multi-wave filter that represents a generalization of the informed LCMV filter and the informed MMSE filter. The proposed spatial filter is referred to as the informed parametric multi-wave multi-channel Wiener (PMMW) filter and is found by minimizing the stationary noise and diffuse sound at the filter output subject to a constraint that limits the signal distortion of the extracted direct sound. Expressed mathematically, the filter weights are computed as

(7.20)

subject to

(7.21)

where is the undesired noise-plus-diffuse sound PSD matrix and is the desired filter output for the lth plane wave. Moreover, is the actual filter output for the lth plane wave, which potentially is distorted. The desired maximum distortion of the lth plane wave is specified with σ²_l. A higher maximum distortion means that we can better attenuate the noise and diffuse sound in Equation (7.20). As shown by Thiergart et al. (2014b), a closed-form solution for is given by Equation (7.17), where is computed as

(7.22)

Computing the filter requires the DOA of the L plane waves – to compute the propagation matrix A(k, n) with Equation (7.5) – as well as and . The estimation of these quantities is discussed in Section 7.5. The real-valued, positive L × L diagonal matrix contains time- and frequency-dependent control parameters,

(7.23)

which allow us to control the trade-off between noise suppression and signal distortion for each plane wave. The trade-off is illustrated in Figure 7.5:

For , reduces to the informed LCMV filter, denoted by , as proposed by Thiergart and Habets (2013), which extracts the L direct sounds without distortions (σ²_1…L = 0) when A(k, n) does not contain estimation errors. The informed LCMV filter provides a trade-off between white noise gain (WNG) and directivity index (DI) depending on the diffuse-to-noise ratio (DNR), that is, depending on which of the two undesired components (diffuse sound or noise) is more prominent.
For , with I_L being the L × L identity matrix, reduces to the informed MMSE filter proposed by Thiergart et al. (2013). This filter provides a stronger suppression of the noise and diffuse sound compared to the informed LCMV but introduces undesired signal distortions (σ²_1…L > 0).
For 0 < ω_s, l(k, n) < 1 we can achieve for each plane wave a trade-off between the LCMV filter and MMSE filter such that we obtain a strong attenuation of the noise while still ensuring a tolerable amount of undesired signal distortions. This is discussed in more detail shortly. For ω_s, l(k, n) > 1, we can achieve an even stronger attenuation of the noise and diffuse sound than with the MMSE filter, at the expense of stronger signal distortions.

Diagram shows filters of different multi-channel and their relation with markings for signal distortion, HsMW, dereverberation, high DNR, HsLCMV, HsPMMW, low DNR, and noise reduction. — **Figure 7.5** Relation of the different multi-channel filters for direct sound estimation in terms of noise reduction, diffuse sound reduction, and signal distortions. Source: Thiergart 2015. Reproduced with permission of Oliver Thiergart.

As an example of an ISF, Figure 7.6 (solid line) shows the magnitude of an arbitrary (for example, user-defined) desired response function as a function of the azimuth angle φ. This function means that we aim to capture a plane wave arriving from φ₁ = 37° (indicated by the circle) with a gain of D_s, 1(k, n) = −19 dB, while a second plane wave arriving from φ₂ = 123° (indicated by the square) is captured with a gain of D_s, 2(k, n) = 0 dB. Both gains would then form the desired response vector in Equation (7.17) – assuming both waves are simultaneously active. The directivity pattern of the resulting spatial filter, which would capture both plane waves with the desired responses contained in , is depicted by the dashed line in Figure 7.6. Here, we are considering the LCMV solution () and a uniform linear array (ULA) with M = 5 omnidirectional microphones and microphone spacing r = 3 cm at f = 3.3 kHz. As we can see in the plot, the directivity pattern of the spatial filter exhibits the desired gains for the DOAs of the two plane waves. Note that the directivity pattern of the spatial filter is different for different L and DOAs of the direct sound. Moreover, the directivity pattern is essentially different from the desired response function . In fact, it is not the aim of the spatial filter to resample the response function for all angles φ, but to provide the desired response for the DOAs of the L plane waves.

Graph shows solid line on DOA varphi in degrees from 45 to 180 versus magnitude in decibels from minus 30 to 0 with plots for |ds(k, varphi)|2, varphi1, and varphi2. — **Figure 7.6** Solid line: Example desired response function . Dashed line: Resulting directivity pattern of an example spatial filter that assumes L = 2 plane waves with DOAs φ₁ and φ₂. Source: Thiergart 2014b. Reproduced with permission of IEEE.

Adjusting the Control Parameters ω_s, l(k, n) in Practice

In many applications it is desired to capture a plane wave with low distortions if the wave is strong compared to the noise and diffuse sound. In this case, a strong attenuation of the noise and diffuse sound is less important. If, however, the plane wave is weak compared to the noise and diffuse sound, a strong attenuation of these components is desired. In the latter case, distortions of the plane wave signal can be assumed to be less critical.

To obtain such a behavior of the ISF, the parameters ω_s, l(k, n) can be made signal dependent. To ensure a computationally feasible algorithm, we compute all parameters ω_s, 1…L(k, n) independently, even though they jointly determine the signal distortion σ²_l for the lth plane. For the following example, let us first introduce the logarithmic input signal-to-diffuse-plus-noise ratio (SDNR) for the lth plane wave as

(7.24)

where is the mean noise power across the microphones. On the one hand, the parameter ω_s, l(k, n) should approach 0 (leading to the LCMV filter) if ξ_l(k, n) is large. On the other hand, ω_s, l(k, n) should become 1 (leading to the MMSE filter) or larger than 1 (leading to a filter that is even more aggressive than the MMSE filter) if ξ_l(k, n) is small. This behavior for ω_s, l(k, n) can be achieved, for instance, if we compute ω_s, l(k, n) via a sigmoid function that is monotonically decreasing with decreasing input SDNR, for example

(7.25a)

(7.25b)

where denotes the sigmoid function and α_1…3 are the sigmoid parameters that control the behavior of ω_s, l(k, n). In practice, the sigmoid parameters may need to be adjusted specifically for the given application, and also depending on the accuracy of the parameter estimators in Section 7.5, to obtain the best performance. Clearly, different functions to control ω_s, l(k, n) can be designed depending on the specific application and desired behavior of the filter. However, the sigmoid function in Equation (7.25b), with the associated parameters, provides high flexibility in adjusting the behavior of the spatial filters.

Figure 7.7 shows three examples of in Equation (7.25b). Note that the sigmoid functions are plotted in decibels. The “sigmoid 1” function may represent a suitable function for many applications. In general, with α₁ we control the maximum of the sigmoid function for low input SDNRs ξ_l(k, n). Higher values for α₁ lead to more aggressive noise and diffuse sound suppression when the input SDNR is low. With α₂ and α₃ we control the slope and shift along the ξ_l axis, respectively. For instance, shifting the sigmoid function towards low input SDNRs and using a steep slope means that the plane waves are extracted with low undesired distortions unless the diffuse sound and noise become very strong. Accordingly, the “sigmoid 2” function in Figure 7.7 would yield a less aggressive filter, while the “sigmoid 3” function would yield a more aggressive filter. Note that all the parameters can be frequency dependent.

Graph shows sigmoid functions example on SDNR in decibels from minus 40 to 20 versus sig (xi) in decibels from minus 20 to 10 with plots for sigmoid 1, sigmoid 2, and sigmoid 3. — **Figure 7.7** Example sigmoid functions depending on the SDNR. The parameters of the function “sigmoid 1” are: a₁ = 1, a₂ = 0.5, a₃ = −9, a₄ = −18 dB. Source: Thiergart 2015. Reproduced with permission of Oliver Thiergart.

Influence of Early Reflections

The signal model in Equation (7.3) assumes that the direct sounds, the L plane waves, are mutually uncorrelated. This assumption greatly simplifies the derivation of the informed PMMW filter leading to the closed-form solution in Equation (7.22). Assuming mutually uncorrelated plane waves is reasonable for the direct sounds of different sources, but typically does not hold for the direct sound of a source and its early reflections.³ Hence, we assume that the plane waves for a given time–frequency instant correspond to different sound sources (rather than one or more reflections of the same source). Analyzing the impact of mutually correlated direct sounds and reflections on the filter performance remains a topic for further research. In fact, it should be noted that not only the underlying assumption of the filter is violated, but also of the parameter estimators in Section 7.5.

Influence of the Assumed Number of Plane Waves L

The number of plane waves L assumed in Equations (7.1) and (7.3) strongly influence the performance of the filter. If L is too small (smaller than the actual number of prominent waves), then the signal model in Equation (7.1) is violated. In this case, the filter in Equation (7.16) extracts fewer plane waves than desired, which leads to undesired distortions of the direct sound. The effects of such model violations are discussed, for example, in Thiergart and Habets (2012). On the other hand, if L is too high, the spatial filter has fewer degrees of freedom to minimize the noise and diffuse sound power in Equation (7.20). Therefore, the assumed value for L is a trade-off that depends on the number of microphones and desired filter performance. For spatial sound reproduction applications, using L = 2 or L = 3 usually represents a reasonable choice. Alternatively, the number of sources can be estimated, for example as discussed in Section 7.5.1.

7.4.2 Estimation of the Diffuse Signal

State of the Art

In most applications, the diffuse sound is extracted from a single microphone using single-channel filters. For example in Uhle et al. (2007), the diffuse sound is extracted based on non-negative matrix factorization (NMF). In DirAC (Pulkki, 2007), the diffuse sound is extracted by designing a single-channel filter based on the diffuseness of the sound. In Del Galdo et al. (2012), it was shown that diffuseness-based sound extraction is a close approximation to a single-channel square-root Wiener filter.

Only a few approaches are available for the extraction of diffuse sounds from multiple microphones. Compared to the single-channel filters, these multi-channel filters have the advantage that they can extract the diffuse sound while at the same time attenuating the direct sound, which avoids direct sounds leaking into the diffuse sound estimate. To extract the diffuse sound from a microphone pair, most available approaches assume that the diffuse sound is uncorrelated between the microphones whereas the direct sound is correlated. The diffuse sound is extracted by removing the correlated sound components from the recorded microphone signals (Avendano and Jot, 2002; Irwan and Aarts, 2002; Faller, 2006; Usher and Benesty, 2007; Merimaa et al., 2007). The drawback of these approaches is that only two microphones can be employed directly. Moreover, the diffuse sound at lower frequencies or for smaller microphones is typically correlated between the microphones such that removing correlated signal components cancels the diffuse sound as well. The approaches in Irwan and Aarts (2002), Faller (2006), Usher and Benesty (2007), and Merimaa et al. (2007) remove the coherent sound by computing the difference between the two microphone signals after properly delaying or equalizing the signals. This is essentially spatial filtering, where a spatial null is steered towards the DOA from which the direct sound arrives. In Thiergart and Habets (2013), this idea was applied to a linear array of omnidirectional microphones. Unfortunately, the resulting spatial filter was highly directional for higher frequencies, which is suboptimal when we aim to capture an isotropic diffuse field.

Informed Diffuse Sound Filtering

In the following, we use an ISF for estimating in Equation (7.2b), similarly to Thiergart and Habets (2013). As in the previous subsection, an estimate of the desired signal is obtained by a linear combination of the microphone signals x(k, n),

(7.26)

where is a complex weight vector of length M. As shown later, all the filters discussed can be decomposed as

(7.27)

where is an application-independent filter while is application dependent. Inserting the previous equation into Equation (7.26) yields

(7.28)

where

(7.29)

is an estimate of the diffuse sound at the reference position. The filter does not depend on the target response . Hence, it is application independent and can be computed and applied at the recording side. Since we recompute the filters and , respectively, for each time and frequency, we focus on filters with a closed-form expression.

Distortionless Response Filters

We first discuss spatial filters that estimate the desired diffuse sound with a distortionless response. Distortionless response means that the filter extracts the diffuse sound at the reference microphone with the target response . Therefore, a distortionless filter must satisfy the constraint

(7.30)

An alternative expression is found by dividing Equation (7.30) by and using Equation (7.14):

(7.31)

The left-hand side in Equation (7.30) is the distortion of the target diffuse signal at the filter output, which is forced to zero by the filter. All filters that satisfy Equation (7.31) require u(k, n) and thus are denoted as . Unfortunately, the vector u(k, n) is unavailable in practice and must be considered as an unobservable random variable. As a consequence, we cannot compute the distortionless response filters in practice. However, the filters serve as a basis for deriving the distortionless response average filters at the end of this section.

Example Distortionless Response Filter

An optimal distortionless response filter for estimating the target diffuse sound cancels out or reduces the direct sounds and noise while satisfying Equation (7.31). Such a filter is found, for example, by minimizing the residual noise power at the filter output while using L linear constraints for nulling out the L plane waves. In this case, the optimal filter is referred to as an LCMV filter and is computed as

(7.32)

subject to Equation (7.31) and subject to

(7.33)

where A(k, n) is the array steering matrix. This filter satisfies L + 1 constraints and hence requires at least M = L + 1 microphones.

The optimization problem of this section has a well-known solution (Van Trees, 2002). The solution for the LCMV filter in Equation (7.32) is given by Equation (7.27), where is

(7.34)

Here, C(k, n) = [A(k, n), u(k, n)] is the constraint matrix and the corresponding response vector. We can see that computing the filters requires knowledge about u(k, n), which is unavailable in practice as explained before.

Distortionless Response Average Filters

Since we cannot compute distortionless response filters in practice, we investigate a second class of filters. To compute them, we consider the expectation of the distortionless response filters computed across the different realizations of u(k, n). This leads to the distortionless response average filters , which are computed as

(7.35a)

(7.35b)

where is the distortionless response filter and f(u) is the probability density function (PDF) of u(k, n). Unfortunately, no closed-form solution exists to compute the integral, and obtaining a numerical solution is computationally very expensive. As shown by Thiergart and Habets (2014), a close approximation of Equation (7.35a) is given by

(7.36)

This equation means that the expected weights of the linearly constrained filters can be found approximately by computing the filter with the expectation of the linear constraint. The expectation of u(k, n) is given in Equation (7.15). The approximate distortionless response average filters, denoted by , can now be computed as

(7.37)

which is an approximation of in Equation (7.35a). An example filter that can be computed in practice is shown next.

Example Distortionless Response Average Filter

The distortionless response LCMV filter w_dLCMV(k, n, u) introduced earlier minimizes the power of the noise at the filter output while satisfying a linear diffuse sound constraint and L additional linear constraints to attenuate the direct sounds. The corresponding approximate average LCMV filter is defined using Equation (7.37):

(7.38)

A closed-form solution for w_dALCMV(k, n) is found when solving the optimization problem in Equation (7.32) subject to Equation (7.33) and subject to Equation (7.31) with instead of u(k, n). The result is given by Equation (7.27), where is given in Equation (7.34) when substituting u(k, n) with :

(7.39)

Computing this filter requires the DOA of the L plane waves, to compute the array steering matrix A(k, n), and the noise PSD matrix . The filter requires M > L + 1 microphones to satisfy the L + 1 linear constraints and to minimize the residual noise.

Figure 7.8 shows the directivity pattern of the approximate average LCMV filter h_dALCMV(k, n) and the SOA spatial filter used in Thiergart and Habets (2013), denoted by . The directivity patterns were computed for different frequencies assuming a ULA of M = 8 omnidirectional microphones with spacing r = 3 cm. A single plane wave was arriving from φ = 72°. We can see that for increasing frequencies, the filter (the plots on the left-hand side) became very directional such that the diffuse sound was captured mainly from one direction. In contrast, the proposed approximate average LCMV filter (right-hand side) provided the desired almost omnidirectional directivity even for high frequencies (besides the spatial null for the direction of the plane wave).

Diagrams show pattern of directivity of SOA filter on left and approximate average LCMV filter on right with markings for hpsi L (500 hertz, 1000 hertz, 2000 hertz, 4000 hertz) and hd A L C M V (500 hertz, 1000 hertz, 2000 hertz, 4000 hertz). — **Figure 7.8** Directivity pattern of the SOA filter (left) and approximate average LCMV filter (right) for extracting diffuse sound in the presence of a single plane wave. Uniform linear array with M = 8 omnidirectional microphones (r = 3 cm). Source: Thiergart 2015. Reproduced with permission of Oliver Thiergart.

7.5 Parameter Estimation

This section deals with the estimation of the parameters that are required to compute the informed filters in Section 7.4.

7.5.1 Estimation of the Number of Sources

A good overview of approaches for estimating the number of sources forming a mixture can be found in Jiang and Ingram (2004). The most popular approaches consider the eigenvalues of the microphone input PSD matrix, in our case defined in Equation (7.4). In general, these approaches are suited to our framework since the computationally expensive eigenvalue decomposition (EVD) of is also required later, namely when estimating the DOAs of the direct sound (see Section 7.5.2).

The well-known approach proposed by Wax and Kailath (1985) models the microphone signals as a sum of L mutually uncorrelated source signals plus independent and identically distributed (iid) noise. Diffuse sound was not considered. Under these assumptions, in Equation (7.4a) possesses M − L eigenvalues which are equal to the noise power . Thus, the number of sources can be determined from the multiplicity of the smallest eigenvalue, that is, from the number of eigenvalues that are equal. In practice, determining this number is difficult since needs to be estimated from the microphone signals and contains estimation errors. Hence, the M − L smallest eigenvalues will be different. Therefore, the approach in Wax and Kailath (1985) determines the multiplicity of the smallest eigenvalue based on information theoretic criteria (ITC) such as the minimum description length (MDL) or Akaike information criterion (AIC).

A conceptually different approach, which also considers the eigenvalues of , was used by Markovich et al. (2009). Here, L was determined by considering the difference of the eigenvalues compared to the maximum eigenvalue and a fixed lower eigenvalue threshold. The main advantage of this approach is the almost negligible computational costs compared to the earlier method. Therefore, we consider this approach in the example application in Section 7.6.

7.5.2 Direction of Arrival Estimation

The DOA of the L narrowband plane waves represents a crucial parameter for computing the informed spatial filters. Once the DOAs are estimated, the elements of the array steering matrix A(k, n), which is required to compute the informed spatial filters, can be determined using Equation (7.6). The required DOA estimators are available for almost any microphone configuration, such as (non-)uniform linear arrays, planar arrays, or the popular B-format microphone.

The most popular multi-wave DOA estimators (which can estimate multiple DOAs per time and frequency) for arrays of omnidirectional microphones are ESPRIT (Roy and Kailath, 1989) and Root MUSIC (Rao and Hari, 1988). Both approaches can estimate the DOA of L < M plane waves. ESPRIT requires an array that can be separated into two identical, rotationally invariant, subarrays. It is also extensible to spherical microphone arrays (Goossens and Rogier, 2009). Root MUSIC was initially derived for ULAs, but later extended to non-uniform linear arrays (NLAs) with microphones located on an equidistant grid (Mhamdi and Samet, 2011). Root MUSIC is also available for circular microphone arrays (Zoltowski and Mathews, 1992; Zoltowski et al., 1993).

Both ESPRIT and Root MUSIC exploit the phase differences between the microphones to estimate the DOAs. Root MUSIC is generally more accurate than ESPRIT, but ESPRIT is computationally more efficient, especially its real-valued formulation Unitary ESPRIT (Haardt and Nossek, 1995). Due to the required EVD, both ESPRIT and Root MUSIC are computationally expensive. Moreover, none of these approaches considers diffuse sound in the signal model, which yields biased DOA estimates in reverberant environments.

7.5.3 Microphone Input PSD Matrix

All estimators required throughout this work operate (directly or indirectly) on the microphone input PSD matrix , or on some of its elements. To estimate , the expectation operator in Equation (7.4a) is usually replaced by temporal averaging. The averaging is often carried out as block averaging – see, for example, Van Trees (2002) and the references therein. Alternatively, can be estimated using a recursive averaging filter,

(7.40)

This approach is used in many applications of spatial sound processing, such as DirAC (Pulkki, 2007), as it requires less memory than block averaging. Here, α_τ ∈ (0, 1] is the filter coefficient corresponding to a specific time constant τ. The filter coefficient depends on the parameters of the time–frequency transform used. For instance, for a short-time Fourier transform (STFT) with hop size R at a sampling frequency f_s, we have

(7.41)

Replacing the expectation operator in Equation (7.40) by a temporal averaging as in Equation (7.41) assumes that the underlying random processes are ergodic. In practice, there is always a trade-off between a low estimation variance (longer time averaging) and a sufficiently high temporal resolution to track changes in the acoustic scene (shorter time averaging). By using small values for τ (typically in the range 30 ms ≤ τ ≤ 60 ms), we obtain almost instantaneous estimates of , and, thus, of all the parameters in this section. This enables the informed filters in this work to adapt sufficiently fast to quick changes in the acoustic scene.

7.5.4 Noise PSD Estimation

Since the SOS of the noise are assumed to be time invariant or slowly time variant, we can estimate the noise PSD matrix and noise PSD during time periods where the sound sources are inactive and where no diffuse sound is present. Several corresponding approaches for estimating that can be used in our framework are discussed, for instance, in Habets (2010), Souden et al. (2011), and Gerkmann and Hendriks (2012).

7.5.5 Diffuse Sound PSD Estimation

This subsection discusses the estimation of the diffuse sound PSD when L ≥ 1 sound sources are active per time and frequency. Once is estimated, we can compute the diffuse sound PSD matrix with Equation (7.12). The diffuse PSD is estimated using spatial filters that suppress the direct sounds and capture the diffuse sound. The filters require an array with at least M = L + 1 microphones to suppress the L direct sounds and capture the diffuse sound.

State of the Art

Irwan and Aarts (2002), Faller (2006), Usher and Benesty (2007), and Merimaa et al. (2007) used M = 2 microphones to extract the diffuse sound while suppressing a single direct sound. From the extracted diffuse sound it is straightforward to estimate the diffuse PSD, as shown later. The same principle was applied by Thiergart and Habets (2013) to multiple plane waves. Here, a linearly constrained spatial filter was employed which attenuates L plane waves while directing the main lobe of the filter towards a direction from which no direct sound arrives. We summarize this approach here.

The weights of the linearly constrained spatial filter can be found by minimizing the noise power at the filter output,

(7.42)

subject to and . The first constraint cancels out the L plane waves, while the second constraint ensures non-zero filter weights . The propagation vector a₀(k, n) corresponds to a specific direction n₀(k, n) from which no direct sound arrives. The optimal direction n₀(k, n), towards which we direct the main lobe of the spatial filter, is the direction which maximizes the output diffuse-to-noise ratio (DNR). Unfortunately, no closed-form solution to compute this direction is available. Therefore, we choose for n₀(k, n) the direction which has the largest distance to all n_l(k, n), that is, in the two-dimensional case,

(7.43)

To estimate the diffuse PSD , we apply the weights to the signal PSD matrix in Equation (7.4a). For a filter , this leads to

(7.44)

Rearranging this equation yields

(7.45)

where

(7.46)

Assuming that the error term is small compared to , a reasonable estimate of the diffuse sound PSD is given by

(7.47)

It is clear from Equation (7.46) that the estimator is biased in the presence of noise, that is, we overestimate . Note that the filter proposed above minimizes the numerator in Equation (7.46), but not the whole error term. Also, the direction φ₀ computed in Equation (7.43) does not necessarily minimize . Therefore, the filter is not optimal in the sense of providing the most accurate diffuse PSD estimate. To reduce the bias, one could subtract the error term , which can be computed with Equation (7.46), from . In practice, this may lead to negative PSD estimates, namely when the involved quantities and contain estimation errors. Therefore, we use Equation (7.47) to estimate .

Diffuse PSD Estimation Using a Quadratically Constrained Spatial Filter

This subsection discusses a second approach to estimating the diffuse PSD , from which we can compute the diffuse sound PSD matrix with Equation (7.12). As in the previous subsection, we use a spatial filter that suppresses the L plane waves and captures the diffuse sound. In contrast to the previous subsection, we consider a quadratically constrained spatial filter which maximizes the output DNR. This is equivalent to minimizing the error term in Equation (7.46). The proposed spatial filter was published by Thiergart et al. (2014b), and now a summary is provided.

To compute the quadratically constrained filter, we minimize the noise at the output of the filter,

(7.48)

subject to the constraints

(7.49a)

(7.49b)

The constraint in Equation (7.49a) ensures that the power of the L plane waves is zero at the filter output. Note that a weight vector h, which satisfies Equation (7.49a), also satisfies , which means that each individual plane wave is canceled out. With the constraint in Equation (7.49b) we capture the diffuse sound power with a specific factor a. The factor a is necessarily real and positive, and ensures non-zero weights . For any a > 0, the error in Equation (7.46) is minimized, which is equivalent to maximizing the output DNR, subject to Equation (7.49a).

To compute , we first consider the M × M matrix in Equation (7.49a), which is Hermitian with rank L and thus has L non-zero real positive eigenvalues, as well as N = M − L zero eigenvalues. We consider the N eigenvectors corresponding to the N zero eigenvalues. Any linear combination of these vectors can be used as weight vector h which would satisfy Equation (7.49a), that is,

(7.50)

where is a matrix containing the N eigenvectors and c(k, n) is a vector of length N containing the (complex) weights for the linear combination. The optimal c(k, n), which yields the weights in Equation (7.50) that minimize the stationary noise, is denoted by c^opt(k, n) and can be found by inserting Equation (7.50) into Equation (7.48):

(7.51)

subject to Equation (7.49b). The cost function to be minimized is now

(7.52)

where η is the Lagrange multiplier. Setting the complex partial derivative of with respect to to zero, we obtain

(7.53)

where and . This is a generalized eigenvalue problem, and c^opt(k, n) is the generalized eigenvector of D and E corresponding to the largest eigenvalue. From this c^opt(k, n), the weights are found with Equation (7.50).

To finally estimate the diffuse PSD , we use the filter in Equation (7.47) similarly to the previous subsection. If the output DNR is high, then in Equation (7.45) becomes small compared to , and in Equation (7.47) represents an accurate estimate of . Since the output DNR is maximized by the filter , the error term is minimized and, thus, is optimal for estimating the diffuse sound power with Equation (7.47).

7.5.6 Signal PSD Estimation in Multi-Wave Scenarios

This section discusses the estimation of the signal PSDs Ψ_s, 1…L(k, n) of the L mutually uncorrelated plane waves in a multi-wave scenario.

Minimum Variance Estimate

The minimum variance approach (Capon, 1969) for estimating the signal PSDs is well known. This approach uses a spatial filter which is applied to the microphone signals x(k, n) and extracts the signal of the lth plane wave. The power of this signal then represents an estimate of Ψ_s, l(k, n). Usually, the spatial filter extracts the lth plane wave with unit gain while minimizing the power of the L − 1 remain plane waves plus the diffuse sound and noise. Alternatively, we can use a filter that extracts the lth plane wave and places spatial nulls towards the L − 1 remaining plane waves while minimizing, for example, the noise plus diffuse sound. The corresponding filter is given by the lth column of the filter matrix H_sLCMV(k, n) discussed in Section 7.4.1. The drawback of using such filters for estimating the PSDs is the overestimation in the presence of noise and diffuse sound, since the filters can minimize only the power of the noise and diffuse sound, and this minimization is strongly limited when only a few microphones are used.

Minimum Mean Square Error Estimate

For the signal model in Section 7.3 it is straightforward to derive an estimator for the signal PSDs Ψ_s, 1…L(k, n) that is optimal in the least squares (LS) sense. This approach was published by Thiergart et al. (2014b). The approach requires the estimation of the microphone input PSD matrix, diffuse PSD matrix, and noise PSD matrix in a pre-processing step. To determine the signal PSDs, we compute

(7.54)

where is an estimate of the microphone input PSD matrix (see Section 7.5.3) and is the estimated input diffuse-plus-noise PSD matrix (see Sections 7.5.4 and 7.5.5). It follows from Equation (7.8) that is an estimate of

(7.55)

where is the estimation error of . Since the L plane wave signals are mutually uncorrelated, is a diagonal matrix. Therefore, Equation (7.4b) can be written as

(7.56)

We estimate the signal PSDs via an LS approach which minimizes the error :

(7.57)

where . The vector operator yields the columns of matrix stacked into one column vector. The solution to the minimization problem of Equation (7.57) is given by

(7.58)

7.6 Application to Spatial Sound Reproduction

As discussed in Section 7.2.2, the compact description of the sound field in terms of direct sounds, diffuse sound, and sound field parameters, as shown in Figure 7.1, enables a huge variety of applications. In the following, we use the concepts presented in this work for the application of spatial sound recording and reproduction.

In spatial sound reproduction, we aim to capture the spatial sound on the recording side and reproduce it on the application side such that the listener perceives the sound with the original spatial impression. An example is depicted in Figure 7.9. Here, the spatial sound of two sources is recorded at the recording side, transmitted over a network, and then reproduced at the reproduction side using a loudspeaker setup that is unknown at the recording side. Clearly, this scenario requires an efficient processing scheme where only a few signals need to be transmitted while still being able to reproduce the sound on arbitrary reproduction setups.

Diagram shows spatial sound processing which is parametric with markings for reverberation, direct sound, recording side, application side, and transmission or storage. — **Figure 7.9** Block scheme of the parametric processing of spatial sound.

7.6.1 State of the Art

The most popular approaches to efficient and flexible sound recording and reproduction are represented by DirAC (Pulkki, 2007) and HARPEX (Berge and Barrett, 2010a,b), which were derived for the so-called B-format microphone. As mentioned before, both approaches are based on a parametric sound field model. While DirAC assumes that the sound field for each time and frequency is composed of a single plane wave (direct sound) plus diffuse sound, HARPEX assumes two plane waves (and no diffuse sound). Both approaches aim to recreate the relevant features for the human perception of spatial sound. For instance, DirAC assumes that the interaural level difference (ILD) is correctly perceived when the direct sound is reproduced from the correct DOA, while a realistic rendering of the diffuse sound leads to a correct perception of the interaural coherence (IC). Both DirAC and HARPEX provide high flexibility, that is, the signals for any common loudspeaker setup – as well as the signals for binaural reproduction (Laitinen and Pulkki, 2009) – can be derived from the parametric description. The compact representation of the spatial sound enables efficient transmission and storage of the recorded audio scene, as shown, for example, by Herre et al. (2011).

Unfortunately, DirAC and HARPEX suffer from specific drawbacks: In DirAC, even though multiple microphones are used to estimate the required parameters, only single-channel filters are applied to extract the direct sound and diffuse sound, which limits the accuracy of the extracted sound components. Moreover, the single wave assumption can be violated easily in practice which impairs the direct sound reproduction, as shown by Thiergart and Habets (2012). Some of these drawbacks can be reduced with recently proposed algorithmic improvements, such as transient detection (Laitinen et al., 2011) or virtual microphone processing (Vilkamo et al., 2009). However, these improvements limit the flexibility and efficiency; for example, virtual microphone processing requires knowing the sound reproduction setup in advance or transmitting all the microphone signals. The higher-order extension to DirAC in Politis et al. (2015) reduces the model violations, but requires higher-order spherical harmonics as input signals, which can be obtained from signals measured with a spherical microphone array. HARPEX suffers from the drawback that no diffuse sound is considered in the model, which results in model violations when capturing reverberant or ambient sounds.

7.6.2 Spatial Sound Reproduction Based on Informed Spatial Filtering

We can use the parametric description of the sound field in Section 7.3 and the informed multi-channel filters in Section 7.4 to achieve efficient and flexible sound acquisition and reproduction. By assuming multiple plane waves per time and frequency in the signal model (that is, L > 1), and by extracting the signal components with the informed multi-channel filters that make use of all available microphone signals, some of the drawbacks of DirAC and HARPEX can be overcome.

The target signal Y(k, n) we wish to obtain is given in Equation (7.2). This signal is computed at the application side in Figure 7.9 individually for each loudspeaker. In spatial sound reproduction, the direct sounds P_s, l(k, n, r₁) and diffuse sound represent the desired signals. Similarly to DirAC, the direct responses D_s, l(k, n) are selected from panning functions , which depend on the DOA of the direct sound and on the loudspeaker position. Figure 7.10 (both plots) shows the panning functions for a 5.1 loudspeaker setup, which were defined using the VBAP scheme (Pulkki, 1997). The target diffuse response D_d(k, n) is set to for all N loudspeakers to reproduce the diffuse sound with the original power. Note that the target diffuse signals Y_d(k, n) in Equation (7.2) for the different loudspeakers are decorrelated to ensure correct IC during sound reproduction (Pulkki, 2007).

Diagram shows setup of loudspeaker with markings for L (left), C (center), R (right), LS (left surround), and RS (right surround), and graph shows varphi in degrees from minus 180 to 180 versus dS (k, varphi) from 0 to 1 with plots for LS, L, C, R, and RS. — **Figure 7.10** Left: Loudspeaker setup for a 5.1 configuration (L: left, R: right, C: center, LS: left surround, RS: right surround). Right: Panning functions for a 5.1 surround sound loudspeaker setup using the VBAP panning scheme (Pulkki, 1997). Source: Thiergart 2015. Reproduced with permission of Oliver Thiergart.

Computing Y(k, n) in Equation (7.2) requires the direct sounds P_s, l(k, n, r₁) and the diffuse sound . These signals can be estimated at the recording side in Figure 7.9 using the spatial filters in Section 7.4. The parametric information required to compute these filters can be estimated with the approaches presented in Section 7.5. Once these signals are obtained at the recording side, they can be transmitted to the application side (together with the DOAs of the direct sounds), and Y(k, n) can be computed for the desired loudspeaker setup.

Example: Direct Sound Extraction

In the following, we discuss an example of direct sound extraction using ISF. We carried out measurements in a reverberant room (). An NLA with M = 6 omnidirectional microphones (microphone spacings 12.8 cm–6.4 cm–3.2 cm–6.4 cm–12.8 cm) was placed in the room center. The spatial aliasing frequency was f_max = 5.3 kHz. Five loudspeakers were located at the positions A–E at a distance of 1.7 m from the array center. The corresponding angles are depicted in Figure 7.11(a). Male speech was emitted from the center loudspeaker (denoted by the black square) and female speech was emitted from the other loudspeakers (one loudspeaker at a time, randomly changing). The dashed lines in Figure 7.11(b) indicate which loudspeaker was active when. This scenario represents a dynamic and rather challenging scenario where one source jumps to different positions, while at the same time another source is active from a fixed position. The sound was sampled at f_s = 16 kHz and transformed into the time–frequency domain using a 512-point STFT with 50% overlap.

Graph shows setup of loudspeaker on angle varphi from minus 90 to 90 versus frequency f with plots for A, B, C, D, and E, scale shows range from minus 20 to 0, graph shows functions of panning on time t from 0 to 8 versus angle varphi from minus 90 to 90, and scale shows range from minus 40 to 40. — **Figure 7.11** Left: Example target response functions to extract direct sounds. Right: Measured input long-term spatial power density.

In this example, we aim to extract the direct sounds with the direct sound response function depicted in Figure 7.11(a), that is, we aim to extract the male speaker from direction A (desired source) and attenuate the female speaker from the other directions (undesired interfering source). In source separation applications, the depicted response function could represent an arbitrary spatial window that extracts sounds from the center direction and attenuates sounds from other directions. In spatial sound reproduction applications, the depicted response function could be a loudspeaker panning window, for example for the center loudspeaker.

Parameter Estimation

The parameters required to compute the ISF were estimated as follows: The microphone input PSD matrix was estimated via the recursive averaging in Section 7.5.3 (τ = 40 ms). The noise PSD matrix was computed from the signal beginning where no source was active. The number of sources L was estimated based on the eigenvalue ratios using the approach in Markovich et al. (2009) – see Section 7.5.1. The estimate of L was limited to a maximum of L_max = 3. The DOAs of the L direct sounds were estimated using Root MUSIC for non-linear arrays (Mhamdi and Samet, 2011). The array steering matrix A(k, n) was determined from the estimated DOAs using Equation (7.6). The diffuse PSD matrix and diffuse power were estimated with the approach in Section 7.5.5 assuming a spherically isotropic diffuse field. The signal PSDs in were obtained with the MMSE approach in Section 7.5.6.

Filter Computation

The L direct sounds P_s, l(k, n, r₁) were extracted from the microphone signals x(k, n) with Equation (7.19). We study the following filters:

LCMV: The LCMV filter H_sLCMV(k, n) discussed in Section 7.4.1 and computed with Equation (7.22), where .
MCW: The multi-channel Wiener filter discussed in Section 7.4.1 and computed with Equation (7.22), where .
PMCW: The parametric multi-channel Wiener filter discussed in Section 7.4.1 and computed with Equation (7.22). The elements of were obtained with Equation (7.4.1). The logarithmic SDNR ξ_l(k, n) was computed with Equation (7.24). The sigmoid function used is shown in Figure 7.7 (sigmoid 1).

For comparison, we computed the following fixed spatial filters, which did not exploit the instantaneous DOA information and multi-wave model:

WNG: Maximum WNG filter. The filter is identical to the delay-and-sum filter (Doclo and Moonen, 2003) where the look direction corresponds to the direction of the (fixed) desired source (center loudspeaker in Figure 7.11(a)).
SD: Robust superdirective beamformer (Cox et al., 1987). This filter possesses a lower bound on the WNG which was set to − 9 dB.

Finally, the target direct signal was computed with Equation (7.18) from the estimated direct sounds . The target direct responses D_s, l(k, n) were selected from using the estimated DOAs (for WNG and SD we used the corresponding look directions).

Results

To study the performance of the parameter estimation, we consider the so-called input long-term spatial power density (LTSPD). This measure allows us to jointly evaluate the performance of the DOA and signal PSD estimation, that is, the estimation of Ψ_s, 1…L(k, n). The input LTSPD characterizes what direct power was localized for which direction (Thiergart et al., 2014b). The input LTSPD is depicted in Figure 7.11(b). As mentioned before, the dashed lines indicate the loudspeaker positions and when which loudspeaker was used. The input LTSPD shows that most estimated direct power was localized near the direction of an active loudspeaker. This is true even during the double-talk periods. Nevertheless, some power was also localized in spatial regions where no source was active. This localized power resulted from the measurement uncertainties of the estimated DOAs and direct PSDs, mostly when the sound field was more diffuse.

A multiple stimuli with hidden reference and anchor (MUSHRA) listening test with the filters mentioned before was carried out to verify the perceptual quality of the presented approaches. The listening test results were published in Thiergart et al. (2014b). Note that the abrupt jumps of the undesired speaker during double talk with the desired speaker is a challenging scenario for spatial filters. The participants were listening to the output signal of the filters (Y_s(k, n), after transforming back to the time domain), which were reproduced over headphones. The MUSHRA test was repeated five times and for each test the participants were evaluating different effects of the filters. The reference signal was the direct sound of the desired speaker plus the direct sound of the undesired speaker attenuated by 21 dB, both signals found from the windowed impulse responses. The lower anchor was an unprocessed and low-pass filtered microphone signal.

The MUSHRA scores are depicted in Figure 7.12. In the first MUSHRA test, denoted “interferer,” the listeners were asked to grade the strength of the interferer attenuation, that is, the attenuation of the undesired speaker. Note that m = 10 in Figure 7.12 means that ten listeners passed the recommended post-screening procedure. We can see that the informed filters (LCMV, MCW, and PMMW) performed significantly better than the fixed SOA filters (WNG and DI). The rather aggressive MCW filter yielded the strongest interferer attenuation. In the second test (noise), the listeners were asked to grade the microphone noise reduction performance. The informed filters clearly outperformed the SOA filters, even the WNG filter which maximizes the WNG. The PMCW filter was as good as the MCW filter, and both were significantly better than the LCMV filter. The SD filter strongly amplified the noise, and hence was graded lowest. In the third test (dereverberation), the dereverberation performance was graded. Here, the PMCW filter was significantly better than the LCMV filter, but worse than the MCW filter. In the fourth test (distortion), the listeners were asked to grade the distortion of the direct sound of the desired speaker (high grades had to be assigned if the speech distortion was low). We can see that the MCW filter yielded strongly noticeable speech distortion, while the LCMV and PMCW filters resulted in a low distortion. The SOA filters (for which the correct look direction towards the desired source was provided) yielded the lowest speech distortion. Finally, the overall quality (listener’s preference) was evaluated in the fifth test (quality). The best overall quality was provided by the PMCW and LCMV filters, which were significantly better than the MCW filter and the SOA filters. In general, the results show that the PMCW filter can provide a good trade-off between noise and diffuse sound reduction and speech distortion.

Graph shows test results of MUSHRA listening on range from interferer to quality versus range from 0 to 100 with plots for hidden ref., LCMV, MCW, PMCW, WNG, SD, and anchor. — **Figure 7.12** MUSHRA listening test results for direct sound extraction. The plot shows the average and 95% confidence intervals. Source: Thiergart 2014b. Reproduced with permission of IEEE.

Example: Diffuse Sound Extraction

In the following, we show an example of diffuse sound extraction using ISF. We simulated a reverberant shoebox with and an NLA with M = 6 omnidirectional microphones (microphone spacings 4 cm–2 cm–2 cm–2 cm–4 cm). Two sound sources were located at a distance of 1.6 m from the array center at the angles and (the array broadside was 0°). Source A was emitting speech while B was emitting transient castanet sound. Both sources were active simultaneously. The microphone signals were simulated with the image-source method (Allen and Berkley, 1979; Habets, 2008). Spatially white microphone noise was added to the microphone signals resulting in a segmental signal-to-noise ratio (SegSNR) of 39 dB. The sound was sampled at f_s = 16 kHz and transformed into the time–frequency domain using a 256-point STFT with 50% overlap.

Parameter Estimation

All the parameters required to compute the spatial filters were provided as accurate prior information. We were assuming L = 2. The DOAs of the L direct sounds were corresponding to the loudspeaker directions and . The microphone input PSD Φ_{x, 11}(k, n) was computed from the reference microphone signals X₁(k, n) using the recursive temporal averaging approach in Section 7.5.3 (with τ = 60 ms, which yielded a typical temporal resolution). The diffuse sound PSD was computed directly from the diffuse sound at the reference position. Note that this signal is not directly observable in practice. Here, the diffuse sound was made available by applying an appropriate temporal window to the simulated room impulse responses (RIRs) to separate the reverberant part (including early reflections). The same averaging was applied when computing and Φ_{x, 11}(k, n).

Filter Computation

The diffuse sound was estimated using a single-channel filter, denoted by , similarly to DirAC (which serves as reference), and with the approximate average LCMV filter h_dALCMV(k, n) in Section 7.4.2. In the case of the single-channel filter, was found with

(7.59)

whereas in the case of the multi-channel filter, was obtained with Equation (7.29). The two filters were computed as follows:

: This filter represents the square-root Wiener filter for estimating the diffuse sound. The filter was computed as .
h_dALCMV(k, n): This filter is easy to compute in practice since only the DOA of the direct sound needs to be known. The filter was computed with Equation (7.39). The noise PSD matrix was not required in Equation (7.34), since for the iid noise in this simulation it can be replaced by the identity matrix. The filter was normalized such that the diffuse sound power is preserved at the filter output, that is, such that .

Results

Figure 7.13 presents the effects of the single-channel filter and the multi-channel filter h_dALCMV(k, n). The plot in Figure 7.13(a) shows the power of the unprocessed microphone signal X₁(k, n), in which we can see the onsets of the castanets and the reverberant tails. Figure 7.13(b) shows the level difference between the true diffuse sound and X₁(k, n). We can mainly see that in the diffuse sound, the onsets of the castanets were not present, as indicated by the high negative level differences. Figure 7.13(c) shows the level difference between the estimated diffuse sound and X₁(k, n) when using the filter . It can be observed that the onsets of the castanets were almost not attenuated, and hence were still present in the filtered output. Figure 7.13(d) depicts the level difference between the estimated diffuse sound and X₁(k, n) when using the filter h_dALCMV(k, n). We can see that this filter strongly attenuated the onsets of the castanets, as desired, and we obtained almost the same result as in Figure 7.13(b).

In general, Figure 7.13 demonstrates the main advantage when using multi-channel filters for diffuse sound extraction compared to single-channel filters, namely the effective attenuation of the (in this case undesired) direct sounds. This is especially true for transient direct sounds or signal onsets. For these components, single-channel filters typically cannot provide the desired attenuation. This is problematic for spatial sound reproduction for two reasons: (i) The estimated diffuse sound (and hence also the contained direct sound) is reproduced from all directions. Therefore, the direct sound contained in the diffuse estimate interferes with the desired direct sound, which is reproduced from the original direction. This impairs the localization of the desired direct sound. (ii) The diffuse sound is typically decorrelated before the sound is reproduced. Applying decorrelators to transient direct sounds yields unpleasant artifacts. This represents a major problem of SOA approaches for spatial sound reproduction, such as DirAC, which use single-channel filters for extracting the diffuse sound.

A MUSHRA listening test with eight participants was conducted to grade the perceptual quality achieved with the single-channel filter and the multi-channel filter h_dALCMV(k, n). For this purpose, the diffuse sound extracted in the simulation was presented to the participants with a 5.1 loudspeaker setup. Note that depending on the filter performance, the extracted diffuse sound contained true diffuse sound but also direct sound components that were not accurately suppressed by the filter. The true diffuse part was decorrelated and reproduced from all loudspeakers, while the direct sound components were reproduced from the center loudspeaker to make them more audible. The participants were asked to grade the overall quality of the extracted diffuse sound compared to the true diffuse sound, which represented the reference. As lower anchor, we were using a low-pass filtered unprocessed microphone signal. In the listening test, we compared the output signal of the single-channel filter , the output signal of the multi-channel filter h_dALCMV(k, n), and an unprocessed microphone signal.

The results are depicted in Figure 7.14. In Scenario I, the two loudspeakers were emitting speech and castanet sounds, as explained before. In Scenario II, the same two loudspeakers were emitting two different speech signals at the same time. The multi-channel filter h_dALCMV(k, n) is denoted by LCMV and the single-channel filter is denoted by SC. As can be seen, for both scenarios the multi-channel filter provided very good results, while the single-channel filter yielded fair results. The participants reported that for the single-channel filter, direct sound components were clearly audible. It was difficult to perceive a difference from the unprocessed signal in this dynamic scenario. For the multi-channel filter, the direct sound was very well attenuated and the filter output was very similar to the reference (true diffuse sound).

Graph shows test results of MUSHRA listening on range from scenario I to overall versus range from 0 to 100 with plots for hidden ref., LCMV, SC, unprocessed, and anchor. — **Figure 7.14** MUSHRA listening test results for the diffuse sound extraction. The plot shows the average and 95% confidence intervals. Source: Thiergart 2014. Reproduced with permission of Oliver Thiergart.

7.7 Summary

Parametric representations of spatial sound provide a complete yet compact description of the acoustic scene at the recording location independent of the microphone configuration. The parametric representation can be obtained with almost any multi-microphone configuration and enables a huge variety of applications, which can be defined and controlled freely by the user at the reproduction side. The flexible acquisition and processing of spatial sound is realized by modeling the sound field for each time and frequency as a sum of multiple plane waves and diffuse sound. In contrast to existing parametric approaches, we assume multiple plane waves per time and frequency, which enables us to accurately model complex acoustic scenes where multiple sources are active at the same time.

The direct sounds (plane waves) and diffuse sounds are extracted from the microphones by applying multi-channel filters. These filters are recomputed for each time and frequency using current information on the underlying parametric sound field model. The multi-channel filter for the direct sound extraction was derived using well-known spatial filtering approaches, which were applied to the proposed multi-wave signal model. This resulted in the parametric multi-channel Wiener filter that adapts to different recording conditions, for example to provide low signal distortions when the direct sounds are strong, a good dereverberation performance when the diffuse sound is prominent, or a high robustness against noise in noisy situations. To extract the diffuse sound, we derived the approximate average LCMV filter that can attenuate (direct) sounds from arbitrary directions, while capturing the sound with an almost omnidirectional directivity from all other directions.

To compute the filters, it is of paramount importance to obtain accurate estimates of the required parameters. This includes the DOA of the direct sounds, but also SOS such as the powers of the direct sounds and diffuse sound. Throughout this work, we provided an overview of appropriate estimators for omnidirectional microphone configurations.

Application of the filters presented was discussed for spatial sound reproduction, where we aimed to reproduce the recorded spatial sound with the original spatial impression. We showed that classical approaches for extracting the diffuse sound, which use single-channel filters, suffer from a leakage of transient direct sounds into the extracted diffuse sound. In contrast, when using the proposed multi-channel filters for the diffuse sound extraction, we can better attenuate the direct sounds.

Notes

References

Affes, S., Gazor, S., and Grenier, Y. (1996) An algorithm for multisource beamforming and multitarget tracking. IEEE Transactions on Signal Processing, 44(6), 1512–1522.
Allen, J.B. and Berkley, D.A. (1979) Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America, 65(4), 943–950.
Avendano, C. and Jot, J.M. (2002) Ambience extraction and synthesis from stereo signals for multi-channel audio up-mix. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. II-1957–II-1960.
Benesty, J., Chen, J., and Huang, Y. (2008) Microphone Array Signal Processing, Springer-Verlag, Berlin.
Berge, S. and Barrett, N. (2010a) A new method for B-format to binaural transcoding. 40th International Audio Engineering Society Conference: Spatial Audio, Tokyo, Japan.
Berge, S. and Barrett, N. (2010b) High angular resolution planewave expansion. 2nd International Symposium on Ambisonics and Spherical Acoustics.
Bitzer, J. and Simmer, K.U. (2001) Superdirective microphone arrays, in Microphone Arrays: Signal Processing Techniques and Applications (ed. Brandstein, M. and Ward, D.), Springer, Berlin, chapter 2, pp. 19–38.
Braun, S., Thiergart, O., and Habets, E.A.P. (2014) Automatic spatial gain control for an informed spatial filter. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
Capon, J. (1969) High-resolution frequency–wavenumber spectrum analysis. Proceedings of the IEEE, 57(8), 1408–1418.
Cox, H., Zeskind, R., and Kooij, T. (1986) Practical supergain. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(3), 393–398.
Cox, H., Zeskind, R., and Owen, M. (1987) Robust adaptive beamforming. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(10), 1365–1376.
Del Galdo, G., Taseska, M., Thiergart, O., Ahonen, J., and Pulkki, V. (2012) The diffuse sound field in energetic analysis. Journal of the Acoustical Society of America, 131(3), 2141–2151.
Doclo, S. and Moonen, M. (2002) GSVD-based optimal filtering for single and multimicrophone speech enhancement. IEEE Transactions on Signal Processing, 50(9), 2230–2244.
Doclo, S. and Moonen, M. (2003) Design of far-field and near-field broadband beamformers using eigenfilters. Signal Processing, 83(12), 2641–2673.
Doclo, S. and Moonen, M. (2005) On the output SNR of the speech-distortion weighted multichannel Wiener filter. IEEE Signal Processing Letters, 12(12), 809–811.
Doclo, S. and Moonen, M. (2007) Superdirective beamforming robust against microphone mismatch. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 617–631.
Doclo, S., Spriet, A., Wouters, J., and Moonen, M. (2005) Speech distortion weighted multichannel Wiener filtering techniques for noise reduction, in Speech Enhancement (ed. Benesty, J., Makino, S., and Chen, J.), Springer, Berlin, chapter 9, pp. 199–228.
Elko, G., West, J.E., and Kubli, R. (1998) An adaptive close-talking microphone array. Conference Record of the Thirty-Second Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 404–408.
Elko, G.W. (2000) Superdirectional microphone arrays, In Acoustic Signal Processing for Telecommunication (ed. Gay, S.L. and Benesty, J.), Kluwer Academic Publishers, Dordrecht, chapter 10, pp. 181–237.
Elko, G.W. (2001) Spatial coherence functions for differential microphones in isotropic noise fields, In Microphone Arrays: Signal Processing Techniques and Applications (ed. Brandstein, M. and Ward, D.), Springer, Berlin, chapter 4, pp. 61–85.
Faller, C. (2006) Multiple-loudspeaker playback of stereo signals. Journal of the Audio Engineering Society, 54(11), 1051–1064.
Frost, III, O.L. (1972) An algorithm for linearly constrained adaptive array processing. Proceedings of the IEEE, 60(8), 926–935.
Gannot, S. and Cohen, I. (2008) Adaptive beamforming and postfiltering, in Springer Handbook of Speech Processing (ed. Benesty, J., Sondhi, M.M., and Huang, Y.), Springer-Verlag, Berlin, chapter 47.
Gannot, S., Burshtein, D., and Weinstein, E. (2001) Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Transactions on Signal Processing, 49(8), 1614–1626.
Gerkmann, T. and Hendriks, R. (2012) Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Transactions on Audio, Speech, and Language Processing, 20(4), 1383–1393.
Goossens, R. and Rogier, H. (2009) Unitary spherical ESPRIT: 2-D angle estimation with spherical arrays for scalar fields. IET Signal Processing, 3(3), 221–231.
Haardt, M. and Nossek, J. (1995) Unitary ESPRIT: How to obtain increased estimation accuracy with a reduced computational burden. IEEE Transactions on Signal Processing, 43(5), 1232–1242.
Habets, E.A.P. (2008) Room impulse response generator. https://www.audiolabs-erlangen .de/fau/professor/habets/software/rir-generator/.
Habets, E.A.P. (2010) A distortionless subband beamformer for noise reduction in reverberant environments Proceedings of the International Workshop on Acoustic Echo Control (IWAENC), Tel Aviv, Israel.
Herbordt, W. and Kellermann, W. (2003) Adaptive beamforming for audio signal acquisition, in Adaptive Signal Processing: Applications to Real-World Problems (ed. Benesty, J. and Huang, Y.), Springer-Verlag, Berlin, chapter 6, pp. 155–194.
Herre, J., Falch, C., Mahne, D., Del Galdo, G., Kallinger, M., and Thiergart, O. (2011) Interactive teleconferencing combining spatial audio object coding and DirAC technology. Journal of the Audio Engineering Society, 59(12), 924–935.
Hoshuyama, O., Sugiyama, A., and Hirano, A. (1999) A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters. IEEE Transactions on Signal Processing, 47(10), 2677–2684.
Irwan, R. and Aarts, R.M. (2002) Two-to-five channel sound processing. Journal of the Audio Engineering Society, 50(11), 914–926.
Jacobsen, F. and Roisin, T. (2000) The coherence of reverberant sound fields. Journal of the Acoustical Society of America, 108(1), 204–210.
Jiang, J.S. and Ingram, M.A. (2004) Robust detection of number of sources using the transformed rotational matrix. IEEE Wireless Communications and Networking Conference, vol. 1, pp. 501–506. IEEE.
Krueger, A., Warsitz, E., and Haeb-Umbach, R. (2011) Speech enhancement with a GSC-like structure employing eigenvector-based transfer function ratios estimation. IEEE Transactions on Audio, Speech, and Language Processing, 19(1), 206–219.
Laitinen, M.V. and Pulkki, V. (2009) Binaural reproduction for directional audio coding. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 337–340.
Laitinen, M.V., Kuech, F., Disch, S., and Pulkki, V. (2011) Reproducing applause-type signals with directional audio coding. Journal of the Audio Engineering Society, 59(1/2), 29–43.
Markovich-Golan, S., Gannot, S., and Cohen, I. (2012) A weighted multichannel Wiener filter for multiple sources scenarios. IEEE 27th Convention of Electrical Electronics Engineers in Israel (IEEEI), pp. 1–5.
Markovich, S., Gannot, S., and Cohen, I. (2009) Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1071–1086.
Merimaa, J., Goodwin, M.M., and Jot, J.M. (2007) Correlation-based ambience extraction from stereo recordings. Audio Engineering Society Convention 123. Audio Engineering Society.
Mhamdi, A. and Samet, A. (2011) Direction of arrival estimation for nonuniform linear antenna. International Conference on Communications, Computing and Control Applications (CCCA), pp. 1–5.
Naylor, P.A, and Gaubitch, N.D. (2010) Speech Dereverberation. Springer, New York.
Nélisse, H. and Nicolas, J. (1997) Characterization of a diffuse field in a reverberant room. Journal of the Acoustical Society of America, 101(6), 3517–3524.
Nordholm, S., Claesson, I., and Bengtsson, B. (1993) Adaptive array noise suppression of handsfree speaker input in cars. IEEE Transactions on Vehicular Technology, 42(4), 514–518.
Politis, A., Vilkamo, J., and Pulkki, V. (2015) Sector-based parametric sound field reproduction in the spherical harmonic domain. IEEE Journal of Selected Topics in Signal Processing, 9(5), 852–866.
Pulkki, V. (1997) Virtual sound source positioning using vector base amplitude panning. Journal of the Audio Engineering Society, 45(6), 456–466.
Pulkki, V. (2007) Spatial sound reproduction with directional audio coding. Journal of the Audio Engineering Society, 55(6), 503–516.
Rao, B. and Hari, K. (1988) Performance analysis of Root-MUSIC. Twenty-Second Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 578–582.
Reuven, G., Gannot, S., and Cohen, I. (2008) Dual-source transfer-function generalized sidelobe canceller. IEEE Transactions on Audio, Speech, and Language Processing, 16(4), 711–727.
Roy, R. and Kailath, T. (1989) ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(7), 984–995.
Souden, M., Chen, J., Benesty, J., and Affes, S. (2011) An integrated solution for online multichannel noise tracking and reduction. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2159–2169.
Spriet, A., Moonen, M., and Wouters, J. (2004) Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction. Signal Processing, 84(12), 2367–2387.
Talmon, R., Cohen, I., and Gannot, S. (2009) Convolutive transfer function generalized sidelobe canceler. IEEE Transactions on Audio, Speech, and Language Processing, 17(7), 1420–1434.
Teutsch, H. and Elko, G. (2001) First- and second-order adaptive differential microphone arrays. Proceedings of the 7th International Workshop on Acoustic Echo and Noise Control (IWAENC 2001).
Thiergart, O. (2015) Flexible Multi-Microphone Acquisition and Processing of Spatial Sound Using Parametric Sound Field Representations. PhD thesis, Friedrich-Alexander-Universitat Erlangen-Nurnberg, Erlangen, Germany.
Thiergart, O. and Habets, E.A.P. (2012) Sound field model violations in parametric spatial sound processing. Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC), Aachen, Germany.
Thiergart, O. and Habets, E.A.P. (2013) An informed LCMV filter based on multiple instantaneous direction-of-arrival estimates. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 659–663.
Thiergart, O. and Habets, E.A.P. (2014) Extracting reverberant sound using a linearly constrained minimum variance spatial filter. IEEE Signal Processing Letters, 21(5), 630–634.
Thiergart, O., Kowalczyk, K., and Habets, E.A.P. (2014a) An acoustical zoom based on informed spatial filtering. Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC), Antibes, France.
Thiergart, O., Taseska, M., and Habets, E.A.P. (2013) An informed MMSE filter based on multiple instantaneous direction-of-arrival estimates. 21st European Signal Processing Conference (EUSIPCO 2013).
Thiergart, O., Taseska, M., and Habets, E.A.P. (2014b) An informed parametric spatial filter based on instantaneous direction-of-arrival estimates. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 2182–2196.
Uhle, C., Walther, A., Hellmuth, O., and Herre, J. (2007) Ambience separation from mono recordings using non-negative matrix factorization. 30th International Audio Engineering Society Conference: Intelligent Audio Environments.
Usher, J. and Benesty, J. (2007) Enhancement of spatial sound quality: A new reverberation-extraction audio upmixer. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2141–2150.
Van Trees, H.L. (2002) Detection, Estimation, and Modulation Theory: Part IV: Optimum Array Processing, vol. 1, John Wiley & Sons, Chichester.
Van Veen, B.D. and Buckley, K.M. (1988) Beamforming: A versatile approach to spatial filtering. IEEE ASSP Magazine, 5(2), 4–24.
Vilkamo, J., Lokki, T., and Pulkki, V. (2009) Directional audio coding: Virtual microphone-based synthesis and subjective evaluation. Journal of the Audio Engineering Society, 57(9), 709–724.
Wax, M. and Kailath, T. (1985) Detection of signals by information theoretic criteria. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(2), 387–392.
Yilmaz, O. and Rickard, S. (2004) Blind separation of speech mixtures via time–frequency masking. IEEE Transactions on Signal Processing, 52(7), 1830–1847.
Zoltowski, M. and Mathews, C.P. (1992) Direction finding with uniform circular arrays via phase mode excitation and beamspace Root-MUSIC. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 245–248.
Zoltowski, M., Kautz, G., and Silverstein, S. (1993) Beamspace Root-MUSIC. IEEE Transactions on Signal Processing, 41(1), 344–364.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7 Multi-Channel Sound Acquisition Using a Multi-Wave Sound Field Model

Create new playlist

Sign In

Sign Up

7.1 Introduction

7.2 Parametric Sound Acquisition and Processing

7.2.1 Problem Formulation

Speech Enhancement for Hands-Free Communication

Source Extraction

Spatial Sound Reproduction

Acoustical Zooming

7.2.2 Principal Estimation of the Target Signal

7.3 Multi-Wave Sound Field and Signal Model

7.3.1 Direct Sound Model

7.3.2 Diffuse Sound Model

7.3.3 Noise Model

7.4 Direct and Diffuse Signal Estimation

7.4.1 Estimation of the Direct Signal

State of the Art

Informed Direct Sound Filtering

Parametric Multi-Channel Wiener Filter

Adjusting the Control Parameters ωs, l(k, n) in Practice

Influence of Early Reflections

Influence of the Assumed Number of Plane Waves L

7.4.2 Estimation of the Diffuse Signal

State of the Art

Informed Diffuse Sound Filtering

Distortionless Response Filters

Example Distortionless Response Filter

Distortionless Response Average Filters

Example Distortionless Response Average Filter

7.5 Parameter Estimation

7.5.1 Estimation of the Number of Sources

7.5.2 Direction of Arrival Estimation

7.5.3 Microphone Input PSD Matrix

7.5.4 Noise PSD Estimation

7.5.5 Diffuse Sound PSD Estimation

State of the Art

Diffuse PSD Estimation Using a Quadratically Constrained Spatial Filter

7.5.6 Signal PSD Estimation in Multi-Wave Scenarios

Minimum Variance Estimate

Minimum Mean Square Error Estimate

7.6 Application to Spatial Sound Reproduction

7.6.1 State of the Art

7.6.2 Spatial Sound Reproduction Based on Informed Spatial Filtering

Example: Direct Sound Extraction

Parameter Estimation

Filter Computation

Results

Example: Diffuse Sound Extraction

Parameter Estimation

Filter Computation

Results

7.7 Summary

Notes

References

Table of Contents for
7 Multi-Channel Sound Acquisition Using a Multi-Wave Sound Field Model

Adjusting the Control Parameters ω_s, l(k, n) in Practice