Chapter 8

Object-Based Audio

Nicolas Tsingos

Introduction

Audio content production or interactive rendering is traditionally based on the manipulation of sound objects. We define sound objects as audio waveforms (audio elements) and associated parameters (metadata) that embody the artistic intent by specifying the translation from the audio elements to loudspeaker signals. Sound objects generally use monophonic audio tracks that have been recorded or synthesized through a process of sound design. These sound elements can be further manipulated, e.g., in a digital audio workstation (DAW), so as to be positioned in a horizontal plane around the listener, or in more recent systems in full three-dimensional (3D) space (see Chapter 6), using positional metadata. An audio object can therefore be thought of as a “track” in a DAW. Similarly, interactive audio engines found in video games or simulators are also manipulating sound objects—generally point source emitters—as the building blocks for complex dynamic soundscapes. In this case, they can incorporate very rich sets of metadata determining their behavior.

This process of positioning sound elements in space has been in use since the early 1940s with the introduction of the FANTASOUND system (Garity & Hawkins, 1941) and later evolved in the now-common 5.1 and 7.1 surround sound systems (see Chapter 2). Until recently, due to technical limitations of the various delivery media, these objects were pre-mixed into a small number of speaker feeds or channels that can be directly played back on matching loudspeaker layouts, without requiring further processing.

Recently, with the transition to digital cinema, overall increase in available bandwidth and advances in parametric audio coding, a number of approaches have been proposed (Robinson, Tsingos & Mehta, 2012) to transport a number of the original objects used during production to be rendered inside the playback environment (movie theaters, home theaters, mobile devices, etc.). As opposed to a pre-mixed speaker feed output targeting a single playback configuration, object-based audio delivery preserves a higher spatial resolution and artistic intent all the way to the playback endpoint. This provides more adaptability and the opportunity to deliver richer and more immersive audio experiences within each environment. More generally, object-based audio production and delivery enables:

  • Enhanced immersion, adding height and flexible rendering across speaker layouts and environments;
  • Enhanced personalization, allowing consumers to tailor the content to their preferences;
  • Enhanced adaptability, ensuring that content is optimized across a wider range of playback devices;
  • Enhanced accessibility, with improved multiple language support, improved video description and dialogue enhancement;
  • Efficient production workflows and future-proofing of content, by deriving current or future deliverables from a single object-based master mix.

In this chapter, we offer an in-depth look at object-audio production, delivery and rendering across cinema, broadcasting and interactive applications (e.g., gaming). We first review how audio objects can be represented spatially and how the associated spatial metadata is used to render the objects to loudspeakers. In particular, we will cover in detail some common object panning algorithms and their tradeoffs. We further describe several application-specific sets of metadata, beyond spatial representation, enabling interactivity and fine-grain control of artistic intent.

One of the challenges of object-based representations is the added complexity of manipulating, encoding and transmitting a potentially large number of audio elements, compared to legacy stereophonic or multichannel techniques. We review advances in object-domain spatial coding, where large sets of objects can be converted into smaller, more convenient, sets while preserving the original perceptual intent. We also cover specific audio object parametric coding strategies for low bit-rate delivery, e.g. to the home, and provide some insights on where audio object delivery brings the most improvement compared to channel-based delivery.

Making the best out of an object-based workflow requires improved capture techniques in particular in live production environments. Building on previous chapters, we discuss how some new tools and conversion techniques can be used to complement traditional microphone techniques to capture sets of objects for both immersive and interactive applications.

Finally, we focus on two industry-wide topics: extending loudness metering and control to object-based presentations as well as standardization for interchange and delivery of object-based content.

Spatial Representation and Rendering of Audio Objects

Coordinate Systems and Frame of Reference

In order to specify locations in a space, a frame of reference is required (Klatzky, 1998). There are many ways to classify reference frames, but one fundamental consideration is the distinction between allocentric (or environmental) and egocentric (observer) reference (Figure 8.1). An egocentric frame of reference encodes object location relative to the position (location and orientation) of the observer or “self.” An allocentric frame of reference encodes object location using reference locations and directions relative to other objects in the environment. An egocentric reference is commonly used for the study and description of perception; the underlying physiological and neurological processes of acquisition and coding most directly relate to the egocentric reference. An allocentric reference is better suited for scene description that is independent of a single observer position, and when the relationship between elements in the environment is of interest. For interactive rendering, video game applications programming interfaces (APIs) generally express the positions of audio objects as allocentric world-space Cartesian coordinates. The coordinates of the objects may be converted to a listener/egocentric frame of reference at rendering time depending on the player’s position, in particular if a single perspective has to be rendered (e.g., on headphones).

When choosing the frame of reference for audio mixing in post-production, the following issues should be taken into consideration: (1) How to best capture artistic intent, (2) How to best preserve and reproduce artistic intent in a variety of listening environments (known as translation), (3) The tools and man-machine interface used to capture artistic intent and (4) Consistent behavior across a wide audience area.

To understand how to best capture and translate artistic intent, one must consider what spatial relationships the mixing engineer is intending to create and preserve. In general, mixing engineers tend to think and mix in allocentric terms, and panning tools are laid out with an allocentric frame—the screen, the room walls—and they expect sounds to be rendered that way: this sound should be on screen, this sound should be off screen, 1/4th of the way from the left to the right wall, and so forth. Movements are also defined in relation to the playback environment e.g., a fly-over from the center of the screen, up across the ceiling and ending at the center of the back wall. Using an egocentric frame of reference can result in an object on the side wall of an elongated mixing room ending up on the back wall of a more square exhibition space. If the egocentric audio framework is 3D, i.e., includes distance, azimuth and elevation, then an object on the rear wall of a small mix stage would end up well within the audience area of a large exhibition auditorium (Figure 8.1(b)). Using an allocentric reference, for every listening position, and for any screen size, the sound can be described at the same relative position on the screen, e.g., 1/3rd left of the middle of the screen. This allows the relationships to be captured and optimally reproduced in the wide range of room sizes and shapes that exist in exhibition.

Modern (surround-sound era) cinema sound uses an allocentric frame of reference (Figure 8.1 (a)). The reference points are the nominal location of loudspeakers (e.g., L, C, R) or loudspeaker zones (e.g., left-side surround array, right-top surround array). These locations have a known and consistent mapping to the important features of the cinema environment: the screen, the audience, the room. These locations also have a known and consistent mapping to authoring tools: left/right fader, joystick position, and the GUI. In this way, when a mix control is full-left and full-front it is understood that sound will be reproduced by a loudspeaker that is nominally located at the left edge of the screen. All location metadata are generated and decoded using this reference. This applies to both objects and channels. Only by using the same frame of reference for both objects and channels can we ensure that the spatial relationship between objects and channels is preserved.

Another particular example is illustrated in Figure 8.2 where characters on screen follow an off-screen sound event with their eyes. The corresponding sound object is perceived as incoming from different directions at each seat but this is consistent with the position of this sound in the room as perceived by both the audience and the characters on screen.

Rendering Approaches

A fundamental operation in spatial sound content creation tools is audio rendering. Audio rendering algorithms map a monophonic audio signal to a set of loudspeakers to generate the perception of an auditory event at an intended source location in space. Such algorithms have long been a key component of channel-based surround sound program creation (Rumsey, 2001; Begault & Rumsey, 2004) and are required to play back object-based content.

Some rendering algorithms, such as wave field synthesis (de Vries, 2009) or higher-order ambisonics (Furness, 1990), attempt to recreate a physically based sound field in the listening area. Such techniques will be covered in Chapters 9 and 10 of this book. Other physically based rendering techniques, such as binaural rendering, can also be used specifically for headphone playback of audio objects (see Chapter 4).

Alternatively, object renderers approximate higher-level perceptual cues, such as interaural time/level differences for the desired source position. Most algorithms currently used in professional audio production attempt to recreate suitable cues by amplitude panning (Lossius, 2009; Dickins, 1999; Pulkki, 1997). A normalized gain vector [Gi] (1 < i < n), where Σi Gi2 =1, is computed and assigned to the source signal for each of the n loudspeakers in use. The object signal s(t) is therefore reproduced by each loudspeaker as Gi(x, y, z).s(t) creating suitable localization cues for a phantom sound source indicated by the object (x, y, z) coordinates. For instance, Figure 8.3 illustrates how different speakers may be used among various 2D rendering (panning) algorithms to simulate an object’s perceived position in the playback environment.

For moving objects, the gains are traditionally evaluated over small time-frames and interpolated, either directly or using an overlap-add reconstruction of the output audio signal. Depending on the rendering system, object position updates, gain computations and audio processing can be performed synchronously or asynchronously. Audio renderers traditionally re-sample the incoming object coordinate updates (e.g., originally at 30 Hz) to a fixed audio frame rate (e.g., 100 Hz) at which the evaluation of the panning gains is performed. The gains are then further interpolated on a per sample basis, matching the audio processing rate (e.g., 48 kHz) (Tsingos & Gascuel, 1997; Tsingos, 2001).

While most renderers compute wide-band gain values for efficiency, Pulkki et al. (1999) and Laitinen et al. (2014) conducted several studies of the frequency-dependent localization and loudness bias of directional panning algorithms and demonstrated the additional benefits of frequency-dependent panning gains.

Directional, Vector-Based Panning

Directional pairwise panning (Figure 8.3(a)) is a commonly used strategy that solely relies on the directional vector from a reference position (generally the sweet spot or center of the room) to the desired object position. The pair of speakers bracketing the relevant directional vector is used to place (render) that object’s position in space during playback. A well-documented extension of directional panning to 3D loudspeaker layouts is Vector-Based Amplitude Panning (VBAP) (Pulkki, 1997), which uses triplets of speakers (see Figure 8.4) to render a sound with a desired 3D direction of incidence to the listener.

A set of speaker triplets can be obtained by triangulating the convex hull of the loudspeaker array, e.g., using a Delaunay triangulation algorithm that provides triangle meshes adapted to the specific geometry of the reproduction loudspeaker setup (Barber, Dobkin & Huhdanpaa, 1996). For a given object position p and sweetspot O, a single triplet of speakers is selected by intersecting the corresponding direction vector d = p-O/||p-O|| with the triangulated convex hull (Figure 8.4). The direction d can be expressed as a function of the unit directions l1, l2, l3 of the 3 corresponding speakers as: d = g1l1 + g2l2 + g3l3. The vector of gains for each loudspeaker G = [g1g2g3] can therefore be obtained as:

G = dTL-1

where L is the 3 × 3 matrix of the loudspeaker direction vectors.

Several proprietary extensions, e.g., in the MPEG-H standard, have been developed over a generic triangulation and VBAP algorithm to improve its performance, in particular for arbitrary loudspeaker setups (Herre et al., 2015).

First, triangulation algorithms have been designed such that they yield a left-right and front-back symmetric division of the loudspeaker convex hull into triangles. This prevents asymmetric rendering of symmetrically placed sound objects. To solve this problem, it is also possible to extend the VBAP approach by using a hybrid mesh composed of quadrilaterals and triangles (or in general n-gons). Once the intersection of the object direction vector with the mesh has been determined, the panning gains can then be computed as a function of the barycentric coordinates of the intersection point in the triangle or generalized barycentric coordinates in the polygon (Warren et al., 2007).

VBAP also requires that the convex hull of the speakers cover the entire sphere (or upper hemisphere) of directions. In order to prevent uneven source movements and to avoid the need for clipping object coordinates, some systems include virtual loudspeakers in the target setup in regions not covered by physical loudspeakers. During rendering, VBAP is applied to a loudspeaker setup extended by the virtual loudspeakers. The obtained signals for the virtual loudspeakers are then downmixed to the actual physical loudspeakers. The downmixing gains for mapping virtual to physical loudspeakers are derived by distributing the virtual loudspeakers energy equally to the neighboring loudspeakers (i.e., as defined by the edges in the triangulation). One prominent use case for the added imaginary loudspeakers are reproduction layouts that only consist of loudspeakers in the horizontal plane: In this case, an imaginary loudspeaker is added at the zenith position above the center of the listening area, resulting in smooth perceived movements e.g., for fly-over sound objects (Herre et al., 2015).

Because vector-based panning only uses the direction of the source relative to a reference position, it cannot differentiate among object sources at different positions along the same direction vector. Moreover, some 3D implementations may constrain the rendered objects to the surface of a unit sphere and thus would not necessarily allow an object to cross inside the room without going “up and over.” Directional panning solutions can also create sharp speaker transitions as objects approach the center of the room, where a small movement of an object’s position would not always translate into a small variation in loudspeaker gains [Gi]. Solving these issues generally requires some form of distance-based blending, where the panning algorithm transitions from the original direction-based behavior to e.g., firing all speakers equally, as the object approaches the origin of the frame of reference (i.e., the sweet spot or center of the room).

Position-Based Panning

Alternative panning approaches relying on position of the objects, rather than direction, provide solutions to the above issues.

The “dual-balance” panning algorithm is the most common approach used in 5.1- or 7.1- channel surround productions today (Figure 8.3(b)). This approach uses the left or right and front or back pan-pot controls widely used for surround panning. As a result, dual-balance panning generally operates on a set of 4 speakers bracketing the desired 2D object position.

Extending to 3D (e.g., when using a vertical layer of speakers above the listener) yields a layered “triple-balance” panner. It generates 3 sets of 1-dimensional (1D) gains corresponding to left or right, front or back, and top or bottom balance values. These values can then be multiplied to obtain the final loudspeaker gains: Gi(x, y, z) = Gxi(x) × Gyi(y) × Gzi(z). This approach is fully continuous for objects panned across the room in either 2D or 3D and makes it easier to precisely control how and when speakers on the base or elevation layers are to be used.

The following background and supporting equations provide a representative example of how a simple panning algorithm would be implemented using a simple sine or cosine law. Other panning laws are also possible.

An indicative 1D (stereo) rendering could be derived as follows, using an audio objects x coordinate in [1, 1]:

Gleft=cos(x+12*π2)
Gright=sin(x+12*π2)

An indicative 3/2/0 example (3 front channels: left, center, right, 2 surround channels: left surround, and right surround, 0 LFE channels) would be, with 2D rendering using (x, y) coordinates in [1, 1] × [1, 1], as follows:

  • Set all gains to 0.0.
  • Using x, compute left and right pan-pot values for front and back speakers as in the previous stereo example (where ls is left-surround, rs is right-surround, l is left, c is center, and r is right):
G(ls)=cos(x+12*π2)G(rs)=cos(x+12*π2)

if (x ≤ 0.0)

G1 = cos(-x π/2)

Gc = sin(-x π/2)

else

Gc = cos(-x π/2)

Gr = cos(-x π/2)

  • Using y, compute front/back pan-pot values and combine with previous left/right gains:
f=cos(y+12*π2)b=sin(y+12*π2)
Gl*=f;Gr*=f;Gc*=f;
Gls*=b;Grs*=b
  • Normalize power to 1.0 by dividing all gains Gi by √ (ΣiGi2). Following the same principle, these examples can be easily extended using a third dimension for elevation (height).

In contrast to the directional and balance-based approaches, distance-based panning (Lossius et al., 2009; Kostadinov, 2010) (Figure 8.3(c)) uses the relative distance from the desired 2D or 3D object location p to each speaker Li in use to determine the panning gains:

Gi(p)=1ε+(Lip)a,

where a is a distance exponent (typical values being a = 1 or, preferably, a = 2) and ε is a “spatial blur” coefficient that controls how much an object can be rendered by a single speaker only.

As a result, this approach generally uses all available speakers rather than a limited subset, which leads to smoother object pans but has the tradeoff of being prone to timbral artifacts. In addition, as the number of objects increases, even a small leakage to all speakers can lead to an overall mix sounding less discrete.

However, one advantage of this approach is that it does not require an underlying topological structure (e.g., a loudspeaker mesh) and therefore provides ultimate flexibility in terms of supported speaker layouts with a very straightforward implementation.

A generalized solution to object panning over arbitrary speaker layouts is to determine an optimal set of (non-negative) gains [G](Gi 0) so that the weighted sum of loudspeaker positions (or directions) yields the desired object position (resp. direction) (Dickins et al., 1999). This can be solved through a least-square approach. As multiple solutions are possible, additional regularization terms are generally required to ensure smoothly varying gains for moving objects, for instance, by enforcing the gains to be as small as possible. An optimization approach is likely to be more computationally intensive than solutions that explicitly pre-select subsets of speakers within a given topological structure e.g., a triangle mesh. However, it is more generic in terms of supported speaker layouts and control of image focus and sweet spot robustness.

Tradeoffs of Different Amplitude Panning Strategies

The design of object panning/rendering algorithms ultimately must balance tradeoffs among timbral fidelity, spatial accuracy, smoothness and sensitivity to listener placement in the listening environment, all of which can affect how an object at a given position in space is perceived by listeners.

For instance, Kostadinov, Reiss and Mladenov (2010) compared source localization with DBAP and VBAP and found the two approaches to perform comparably. However, no evaluation of the source timbral fidelity was conducted.

Different rendering approaches may have a significant impact on how object trajectories are perceived in the playback environment. Figure 8.5 illustrates results from an experiment comparing the three panning algorithms of Figure 8.3 to produce 2D pans across a 250-seat movie theater outfitted with a 25-loudspeaker system (Tsingos et al., 2014). Taken together, these results suggest that rendering strategies critically depend on listener distance from the center of the room (i.e., the origin of the egocentric frame of reference for directional panning), with dual-balance panning performing well near the center of the room and direction-based panning achieving good results at far distances, especially near or outside of the room boundary.

Point Objects and Wide Objects

Most object audio renderers include a control of perceptual object size that helps mixers create the impression of spatially extended sound sources. Perceptual object size can be implemented using a combination of spreading the object across multiple neighboring speakers and decorrelating the resulting speaker feeds to prevent the creation of a phantom panned image (Figure 8.6). The set of gains for the spread object is computed by spatial integration, summing the gains for a number of elementary point sources covering a given 2D area or 3D volume (Figure 8.7). For a review of decorrelation and size algorithms, we refer the reader to Potar and Burnett (2004).

Different solutions are used to model the spreading of an object. Following an egocentric frame of reference, some interfaces and renderers model spreading as an angular spread in azimuth, elevation and possibly distance (Figure 8.7(a)). For instance, the spread algorithm in MPEG-H 3D Audio is based on Multiple Direction Amplitude Panning (MDAP) (Herre et al., 2015; Pulkki, 1997). Other rendering approaches model spread as a 3D box-shaped volume but can also limit the spreading to the walls if the original object is positioned on the wall boundary (Figure 8.7(b) and (c)) (Robinson et al., 2012).

Advanced Metadata and Applications of Object-Based Representations

Artistic Controls for Object Rendering in Cinema Audio

While the use of a consistent core audio rendering technique is desirable, it cannot be assumed that a given rendering technique will always deliver consistent, aesthetically pleasing, results across different playback environments. For instance, cinema mixing engineers commonly remix the same soundtrack for different channel-based formats, such as 7.1/5.1 or stereo, to achieve their desired artistic goals in each configuration. With several hundred audio tracks competing for audibility, maintaining the discreteness of the mix and finding a place for all the key elements is a challenge that all cinema mixing engineers face and that requires mixing rules that are deliberately inconsistent with a physical model or a direct re-rendering across different speaker configurations.

Recent cinema mixing formats (Robinson et al., 2012; Robinson & Tsingos, 2001) introduce additional object-level controls such as loudspeaker-zone metadata, which are used to dynamically reconfigure the object renderer to “mask out” certain loudspeakers (see Figure 8.6 and Figure 8.8(a)). This guarantees that no loudspeaker belonging to the masked zones will be used for rendering the object. Typical zone masks used in production include no sides, no back, screen only, room only and elevation on/off. In this section, we review how they are used by mixing engineers to control and improve object panning over a large audience, or to render overhead objects for speaker layouts without ceiling speakers, and how they affect rendering of perceived object size. In addition, we illustrate the use of an additional snap-to-speaker rendering metadata to control the discreteness of the rendering as well as downmixing of proscenium objects (i.e., objects in the room that are close to the screen).

Usage of Speaker Zone Masks

A main application of speaker zone masks is to help the mixing engineer achieve a tight control of which speakers are used to render each object in order to maximize the desired perceptual effect. For instance, the no sides mask guarantees that no loudspeaker on the side wall of the room (see Figure 8.8(a)) will be used. This creates more stable screen-to-back fly-throughs across a wider audience. If the side speakers are used to render such trajectories, they will become audible for the seats nearest to the side walls and these seats will perceive a distorted trajectory “sliding” along the walls rather than crossing the room.

Another key application of zone masks is to fine tune how overhead objects must be rendered in a situation where no ceiling speakers are available. Depending on the object and whether it is directly tied to an on-screen element, a mixing engineer can choose, e.g., to use the screen only or room only mask to render this object, in which case it will be rendered only using screen speakers or using surround speakers, respectively, when no overhead speakers are present. Overhead music objects, for instance, are often authored with a screen only mask. This is directly achieved as a function of the object elevation coordinate (z). If z = 1 the object is rendered fully in the overhead speaker zone using all available overhead speakers. When no ceiling speakers are available, the object is projected on the base plane for rendering (i.e., z = 0) and the zone masks apply.

Speaker zone masks provide an effective means to further control which speakers can be used as part of the process in order to optimize the discreteness of the mix. For instance, a wide object can be rendered only in the 2D plane by using the elevation off mask. To avoid adding more energy to screen channels, which could compromise dialogue intelligibility, the room only mask is also used.

Panned vs. Discrete Sources

Another useful aesthetic control parameter is the snap-to-speaker mode. The mixing engineer can select this mode to indicate that consistent reproduction of timbre is more important that consistent reproduction of position. When this mode is enabled, the object renderer does not perform phantom panning to locate the desired sound image. Rather, it renders the object entirely from the single loudspeaker nearest to the intended object location. Reproduction for a single loudspeaker creates a pinpoint, timbrally neutral source and can be used to contrast key effects in the mix, in particular with respect to more diffuse elements such as those rendered directly using the channel-based representation and cinema arrays.

The snap-to-speaker mode can also be combined with zone exclusion masks and size control although size is generally forced to zero (i.e., no spreading) as the goal is mainly to create a sharp point source. To achieve better consistency when the snap mode is turned on and off, the “nearest” speaker is generally chosen as the one that would receive the largest energy if the source was phantom panned (i.e., snap off). To avoid snapping to a speaker too far from the original intended object location (which could happen in sparse speaker configurations), a snap-release threshold is provided. If the snapped position is farther from the intended position by more than this distance threshold, the renderer reverts to panning.

For film soundtracks, a key use of the snapped parameter is to create near-screen/wide pairs of objects along the side walls of the cinema by using objects which snap to the proscenium speakers. This is particularly useful for music elements, e.g., to extend the orchestra beyond the screen. When re-rendered to sparser speaker configurations (e.g., legacy 5.1 or 7.1), these elements will be automatically snapped to left/right screen channels. Another use of the snap metadata is to create “virtual channels,” for instance to re-position the outputs of legacy multichannel reverberation plug-ins in 3D without risk of “double panning.”

Figure 8.8(b) shows aesthetic overall metadata usage in six recent movie clips. As can be seen, the default rendering behavior, based only on object position, is used most of the time while mixing object-based content. However, in a significant number of cases, the baseline rendering does not give the best result and additional artistic input, overriding the default behavior, is beneficial.

Cinematic Virtual Reality and Headphone Playback

An emerging type of cinematic content targets virtual reality (VR) endpoints. VR endpoints leverage head-mounted displays (HMDs) and headphone rendering to deliver an immersive full 360-degree stereoscopic experience to the user. Rendering of spatialized sound is paramount for this use case and object-based descriptions are well suited to create auditory scenes with the required high spatial resolution. However, additional rendering metadata can also be authored and delivered on a per-object basis to help the mixing engineers fine tune the headphone experience. For instance, it can be beneficial to control the rendering algorithm depending on the type of content. A stereo music ambiance could then be rendered as is, without binaural spatial processing and without head tracking (i.e., head-relative) while sound effects can be spatially rendered and head-tracked (i.e., world-relative). Mixing engineers also commonly choose to render percussive sounds (e.g., drums) in stereo to avoid transient smearing due to binaural filtering. Endpoint-specific metadata enables the obtained VR headphone mix to be delivered and adapted for loudspeakers playback, for instance combining an HMD for video playback and a home-theater system for audio.

Interactivity and Personalization in Broadcasting

Traditionally, live-broadcast mixing engineers take all microphone feeds from a live event and mix them down to a high-quality 5.1-channel or stereo program. Production mixing engineers also create different submixes (e.g., ambience or effects, music and dialogue) before mixing down to a final program. These submixes make up the audio beds (i.e., traditional multichannel ambiences) and objects that are sent discretely to the playback device, where they can be personalized and rendered. For example, an ambience or effects submix could be represented by a 5.1- channel audio bed, and each of the multiple dialogue submixes (e.g., for multiple commentators or languages) could be represented by an audio object. As a result, existing microphones and microphone plans can be used to create customizable object-based mixes for live broadcast events. Additional microphones (e.g., for capturing height or close-proximity crowd sounds) can be added to provide a more immersive audio experience.

A channel-based live audio production today is composed of a number of individual elements, which may include the following:

  • Crowd sound (diffuse) constructed from a number of sources;
  • Spot sounds (e.g., ball kick or basketball bounce);
  • Off-screen dialogue (e.g., commentary or announcer);
  • On-screen dialogue (e.g., studio links and rangefinder camera interviews);
  • Audio effects associated with on-screen graphic transitions;
  • Pre-recorded audio and video (A/V) playback material (e.g., a highlights package or replay elements from the event);
  • Synthesized fill sounds (e.g., a helicopter, garage sounds in a pit lane, or crowd fill).

An object-based audio production is made up of the same elements. The difference, however, is that some of these elements are not blended into the mix during production but instead are sent to the receiver as a number of different audio presentations (made from one or more sub-streams). Then, the selected presentation is rendered in the receiver to the final speaker configuration (Mann et al., 2013).

An additional layer of metadata defines the personalization aspects of the audio program. These personalization metadata serve two purposes: to define a set of unique audio presentations that a consumer could select and to define dependencies (i.e., constraints) between the audio elements (objects) that make up the unique presentation to ensure that personalization always sounds optimal (Riedmiller et al., 2015).

Presentation Metadata

Producers and sound mixing engineers can define multiple audio presentations for a program to allow users to switch easily among several optimally predefined audio configurations. For example, a sound mixing engineer for a sports event could define a default sound mix for general audiences, biased sound mixes for supporters of each team that emphasize their crowd and favorite commentators, and a commentator-free mix. The defined presentations depend on the content genre (e.g., sport or drama) and differ among subgenres (from sport to sport or form to form). Presentation metadata define the details that create these different sound experiences. An audio presentation specifies which object elements or groups should be active, along with their position and their volume level. Defining a default audio presentation ensures that audio is always output for a given program. Presentation metadata can also provide conditional rendering instructions that specify different audio object placements or volumes for different speaker configurations. For example, a dialogue objects playback gain may be specified at a higher level when reproduced on a mobile device as opposed to an A/V receiver. Each object or audio bed may be assigned a category, such as dialogue or music. This category information can be used later either by the production chain to perform further processing or by the playback device to enable specific behavior. For example, categorizing an object as dialogue would allow the playback device to manipulate the level of the dialogue object with respect to the ambience. This categorization can also be used to help in prioritizing objects during playback to conditionally enable ducking of non-prioritized elements and thus enhance intelligibility. Presentation metadata can also identify the program, along with other aspects (e.g., which sports genre or which teams are playing), that could be used to automatically recall personalization details when similar programs are played. For example, if a consumer personalizes a viewing experience to always pick a radio commentary for a baseball game, the playback device can remember this genre-based personalization and select the radio commentary for subsequent baseball games. The presentation metadata also contains unique identifiers for the program, each presentation and each sound element. This allows user interfaces on the playback device to associate user interface elements to each aspect of the personalized program. Presentation metadata typically does not vary temporally, on a frame-by-frame basis. However, they may occasionally change throughout the course of a program. For example, the number of presentations available may be different during live game play but may change during a halftime presentation as an example.

Interactive Metadata

Users may want complete control over personalizing an audio program. To ensure that every customization results in an optimal sounding mix, the content creator or broadcaster can provide interactive metadata defining a set of rendering rules to be used for personalization only. Interactive metadata can specify object parameter minimum or maximum values; inter-object mutual exclusion; inter-object position, volume or ducking rules; and overall mix rules. The interactive metadata are typically leveraged by a consumer user interface to prevent the creation of a non-ideal rendering (mix). For example, if both English and Spanish dialogue objects are present in the audio stream, the interactive metadata would prevent a user from enabling both objects simultaneously. The interactive metadata can be represented using a mixgraph structure (Figure 8.9).

Video Games and Simulation

The definition of audio object in a game engine generally extends far beyond what can be found for theatrical or broadcasting applications. It includes specific sound source modeling (e.g., directivity) as well as propagation metadata (e.g., distance attenuation models and reverberation parameters). For a review of audio rendering for games and simulation, we refer the reader to (Savioja et al., 1999; Funkhouser, Jot & Tsingos, 2002). This is to be contrasted to post-production applications where such effects are generally pre-baked in the objects’ audio essence and therefore are not considered part of the object metadata further transmitted downstream e.g., for theatrical or home playback. In addition, audio objects in game engines can comprise multiple elements each tied to the game logic through specific control parameters, therefore defining very complex mixing graphs including scripted behavior e.g., enabling live concatenative synthesis (Roads, 1996).

Several standards and APIs have been historically developed to render object-based content for games. For instance, DirectSound (Bargen & Donelly, 1998), OpenAL (2000) were introduced to perform rendering of point source objects in 3D space and also included both standardized (IASIG level 2.0, 2016) and proprietary extensions (EAX, 2004) enabling the rendering of environmental occlusion or reverberation effects. Some environmental rendering extensions were also part of the MPEG4 audio BIFS description format (Jot, Ray & Dahl, 1998). Today, video game developers either rely on middleware solutions (FMOD, WWISE) or proprietary audio engines for which specific plug-ins can be created and the flexibility of which extends far beyond what could be realistically captured with a limited set of metadata. Such engines combine real-time object rendering with dynamic busing and audio effect plug-in architectures similar to post-production DAWs or consoles.

However, similar to production environments, game engines traditionally render a channel-based audio output. Object-based audio output, in a way that abstracts the rendering configuration, is highly desirable, in particular as new loudspeaker systems are introduced that support rendering of height for instance. In that sense, the notion of object-based audio output matches the one used in post-production applications. In recent gaming consoles such as Sony’s PS4 or Microsoft’s Xbox One, this functionality is enabled at the system level and supports plug-ins for encoding of object-based information into proprietary formats supported by external A/V receivers.

Managing Complexity of Object-Based Content

Whether in post-production or for interactive gaming, an audio mix can comprise hundreds of simultaneous elements. It is often desirable to compress the information of the resulting auditory scene in a way independent of the chosen reproduction technique to enable more efficient delivery, rendering or post-processing. Several perceptual auditory properties may be exploited in order to simplify the rendering of a complex object-based scene with limited impact on the overall perceived audio quality. The general approach is to re-organize the sound scene by (1) sorting its components by their relative importance and (2) reducing its spatial complexity.

Prioritizing and Culling of Objects

A first approach to manage the complexity of an object-based audio scene is to prioritize objects and discard the least important ones. This solution builds upon prior work from the field of perceptual audio coding that exploits auditory masking. When a large number of sources are present in the environment, it is very unlikely that all will be audible due to masking occurring in the human auditory system (Moore, 1997). This masking mechanism has been successfully exploited in perceptual audio coding (PAC), such as the well-known MPEG I Layer 3 (mp3) standard (Painter & Spanias, 2000) and several efficient computational models have been developed in this field. This approach is also linked to the illusion of continuity phenomenon (Kelly & Tew, 2002), although current works do not generally include explicit models for this effect. This phenomenon is implicitly used together with masking to discard entire frames of original audio content without perceived artifacts or holes in the resulting mixtures (Gallo, Lemaitre & Tsingos, 2005).

Evaluating all possible solutions to the optimization problem required for optimal rendering of a sound scene would be computationally intractable. An alternative is to use greedy approaches, which first require estimating the relative importance of each source in order to get a good starting point. A key aspect is also to be able to dynamically adapt to the content. Several metrics can be used for this purpose such as energy, loudness or saliency (Kayser et al., 2005). Recent studies have compared some of these metrics showing that they might achieve different results depending on the nature of the signal (speech, music, ambient sounds). Loudness has been found to be generally leading to better results while energy is a good compromise between complexity and quality (Gallo et al., 2005).

Following these principles, recent game audio engines implement “High-Dynamic Range” audio by dynamically adapting the dynamic range window to the current loudness of the mix. This offers the opportunity to simultaneously control the dynamic range of the overall mix while culling the objects of weaker relative level, freeing up computational resources (Frostbite, 2009).

Spatial Coding

Limitations of the human spatial hearing e.g., as measured through perceivable distance and angular thresholds (Begault, 1994) can also be exploited for faster rendering independently of the subsequent signal processing operations. Studies have also shown that our auditory localization is strongly affected in multi-source environments. Localization performance decreases with increasing number of competing sources (Brungart, Simpson & Kordik, 2005) showing various effects such as pushing effect (the source localization is repelled from the masker) or pulling effects (the source localization is attracted by the masker), which depend on the time and frequency overlapping between the concurrent sources (Best et al., 2005). Thus, spatial simplification could be performed even more aggressively as the complexity of the scene, in particular the number of sound sources, grows. However, if interaction is possible at rendering time, for instance the ability for the listener to navigate the scene or some objects to be emphasized, this approach may lead to artifacts unless the specific elements can be prioritized accordingly. Another challenge is spatial coding of specific object rendering metadata (see the section “Advanced Metadata and Applications of Object-Based Representations”). In this case, spatial coding approaches must be extended to preserve objects with different metadata into different groups.

If the reproduction format is known in advance, one straightforward approach to reducing spatial complexity is simply to render some objects to an intermediate smaller set of channels, at the expense of downstream flexibility. This approach is often used in interactive game engines to feed “effect-send” buses (e.g., to multichannel reverberators) with a premix of the objects in the scene.

For applications where the reproduction format or target device is not set in advance, the coding/simplification of the spatial information must be performed in scene-space, ideally directly on the original objects themselves. One solution is to convert objects to a fixed set of spatial basis functions to obtain a new alternative representation that does not depend on the original number of objects. For instance, Ambisonics (Malham & Myatt, 1995; Daniel & Moreau, 2004) uses a spherical harmonics decomposition of the incoming sound pressure at the listening point (we refer the reader to Chapter 9 for more information on Ambisonics and high-order Ambisonics solutions).

However, it is generally desirable to preserve the object-based nature of the scene by converting the original set of objects into a reduced set of perceptually equivalent objects (Figure 8.10). This solution, which we refer to as spatial coding, can be divided into three categories:

  • Fixed clustering approaches operate in object-space and explicitly group neighboring sound sources belonging to the same cone of directions (Herder, 1999), or using fixed or hierarchical grid structures (Wand & Straßer, 2004).
  • Dynamic per-object clustering: The clustering proposed by Sibbald (2001) is an object-based method aiming at progressively rendering objects formed of complex aggregates of elementary sources. Sound sources related to an object, or an area, are grouped according to their distance to the listener. In the near field, secondary sound sources are created and dynamically uncorrelated in order to improve the spatial sensation. In the far field, sources are clustered together, accelerating the spatial rendering. The drawback of the method is that the clustering is evaluated on a per-object basis and does not consider interactions between all the elements of the scene.
  • Dynamic global clustering: Dynamic source clustering methods (Tsingos, Gallo & Drettakis, 2004) based on both the geometry of the scene and the signals emitted by each source have also been proposed. This is especially useful for scenes where sound objects are frequently changing in time, varying their shape, energy as well as location. These algorithms flexibly allocate the required number of clusters; thus clusters are not wasted where they are not needed. Dynamic clustering can be greedily derived from e.g., the Hochbaum-Shmoys heuristic. The cost-function used for clustering combines instantaneous loudness (Moore, Glasberg & Baer, 1997) and inter-object distance (allocentric grouping) or distance and incidence direction to the listener (egocentric grouping). An equivalent signal for each object cluster is then computed as a mixture of the signals of the clustered sound sources. In Tsingos et al. (2004), each original object is assigned to a single cluster but it is also possible to re-distribute objects to multiple clusters based on their relative position. A representative loudness-weighted centroid is used to spatialize the cluster according to the desired reproduction setup.

In the context of interactive simulations and gaming, adaptive clustering techniques have been shown to preserve very good rendering quality and minimal impact on localization-task performance, even with a small number of clusters (Tsingos et al., 2004). More recently, similar approaches have been successfully used to translate complex object-based theatrical audio mixes comprising dozens of concurrent objects into more compact versions for interchange and home delivery (Riedmiller et al., 2015).

Audio Object Coding

Complexity-reduction approaches, such as spatial coding, can dramatically reduce the bandwidth required for storing or transmitting object-based programs. However, the resulting program might still require several megabits per second (Mbps) for transmission which is incompatible with home delivery via streaming where the target bitrates are typically a few hundred kilobits per second (kbps) (e.g., 192 to 640 kbps for current 5.1/7.1 Dolby Digital Plus). Delivering object-based content at these bitrates requires additional coding tools.

Independent Coding of Objects

For applications requiring full interactivity and manipulation of the audio objects (such as freely adjusting their level), independent coding of the individual objects is the preferred solution. This is typically used in video games, where the objects’ audio essence is generally separately encoded as monophonic files that can be efficiently decoded by the hardware (e.g., using AAC, mp3 or WMA codecs). For closed ecosystems, any codec whether lossless or lossy could in theory be used. However, mono coding of objects introduces several challenges to deliver linear production content. First, it is not directly backwards compatible with common legacy playback formats such as stereo or 5.1 surround and requires decoding and rendering the full set of objects to produce legacy channel-based output. Second, for movie playback or applications that do not require full interactivity, maintaining perfect separation of the individual objects is not required and could be exploited for more efficient compression.

To enable backwards compatibility with a typical channel-based surround format, Figure 8.11(a) illustrates a possible workflow where the objects’ signals are individually coded either with a lossy or lossless approach and can be canceled out from a core backwards compatible rendering (e.g., 5.1 or 7.1 surround channels) at playback time to be individually re-rendered. Both core and individual objects are simultaneously transmitted. If a lossy algorithm is used, the objects must ideally first be encoded and decoded before rendering to the core set of channels so that coding artifacts can be entirely canceled out during playback. The cost of decoding the full set of objects can be high, as all objects must be decoded and rendered twice (once to be canceled out from the main core channels and another for the actual playback). However, the approach lends itself well to scalable decoding where objects can be extracted in priority order, the least important ones remaining in the channel core if the decoder becomes computationally constrained. It is also well suited to carrying a residual channel-based ambience or “bed” in addition to supplemental objects. One challenge, however, is to handle dynamic object priorities as time-varying extraction of objects from the core channels could lead to audio artifacts.

Parametric, Joint Coding of Audio Objects

Parametric joint audio object coding e.g., MPEG-SAOC (Herre & Disch, 2007), overcomes these drawbacks and can achieve further coding efficiency. Such techniques are designed to transmit a number N of audio objects in an audio signal that comprises K downmix channels, where K < N and K is typically two, five or seven channels following standard channel-based layouts. For instance, the downmix core signal can be obtained by rendering the original objects to a canonical stereo or 5.1 speaker configuration. This core signal can then be coded using traditional perceptual coding techniques (Figure 8.11(b)). Together with this backwards compatible downmix signal core, object reconstruction metadata (typically in time-frequency tiles) is transmitted through a dedicated bitstream to the decoder side.

Although the object reconstruction data grows linearly with the amount of objects, it only adds a small overhead to the bitrate required to code the channel-based downmix itself.

A benefit of the approach is that the core downmix can be directly decoded and played back for legacy channel-based systems at no additional cost over legacy coding techniques. Some control over the core downmix is also possible, giving the opportunity to mixing engineers to fine tune the presentation for these important legacy use cases.

For applications where direct backwards compatibility is not a strong requirement, it should be noted that the coded core does not have to match typical channel-based playback configurations. In fact, the core could itself be object-based, thus creating a hierarchical object-based presentation (e.g., where 15 objects could be reconstructed from 4). Applying spatial coding techniques recursively is a solution to build such hierarchical object-based presentations. This approach typically leads to better reconstruction of the original objects. The smaller core set of objects is more efficient to decode and can still be flexibly rendered to different speaker layouts. It can typically be used for efficient virtualization on a mobile device while maintaining battery life.

Capturing Audio Objects

A Good Object Is a Clean Object

In traditional production workflows or interactive rendering systems, sound objects are generally attached to monophonic audio elements. As a single monophonic element cannot reproduce the full spatial characteristics of natural soundscapes, an object-based program is generally assembled from multiple mono, stereophonic and possible 3D spatial captures. Preferably, the monaural audio essence must be individually recorded, ensuring the cleanest possible capture for maximum flexibility. This approach gives the user the freedom to control and position each object and freely navigate throughout the resulting auditory scene. A clean capture also avoids noise buildup as many objects get mixed to form the final auditory scene. In some cases, it is valuable to apply more aggressive noise-removal approaches on the individual objects and introduce a unique global room tone or background ambience to mask possible remaining artifacts in the final mix. In broadcast applications, crosstalk between non-coincident microphones used as audio essence for objects can lead to combing artifacts during re-rendering to smaller speaker layouts (e.g., stereo). A solution to this problem is to introduce decorrelation processing on the acquired signals or modify the microphone layout to capture more diverse signals. Traditional directional microphone techniques (e.g., using close miking or shotgun microphones (Rayburn, 2012; Viers, 2012) can be used to achieve the desired separation. Steerable microphone arrays, either linear or spherical (Meyer & Elko, 2004), also offer a similar type of control with the added benefit of extracting specific components at post-production time that can be further turned into objects. Finally, the combination of directional microphones and orientation- trackers, e.g., in mobile devices such as phones, can enable simultaneous recording of audio essence and orientation data that can directly serve as positional information for object rendering (Tsingos et al., 2016).

The work of Cengarle et al. (Cengarle et al., 2010) proposes to automate the production of audio for sports broadcasting by dynamically blending the contribution of spot microphones around a soccer pitch based on a desired point of interest (Figure 8.12). The point of interest can thus become an audio object with an associated signal, i.e., the weighted sum of the microphones closest to the point of interest and a position. This point-of-interest object can be manually tracked by the user or could be automatically tracked in a live video feed. The gains associated with each microphone can be sent as level automation to the mixing console to provide additional manual control options to the mixer.

From Spatial Capture to Objects

Spatial sound recording techniques which encode directional components of the sound field (Merimaa, 2002; Meyer & Elko, 2004; Soundfield, 2016) can also be directly used to acquire and play back real-world auditory environments as a whole. Converting such recordings to more structured object-based representations is an emerging problem.

A number of solutions have been proposed to extract parametric, object-like, representations from real-world coincident or non-coincident recordings. For instance, Directional audio coding (DirAc) (Pulkki, 2006; Vilkamo, Looki and Pulkki, 2009) reconstructs direction of arrival information for different frequency sub-bands, typically from B-format recordings or using MEMS acoustic velocity probes or small arrays of omnidirectional microphones (Ahonen, 2013). This process of converting spatial recordings into more parametric object-like formats also enables improved, more discrete, rendering. For instance, a B-format recording converted into an object-like format with DirAc outperforms a traditional 1st-order decoding (Vilkamo, Lokki & Pulkki, 2009). Another advantage of such an approach is that it lends itself to more efficient coding and distribution as a downmix or intermediate rendering (e.g., mono or stereo) of the original recordings. It can be more efficiently transmitted but can still be parametrically decoded to a richer spatial representation using the estimated spatial metadata. Recently, it has been shown that such approaches can generate compelling 3D soundscapes including elevation information from widely available XY-stereo microphones (Tsingos et al., 2016).

Similar approaches have proposed a reconstruction extended to 3D position using non- coincident recordings (Gallo et al., 2007; Gallo & Tsingos, 2007). Time-differences of arrival between the non-coincident microphones can be used to reconstruct the 3D position of the different time-frequency tiles in the convex hull of the microphone array. For a moving listener across the microphone hull, a spatial rendering can be reproduced from the estimated position. A combination of the signals of the microphones closest to the listening point can be used as a good representative signal for each time-frequency tile/object. The different sub-band signals and their estimated direction or position can be considered audio objects and be imported into object-audio production workflows to be mixed with other elements or manipulated. For instance, Gallo et al. (2007) demonstrates the ability to move some of the original elements or dynamically occlude them by introducing virtual obstacles in a virtual reality or gaming context.

Tracking of the acoustic intensity and spatial metadata in general can also enable virtual beamforming or blind source separation (BSS) (Yu, Hu & Xu, 2014). Such approaches, however, cannot fully achieve fine-grain separation of semantic objects (e.g., perfectly separate a dog bark from a car engine) and as a result cannot enable full control over the captured auditory scene without introducing audible artifacts. Other categories of BSS algorithms that attempt to segregate sources more finely do not directly rely upon spatial information but more on the assumed statistical independence of the signals themselves, e.g., by performing independent component analysis (Yu et al., 2014). They generally require the number of sources to be known in advance or obtained through other algorithms as a pre-processing step.

Tradeoffs of Object-Based Representations

The previous sections have explored the core components of object-based content creation, encoding and rendering. In this section, we review different tradeoffs that can guide the decision to create and deliver object-based content as opposed to alternative channel-based programs (e.g., loudspeaker feeds or B-format).

Rendering Flexibility and Scalability

Object-based delivery is an obvious choice for high-quality content distribution to high-end playback environment such as movie theaters that use large speaker configurations (e.g., typically 40 and up to 64 in some Dolby Atmos theatrical installations). Enabling spatial audio reproduction over a large number of discrete speakers has a number of benefits over traditional cinema playback where speakers are grouped into arrays: better timbral characteristic, sharper point sources, better A/V and spatial coherence for a large listening area. Objects also offer exhibitors the ability to differentiate or adapt the quality offering to the different markets. Delivering a multitude of different channel-based renderings to match the different room configurations would create a distribution nightmare and is not desirable. Object-based audio production lets content creators streamline the process of creating high-end spatial audio presentations including height information, while being able to re-render in a single pass multiple legacy stereo, 5.1 and 7.1 deliverables or stems for subsequent tuning or internationalization.

For home applications, where the number of loudspeakers will be in general much smaller or the playback will occur on headphones, one can wonder if the same benefits of objects hold in terms of spatial representation. Figure 8.13 shows the result of a preference test conducted on headphones with listeners comparing channel-based virtualization (i.e., using a few discrete Head-Related Transfer Functions [HRTFs] for each channel) and native object-based virtualization (using a high-resolution continuous HRTF model). As can be seen, an object-based presentation can be preferred to a channel-based virtualization, assuming a high-resolution or continuous HRTF model is used for rendering.

Coding Efficiency and Transmission

In the theatrical context, object-based content requires on average much less audio tracks to reproduce the soundtrack at any single time-instant than delivering a large number of pre- rendered channels. This is due to object-based content being, in general, temporally sparse. It can therefore be efficiently and losslessly coded, leading to relatively small file sizes compared to channel-based linear PCM (LPCM). Peak bitrates can still be high, however, with up to 128 simultaneous objects in some theatrical formats (Robinson et al., 2012), but this is rare in practice and does not strongly impact average file sizes. On average, a losslessly coded Dolby Atmos digital cinema package may vary in size from 3 gigabytes (Gb) to 10Gb, with a relative size of 0.6× to 1.5× the uncompressed 24-bit LPCM 7.1 deliverables.

In the context of low-bitrate lossy delivery to the home, the quality of object-based playback tends to decrease faster due to the extra cost of sending object metadata and the slightly reduced coding efficiency. This tradeoff also has to be balanced against the improved quality brought by object-based delivery for other endpoints, as illustrated in Figure 8.13.

Finally, for applications requiring some amount of interactivity or conditional rendering, such as the broadcast applications discussed above, object-based coding offers a more bandwidth-efficient way to deliver separate elements that would otherwise require multiple fully separated multichannel stems. This also allows a more efficient and higher-quality processing for playback-time enhancement of dialogue tracks by representing spatialized dialogue as one or few objects. For emerging applications such as cinematic virtual reality, the artistic control of the headphone rendering (e.g., stereo vs. binaural or headtracked vs. non-headtracked sounds) is also made more bit-efficient through the use of objects as opposed to e.g., delivering multiple B-format stems.

Object-Based Loudness Estimation and Control

As object-based audio representations move from production or interactive gaming into more dual-ended transmission workflows such as broadcast delivery, ensuring consistent loudness estimation and control becomes paramount as it ties to legal regulations in many geographies.

International recommendations, such as ITU-R BS.1770 (2015), provide methods to estimate loudness for channel-based content and are widely used throughout the audio industry.

Two issues become of interest in the context of object-based audio production: 1) extending current standards to meter and correct loudness of object-based content, and 2) ensuring that playback of such content has consistent loudness regardless of the rendering configuration.

ITU-R BS.1770 has recently been revised to be able to measure any arbitrary channel-based configuration (see Figure 8.14). This extension to an arbitrary number of channels could form the basis for object loudness measurement, where the loudness is derived from summing the frequency-weighted energies of all the objects monaural audio essence. Similar to current channel-based workflows, this extension would be used to measure and correct object-based programs prior to transmission and would ensure that multiple programs (whether channel-based or object-based) are mixed and delivered at consistent levels. On the playback side, it is indeed desirable that levels do not drastically vary from one rendering configuration to another. As seen above, a widely accepted requirement of object rendering is to be energy preserving, which is different from loudness preservation.

Figure 8.15(a) shows a comparison of loudness measurements for several object-based audio clips being rendered to different output formats from stereo to an immersive 7.1.4 channel layout. The measurements were done using the recently revised version of ITU-R BS.1770 as shown in Figure 8.14. As can be observed, an energy-preserving renderer provides a good baseline for loudness preservation across different rendering configurations (within 2.5 dB from the object-based measurement). It can be expected that the measured loudness generally increases as the number of playback channels decreases, being maximum for stereo. This is caused by having more objects summing up electrically into fewer output channels, therefore departing from the model that different speakers’ outputs sum acoustically in a way closer to an energy-preserving model.

A solution to achieve better timbral and loudness preservation would be to perform frequency-dependent rendering where the renderer transitions from energy-preserving panning to amplitude-preserving panning as the frequency decreases (Laitinen et al., 2014). However, this would require a significant increase in rendering cost. Alternatively, the renderer can incorporate level trimming, possibly controlled via metadata that can help achieve both better aesthetic results and loudness preservation, in a way similar to surround downmix coefficients in current channel-based codecs. Figure 8.15(b) illustrates such a mechanism where object-based level trims (a per-object gain that depends on each object’s (x,y,z) coordinates as well as the number of playback speakers) are used to alter the core energy-preserving rendering. The trims are applied to the object’s audio essence prior to rendering. Such an approach can lead to a reduced loudness discrepancy without increasing the rendering complexity.

Object-Based Program Interchange and Delivery

The spatial coding process described in the section above enables the interchange of countless combinations of spatial object-groups that simplify the enablement of immersive and personalized experiences. Figure 8.16 illustrates example spatial object-groups along with channel-based group program elements envisioned for immersive and/or personalized program interchange (including consumer delivery) represented in blocks A–G. All of the blocks A–G can be generated from a single object-based audio master. In each of the blocks (AC) in Figure 8.16, N indicates the number of spatial object-groups, each carrying a monophonic audio signal and metadata. This number is flexible to address a wide range of workflow capabilities and compatibility needs. This approach allows users to tradeoff spatial resolution to meet their business and/or operational needs.

The audio essence and associated metadata for each of the program building blocks can be carried e.g., in the newly defined ITU recommendation for the broadcast WAVE file format and audio definition model (BWAV/ADM) (ITU-R BS 2076–0, 2015; ITU-R BS 2088–0, 2015) The audio essence can be either as encoded as linear PCM or in a mezzanine compressed format such as Dolby E with extensible metadata delivery format (EMDF) extensions (ESTI TS 102 366 Annex H, 2014). For live applications, the EMDF metadata can also be encoded over AES digital audio interfaces using the SMPTE 337 standard (2015). In addition to formats suitable for traditional audio production, efforts have also been ongoing to standardize richer file formats, such as Interactive XMF (2008) that could be used for interchange and playback in the context of video games and interactive applications in the future.

In addition to the standardization of new professional formats for interchanging audio objects, new commercial formats have also been recently introduced for the home delivery of object-based content. These formats leverage current generation codecs (e.g., Dolby Digital Plus, Dolby TrueHD or DTS Master Audio) using a backwards-compatible core and extension layers. Object-based audio is also an essential component of next-generation codecs such as Dolby AC-4 and MPEG-H, which are now part of the next generation ATSC 3.0, ETSI (2015) and DVB broadcasting standards. These new codecs dramatically broaden the reach of object-based representations and their applications into consumer devices.

Conclusion

Historically limited to interactive gaming or production, object-based audio is now extending throughout the audio industry and all the way to consumers thanks to new tools, codecs and standards. Object-based audio mixing and delivery concepts strongly resonate with the content creation community. Objects are fast becoming the preferred approach to mix and deliver soundtracks in cinemas (9 out of 10 Oscar-nominated movies for sound mixing and editing in 2016 were mixed in an object-based format) and is also making fast progress in the broadcast industry. New game engines and consoles include object-based audio output to address increasingly flexible playback configurations in the home, which may include 3D speaker layouts or virtualized ceiling speakers. Audio objects are also making their first foray into music production and delivery and live DJ-ing and will be critical to emerging applications such as cinematic virtual reality or augmented reality as they become more accessible to the consumers.

More importantly, the growing adoption of object-based production, interchange and delivery throughout the audio industry brings unprecedented future-proofing and evolution capabilities. As rendering or capture techniques improve, the new and upcoming standardized audio and metadata infrastructures will enable these improvements to be readily heard by wide audiences at much reduced cost for the content creators.

Acknowledgments

The author thanks Jeff Riedmiller, Scott Norcross, Charles Robinson, Sripal Mehta, Dan Darcy and Poppy Crum for their input and contributions to this work, as well as the greater sound technology team at Dolby Laboratories.

References

Ahonen, J. (2013). Microphone Front-ends for Spatial Sound Analysis and Synthesis with Directional Audio Coding, Doctoral Thesis, Department of Signal Processing and Acoustics, Aalto University.

Barber, C. B., Dobkin, D. P., & Huhdanpaa, H. (1996). The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 22(4), 469–483.

Bargen, B., & Donelly, P. (1998). Inside Direct X. Microsoft Press.

Begault, D. R. (1994). 3D Sound for Virtual Reality and Multimedia. Academic Press Professional.

Begault, D. R., & Rumsey, F. (2004). An Anthology of Articles on Spatial Sound Techniques: Part 2—Multichannel Audio Technologies. New York: Audio Engineering Society.

Best, V., van Schaik, A., Jin, C., & Carlile, S. (2005). Auditory spatial perception with sources overlapping in frequency and time. Acta Acustica united with Acustica, 91(8), 421–428.

Brungart, D. S., Simpson, B. D., & Kordik, A. J. (2005). Localization in the presence of multiple simultaneous sounds. Acta Acustica United with Acustica, 91(9), 471–479.

Cengarle, G., Mateos, T., Olaiz, N., & Arumı, P. (2010). A new technology for the assisted mixing of sport events: Application to live football broadcasting. Proceedings of the 128th Audio Engineering Society Convention, Barcelona, Spain.

Daniel, J., & Moreau, S. (2004). Further study of sound field coding with higher order Ambisonics. Proceedings of the 116th Audio Engineering Society Convention, Berlin, Germany.

de Vries, D. (2009). Wave Field Synthesis. AES monograph.

Dickins, G., Flax, M., McKeag, A., & McGrath, D. (1999). Optimal 3D-speaker panning. Proceedings of the 16th AES International Conference, Spatial Sound Reproduction. Rovaniemi, Finland.

EAX. (2004). Environmental Audio Extensions 4.0, Creativeqc. Retrieved from www.soundblaster.com/eaudio.

ETSI. (2014). Digital Audio Compression (ac-3, enhanced ac-3) Standard, ESTI TS 102 366 Annex H.

ETSI. (2015). Digital Audio Compression (ac-4) Standard part 2: Immersive and Personalized Audio, ESTI TS 103 190–2.

Faller, C., Favrot, A., Langen, C., Tournery, C., & Wittek, H. (2010). Digitally enhanced shotgun microphone with increased directivity. Proceedings of the 129th Audio Engineering Society Convention, San Francisco, USA.

FMOD Music and Sound Effects System. Retrieved from www.fmod.org.

Funkhouser, T., Jot, J. M., & Tsingos, N. (2002). Sounds Good to Me! Computational Sound for Graphics, vr, and Interactive Systems. Siggraph 2002 course #45, 2002.

Furness, R. K. (1990). Ambisonics—An overview. Proceedings of the 8th Audio Engineering Society Conference. Washington, DC.

Gallo, E., Lemaitre, G., & Tsingos, N. (2005). Prioritizing signals for selective real- time audio processing. Proceedings of International Conference on Auditory Display (ICAD). Limerick, Ireland.

Gallo, E., & Tsingos, N. (2007). Extracting and re-rendering structured auditory scenes from field recordings. Proceedings of the 30th Audio Engineering Society International Conference on Intelligent Audio Environments, Saariselka, Finland.

Gallo, E., Tsingos, N., & Lemaitre, G. (2007). 3D-Audio matting, post-editing and re-rendering from field recordings. EURASIP Journal on Applied Signal Processing, Special Issue on Spatial Sound and Virtual Acoustics.

Garity, W. E., & Hawkins, J. N.A. (1941). Fantasound. Journal of the Society of Motion Picture Engineers, 37.

Herder, J. (1999). Optimization of sound spatialization resource management through clustering. The Journal of Three Dimensional Images, 3D-Forum Society, 13(3), 59–65.

Herre, J., & Disch, S. (2007). New concepts in parametric coding of spatial audio: From SAC to SAOC. Proceedings of the IEEE International Conference on Multimedia and Expo, Beijing, China.

Herre, J., Hilpert, J., Kuntz, A., & Plogsties, J. (2015). MPEG-H 3D audio‑The new standard for coding of immersive spatial audio. IEEE Journal of Selected Topics in Signal Processing, 9(5).

High Dynamic Range Audio in the Frostbite Game Engine. (2009). Retrieved from www.frostbite.com/2009/04/how-hdr-audio-makes-battlefield-bad-company-goboom/

Interactive Audio Special Interest Group (IASIG), 3D Audio Working Group. (2016). Retrieved from www.iasig.org/.

Interactive XMF File Format. (2008). Retrieved from www.iasig.org/wg/ixwg. Format for non-pcm audio and data in an aes3 serial digital audio interface (2015). SMPTE ST 337:2015, 1–17.

International Telecom: Union. Method for the subjective assessment of intermediate quality level of coding systems. Recommendation ITU-R BS.1534–1, 2001–2003.

International Telecom: Union. Advanced Sound System for Programme Production. Recommendation ITU-R BS.2051–0, 2014.

ITU-R. Algorithms to measure audio programme loudness and true-peak audio level, ITU-R BS 1770–4. 2015.

ITU-R. Audio definition model, ITU-R BS 2076–0. 2015.

ITU-R. Long-form file format for the international exchange of audio programme materials with metadata, ITU-R BS 2088–0. 2015.

Jot, J. M., Ray, L., & Dahl, L. (1998). Extension of Audio BIFS: Interfaces and Models Integrating Geometrical and Perceptual Paradigms for the Environmental Spatialization of Audio. ISO Standard SO/IEC JTC1/SC29/WG11 M.

Kayser, C., Petkov, C., Lippert, M., & Logothetis, N. K. (2005). Mechanisms for allocating auditory attention: An auditory saliency map. Current Biology, 15, 1943–1947.

Kelly, M. C., & Tew, A. I. (2002). The continuity illusion in virtual auditory space. Proceedings of the 112th Audio Engineering Society Convention. Munich, Germany.

Klatzky, R. L. (1998). Allocentric and Egocentric Spatial Representations: Definitions, Distinctions, and Interconnections: Spatial Cognition. Berlin Heidelberg: Springer-Verlag.

Kostadinov, D., Reiss, J., & Mladenov, V. (2010). Evaluation of distance based amplitude panning for spatial audio. Proceedings of ICASSP2010, 285–288.

Laitinen, M.-V., Vilkamo, J., Jussila, K., Politis, A., & Pulkki, V. (2014). Gain normalization in amplitude panning as a function of frequency and room reverberance. Proceedings of the 55th International Audio Engineering Society Conference on Spatial Audio.

Lossius, T., Baltazar, P., & de la Hogue, T. (2009). DBAP—distance-based amplitude panning. Proceedings of the International Conference on Computer Music (ICMC). Montreal, Canada.

Malham, D. G., & Myatt, A. (1995). 3D sound spatialization using ambisonic techniques. Computer Music Journal, 19(4), 58–70.

Mann, M., Churnside, A., Bonney, A., & Melchior, F. (2013). Object-based audio applied to football broadcasts. Proceedings of the 2013 ACM International Workshop on Immersive Media Experiences, ImmersiveMe ’13, 13–16. New York, NY.

Merimaa, J. (2002). Applications of a 3D microphone array. Proceedings of the 112th Audio Engineering Society Convention, Munich, Germany.

Meyer, J., & Elko, G. (2004). Spherical microphone arrays for 3D sound recording. In Yiteng (Arden) Huang & Jacob Benesty (Eds.), Audio Signal Processing for Next-Generation Multimedia Communication Systems (Chap. 2). Boston: Kluwer Academic Publisher.

Moore, B. C. J. (1997). An Introduction to the Psychology of Hearing (4th ed.). San Diego, CA: Academic Press.

Moore, B. C. J., Glasberg, B., & Baer, T. (1997). A model for the prediction of thresholds, loudness and partial loudness. Journal of the Audio Engineering Society, 45(4), 224–240. Retrieved from http://hearing.psychol.cam.ac.uk/Demos/demos.html

OpenAL: An Open Source 3D Sound Library. (2000). Retrieved from www.openal.org.

Painter, E. M., & Spanias, A. S. (2000). Perceptual coding of digital audio. Proceedings of the IEEE, 88(4), 451–515.

Potard, G., & Burnett, I. (2004). Decorrelation techniques for the rendering of apparent source width in 3D audio displays. Proceedings of the 7th International Conference on Digital Audio Effects (DAFX’04). Naples, Italy.

Pulkki, V. (1997). Virtual sound source positioning using vector base amplitude panning. Journal of the Audio Engineering Society, 45(6), 456–466.

Pulkki, V. (1999). Uniform spreading of amplitude panned virtual sources. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz, NY.

Pulkki, V. (2006). Directional audio coding in spatial sound reproduction and stereo upmixing. Proceedings of the 28th AES International Conference. Pitea, Sweden.

Pulkki, V., Karjalainen, M., & Valimaki, V. (1999). Localization, coloration, and enhancement of amplitude-panned virtual sources. Proceedings of the 16th AES International Conference on Spatial Sound Reproduction. Rovaniemi, Finland.

Rayburn, R. A. (2012). Eargle’s Microphone Book. Amsterdam: Focal Press.

Riedmiller, J., Mehta, S., Tsingos, N., & Boon, P. (2015). Immersive and personalized audio: A practical system for enabling interchange, distribution, and delivery of next-generation audio experiences. Motion Imaging Journal, SMPTE, 124(5), 1–23.

Roads, C. (1996). The Computer Music Tutorial. Cambridge: MIT Press.

Robinson, C., & Tsingos, N. (2001). Cinematic sound scene description and rendering control. Annual Technical Conference Exhibition, SMPTE 2014, 1–14.

Robinson, C., Tsingos, N., & Mehta, S. (2012). Scalable format and tools to extend the possibilities of cinema audio. SMPTE Motion Imaging Journal, 121(8).

Rumsey, F. (2001). Spatial Audio. US: Taylor & Francis.

Savioja, L., Huopaniemi, J., Lokki, T., & Vaananen, R. (1999). Creating interactive virtual acoustic environments. Journal of the Audio Engineering Society, 47(9), 675–705.

Sibbald, A. (2001). MacroFX Algorithm: White Paper. Retrieved from www.sensaura.co.uk/whitepapers

Soundfield (2016). Soundfield microphones. Retrieved from www.soundfield.com.

Tsingos, N. (2001). Artifact-free Asynchronous Geometry-based Audio Rendering. ICASSP’2001, Salt Lake City, USA.

Tsingos, N., Gallo, E., & Drettakis, G. (2004). Perceptual audio rendering of complex virtual environments: ACM Transactions on Graphics. Proceedings of SIGGRAPH 2004.

Tsingos, N., & Gascuel, J.-D. (1997). Soundtracks for computer animation: Sound rendering in dynamic environments with occlusions. Proceedings of Graphics Interface’97, pp. 9–16.

Tsingos, N., Govindaraju, P., Zhou, C., & Nadkarni, A. (2016). XY-stereo capture and upconversion for virtual reality. AES International Conference on Augmented and Virtual Reality. Los Angeles.

Tsingos, N., Robinson, C., Darcy, D., & Crum, P. (2014). Evaluation of panning algorithms for theatrical applications. Proceedings of the 2nd International Conference on Spatial Audio (ICSA). Erlangen, Germany.

Viers, R. (2012). The Location Sound Bible. Studio City, CA: Michael Wiese Productions.

Vilkamo, J., Lokki, T. & Pulkki, V. (2009). Directional audio coding: Virtual microphone based synthesis and subjective evaluation. Journal of Audio Engineering Society, 57(9), 709–724.

Vries, D. de. (2009). Wave Field Synthesis. AES monograph.

Wand, M., & Straßer, W. (2004). Multi-resolution sound rendering. Symposium on Point-Based Graphics, Zurich, Switzerland.

Warren, J., Schaefer, S., Hirani, A., & Desbrun, M. (2007). Barycentric coordinates for convex sets. Advances in Computational Mathematics, 27(3), 319–338.

WWISE by Audiokinetic. Retrieved from www.audiokinetic.com/products/wwise.

Yu, X., Hu, D., & Xu, J. (eds.) (2014). Blind Source Separation: Theory and Applications. Singapore: Wiley.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset