This chapter provides further details about the main formats in which digital audio data is stored and moved between systems. This includes coverage of audio file formats, digital interfaces and networked audio interchange, concentrating mainly on those issues of importance for professional applications.

AUDIO FILE FORMATS FOR DIGITAL WORKSTATIONS

There used to be almost as many file formats for audio as there are days in the year. In the computer games field, for example, this is still true to some extent. For a long time the specific file storage strategy used for disk-based digital audio was the key to success in digital workstation design, because disk drives were relatively slow and needed clever strategies to ensure that they were capable of handling a sufficiently large number of audio channels. Manufacturers also worked in isolation and the size of the market was relatively small, leading to virtually every workstation or piece of software using a different file format for audio and edit list information. Although there may still be theoretical performance advantages in the use of filing structures specially designed for real-time applications such as audio and video editing, interchange of material between systems and applications is now at least as important as ultimate transfer speed. Also the majority of hard disk drives available today are capable of replaying many channels of audio in real time without needing to use a dedicated storage strategy. This has led to the increased use of a few common cross-platform file formats such as WAVE and AIFF, both of which can be used in the IEEE 32-bit floating point data format that is becoming popular on audio workstations, as well as in conventional 16–24 bit PCM mode.

The growth in the importance of metadata (data about data), and the representation of audio, video and metadata as ‘objects’, has led to the development of interchange methods that are based on object-oriented concepts and project ‘packages’ as opposed to using simple text files and separate media files. There is increasing integration between audio and other media in multimedia authoring and some of the file formats mentioned below are closely related to international efforts in multimedia file exchange.

It is not proposed to attempt to describe all of the file formats in existence, because that would be a relatively pointless exercise and would not make for interesting reading. It is nonetheless useful to have a look at some examples taken from the most commonly encountered file formats, particularly those used for high-quality audio by desktop and multimedia systems, since these are amongst the most widely used in the world and are often handled by audio workstations even if not their native format. It is not proposed to investigate the large number of specialized file formats developed principally for computer music on various platforms, nor the files used for internal sounds and games on many computers.

File formats in general

A data file is simply a series of data bytes formed into blocks and stored either contiguously or in fragmented form. Files themselves are largely independent of the operating system and filing structure of the host computer, because a file can be transferred to another platform and still exist as an identical series of data blocks. It is the filing system that is often the platform- or operating-system-dependent entity (e.g. FAT32 or HFS).

There are sometimes features of data files that relate directly to the operating system and filing system that created them, but they do not normally prevent such files being translated by other platforms. For example, there are two approaches to byte ordering: the so-called little-endian order in which the least significant byte comes first or at the lowest memory address, and the big-endian format in which the most significant byte comes first or at the highest memory address. These originally related to the byte ordering used in data processing by the two most common microprocessor families and thereby to the two most common operating systems used in desktop audio workstations. Motorola processors, as originally used in the Apple Mac, dealt in big-endian byte ordering, and Intel processors, as used in MS-DOS machines (now in Macs too), deal in little-endian byte ordering.

Second, some older Mac files may have two parts — a resource fork and a data fork — whereas Windows files only have one part. High-level ‘resources’ were stored in the resource fork (used in some legacy audio files for storing information about the file, such as signal processing to be applied, display information and so forth) whilst the raw data content of the file was stored in the data fork (used in audio applications for audio sample data). The resource fork is not always there, but may be. The resource fork can get lost when transferring such files between machines or to servers, unless Mac-specific protocols are used (e.g. MacBinary or BinHex). Mac OS X files are not supposed to use resource forks, in order that they can be easily transferred between systems.

Some data files include a ‘header’, that is a number of bytes at the start of the file containing information about the data that follows. In audio systems this may include the sampling rate and resolution of the file. Audio replay would normally be started immediately after the header. On the other hand, some files are simply raw data, usually in cases where the format is fixed. ASCII text files are a well-known example of raw data files — they simply begin with the first character of the text. More recently file structures have been developed that are really ‘containers’ for lots of smaller files, or data objects, each with its own descriptors and data. The RIFF structure, described below, is an early example of the concept of a ‘chunk-based’ file structure. Apple’s Bento container structure, used in OMFI, and the container structure of AAF are more advanced examples of such an approach.

The audio data in most common high-quality audio formats are stored in two’s complement form (see Chapter 8) and the majority of files are used for 16 or 24 bit data, thus employing either 2 or 3 bytes per audio sample. 8 bit files use 1 byte per sample. 32 bit floating point files use 4 bytes per sample, three containing the mantissa and one containing the exponent. This makes the storage required to use this number format 50% greater than for 24 bit fixed point operation.

AIFF and AIFF-C formats

The AIFF format is widely used as an audio interchange standard, because it conforms to the EA IFF 85 standard for interchange format files used for various other types of information such as graphical images. AIFF is an Apple standard format for audio data and is encountered widely on Mac-based audio workstations and some Silicon Graphics systems. Audio information can be stored at a number of resolutions and for any number of channels if required, and the related AIFF-C (file type ‘AIFC’) format allows also for compressed audio data. It consists only of a data fork, with no resource fork, making it easy to transport to other platforms.

All IFF-type files are typically made up of ‘chunks’ of data as shown in Figure 10.1. A chunk consists of a header and a number of data bytes to follow. The simplest AIFF files contain a ‘common chunk’, which is equivalent to the header data in other audio files, and a ‘sound data’ chunk containing the audio sample data. These are contained overall by a ‘form’ chunk as shown in Figure 10.2. AIFC files must also contain a ‘version chunk’ before the common chunk to allow for future changes to AIFC.

RIFF WAVE format

The RIFF WAVE (often called WAV) format is the Microsoft equivalent of Apple’s AIFF. It has a similar structure, again conforming to the IFF pattern, but with numbers stored in little-endian rather than big-endian form. It is used widely for sound file storage and interchange on PC workstations, and for multimedia applications involving sound. Within WAVE files it is possible to include information about a number of cue points, and a playlist to indicate the order in which the cues are to be replayed. WAVE files use the file extension ‘wav’.

FIGURE 10.1 General format of an IFF file chunk

FIGURE 10.2 General format of an AIFF file

FIGURE 10.3 Diagrammatic representation of a simple RIFF WAVE file, showing the three principal chunks. Additional chunks may be contained within the overall structure, for example a ‘bext’ chunk for the Broadcast WAVE file

A basic WAV file consists of three principal chunks, as shown in Figure 10.3: the RIFF chunk, the FORMAT chunk and the DATA chunk. The RIFF chunk contains 12 bytes, the first four of which are the ASCII characters ‘RIFF’, the next four indicating the number of bytes in the remainder of the file (after the first eight) and the last four of which are the ASCII characters ‘WAVE’. The format chunk contains information about the format of the sound file, including the number of audio channels, sampling rate and bits per sample, as shown in Table 10.1.

The audio data chunk contains a sequence of bytes of audio sample data, divided as shown in the FORMAT chunk. Unusually, if there are only 8 bits per sample or fewer each value is unsigned and ranges between 0 and 255 (decimal), whereas if the resolution is higher than this the data is signed and ranges both positively and negatively around zero. Audio samples are interleaved by channel in time order, so that if the file contains two channels a sample for the left channel is followed immediately by the associated sample for the right channel. The same is true of multiple channels (one sample for time-coincident sample periods on each channel is inserted at a time, starting with the lowest numbered channel), although basic WAV files were nearly always just mono or two channel.

Table 10.1 Contents of FORMAT Chunk in a Basic WAVE PCM File

Byte	ID	Contents
0–3	ckID	‘fmt_’ (ASCII characters)
4–7	nChunkSize	Length of FORMAT chunk (binary, hex value: &00000010)
8–9	wFormatTag	Audio data format (e.g. &0001 = WAVE format PCM) Other formats are allowed, for example IEEE floating point and MPEG format (&0050 = MPEG 1)
10–11	nChannels	Number of channels (e.g. &0001 = mono, &0002 = stereo)
12–15	nSamplesPerSec	Sample rate (binary, in Hz)
16–19	nAvgBytesPerSec	Bytes per second
20–21	nBlockAlign	Bytes per sample: e.g. &0001 = 8 bit mono; &0002 = 8 bit stereo or 16 bit mono; &0004 = 16 bit stereo
22–23	nBitsPerSample	Bits per sample

The RIFF WAVE format is extensible and can have additional chunks to define enhanced functionality such as surround sound and other forms of coding. This is known as ‘WAVE-format extensible’. Chunks can include data relating to cue points, labels and associated data, for example. The Broadcast WAVE format is one example of an enhanced WAVE file (see Fact File 10.1), which is used widely in professional applications for interchange purposes.

DSD-IFF file format

The DSD-IFF file format is based on a similar structure to other IFF-type files, described above, except that it is modified slightly to allow for the large file sizes that may be encountered with the high-resolution Direct Stream Digital format used for SuperAudio CD. Specifically the container FORM chunk is labeled ‘FRM8’ and this identifies all local chunks that follow as having ‘length’ indications that are 8 bytes long rather than the normal 4. In other words, rather than a 4 byte chunk ID followed by a 4 byte length indication, these files have a 4 byte ID followed by an 8 byte length indication. This allows for the definition of chunks with a length greater than 2 Gbytes, which may be needed for mastering SuperAudio CDs. There are also various optional chunks that can be used for exchanging more detailed information and comments such as might be used in project interchange. Further details of this file format, and an excellent guide to the use of DSD-IFF in project applications, can be found in the DSD-IFF specification, as described in the ‘Recommended further reading’ at the end of this chapter.

FACT FILE 10.1 BROADCAST WAVE FORMAT

The Broadcast WAVE format, described in EBU Tech. 3285, was standardized by the European Broadcasting Union (EBU) because of a need to ensure compatibility of sound files and accompanying information when transferred between workstations. It is based on the RIFF WAVE format described above, but contains an additional chunk that is specific to the format (the ‘broad-cast_audio_extension’ chunk, ID 5 ‘bext’) and also limits some aspects of the WAVE format. Version 0 was published in 1997 and Version 1 in 2001, the only difference being the addition of an SMPTE UMID (Unique Material Identifier) in version 1 (this is a form of metadata). Such files currently only contain either PCM or MPEG-format audio data. An optional Extended-BWF file (BWF-E) enables the size to exceed the limits of the basic RIFF WAVE format by extending the address space to 64 bits.

Broadcast WAVE files contain at least three chunks: the broadcast_audio_extension chunk, the format chunk and the audio data chunk. The broadcast extension chunk contains the data shown in the table below. Optionally files may also contain further chunks for specialized purposes and may contain chunks relating to MPEG audio data (the ‘fact’ and ‘mpeg_audio_extension’ chunks). MPEG applications of the format are described in EBU Tech. 3285, Supplement 1 and the audio data chunk containing the MPEG data normally conforms to the MP3 frame format.

A multichannel extension chunk defines the channel ordering, surround format, downmix coefficients for creating a two-channel mix, and some descriptive information. There are also chunks defined for metadata describing the audio contained within the file, such as the ‘quality chunk’ (ckID = ‘qlty’), which together with the coding history contained in the ‘bext’ chunk make up the so-called ‘capturing report’. These are described in Supplement 2 to EBU Tech. 3285. Finally there is a chunk describing the peak audio level within a file, which can aid automatic program level setting and program interchange. Recent revisions include the option to add loud ness metadata into the file.

BWF files can be either mono, two-channel or multichannel (sometimes called polyfiles, or BWF-P), and utilities exist for separating polyfiles into individual mono files which some applications require.

Broadcast audio extension chunk format
Data	Size (bytes)	Description
ckID	4	Chunk ID = ‘bext’
ckSize	4	Size of chunk
Description	256	Description of the sound clip
Originator	32	Name of the originator
OriginatorReference	32	Unique identifier of the originator (issued by the EBU)
OriginationDate	10	‘yyyy-mm-dd’
OriginationTime	8	‘hh-mm-ss’
TimeReferenceLow	4	Low byte of the first sample count since midnight
TimeReferenceHigh	4	High byte of the first sample count since midnight
Version	2	BWF version number, e.g. &0001 is Version 1
UMID	64	UMID according to SMPTE 330M; if only a 32 byte UMID then the second half should be padded with zeros
Reserved	190	Reserved for extensions; set to zero in Version 1.
CodingHistory	Unrestricted	A series of ASCII strings, each terminated by CR/LF (carriage return, line feed) describing each stage of the audio coding history, according to EBU R-98

Apple core audio format

Apple’s Core Audio Format (CAF) is a chunk-based container structure for storing audio in a way that is highly compatible with the Core Audio architecture of Mac OS X (see Chapter 9). It has a number of advantages over the other standard file types mentioned above in that it can have unlimited size, it can contain audio in a number of data formats and for any number of audio channels, and it can contain a range of metadata types including markers and channel layouts. Recording is also said to be safer because the file header does not have to be rewritten or updated at the end of or during recording, and new data can be appended to the end of existing files in a way that allows applications to determine the length of a file even if the header has not been properly finalized. The Channel Layout chunk describes the way in which channels are allocated to particular surround sound loudspeaker locations in multichannel files, in a similar way to the extended WAVE multichannel formats.

MPEG audio file formats

It is possible to store MPEG-compressed audio in AIFF-C or WAVE files, with the compression type noted in the appropriate header field. There are also older MS-DOS file extensions used to denote MPEG audio files, notably .MPA (MPEG Audio) or .ABS (Audio Bit Stream). However, owing to the ubiquity of the so-called ‘MP3’ format (MPEG 1, Layer 3) for audio distribution on the Internet, MPEG audio files are increasingly denoted with the extension ‘.MP3’. Such files are relatively simple, being really no more than MPEG audio frame data in sequence, each frame being preceded by a frame header. MPEG-4 files are containers that can carry multiple streams of audio, video and subtitle information. Two file extensions are commonly used, namely .mp4 and .m4a, having essentially the same structure. The latter was popularized by Apple with its iTunes releases and contains only audio information, including lossy-coded AAC data and Apple Lossless Audio Coding (ALAC) data.

Edit decision list (EDL) files and project interchange

EDL formats were historically proprietary but the need for open interchange of project data has increased the use of standardized EDL structures and ‘packaged’ project formats to make projects transportable between systems from different manufacturers.

Project interchange can involve the transfer of edit list, mixing, effects and audio data. Many of these are proprietary, such as the Digidesign ProTools session format. Software, such as SSL Pro-Convert, can be obtained for audio and video workstations that translates EDLs or projects between a number of different systems to make interchange easier. The OMFI (Open Media Framework Interchange) structure, originally developed by Avid, was one early attempt at an open project interchange format and contained a format for interchanging edit list data. Other options include XML-tagged formats that identify different items in the edit list in a text-based form. AES-31 is gaining popularity among workstation software manufacturers as a simple means of exchanging audio editing projects between systems, and is described in more detail below.

AES-31 format

AES-31 is an international standard designed to enable straightforward interchange of audio files and projects between systems. Audio editing packages are increasingly offering AES-31 as a simple interchange format for edit lists. In Part 1 the standard specifies a disk format that is compatible with the FAT32 file system, a widely used structure for the formatting of computer hard disks. Part 2 describes the use of the Broadcast WAVE audio file format. Part 3 describes simple project interchange, including a format for the communication of edit lists using ASCII text that can be parsed by a computer as well as read by a human. The basis of this is the edit decision markup language (EDML). It is not necessary to use all the parts of AES-31 to make a satisfactory interchange of elements. For example, one could exchange an edit list according to Part 3 without using a disk based on Part 1. Adherence to all the parts would mean that one could take a removable disk from one system, containing sound files and a project file, and the project would be readable directly by the receiving device.

EDML documents are limited to a 7 bit ASCII character set in which white space delimits fields within records. Standard carriage return (CR) and line-feed (LF) characters can be included to aid the readability of lists but they are ignored by software that might parse the list. An event location is described by a combination of time code value and sample count information. The time code value is represented in ASCII using conventional hours, minutes, seconds and frames (e.g. HH:MM:SS:FF) and the optional sample count is a four-figure number denoting the number of samples after the start of the frame concerned at which the event actually occurs. This enables sample-accurate edit points to be specified. It is slightly more complicated than this because the ASCII delimiters between the time code fields are changed to indicate various parameters:

HH:MM delimiter = Frame count and timebase indicator (see Table 10.2)

MM:SS delimiter = Film frame indicator (if not applicable, use the previous delimiter)

SS:FF delimiter = Video field and timecode type (see Table 10.3)

Table 10.2 Frame Count and Timebase Indicator Coding in AES-31

		Timebase
Frame count	Unknown	*1.000*	*1.001*
30	?	\|	:
25	!	·	/
24	#	=	−

Table 10.3 Video Field and Timecode Type Indicator in AES-31

	Video Field
Counting mode	Field 1	Field 2
PAL	·	:
NTSC non-drop-frame	·	:
NTSC drop-frame	·	:

The delimiter before the sample count value is used to indicate the audio sampling frequency, including all the pull-up and pull-down options (e.g. f_s times 1/1.001). There are too many of these possibilities to list here and the interested reader is referred to the standard for further information. This is an example of a time code and (after the slash denoting 48 kHz sampling frequency) optional sample count value:

The Audio Decision List (ADL) is contained between two ASCII keyword tags <ADL> and </ADL>. It in turn contains a number of sections, each contained within other keyword tags such as <VERSION>, <PROJECT>, <SYSTEM> and <SEQUENCE>. The edit points themselves are contained in the <EVENT_LIST> section. Each event begins with the ASCII keyword “(Entry)”, which serves to delimit events in the list, followed by an entry number (32 bit integer, incrementing through the list) and an entry type keyword to describe the nature of the event (e.g. “(Cut)”). Each different event te then has a number of bytes following that define the event more specifically. The following is an example of a simple cut edit, as suggested by the standard:

This sequence essentially describes a cut edit, entry number 0010, the source of which is the file (F) with the path shown, using channel 1 of the source file (or just a mono file), placed on track 1 of the destination timeline, starting at timecode three hours in the source file, placed to begin at one hour in the destination timeline (the ‘in point’) and to end ten seconds later (the ‘out point’). Some workstation software packages store a timecode value along with each sound file to indicate the nominal start time of the original recording (e.g. BWF files contain a timestamp in the ‘bext’ chunk), otherwise each sound file is assumed to start at time zero.

It is assumed that default crossfades will be handled by the workstation software itself. Most workstations introduce a basic short crossfade at each edit point to avoid clicks, but this can be modified by ‘event modifier’ information in the ADL. Such modifiers can be used to adjust the shape and duration of a fade in or fade out at an edit point. There is also the option to point at a rendered crossfade file for the edit point, as described in Chapter 9.

MXF — the media exchange format

MXF was developed by the Pro-MPEG forum as a means of exchanging audio, video and metadata between devices, primarily in television operations. It is based on the modern concept of media objects that are split into ‘essence’ and ‘metadata’. Essence files are the raw material (i.e. audio and video) and the metadata describes things about the essence (such as where to put it, where it came from and how to process it).

MXF files attempt to present the material in a ‘streaming’ format, that is, one that can be played out in real time, but they can also be exchanged in conventional file transfer operations. As such they are normally considered to be finished program material, rather than material that is to be processed somewhere downstream, designed for playout in broadcasting environments. The bit stream is also said to be compatible with recording on digital videotape devices.

AAF — the advanced authoring format

AAF is an authoring format for multimedia data that is supported by numerous vendors, including Avid which adopted it as a migration path from OMFI. Parts of OMFI 2.0 form the basis for parts of AAF and there are also close similarities between AAF and MXF (described in the previous section). Like the formats to which it has similarities, AAF is an object-oriented format that combines essence and metadata within a container structure. Unlike MXF it is designed for project interchange such that elements within the project can be modified, post-processed and resynchro-nized. It is not, therefore, directly suitable as a streaming format but can easily be converted to MXF for streaming if necessary.

Rather like OMFI it is designed to enable complex relationships to be described between content elements, to map these elements onto a timeline, to describe the processing of effects, synchronize streams of essence, retain historical metadata and refer to external essence (essence not contained within the AAF package itself). It has three essential parts: the AAF Object Specification (which defines a container for essence and metadata, the logical contents of objects and rules for relationships between them); the AAF Low-Level Container Specification (which defines a disk filing structure for the data, based on Microsoft’s Structured Storage); and the AAF SDK Reference Implementation (which is a software development kit that enables applications to deal with AAF files). The Object Specification is extensible in that it allows new object classes to be defined for future development purposes.

The basic object hierarchy is illustrated in Figure 10.4, using an example of a typical audio post-production scenario. ‘Packages’ of metadata are defined that describe either compositions, essence or physical media. Some package types are very ‘close’ to the source material (they are at a lower level in the object hierarchy, so to speak) — for example, a ‘file source package’ might describe a particular sound file stored on disk. The metadata package, however, would not be the file itself, but it would describe its name and where to find it. Higher-level packages would refer to these lower-level packages in order to put together a complex program. A composition package is one that effectively describes how to assemble source clips to make up a finished program. Some composition packages describe effects that require a number of elements of essence to be combined or processed in some way.

FIGURE 10.4
Graphical conceptualisation of some metadata package relationships in AAF-a simple audio post-production example

Packages can have a number of ‘slots’. These are a bit like tracks in more conventional terminology, each slot describing only one kind of essence (e.g. audio, video, graphics). Slots can be static (not time-dependent), timeline (running against a timing reference) or event-based (one-shot, triggered events). Slots have segments that can be source clips, sequences, effects or fillers. A source clip segment can refer to a particular part of a slot in a separate essence package (so it could refer to a short portion of a sound file that is described in an essence package, for example).

Disk pre-mastering formats

The Disk Description Protocol (DDP) developed and licensed by Doug Carson and Associates has been widely adopted for describing consumer optical disk masters, and some optical disk pressing plants require masters to be submitted in this form. Version 1 of the DDP for CD laid down the basic data structure but said little about higher-level issues involved in interchange, making it more than a little complicated for manufacturers to ensure that DDP masters from one system would be readable on another. Version 2 addressed some of these issues. There are versions of DDP for CD, DVD and HD DVD-ROM. The so-called Cutting Master Format (CMF) sanctioned by the DVD Forum is derived from DDP, but the Blu-Ray CMF is not related to DDP.

DDP is a protocol for describing the contents of a disk, which is not medium specific, so a range of tape or disk storage media can be used to transfer files to pressing plants. DDP files can be supplied separately to the audio data if necessary. The protocol consists of a number of ‘streams’ of data, each of which carries different information to describe the contents of the disk. These streams may be either a series of packets of data transferred over a network, files on a disk or tape, or raw blocks of data independent of any filing system. The DDP protocol simply maps its data into whatever block or packet size is used by the medium concerned, provided that the block or packet size is at least 128 bytes. Either a standard computer filing structure can be used, in which case each stream is contained within a named file, or the storage medium is used ‘raw’ with each stream starting at a designated sector or block address.

The ANSI tape labeling specification is used to label the media used for DDP transfers. This allows the names and locations of the various streams to be identified. The principal streams included in a DDP transfer for CD mastering are as follows:

1. DDP ID stream or ‘DDPID’ file. 128 bytes long, describing the type and level of DDP information, various ‘vital statistics’ about the other DDP files and their location on the medium (in the case of physically addressed media), and a user text field (not transferred to the CD).

2. DDP Map stream or ‘DDPMS’ file. This is a stream of 128 byte data packets which together give a map of the CD contents, showing what types of CD data are to be recorded in each part of the CD, how long the streams are, what types of subcode are included, and so forth. Pointers are included to the relevant text, subcode and main streams (or files) for each part of the CD.

3. Text stream. An optional stream containing text to describe the titling information for volumes, tracks or index points (not currently stored in CD formats), or for other text comments. If stored as a file, its name is indicated in the appropriate map packet.

4. Subcode stream. Optionally contains information about the subcode data to be included within a part of the disk, particularly for CD-DA. If stored as a file, its name is indicated in the appropriate map packet.

5. Main stream. Contains the main data to be stored on a part of the CD, treated simply as a stream of bytes, irrespective of the block or packet size used. More than one of these files can be used in cases of mixed-mode disks, but there is normally only one in the case of a conventional audio CD. If stored as a file, its name is indicated in the appropriate map packet.

INTERCONNECTING DIGITAL AUDIO DEVICES

Introduction

In the case of analog interconnection between devices, replayed digital audio is converted to the analog domain by the replay machine’s D/A convertors, routed to the recording machine via a conventional audio cable and then reconverted to the digital domain by the recording machine’s A/D convertors. The audio is subject to any gain changes that might be introduced by level differences between output and input, or by the record gain control of the recorder and the replay gain control of the player. Analog domain copying is necessary if any analog processing of the signal is to happen in between one device and another, such as gain correction, equalization, or the addition of effects such as reverberation. Most of these operations, though, are now possible in the digital domain.

An analog domain copy is not a perfect copy or a clone of the original master, because the data values will not be exactly the same (owing to slight differences in recording level, differences between convertors, the addition of noise, and so on). For a clone it is necessary to make a true digital copy. This can either involve a file copying process, perhaps over a network using a workstation, or a digital interface or network may be used for the streamed interconnection of recording systems and other audio devices such as mixers and effects units. The essential differences between digital audio interfaces and networked data exchange are explained in Fact File 10.2.

Digital interface basics

Digital audio interfaces conforming to one of the standard protocols allow for a number of channels of digital audio data to be transferred between devices with no loss of sound quality. Any number of generations of digital copies can be made without affecting the sound quality of the latest generation, provided that errors have been fully corrected. (This assumes that the audio is in a linear PCM format and has not been subject to low bit rate decoding and re-encoding.) This process takes place in real time, requiring the operator to put the receiving device into record mode such that it simply stores the incoming stream of audio data. Any accompanying metadata may or may not be recorded (often most of it is not). Both machines must be operating at the same sampling frequency (unless a sampling frequency convertor is used) and it may require the recorder to be switched to ‘external sync’ mode, so that it can lock its sampling frequency to that of the player. Alternatively (and preferably) a common reference (e.g. word clock) signal can be used to synchronize all devices that are to be interconnected digitally. If one of these methods of ensuring a common sampling frequency is not used then either audio will not be decoded at all by the receiver, or regular clicks will be audible at a rate corresponding to the difference between the two sampling frequencies (at which point samples are either skipped or repeated owing to the ‘sample slippage’ that is occurring between the two machines). A receiver should be capable of at least the same quantizing resolution (number of bits per sample) as the source device, otherwise audio resolution will be lost. If there is a difference in resolution between the systems it is advisable to use a processor in between the machines that optimally dithers the signal for the new resolution, or alternatively, to use redithering options on the source machine to prepare the signal for its new resolution (see Chapter 8).

FACT FILE 10.2 COMPUTER NETWORKS VS DIGITAL AUDIO INTERFACES

Dedicated digital audio interfaces are the digital equivalent of analog signal cables, down which signals for one or more channels are carried in real time from one point to another, possibly with some auxiliary information (metadata) attached. An example is the AES-3 interface, described in the main text. Such an audio interface uses a data format dedicated to audio purposes, whereas a computer data network can carry numerous types of information.

Dedicated interfaces are normally unidirectional, point-to-point connections, whereas computer data interconnects and networks are often bidirectional and carry data in a packet format for numerous sources and destinations. With dedicated interfaces sources may be connected to destinations using a routing matrix or by patching individual connections. Audio data are transmitted in an unbroken stream, there is no handshaking process involved in the data transfer, and erroneous data are not retransmitted because there is no mechanism for requesting its retransmission. The data rate of a dedicated audio interface is usually directly related to the audio sampling frequency, word length and number of channels of the audio data to be transmitted, ensuring that the interface is always capable of serving the specified number of channels. If a channel is unused for some reason its capacity is not normally available for assigning to other purposes (such as higher-speed transfer of another channel).

An alternative to dedicated digital audio interfaces, standard computer interconnects and networks are now used widely to transfer audio information. Computer networks are typically general purpose data carriers that may have asynchronous features and may not always have the inherent quality-of-service (QoS) features that are required for ‘streaming’ applications. They also normally use an addressing structure that enables packets of data to be carried from one of a number of sources to one of a number of destinations and such packets will share the connection in a more or less controlled way. Data transport protocols such as TCP/IP are often used as a universal means of managing the transfer of data from place to place, adding overheads in terms of data rate, delay and error handling that may work against the efficient transfer of audio. Such networks may be designed primarily for file transfer applications where the time taken to transfer the file is not a crucial factor — ‘as fast as possible’ will do. This has required some special techniques to be developed for carrying real-time data such as audio information.

USB (Universal Serial Bus) and Firewire (IEEE 1394) are examples of personal area network (PAN) technology, allowing a number of devices to be interconnected within a limited range around the user. These have a high enough data rate to carry a number of channels of audio data over relatively short distances, either over copper or optical fiber. Audio protocols also exist for these as discussed in the main text.

Dedicated audio interface formats

There are a number of types of digital interface, some of which are international standards and others of which are manufacturer specific. They all carry digital audio for one or more channels with at least 16 bit resolution and will operate at the standard sampling rates of 44.1 and 48kHz, as well as at 32kHz if necessary, some having a degree of latitude for varispeed. Some interface standards have been adapted to handle higher sampling frequencies such as 88.2 and 96kHz. The interfaces vary as to how many physical interconnections are required. Some require one link per channel plus a synchronization signal, whilst others carry all the audio information plus synchronization information over one cable.

The most common interfaces are described below in outline. It is common for subtle incompatibilities to arise between devices, even when interconnected with a standard interface, owing to the different ways in which non-audio information is implemented. This can result in anything from minor operational problems to total non-communication and the causes and remedies are unfortunately far too detailed to go into here. The reader is referred to The Digital Interface Handbook by Rumsey and Watkinson, as well as to the standards themselves, if a greater understanding of the intricacies of digital audio interfaces is required.

The AES/EBU interface (AES-3)

The AES-3 interface, described almost identically in AES-3-1992, IEC 60958 and EBU Tech. 3250E among others, allows for two channels of digital audio (A and B) to be transferred serially over one balanced interface, using drivers and receivers similar to those used in the RS422 data transmission standard, with an output voltage of between 2 and 7 volts as shown in Figure 10.5. The interface allows two channels of audio to be transferred over distances up to 100 m, but longer distances may be covered using combinations of appropriate cabling, equalization and termination. Standard XLR-3 connectors are used, often labeled DI (for digital in) and DO (for digital out).

Each audio sample is contained within a ‘subframe’ (see Figure 10.6), and each subframe begins with one of three synchronizing patterns to identify the sample as either the A or B channel, or to mark the start of a new channel status block (see Figure 10.7). These synchronizing patterns violate the rules of bi-phase mark coding (see below) and are easily identified by a decoder. One frame (containing two audio samples) is normally transmitted in the time period of one audio sample, so the data rate varies with the sampling frequency. (The later ‘single-channel-double-sampling-frequency’ mode of the interface allows two samples for one channel to be transmitted within a single frame in order to allow the transport of audio at 88.2 or 96kHz sampling frequency.)

FIGURE 10.5
Recommended electrical circuit for use with the standard two-channel interface

FIGURE 10.6
Format of the standard two-channel interface frame

FIGURE 10.7
Three different preambles (X, Yand Z) are used to synchronise a receiver at the starts of subframes

FIGURE 10.8 Overview of the professional channel status block

FIGURE 10.9 An example of the bi-phase mark channel code

Additional data is carried within the subframe in the form of 4 bits of auxiliary data (which may either be used for additional audio resolution or for other purposes such as low-quality speech), a validity bit (V), a user bit (U), a channel status bit (C) and a parity bit (P), making 32 bits per subframe and 64 bits per frame. Channel status bits are aggregated at the receiver to form a 24 byte word every 192 frames, and each bit of this word has a specific function relating to interface operation, an overview of which is shown in Figure 10.8. Examples of bit usage in this word are the signaling of sampling frequency and pre-emphasis, as well as the carrying of a sample address ‘time-code’ and labeling of source and destination. Bit 1 of the first byte signifies whether the interface is operating according to the professional (set to 1) or consumer (set to 0) specification.

Bi-phase mark coding, the same channel code as used for SMPTE/EBU timecode, is used in order to ensure that the data is self-clocking, of limited bandwidth, DC free, and polarity independent, as shown in Figure 10.9. The interface has to accommodate a wide range of cable types and a nominal 110 ohm characteristic impedance is recommended. Originally (AES-3-1985) up to four receivers with a nominal input impedance of 250 ohms could be connected across a single professional interface cable, but a later modification to the standard recommended the use of a single receiver per transmitter, having a nominal input impedance of 110 ohms.

Standard consumer interface (IEC 60958-3)

The most common consumer interface (historically related to SPDIF — the Sony/Philips digital interface) is very similar to the AES-3 interface, but uses unbalanced electrical interconnection over a coaxial cable having a characteristic impedance of 75 ohms, as shown in Figure 10.10. It can be found on many items of semi-professional or consumer digital audio equipment, such as CD players, DVD players and DAT machines, and is also widely used on computer sound cards because of the small physical size of the connectors. It usually terminates in an RCA phono connector, although some equipment makes use of optical fiber interconnects (TOS-link) carrying the same data. Format convertors are available for converting consumer format signals to the professional format, and vice versa, and for converting between electrical and optical formats. Both the professional (AES-3 equivalent) and consumer interfaces are capable of carrying data-reduced stereo and surround audio signals such as MPEG and Dolby Digital as described in Fact File 10.3.

The data format of subframes is the same as that used in the professional interface, but the channel status implementation is almost completely different, as shown in Figure 10.11. The second byte of channel status in the consumer interface has been set aside for the indication of ‘category codes’, these being set to define the type of consumer usage. Examples of defined categories are (00000000) for the General category, (10000000) for Compact Disc and (11000000) for a DAT machine. Once the category has been defined, the receiver is expected to interpret certain bits of the channel status word in a particular way, depending on the category. For example, in CD usage, the four control bits from the CD’s ‘Q’ channel subcode are inserted into the first four control bits of the channel status block (bits 1-4). Copy protection can be implemented in consumer-interfaced equipment, according to the Serial Copy Management System (SCMS).

FIGURE 10.10 The consumer electrical interface (transformer and capacitor are optional but may improve the electrical characteristics of the interface)

FIGURE 10.11 Overview of the consumer channel status block

FACT FILE 10.3 CARRYING DATA-REDUCED AUDIO

The increased use of data-reduced multichannel audio has resulted in methods by which such data can be carried over standard two-channel interfaces, for either professional or consumer purposes. This makes use of the ‘non-audio’ or ‘other uses’ mode of the interface, indicated in the second bit of channel status, which tells conventional PCM audio decoders that the information is some other form of data that should not be converted directly to analog audio. Because data-reduced audio has a much lower rate than the PCM audio from which it was derived, a number of audio channels can be carried in a data stream that occupies no more space than two channels of conventional PCM. These applications of the interface are described in SMPTE 337 M (concerned with professional applications) and IEC 61937, although the two are not identical. SMPTE 338M and 339 M specify data types to be used with this standard. The SMPTE standard packs the compressed audio data into 16, 20 or 24 bits of the audio part of the AES-3 subframe and can use the two subframes independently (e.g. one for PCM audio and the other for data-reduced audio), whereas the IEC standard only uses 16 bits and treats both subframes the same way.

Consumer use of this mode is evident on DVD players, for example, for connecting them to home cinema decoders. Here the Dolby Digital or DTS-encoded surround sound is not decoded in the player but in the attached receiver/decoder. IEC 61937 has parts dealing with a range of different codecs including ATRAC, Dolby AC-3, DTS and MPEG (various flavors). An ordinary PCM convertor trying to decode such a signal would simply reproduce it as a loud, rather unpleasant noise, which is not advised and does not normally happen if the second bit of channel status is correctly observed. Professional applications of the mode vary, but are likely to be increasingly encountered in conjunction with Dolby E data reduction — a relatively recent development involving mild data reduction for professional multichannel applications in which users wish to continue making use of existing AES-3-compatible equipment (e.g. VTRs, switchers and routers). Dolby E enables 5.1-channel surround audio to be carried over conventional two-channel interfaces and through AES-3-transparent equipment at a typical rate of about 1.92Mbit/s (depending on how many bits of the audio subframe are employed). It is designed so that it can be switched or edited at video frame boundaries without disturbing the audio.

AES55-2012 details ways of transporting MPEG Surround data in an AES-3 bitstream, including the use of Spatial Audio Object Coding (SAOC). The standard specifies how a mono or stereo downmix can be transported in the linear PCM domain, while the MPEG Surround or SAOC data is included in the least significant bits of the PCM audio data.

The user bits of the consumer interface are often used to carry information derived from the subcode of recordings, such as track identification and cue point data. This can be used when copying CDs and DAT tapes, for example, to ensure that track start ID markers are copied along with the audio data. This information is not normally carried over AES/EBU interfaces.

MADI

Originally proposed in the UK in 1988 by four manufacturers of professional audio equipment (Sony, Neve, Mitsubishi and Solid State Logic), the so-called ‘MADI’ interface is an AES and ANSI standard. It was designed to simplify cabling in large installations, especially between multitrack recorders and mixers, and has a lot in common with the format of the AES/ EBU interface. The standard concerned is AES10–1991 (ANSI S4.43–1991). This interface was intentionally designed to be transparent to standard two-channel data making the incorporation of two-channel signals into a MADI multiplex a relatively straightforward matter. The original channel status, user and auxiliary data remain intact within the multichannel format.

MADI stands for Multichannel Audio Digital Interface. 32, 56 or 64 channels of audio are transferred serially in asynchronous form and consequently the data rate is much higher than that of the two-channel interface. For this reason the data is transmitted either over a coaxial transmission line with 75 ohm termination (not more than 50 m) or over a fiber optic link. A twisted pair version is also in development. MADI PCIe cards are available for digital audio workstations, which make a convenient way of connecting large numbers of audio channels to and from external systems such as digital mixers.

Proprietary digital interfaces

Tascam’s interfaces became popular owing to the widespread use of the company’s DA-88 multitrack recorder and derivatives. The primary TDIF-1 interface uses a 25-pin D-sub connector to carry eight channels of audio information in two directions (in and out of the device), sampling frequency and pre-emphasis information (on separate wires, two for f_s and one for emphasis) and a synchronizing signal. The interface is unbalanced and uses CMOS voltage levels. Each data connection carries two channels of audio data, odd channel and MSB first, as shown in Figure 10.12. As can be seen, the audio data can be up to 24 bits long, followed by 2 bits to signal the word length, 1 bit to signal emphasis and 1 bit for parity. There are also 4 user bits per channel that are not usually used.

The Alesis ADAT multichannel optical digital interface, commonly referred to as the ‘light pipe’ interface or simply ‘ADAT Optical’, is a serial, self-clocking, optical interface that carries eight channels of audio information. It is described in US Patent 5,297,181: ‘Method and apparatus for providing a digital audio interface protocol’. The interface is capable of carrying up to 24 bits of digital audio data for each channel and the eight channels of data are combined into one serial frame that is transmitted at the sampling frequency. The data is encoded in NRZI format for transmission, with forced ones inserted every 5 bits (except during the sync pattern) to provide clock content. This can be used to synchronize the sampling clock of a receiving device if required, although some devices require the use of a separate 9-pin ADAT sync cable for synchronization. The sampling frequency is normally limited to 48 kHz with varispeed up to 50.4kHz and TOSLINK optical connectors are typically employed (Toshiba TOCP172 or equivalent). In order to operate at 96kHz sampling frequency some implementations use a ‘double-speed’ mode in which two channels are used to transmit one channel’s audio data (naturally halving the number of channels handled by one serial interface). Although 5 m lengths of optical fiber are the maximum recommended, longer distances may be covered if all the components of the interface are of good quality and clean. Experimentation is required.

As shown in Figure 10.13 the frame consists of an 11 bit sync pattern consisting of 10 zeros followed by a forced one. This is followed by 4 user bits (not normally used and set to zero), the first forced one, then the first audio channel sample (with forced ones every 5 bits), the second audio channel sample, and so on.

SDIF is the original Sony interface for digital audio, most commonly encountered in SDIF-2 format on BNC connectors, along with a word clock signal. However, this is not often used these days. SDIF-3 is Sony’s interface for high-resolution DSD data (see Chapter 8), although some early DSD equipment used a data format known as ‘DSD-raw’, which was simply a stream of DSD samples in non-return-to-zero (NRZ) form, as shown in Figure 10.14. (The latter is essentially the same as SDIF-2.) In SDIF-3 data is carried over 75 ohm unbalanced coaxial cables, terminating in BNC connectors. The bit rate is twice the DSD sampling frequency (or 5.6448Mbit/s at the sampling frequency given above) because phase modulation is used for data transmission as shown in Figure 10.14(b). A separate word clock at 44.1kHz is used for synchronization purposes. It is also possible to encounter a DSD clock signal connection at the 64 times 44.1kHz (2.8224 MHz).

FIGURE 10.12
Format of TDIF data and LRsync signal

FIGURE 10.13
Basic format of ADAT data

FIGURE 10.14
Direct Stream Digital interface data is either transmitted ‘raw’, as shown at (a) or phase modulated as in the SDIF-3 format shown at (b)

Sony also developed a multichannel interface for DSD signals, capable of carrying 24 channels over a single physical link. The transmission method is based on the same technology as used for the Ethernet 100BASE-TX (100Mbit/s) twisted-pair physical layer (PHY), but it is used in this application to create a point-to-point audio interface. Category 5 cabling is used, as for Ethernet, consisting of eight conductors. Two pairs are used for bi-directional audio data and the other two pairs for clock signals, one in each direction.

Twenty-four channels of DSD audio require a total bit rate of 67.7Mbit/s, leaving an appreciable spare capacity for additional data. In the MAC-DSD interface this is used for error correction (parity) data, frame header and auxiliary information. Data is formed into frames that can contain Ethernet MAC headers and optional network addresses for compatibility with network systems. Audio data within the frame is formed into 352 32bit blocks, 24 bits of each being individual channel samples, six of which are parity bits and two of which are auxiliary bits.

More recently Sony introduced ‘SuperMAC’ which is capable of handling either DSD or PCM audio with very low latency (delay), typically less than 50μs, over Cat-5 Ethernet cables using the 100BASE-TX physical layer. The number of channels carried depends on the sampling frequency. Twenty-four bidirectional DSD channels can be handled, or 48 PCM channels at 44.1/48 kHz, reducing proportionately as the sampling frequency increases. In conventional PCM mode the interface is transparent to AES-3 data including user and channel status information. Up to 5 Mbit/s of Ethernet control information can be carried in addition. A means of interchange based on this was standardized by the AES as AES-50. ‘HyperMAC’ runs even faster, carrying up to 384 audio channels on gigabit Ethernet Cat-6 cable or optical fiber, together with 100Mbit/s Ethernet control data. Sony sold this networking technology to Klark Teknik.

The advantage of these interfaces is that audio data thus formatted can be carried over the physical drivers and cables common to Ethernet networks, carrying a lot of audio at high speed. These interfaces bridge the conceptual gap between dedicated audio interfaces and generic computer networks, as they use some of the hardware and the physical layer of a computer network to transfer audio in a convenient form. They do not, however, employ all the higher layers of computer network protocols as mentioned in the next section. This means that the networking protocol overhead is relatively low, minimal buffering is required and latency can be kept to a minimum. Dedicated routing equipment is, however, required. One of the main applications so far has been in a Midas router for live performance mixing.

Data networks and computer interconnects

A network carries data either on wire or optical fiber, and is normally shared between a number of devices and users. The sharing is achieved by containing the data in packets of a limited number of bytes (usually between 64 and 1518), each with an address attached. The packets usually share a common physical link, normally a high-speed serial bus of some kind, being multiplexed in time either using a regular slot structure synchronized to a system clock (isochronous transfer) or in an asynchronous fashion whereby the time interval between packets may be varied or transmission may not be regular, as shown in Figure 10.15. The length of packets may not be constant, depending on the requirements of different protocols sharing the same network. Packets for a particular file transfer between two devices may not be contiguous and may be transferred erratically, depending on what other traffic is sharing the same physical link.

FIGURE 10.15 Packets for different destinations (A, B and C) multiplexed onto a common serial bus. (a) Time division multiplexed into a regular time slot structure. (b) Asynchronous transfer showing variable time gaps and packet lengths between transfers for different destinations

Figure 10.16 shows some common physical layouts for local area networks (LANs). LANs are networks that operate within a limited area, such as an office building or studio center, within which it is common for every device to ‘see’ the same data, each picking off that which is addressed to it and ignoring the rest. Routers and bridges can be used to break up complex LANs into subnets. WANs (wide area networks) and MANs (metropolitan area networks) are larger entities that link LANs within communities or regions. PANs (personal area networks) are typically limited to a range of a few tens of meters around the user (e.g. Firewire, USB, Bluetooth). Wireless versions of these network types are increasingly common. Different parts of a network can be interconnected or extended as explained in Fact File 10.4.

FIGURE 10.16 Two examples of computer network topologies. (a) Devices connected by spurs to a common hub, and (b) devices connected to a common ‘backbone’. The former is now by far the most common, typically using CAT 5 cabling

FACT FILE 10.4 EXTENDING A NETWORK

It is common to need to extend a network to a wider area or to more machines. As the number of devices increases so does the traffic, and there comes a point when it is necessary to divide a network into zones, separated by ‘repeaters’, ‘bridges’ or ‘routers’. Some of these devices allow network traffic to be contained within zones, only communicating between the zones when necessary. This is vital in large interconnected networks because otherwise data placed anywhere on the network would be present at every other point on the network, and overload could quickly occur.

A repeater is a device that links two separate segments of a network so that they can talk to each other, whereas a bridge isolates the two segments in normal use, only transferring data across the bridge when it has a destination address on the other side. A router is very selective in that it examines data packets and decides whether or not to pass them depending on a number of factors. A router can be programmed only to pass certain protocols and only certain source and destination addresses. It therefore acts as something of a network policeman and can be used as a first level of ensuring security of a network from unwanted external access. Routers can also operate between different standards of network, such as between FDDI and Ethernet, and ensure that packets of data are transferred over the mosttime-/cost-effective route.

One could also use some form of router to link a local network to another that was quite some distance away, forming a wide area network (WAN). Data can be routed either over dialed data links such as ISDN, in which the time is charged according to usage just like a telephone call, or over leased circuits. The choice would depend on the degree of usage and the relative costs. The Internet provides a means by which LANs are easily interconnected, although the data rate available will depend on the route, the service provider and the current traffic.

FIGURE 10.17
The ISO model for Open Systems Interconnection is arranged in seven layers, as shown here

Network communication is divided into a number of conceptual ‘layers’, each relating to an aspect of the communication protocol and interfacing correctly with the layers either side. The ISO seven-layer model for open systems interconnection (OSI) shows the number of levels at which compatibility between systems needs to exist before seamless interchange of data can be achieved (Figure 10.17). It shows that communication begins when the application is passed down through various stages to the layer most people understand — the physical layer, or the piece of wire over which the information is carried. Layers 3, 4 and 5 can be grouped under the broad heading of ‘protocol’, determining the way in which data packets are formatted and transferred. There is a strong similarity here with the exchange of data on physical media, as discussed earlier, where a range of compatibility layers from the physical to the application determine whether or not one device can read another’s disks.

Audio network requirements

The principal application of computer networks in audio systems is in the transfer of audio data files between workstations, or between workstations and a central ‘server’ which stores shared files. The device requesting the transfer is known as the ‘client’ and the device providing the data is known as the ‘server’. When a file is transferred in this way a byte-for-byte copy is reconstructed on the client machine, with the file name and any other header data intact. There are considerable advantages in being able to perform this operation at speeds in excess of real time for operations in which real-time feeds of audio are not the aim. For example, in a news editing environment a user might wish to upload a news story file from a remote disk drive in order to incorporate it into a report, this being needed as fast as the system is capable of transferring it. Alternatively, the editor might need access to remotely stored files, such as sound files on another person’s system, in order to work on them separately. In audio post-production for films or video there might be a central store of sound effects, accessible by everyone on the network, or it might be desired to pass on a completed portion of a project to the next stage in the post-production process.

Wired Ethernet is fast enough to transfer audio data files faster than real time, depending on network loading and speed. Switched Ethernet architectures allow the bandwidth to be more effectively utilized, by creating switched connections between specific source and destination devices. Approaches using FDDI or ATM are appropriate for handling large numbers of sound file transfers simultaneously at high speed. Unlike a real-time audio interface, the speed of transfer of a sound file over a packet-switched network (when using conventional file transfer protocols) depends on how much traffic is currently using it. If there is a lot of traffic then the file may be transferred more slowly than if the network is quiet (very much like motor traffic on roads). The file might be transferred erratically as traffic volume varies, with the file arriving at its destination in ‘spurts’. There therefore arises the need for network communication protocols designed specifically for the transfer of real-time data, which serve the function of reserving a proportion of the network bandwidth for a given period of time. This is known as engineering a certain ‘quality of service’.

Without real-time protocols the computer network cannot be relied upon for transferring audio where an unbroken audio output is to be reconstructed at the destination from the data concerned. The faster the network the more likely it is that one would be able to transfer a file fast enough to feed an unbroken audio output, but this should not be taken for granted. Even the highest speed networks can be filled up with traffic! This may seem unnecessarily careful until one considers an application in which a disk drive elsewhere on the network is being used as the source for replay by a local workstation, as illustrated in Figure 10.18. Here it must be possible to ensure guaranteed access to the remote disk at a rate adequate for real-time transfer, otherwise gaps will be heard in the replayed audio.

FIGURE 10.18
In this example of a networked system a remote disk is accessed over the network to provide data for real-time audio playout from a workstation used for on-air broadcasting. Continuity of data flow to the on-air workstation is of paramount importance here

Protocols for the Internet

The common protocol for communication on the Internet is called TCP/ IP (Transmission Control Protocol/Internet Protocol). This provides a connection-oriented approach to data transfer, allowing for verification of packet integrity, packet order and retransmission in the case of packet loss. At a more detailed level, as part of the TCP/IP structure, there are high-level protocols for transferring data in different ways. There is a file transfer protocol (FTP) used for downloading files from remote sites, a simple mail transfer protocol (SMTP) and a post office protocol (POP) for transferring email, and a hypertext transfer protocol (HTTP) used for interlinking sites on the world wide web (WWW). The WWW is a collection of file servers connected to the Internet, each with its own unique IP address (the method by which devices connected to the Internet are identified), upon which may be stored text, graphics, sounds and other data.

UDP (user datagram protocol) is a relatively low-level connectionless protocol that is useful for streaming audio over the Internet. Being connectionless, it does not require any handshaking between transmitter and receiver, so the overheads are very low and packets can simply be streamed from a transmitter without worrying about whether or not the receiver gets them. If packets are missed by the receiver, or received in the wrong order, there is little to be done about it except mute or replay distorted audio, but UDP can be efficient when bandwidth is low and quality of service is not the primary issue.

Various real-time protocols have also been developed for use on the Internet, such as RTP (real-time transport protocol). Here packets are time-stamped and may be reassembled in the correct order and synchronized with a receiver clock. RTP does not guarantee quality of service or reserve bandwidth but this can be handled by a protocol known as RSVP (reservation protocol). RTSP is the real-time streaming protocol that manages more sophisticated functionality for streaming media servers and players, such as stream control (play, stop, fast-forward, etc.) and multicast (streaming to numerous receivers).

Audio-specific network standards

A number of proprietary systems have been developed for audio networking in recent years, including Audinate’s Dante, which can be licensed to other manufacturers, and Axia’s Livewire. Alternatively there are open technology solutions such as RAVENNA, to which a number of parties have signed up. Until recently there have been considerable difficulties in establishing interoperability between the various audio networking systems, even though many of them are based broadly on IP network protocols. Data is almost always transferred using the Real-time Transport Protocol (RTP), for example, although some proprietary solutions use UDP. Synchronization is very often achieved using IEEE 1588 Precision Time Protocol (PTP). Despite this, small differences in implementation often work against interoperability.

The AES X-192 project published a standard for audio network interoperability in 2013. The new standard, known as “AES67-2013, AES standard for audio applications of networks — High-performance streaming audio-over-IP interoperability”, concentrates on IP (internet protocol) networks for professional audio applications. Professional applications imply low latency and high bandwidth. It uses existing IP network standards and extends interoperability across medium scale networks, such as entire campuses, by including elements of Layer 3 of the internet protocol suite. It does not offer super high performance but occupies the middle space between high performance local networks and the public internet. Latency in AES67-based systems is intended to be 10ms or below, for example. Synchronization is achieved using the Precision Time Protocol, to which an audio sample clock can be referred, for example.

The European Broadcasting Union (EBU) ACIP recommendations have been developed to standardize broadcast audio contributions over IP networks, as a replacement for ageing and increasingly redundant ISDN connections. The original ACIP interoperability standard itself is described in EBU Tech. Doc. 3326. It recognizes that the quality of service from the public Internet cannot be guaranteed, and a robust codec needed to be developed in order to deliver the material reliably. The ACIP group implemented a very simple solution involving a Voice-over Internet Protocol (VoIP) using the standard session initiation protocol (SIP) that is used for setting up voice calls over the Internet, but with specified codecs. A list of recommended codecs is provided, and the basic G711 and G 722 codecs are mandatory because they are specifically designed for VoIP, and so are robust.

The recently developed IEEE 802.1 AVB (Audio Video Bridging) standard can be considered as a set of extensions of the original IEEE 802 network standards that defined Ethernet, designed to facilitate the transfer of real-time media data over IP networks. It currently operates at Layer 2 of the Internet protocol, relying on the physical MAC addresses of devices for packet routing. A number of different components are involved, including Precision Time Protocol to ensure the accurate synchronization of devices, and the possibility to reserve bandwidth for specific time-sensitive data streams. Once a reservation has been made for such a stream the same transport slots cannot be competed for by other types of data. Traffic shaping is a way of attempting to distribute the network load so that requested real-time streams don’t overload the resources available.

The AVnu Alliance was formed to advance the Audio/Visual Bridging (AVB) networking standard in practice. There are various options in the AVB standard and not all devices will implement them, or may not implement the standard in the same way, which means that interoperability cannot be guaranteed simply by devices ‘conforming’ to it. AVnu deals with this problem by introducing a certified base level of interoperability that gives end users the assurance that devices will talk to each other.

Storage area networks

An alternative setup involving the sharing of common storage by a number of workstations is the Storage Area Network (SAN). This employs a networking technology known as Fibre Channel that can run at speeds of 4Gbit/s, and can also employ fiber optic links to allow long connections between shared storage and remote workstations. RAID arrays (see Chapter 9) are typically employed with SANs, and special software such as Apple’s XSAN is needed to enable multiple users to access the files on such common storage. iSCSI is another option for networked storage according to a variant of the SCSI protocol (see Chapter 9).

Wireless networks

Increasing use is made of wireless networks these days, the primary advantage being the lack of need for a physical connection between devices. There are various IEEE 802 standards for wireless networking, including 802.11 which covers wireless Ethernet or ‘Wi-Fi’. These typically operate on either the 2.4 GHz or 5 GHz radio frequency bands, at relatively low power, and use various interference reduction and avoidance mechanisms to enable networks to coexist with other services. It should, however, be recognized that wireless networks will never be as reliable as wired networks owing to the differing conditions under which they operate, and that any critical applications in which real-time streaming is required would do well to stick to wired networks where the chances of experiencing drop-outs owing to interference or RF fading are almost non-existent. They are, however, extremely convenient for mobile applications and when people move around with computing devices, enabling reasonably high data rates to be achieved with the latest technology.

Bluetooth is one example of a wireless personal area network (WPAN) designed to operate over limited range at data rates of up to 1 Mbit/s. Within this there is the capacity for a number of channels of voice quality audio at data rates of 64kbit/s and asynchronous channels up to 723kbit/s. Taking into account the overhead for communication and error protection, the actual data rate achievable for audio communication is usually only sufficient to transfer data-reduced audio for a few channels at a time.

Audio over Firewire (IEEE 1394)

Firewire is an international standard serial data interface specified in IEEE 1394–1995. One of its key applications has been as a replacement for SCSI (Small Computer Systems Interface) for connecting disk drives and other peripherals to computers. It is extremely fast, running at rates of 100, 200 and 400Mbit/s in its original form, with higher rates appearing all the time up to 3.2Gbit/s. It is intended for optical fiber or copper interconnection. The S100 version has a maximum realistic data capacity of 65Mbit/s, a maximum of 16 hops between nodes and no more than 63 nodes on up to 1024 separate buses. On the copper version there are three twisted pairs — data, strobe and power — and the interface operates in half duplex mode, which means that communications in two directions are possible, but only in one direction at a time. Connections are ‘hot pluggable’ with auto-reconfiguration — in other words one can connect and disconnect devices without turning off the power and the remaining system will reconfigure itself accordingly. It is also relatively cheap to implement. A recent implementation, 1394c, allows the use of gigabit Ethernet connectors, which may improve the reliability and usefulness of the interface in professional applications.

Firewire combines features of network and point-to-point interfaces, offering both asynchronous and isochronous communication modes, so guaranteed latency and bandwidth are available if needed for time-critical applications. Communications are established between logical addresses, and the end point of an isochronous stream is called a ‘plug’. Logical connections between devices can be specified as either ‘broadcast’ or ‘point-to-point’. In the broadcast case either the transmitting or receiving plug is defined, but not both, and broadcast connections are unprotected in that any device can start and stop it. A primary advantage for audio applications is that point-to-point connections are protected — only the device that initiated a transfer can interfere with that connection, so once established the data rate is guaranteed for as long as the link remains intact. The interface can be used for real-time multichannel audio interconnections, file transfer, MIDI and machine control, carrying digital video, carrying any other computer data and connecting peripherals (e.g. disk drives).

Originating partly in Yamaha’s ‘m-LAN’ protocol, the 1394 Audio and Music Data Transmission Protocol is now also available as an IEC PAS component of the IEC 61883 standard (a PAS is a publically available specification that is not strictly defined as a standard but is made available for information purposes by organizations operating under given procedures). It offers a versatile means of transporting digital audio and MIDI control data.

Audio over Universal Serial Bus (USB)

The Universal Serial Bus is not the same as IEEE 1394, but it has some similar implications for desktop multimedia systems, including audio peripherals. USB has been jointly supported by a number of manufacturers including Microsoft, Digital, IBM, NEC, Intel and Compaq. Version 1.0 of the copper interface runs at a lower speed than 1394 (typically either 1.5 or 12Mbit/s) and is designed to act as a low-cost connection for multiple input devices to computers such as joysticks, keyboards, scanners and so on. USB 2.0 runs at a higher rate of up to 480Mbit/s and is supposed to be backwards-compatible with 1.0. USB 3.0, introduced in 2008, running at up to 5 Gbit/s.

USB 1.0 supports up to 127 devices for both isochronous and asynchronous communication and can carry data over distances of up to 5 m per hop (similar to 1394). A hub structure is required for multiple connections to the host connector. Like 1394 it is hot pluggable and reconfigures the addressing structure automatically, so when new devices are connected to a USB setup the host device assigns a unique address. Limited power is available over the interface and some devices are capable of being powered solely using this source — known as ‘bus-powered’ devices — which can be useful for field operation of, say, a simple A/D convertor with a laptop computer.

The way in which audio is handled on USB is well defined and somewhat more clearly explained than the 1394 audio/music protocol. It defines three types of communication: audio control, audio streaming and MIDI streaming. Audio data transmissions fall into one of three types. Type 1 transmissions consist of channel-ordered PCM samples in consecutive subframes, whilst Type 2 transmissions typically contain non-PCM audio data that does not preserve a particular channel order in the bitstream, such as certain types of multichannel data-reduced audio stream. Type 3 transmissions are a hybrid of the two such that non-PCM data is packed into pseudo-stereo data words in order that clock recovery can be made easier.

Audio samples are transferred in subframes, each of which can be 1–4 bytes long (up to 24 bits resolution). An audio frame consists of one or more subframes, each of which represents a sample of different channels in the cluster (see below). As with 1394, a USB packet can contain a number of frames in succession, each containing a cluster of subframes. Frames are described by a format descriptor header that contains a number of bytes describing the audio data type, number of channels, subframe size, as well as information about the sampling frequency and the way it is controlled (for Type 1 data). An example of a simple audio frame would be one containing only two subframes of 24bit resolution for stereo audio.

Audio of a number of different types can be transferred in Type 1 transmissions, including PCM audio (two’s complement, fixed point), PCM-8 format (compatible with original 8 bit WAV, unsigned, fixed point), IEEE floating point, A-law and μ-law (companded audio corresponding to relatively old telephony standards). Type 2 transmissions typically contain data-reduced audio signals such as MPEG or AC-3 streams. Here the data stream contains an encoded representation of a number of channels of audio, formed into encoded audio frames that relate to a large number of original audio samples. An MPEG encoded frame, for example, will typically be longer than a USB packet (a typical MPEG frame might be 8 or 24ms long), so it is broken up into smaller packets for transmission over USB rather like the way it is streamed over the IEC 60958 interface described in Fact File 10.3.

Audio data for closely related synchronous channels can be clustered for USB transmission in Type 1 format. Up to 254 streams can be clustered and there are 12 defined spatial positions for reproduction, to simplify the relationship between channels and the loudspeaker locations to which they relate. The first six defined streams follow the internationally standardized order of surround sound channels for 5.1 surround, that is left, right, center, LFE (low frequency effects), left surround, right surround (see Chapter 17). Subsequent streams are allocated to other loudspeaker locations around a notional listener. Not all the spatial location streams have to be present but they are supposed to be presented in the defined order. Clusters are defined in a descriptor field that includes ‘bNrChannels’ (specifying how many logical audio channels are present in the cluster) and ‘wChannelConfig’ (a bit field that indicates which spatial locations are present in the cluster). If the relevant bit is set then the relevant location is present in the cluster. The bit allocations are shown in Table 10.4

AES-47: audio over ATM

AES-47 defines a method by which linear PCM data, either conforming to AES-3 format or not, can be transferred over ATM (Asynchronous Transfer Mode) networks. There are various arguments for doing this, not least being the increasing use of ATM-based networks for data communications within the broadcasting industry and the need to route audio signals over longer distances than possible using standard digital interfaces. There is also a need for low latency, guaranteed bandwidth and switched circuits, all of which are features of ATM. Essentially an ATM connection is established in a similar way to making a telephone call. A SETUP message is sent at the start of a new ‘call’ that describes the nature of the data to be transmitted and defines its vital statistics. The AES-47 standard describes a specific professional audio implementation of this procedure that includes information about the audio signal and the structure of audio frames in the SETUP at the beginning of the call.

Table 10.4 Channel Identification in USB Audio Cluster Descriptor

Data bit	Spatial location
D0	Left Front (L)
D1	Right Front (R)
D2	Center Front (C)
D3	Low Frequency Enhancement (LFE)
D4	Left Surround (LS)
D5	Right Surround (RS)
D6	Left of Center (LC)
D7	Right of Center (RC)
D8	Surround (S)
D9	Side Left (SL)
D10	Side Right (SR)
D11	Top (T)
D12…15	Reserved

USEFUL WEBSITES

Audio Engineering Society standards, www.aes.org

EBU ACIP, http://www.ebu-acip.org

IEEE 1394, www.1394ta.org

IEEE AVB, http://www.ieee802.org/1/pages/avbridges.html

Universal Serial Bus, www.usb.org

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
Chapter 10 Digital Audio Formats and Interchange

CHAPTER 10

Digital Audio Formats and Interchange

CHAPTER CONTENTS