6  Audio formats and data interchange

This chapter is concerned with common formats for the storage and interchange of digital audio. It includes coverage of the most widely encountered audio and edit list file formats, digital interfaces and networking protocols.

6.1 Audio file formats

6.1.1 Introduction

There used to be almost as many file formats for audio as there are days in the year. In the computer games field, for example, this is still true to some extent. For a long time the specific file storage strategy used for disk-based digital audio was the key to success in digital workstation design, because disk drives were relatively slow and needed clever strategies to ensure that they were capable of handling a sufficiently large number of audio channels. Manufacturers also worked very much in isolation and the size of the market was relatively small, leading to virtually every workstation or piece of software using a different file format for audio and edit list information.

There are still advantages in the use of filing structures specially designed for real-time applications such as audio and video editing, because one may obtain better performance from a disk drive in this way, but the need is not as great as it used to be. Interchange is becoming at least as important as, if not more important than ultimate transfer speed and the majority of hard disk drives available today are capable of replaying many channels of audio in real time without a particularly fancy storage strategy. Indeed a number of desktop systems simply use the native filing structure of the host computer (see Chapter 5). As the use of networked workstations grows, the need for files to be transferred between systems also grows and either by international standardisation or by sheer force of market dominance certain file formats are becoming the accepted means by which data are exchanged. This is not to say that we will only be left with one or two formats, but that systems will have to be able to read and write files in the common formats if users are to be able to share work with others.

The recent growth in the importance of metadata (data about data), and the representation of audio, video and metadata as ‘objects’, has led to the development of interchange methods that are based on object-oriented concepts and project ‘packages’ as opposed to using simple text files and separate media files. There is increasing integration between audio and other media in multimedia authoring and some of the file formats mentioned below are closely related to international efforts in multimedia file exchange.

It is not proposed to attempt to describe all of the file formats in existence, because that would be a relatively pointless exercise and would not make for interesting reading. It is nonetheless useful to have a look at some examples taken from the most commonly encountered file formats, particularly those used for high quality audio by desktop and multimedia systems, since these are amongst the most widely used in the world and are often handled by audio workstations even if not their native format. It is not proposed to investigate the large number of specialised file formats developed principally for computer music on various platforms, nor the files used for internal sounds and games on many computers.

6.1.2 File formats in general

A data file is simply a series of data bytes formed into blocks and stored either contiguously or in fragmented form. In a sense files themselves are independent of the operating system and filing structure of the host computer, because a file can be transferred to another platform and still exist as an identical series of data blocks. It is the filing system that is often the platform- or operating-system-dependent entity. There are sometimes features of data files that relate directly to the operating system and filing system that created them, these being fairly fundamental features, but they do not normally prevent such files being translated by other platforms.

For example, there are two approaches to byte ordering: the so-called little-endian order in which the least significant byte comes first or at the lowest memory address, and the big-endian format in which the most significant byte comes first or at the highest memory address. These relate to the byte ordering used in data processing by the two most common microprocessor families and thereby to the two most common operating systems used in desktop audio workstations. Motorola processors, as used in the Apple Macintosh, deal in big-endian byte ordering, and Intel processors, as used in MS-DOS machines, deal in little-endian byte ordering. It is relatively easy to interpret files either way around but it is necessary to know that there is a need to do so if one is writing software.

Secondly, Macintosh files may have two parts – a resource fork and a data fork, as shown in Figure 6.1, whereas Windows files only have one part. High level ‘resources’ are stored in the resource fork (used in some audio files for storing information about the file, such as signal processing to be applied, display information and so forth) whilst the raw data content of the file is stored in the data fork (used in audio applications for audio sample data). The resource fork is not always there, but may be. The resource fork can get lost when transferring such files between machines or to servers, unless Mac-specific protocols are used (e.g. MacBinary or BinHex).

Some data files include a ‘header’, that is a number of bytes at the start of the file containing information about the data that follows (see Figure 6.2). In audio systems this may include the sampling rate and resolution of the file. Audio replay would normally be started immediately after the header. On the other hand, some files are simply raw data, usually in cases where the format is fixed. ASCII text files are a well known example of raw data files – they simply begin with the first character of the text. More recently file structures have been developed that are really ‘containers’ for lots of smaller files, or data objects, each with its own descriptors and data. The RIFF structure, described in Section 6.1.6, is an early example of the concept of a ‘chunk-based’ file structure. Apple’s Bento container structure, used in OMFI, and the container structure of AAF are more advanced examples of such an approach.

The audio data in most common high-quality audio formats is stored in twos complement form (see Chapter 2) and the majority of files are used for 16- or 24-bit data, thus employing either two or three bytes per audio sample. Eight-bit files use one byte per sample.

image

Figure 6.1 Macintosh files may have a resource and a data fork

image

Figure 6.2 Three different kinds of sound file. (a) A simple file containing only raw audio data (showing optional Mac resource fork). (b) A file that begins with a number of bytes of header, describing the audio data that follows. (c) A chunk-format file containing self-describing chunks, each fulfilling a different function

6.1.3 Sound Designer I format

Sound Designer files originate with the Californian company Digidesign, manufacturer of probably the world’s most widely used digital audio hardware for desktop computers. Many systems handle Sound Designer files because they were used widely for such purposes as the distribution of sound effects on CD-ROM and for other short music sample files. Detailed information about Digidesign file formats can be obtained if one wishes to become a third-party developer and the company exercises no particular secrecy in the matter.

The Sound Designer I format (SD I) is for mono sounds and it is recommended principally for use in storing short sounds. It originated on the Macintosh, so numerical data are stored in big-endian byte order but it has no resource fork. The data fork contains a header of 1336 bytes which is followed by the audio data bytes. The header contains information about how the sample should be displayed in Sound Designer editing software, including data describing vertical and horizontal scaling. It also contains details of ‘loop points’ for the file (these are principally for use with audio/MIDI sampling packages where portions of the sound are repeatedly cycled through while a key is held down, in order to sustain a note). The header contains information on the sample rate, sample period, number of bits per sample, quantisation method (e.g. ‘linear’, expressed as an ASCII string describing the method) and size of RAM buffer to be used.

The audio data are normally either 8- or 16-bit, and always MS byte followed by LS byte of each sample.

6.1.4 Sound Designer II format

Sound Designer II has been one of the most commonly used formats for audio workstations and has greater flexibility than SD I. Again it originated as a Mac file and unlike SD I it has a separate resource fork. The data fork contains only the audio data bytes in twos complement form, either 8 or 16 bits per sample. SD II files can contain audio samples for more than one channel, in which case the samples are interleaved, as shown in Figure 6.3, on a sample by sample basis (i.e. all the bytes for one channel sample followed by all the bytes for the next, etc.). It is unusual to find more than stereo data contained in SD II files and it is recommended that multichannel recordings are made using separate files for each channel. Some multichannel applications, when opening stereo SD II files have first to split them into two mono files before they can be used, by deinterleaving the sample data. This requires that there is sufficient free disk space for the purpose.

image

Figure 6.3 Sound Designer II files allow samples for multiple audio channels to be interleaved. Four-channel, 16-bit example shown

Since Mac resource forks can be written separately from their associated data forks, it is possible to update the descriptive information about the file separately from the audio data. This can save a lot of time (compared with single-fork files such as SD I) if the file is long and the audio data has not changed, because it saves rewriting all the audio data at the same time. Only three resources are mandatory and others can be added by developers to customise the files for their own purposes. The mandatory ones are ‘sample size’ (number of bytes per sample), ‘sample rate’ and ‘channels’ (describing the number of audio channels in the file). Digidesign optionally adds other resources describing things like the timecode start point and frame rates originally associated with the file, for use in post-production applications.

6.1.5 AIFF and AIFF-C formats

The AIFF format is widely used as an audio interchange standard, because it conforms to the EA IFF 85 standard for interchange format files used for various other types of information such as graphical images. AIFF is an Apple standard format for audio data and is encountered widely on Macintosh-based audio workstations and some Silicon Graphics systems. It is claimed that AIFF is suitable as an everyday audio file format as well as an interchange format and some systems do indeed use it in this way. Audio information can be stored at a number of resolutions and for any number of channels if required, and the related AIFF-C (file type ‘AIFC’) format allows also for compressed audio data. It consists only of a data fork, with no resource fork, making it easier to transport to other platforms.

All IFF-type files are made up of ‘chunks’ of data which are typically made up as shown in Figure 6.4. A chunk consists of a header and a number of data bytes to follow. The simplest AIFF files contain a ‘common chunk’, which is equivalent to the header data in other audio files, and a ‘sound data chunk’ containing the audio sample data. These are contained overall by a ‘form chunk’ as shown in Figure 6.5. AIFC files must also contain a ‘version chunk’ before the common chunk to allow for future changes to AIFC.

The common chunk header information describes the number of audio channels, the number of audio samples per channel in the following sound chunk, bits per sample (anything from 1 to 32), sample rate, compression type ID (AIFC only, a register is kept by Apple), and a string describing the compression type (again AIFC only). The sound data chunk consists of twos complement audio data preceded by the chunk header, the audio samples being stored as either 1, 2, 3 or 4 bytes, depending on the resolution, interleaved for multiple channels in the same way as for SD II files (see Section 6.1.4). Samples requiring less than the full eight bits of each byte should be left-justified (shifted towards the MSB), with the unused LSBs set to zero.

image

Figure 6.4 General format of an IFF file chunk

image

Figure 6.5 General format of an AIFF file

Optional chunks may also be included within the overall container chunk, such as marker information, comments, looping points and other information for MIDI samplers, MIDI data (see Chapter 4), AES channel status data (see Section 6.5.2), text and application-specific data.

6.1.6 RIFF WAVE format

The RIFF WAVE (often called WAV) format is the Microsoft equivalent of Apple’s AIFF. It has a similar structure, again conforming to the IFF pattern, but with numbers stored in little-endian rather than big-endian form. It is used widely for sound file storage and interchange on PC workstations, and for multimedia applications involving sound. Within WAVE files it is possible to include information about a number of cue points, and a playlist to indicate the order in which the cues are to be replayed. WAVE files use the file extension ‘.wav’.

A basic WAV file consists of three principal chunks, as shown in Figure 6.6, the RIFF chunk, the FORMAT chunk and the DATA chunk. The RIFF chunk contains 12 bytes, the first four of which are the ASCII characters ‘RIFF’, the next four indicating the number of bytes in the remainder of the file (after the first eight) and the last four of which are the ASCII characters ‘WAVE’. The format chunk contains information about the format of the sound file, including the number of audio channels, sampling rate and bits per sample, as shown in Table 6.1.

The audio data chunk contains a sequence of bytes of audio sample data, divided as shown in the FORMAT chunk. Unusually, if there are only 8 bits per sample or fewer each value is unsigned and ranges between 0 and 255 (decimal), whereas if the resolution is higher than this the data are signed and range both positively and negatively around zero. Audio samples are interleaved by channel in time order, so that if the file contains two channels a sample for the left channel is followed immediately by the associated sample for the right channel. The same is true of multiple channels (one sample for time-coincident sample periods on each channel is inserted at a time, starting with the lowest numbered channel), although basic WAV files were nearly always just mono or 2-channel.

image

Figure 6.6 Diagrammatic representation of a simple RIFF WAVE file, showing the three principal chunks. Additional chunks may be contained within the overall structure, for example a ‘bext’ chunk for the Broadcast WAVE file

Table 6.1 Contents of FORMAT chunk in a basic WAVE PCM file

Byte ID Contents
0–3 ckID ‘fmt_’ (ASCII characters)
4–7 nChunkSize Length of FORMAT chunk (binary, hex value:
    &00000010)
8–9 wFormatTag Audio data format (e.g. &0001 = WAVE format PCM)
    Other formats are allowed, for example IEEE floating
    point and MPEG format (&0050 = MPEG 1)
10–11 nChannels Number of channels (e.g. &0001 = mono, &0002 =
    stereo)
12–15 nSamplesPerSec Sample rate (binary, in Hz)
16–19 nAvgBytesPerSec Bytes per second
20–21 nBlockAlign Bytes per sample: e.g. &0001 = 8-bit mono; &0002 =
    8-bit stereo or 16-bit mono; &0004 = 16-bit stereo
22–23 nBitsPerSample Bits per sample

The RIFF WAVE format is extensible and can have additional chunks to define enhanced functionality such as surround sound and other forms of coding. This is known as ‘WAVE-format extensible’. Chunks can include data relating to cue points, labels and associated data, for example.

6.1.7 WAVE-format extensible

In order to enable the extension of the WAVE format to contain new audio formats such as certain types of surround sound and data-reduced audio (e.g. MPEG, Dolby Digital), WAVE-format extensible has a means of referring to globally unique identifiers (GUIDs) and sub-format chunks that can be vendor-specific. The ‘format’ chunk is extended to describe the additional content of the file, with a ‘cbSize’ descriptor at the end of the standard format chunk followed by the additional bytes describing the extended format. The details of this are too complex to describe here, and interested readers will find more information on the Microsoft website at www.microsoft.com/hwdev/tech/audio/multichaud.asp. The various sub-format possibilities include the option to define alternative coding formats for surround sound data that are not tied to the loudspeaker locations described below, such as B-format Ambisonic signals.

One of the necessary ambiguities to resolve was the mapping to loudspeaker locations of the multiple channels contained within a file, for speaker-feed-oriented multichannel formats. This has been achieved by defining a standard ordering of the loudspeaker locations concerned and including a channel mask word in the format chunk that indicates which channels are present. Although it is not necessary for every loudspeaker location’s channel to be present in the file the samples should be presented in this order, leaving out missing channels. The first 12 of these correspond to the ordering of loudspeaker channels in the USB 1.0 specification (see Section 6.7.2), as shown in Table 6.2.

Secondly, the extensible format defines more clearly the alignment between audio sample information and the byte structure of the WAVE file, so that audio sample resolutions that do not fit exactly within a number of bytes can be handled more unambiguously. Here, wBitsPerSample must be a multiple of eight and a new field defines how many of those bits are actually used. Samples are then MSB-justified.

6.1.8 Broadcast WAVE format

The Broadcast WAVE format, described in EBU Tech. 3285, was standardised by the European Broadcasting Union (EBU) because of a need to ensure compatibility of sound files and accompanying information when transferred between workstations. It is based on the RIFF WAVE format described above, but contains an additional chunk that is specific to the format (the ‘broadcast_audio_extension’ chunk, ID = ‘bext’) and also limits some aspects of the WAVE format. Version 0 was published in 1997 and Version 1 in 2001, the only difference being the addition of a SMPTE UMID (unique material identifier) in Version 1 (this is a form of metadata). Such files currently only contain either PCM or MPEG-format audio data.

Broadcast WAVE files contain at least three chunks: the broadcast_audio_extension chunk, the format chunk and the audio data chunk. The broadcast extension chunk contains the data shown in Table 6.3. Optionally files may also contain further chunks for specialised purposes and may contain chunks relating to MPEG audio data (the ‘fact’ and ‘mpeg_audio_extension’ chunks). MPEG applications of the format are described in EBU Tech. 3285, Supplement 1 and the audio data chunk containing the MPEG data normally conforms to the MP3 frame format described in Section 6.1.9.

Table 6.2 Channel ordering of WAVE format extensible audio data

Channel Spatial location
0 Left Front (L)
1 Right Front (R)
2 Center Front (C)
3 Low Frequency Enhancement (LFE)
4 Left Surround (LS)
5 Right Surround (RS)
6 Left of Center (LC)
7 Right of Center (RC)
8 Back center (BC)
9 Side Left (SL)
10 Side Right (SR)
11 Top (T)
12 Top Front Left (TFL)
13 Top Front Center (TFC)
14 Top Front Right (TFR)
15 Top Back Left (TBL)
16 Top Back Center (TBC)
17 Top Back Right (TBR)

Table 6.3 Broadcast_audio_extension chunk format

Data Size (bytes) Description
ckID    4 Chunk ID = ‘bext’
ckSize    4 Size of chunk
Description 256 Description of the sound clip
Originator 32 Name of the originator
OriginatorReference 32 Unique identifier of the originator (issued by
    the EBU)
OriginationDate 10 ‘yyyy-mm-dd’
OriginationTime    8 ‘hh-mm-ss’
TimeReferenceLow    4 Low byte of the first sample count since midnight
TimeReferenceHigh    4 High byte of the first sample count since
    midnight
Version    2 BWF version number, e.g. &0001 is Version 1
UMID 64 UMID according to SMPTE 330M. If only a
    32-byte UMID then the second half should be padded with zeros
Reserved 190 Reserved for extensions. Set to zero in
    Version 1
CodingHistory Unrestricted A series of ASCII strings, each terminated by CR/LF (carriage return, line feed) describing
    each stage of the audio coding history, according to EBU R-98

A multichannel extension chunk has recently been proposed for Broadcast WAVE files that defines the channel ordering, surround format, downmix coefficients for creating a two-channel mix, and some descriptive information. There are also chunks defined for metadata describing the audio contained within the file, such as the ‘quality chunk’ (ckID = ‘qlty’), which together with the coding history contained in the ‘bext’ chunk make up the so-called ‘capturing report’. These are described in Supplement 2 to EBU Tech. 3285. Finally there is a chunk describing the peak audio level within a file, which can aid automatic programme level setting and programme interchange.

6.1.9 MPEG audio file formats

It is possible to store MPEG-compressed audio in AIFF-C or WAVE files, with the compression type noted in the appropriate header field. There are also older MS-DOS file extensions used to denote MPEG audio files, notably .MPA (MPEG Audio) or .ABS (Audio Bit Stream). However, owing to the ubiquity of the so-called ‘MP3’ format (MPEG 1, Layer 3) for audio distribution on the Internet, MPEG audio files are increasingly denoted with the extension ‘.MP3’. Such files are relatively simple, being really no more than MPEG audio frame data in sequence, each frame being preceded by a frame header. Although the frame header at the beginning of the file might be considered to relate to all the remaining audio information, there is the possibility that settings may change during the course of replay. For example, the bit rate can change in variable bit rate modes, or joint stereo coding might be switched on, so each frame header should ideally be correctly decoded. The following describes the basic format of .MP3 files and Figure 6.7 shows the structure of a typical MPEG frame.

Layer 1 frames correspond to 384 original PCM samples (8 ms at a sampling rate of 48 kHz), and Layer 2 and 3 frames correspond to 1152 PCM samples (24 ms @ 48 kHz). The frame consists of a 32-bit header, a 16-bit CRC check word, the audio data (consisting of subband samples, appropriate scale factors and information concerning the bit allocation to different parts of the spectrum) and an ancillary data field whose length is currently unspecified. The 32-bit header of each frame consists of the information shown in Table 6.4.

image

Figure 6.7 MPEG-Audio frame format

MPEG files can usually be played from pretty well anywhere in the file by looking for the next frame header. Layer 1 and 2 frames are self-contained and can be decoded immediately, but Layer 3 (the only one that should strictly be called MP3) is slightly different and can take up to nine frames before the decoding can be completed correctly. This is because of the bit reservoir technique that is used to share bits optimally between a series of frames.

A so-called ‘ID3 tag’ has been included in many MP3 files as a means of describing the contents of the file, such as the title, artists, and so forth. This is usually found in the last 128 bits of the whole file and should not be decoded as audio. It does not begin with a frame header sync pattern so most software will not attempt to decode it as audio. A simple tag has the typical format shown in Table 6.5.

Table 6.4 MPEG audio frame header

Function No. of bits Description
Sync word 11 All set to binary ‘1’ to act as a synchronisation pattern. (Technically the first 12 bits were intended to be the sync word and the following ID a single bit, but see MPEG 2.5 below)
ID bits 2 Indicates the ID of the algorithm in use: ‘11’ = MPEG 1 (ISO 11172-3); ‘10’ = MPEG 2 (ISO 13818-3); ‘01’ = reserved; ‘00’ = MPEG 2.5 (an unofficial version that allowed even lower bit rates than specified in the original MPEG 2 standard, by using lower sampling frequencies (8, 11 and 12 kHz))
Layer 2 Indicates the MPEG layer in use: ‘11’ = Layer 1; ‘10’ = Layer 2; ‘01’ = Layer 3; ‘00’ = reserved
Protection bit 1 Indicates whether error correction data has been added to the audio bitstream (‘0’ if yes)
Bitrate index 4 Indicates the total bit rate of the channel according to a table which relates the state of these 4 bits to rates in each layer
Sampling frequency 2 Indicates the original PCM sampling frequency: ‘00’ = 44.1 kHz; ‘01’ = 48 kHz; ‘10’ = 32 kHz
Padding bit 1 Indicates in state ‘1’ that a slot has been added to the frame to make the average bit rate of the data-reduced channel relate exactly to the original sampling rate
Private bit 1 Available for private use
Mode 2 ‘00’ = stereo; ‘01’ = joint stereo; ‘10’ = dual channel; ‘11’ = single channel
Mode extension 2 Used for further definition of joint stereo coding mode to indicate either which bands are coded in joint stereo, or which type of joint coding is to be used
Copyright 1 ‘1’ = copyright protected
Original/copy 1 ‘1’ = original; ‘0’ = copy
Emphasis  2 Indicates audio pre-emphasis type: ‘00’ = none; ‘01’ = 50/15 μs; ‘11’ = CCITT J17

Table 6.5 A simple ID3 tag structure

Sign Length (bytes) Description
A 3 Tag identification. Normally ASCII ‘TAG’.
B 30 Title
C 30 Artist
D 30 Album
E 4 Year
F 30 Comment string (only 28 bytes followed by ‘’ in some versions)
G 1 This may represent the track number, or may be part of the comment string
H  1 Genre

Table 6.6 MPEG file ID3v2 tag header

Function No. of bytes Description
File identifier 3 ASCII ‘ID3’
Version 2 & 03 00 (major version then revision number)
Flags 1 Binary ‘abc00000’
a = unsynchronisation used when set
b = extended header present when set
c = flag for indicating experimental version of tag
Size 4 Four lots of 0xxxxxxx, concatenated ignoring the MSB of each byte to make a 28-bit word that indicates the length of the tag after the header and including any unsynchronisation bytes

ID3v2.2 is a much more developed and extended ID3 tagging structure for a range of information contained within frames of data, each frame having its own header. The ID3, version 2 header format is shown in Table 6.6. Patterns of data within the ID3 information that might look like an audio sync pattern are dealt with using a method known as unsynchronisation that modifies the tag to prevent the sync pattern occurring. It does this by inserting an extra zero-valued byte after the first byte of the false sync pattern.

6.1.10 DSD-IFF file format

Sony and Philips have introduced Direct Stream Digital (DSD), as already discussed in Section 2.7, as an alternative high-resolution format for audio representation. The DSD-IFF file format is based on a similar structure to other IFF-type files, described above, except that it is modified slightly to allow for the large file sizes that may be encountered. Specifically the container FORM chunk is labelled ‘FRM8’ and this identifies all local chunks that follow as having ‘length’ indications that are 8 bytes long rather than the normal 4. In other words, rather than a 4-byte chunk ID followed by a 4-byte length indication, these files have a 4-byte ID followed by an 8-byte length indication. This allows for the definition of chunks with a length greater than 2 Gbytes, which may be needed for mastering SuperAudio CDs.

In such a file, local mandatory chunks are the format version chunk (‘FVER’), the property chunk (‘PROP’), containing information such as sampling frequency, number of channels and loudspeaker configuration, and at least one DSD or DST (direct stream transfer – the losslessly compressed version of DSD) sound data chunk (‘DSD’ or ‘DST’). There are also various optional chunks that can be used for exchanging more detailed information and comments such as might be used in project interchange. Further details of this file format, and an excellent guide to the use of DSD-IFF in project applications, can be found in the DSD-IFF specification, as described in the Further reading at the end of this chapter.

6.1.11 Edit decision list (EDL) files

EDL formats have usually been unique to the workstation on which they are used but the need for open interchange is increasing the pressure to make EDLs transportable between packages. There is an old and widely used format for EDLs in the video world that is known as the CMX-compatible form. CMX is a well-known manufacturer of video editing equipment and most editing systems will read CMX EDLs for the sake of compatibility. These can be used for basic audio purposes, and indeed a number of workstations can read CMX EDL files for the purpose of auto-conforming audio edits to video edits performed on a separate system. The CMX list defines the cut points between source material and the various transition effects at joins, and it can be translated reasonably well for the purpose of defining audio cut points and their timecode locations, using SMPTE/EBU form, provided video frame accuracy is adequate.

Software can be obtained for audio and video workstations that translates EDLs between a number of different standards to make interchange easier, although it is clear that this process is not always problem-free and good planning of in-house processes is vital. The OMFI structure also contains a format for interchanging edit list data, as described below. AES 31 (see below) is now gaining considerable popularity among workstation software manufacturers as a simple means of exchanging audio editing projects between systems. The Advanced Authoring Format (AAF) is becoming increasingly relevant to the exchange of media project data between systems, and is likely to take over from OMFI as time progresses.

6.1.12 AES 31 format

AES 31 is an international standard designed to enable straightforward interchange of audio files and projects between systems. Audio editing packages are increasingly offering AES 31 as a simple interchange format for edit lists. In Part 1 the standard specifies a disk format that is compatible with the FAT32 file system, a widely used structure for the formatting of computer hard disks. Part 2 is not finalised at the time of writing but is likely to describe an audio file format closely based on the Broadcast WAVE format. Part 3 describes simple project interchange, including a format for the communication of edit lists using ASCII text that can be parsed by a computer as well as read by a human. The basis of this is the edit decision markup language (EDML). It is not necessary to use all the parts of AES 31 to make a satisfactory interchange of elements. For example, one could exchange an edit list according to part 3 without using a disk based on part 1. Adherence to all the parts would mean that one could take a removable disk from one system, containing sound files and a project file, and the project would be readable directly by the receiving device.

EDML documents are limited to a 7-bit ASCII character set in which white space delimits fields within records. Standard carriage return (CR) and line-feed (LF) characters can be included to aid the readability of lists but they are ignored by software that might parse the list. An event location is described by a combination of timecode value and sample count information. The timecode value is represented in ASCII using conventional hours, minutes, seconds and frames (e.g. HH:MM:SS:FF) and the optional sample count is a four-figure number denoting the number of samples after the start of the frame concerned at which the event actually occurs. This enables sample-accurate edit points to be specified. It is slightly more complicated than this because the ASCII delimiters between the timecode fields are changed to indicate various parameters:

HH:MM delimiter = Frame count and timebase indicator (see Table 6.7)

MM:SS delimiter = Film frame indicator (if not applicable, use the previous delimiter)

SS:FF delimiter = Video field and timecode type (see Table 6.8)

The delimiter before the sample count value is used to indicate the audio sampling frequency, including all the pull-up and pull-down options (e.g. fs times 1/1.001). There are too many of these possibilities to list here and the interested reader is referred to the standard for further information. This is an example of a timecode and (after the slash denoting 48 kHz sampling frequency) optional sample count value:

Table 6.7 Frame count and timebase indicator coding in AES 31

image

Table 6.8 Video field and timecode type indicator in AES 31

image

14:57:24.03/0175

The Audio Decision List (ADL) is contained between two ASCII keyword tags <ADL> and </ADL>. It in turn contains a number of sections, each contained within other keyword tags such as <VERSION>, <PROJECT>, <SYSTEM> and <SEQUENCE>. The edit points themselves are contained in the <EVENT_LIST> section. Each event begins with the ASCII keyword ‘(Entry)’, which serves to delimit events in the list, followed by an entry number (32-bit integer, incrementing through the list) and an entry type keyword to describe the nature of the event (e.g. ‘(Cut)’). Each different event type then has a number of bytes following that define the event more specifically. The following is an example of a simple cut edit, as suggested by the standard:

(Entry) 0010 (Cut) F “FILE://VOL/DIR/FILE” 1 1 03:00:00;00/0000 01:00:00:00/0000 01:00:10:00/0000_

This sequence essentially describes a cut edit, entry number 0010, the source of which is the file (F) with the path shown, using channel 1 of the source file (or just a mono file), placed on track 1 of the destination timeline, starting at timecode three hours in the source file, placed to begin at one hour in the destination timeline (the ‘in point’) and to end ten seconds later (the ‘out point’). Some workstation software packages store a timecode value along with each sound file to indicate the nominal start time of the original recording (e.g. BWF files contain a timestamp in the ‘bext’ chunk), otherwise each sound file is assumed to start at time zero.

It is assumed that default crossfades will be handled by the workstation software itself. Most workstations introduce a basic short crossfade at each edit point to avoid clicks, but this can be modified by ‘event modifier’ information in the ADL. Such modifiers can be used to adjust the shape and duration of a fade in or fade out at an edit point. There is also the option to point at a rendered crossfade file for the edit point, as described in Chapter 3.

6.1.13 The Open Media Framework Interchange (OMFI)

OMFI was introduced in 1994 by Avid Technology, an American company specialising in desktop audio and video post-production systems (now merged with Digidesign). It was an attempt to define a common standard for the interchange of audio, video, edit list and other multimedia information between workstations running on different platforms. It was in effect a publicly available format and Avid did not charge licensing fees of any kind, OMFI being a means of trying to encourage greater growth in this field of the industry as a whole. A number of other manufacturers signed up to support OMF and worked jointly on its development. Avid makes available an OMF Interchange Toolkit at moderate cost for developers who want to build OMF compatibility into their products. The company is gradually migrating from OMFI to a new format called AAF (Advanced Authoring Format) that is supported by a wide range of multimedia vendors. Parts of OMFI 2.0 have apparently been incorporated within AAF.

The OMFI 1.0 specification was lengthy and dealt with descriptions of the various types of information that could be contained and the methods of containment. It also contained details of compositions and the ways in which edit timing data should be managed. Version 1.0 of OMFI was very video oriented and specified no more for audio than the two common formats for the audio data files and a means of specifying edit points and basic crossfade durations (but not the shape). Compared with AES 31, OMFI is much more difficult for programmers to understand because it is much more expandable and versatile, being based on object-oriented concepts rather than being a simple text-based description of the project. OMFI 2.0 is yet more involved and links information in a different way to 1.0, the two being incompatible. As far as the audio user is concerned, the 1.0 version specifies that the common audio formats to be used are the uncompressed versions of either the AIFF format or the WAVE format (see above), depending on the intended hardware platform. It also allows for the possibility that manufacturers might want to specify ‘private’ interchange formats of their own. The format apparently limits audio resolution to 16 bits for interchange, but there is no particular reason why this should be so and some programmers have modified the Toolkit code to accommodate 24-bit audio files. Most of the version 1.0 document referred to video operations, so cuts and effects were all described in video terms.

OMFI 1.0 projects contain two types of information: ‘compositions’ and ‘sources’. Compositions specify how the various sources are to be assembled in order to play the finished product. Source data (audio, video, or other multimedia files) may be stored either in separate files, referenced by the OMF file, or within the OMF container structure. The container structure is similar to the IFF model described above (indeed Avid originally started to use IFF), in that it contains a number of self-describing parts, and is called Bento (an Apple development). Each part of the OMFI file is complete in itself and can be handled independently of the other parts – indeed applications do not need to be able to deal with every component of an OMFI file – allowing different byte ordering for different parts if required. Systems may claim OMFI compatibility yet still not be able to deal with some of the data objects contained within the file, requiring care in implementation and some discipline in the use of OMFI within organisations. The fact that OMFI used Apple’s Bento container was one of the problems encountered by the AES when attempting to standardise editing project interchange for the AES 31 standard. Since Bento is not an open standard and is Apple’s proprietary technology, AES apparently could not adopt OMFI directly.

6.1.14 MXF – the Media Exchange Format

MXF was developed by the Pro-MPEG forum as a means of exchanging audio, video and metadata between devices, primarily in television operations. It is based on the modern concept of media objects that are split into ‘essence’ and ‘metadata’. Essence files are the raw material (i.e. audio and video) and the metadata describes things about the essence (such as where to put it, where it came from and how to process it).

MXF files attempt to present the material in a ‘streaming’ format, that is one that can be played out in real time, but they can also be exchanged in conventional file transfer operations. As such they are normally considered to be finished program material, rather than material that is to be processed somewhere downstream, designed for playout in broadcasting environments. The bit stream is also said to be compatible with recording on digital videotape devices.

6.1.15 AAF – the Advanced Authoring Format

AAF is an authoring format for multimedia data that is supported by numerous vendors, including Avid which has adopted it as a migration path from OMFI (see above). Parts of OMFI 2.0 form the basis for parts of AAF and there are also close similarities between AAF and MXF (described in the previous section). Like the formats to which it has similarities, AAF is an object-oriented format that combines essence and metadata within a container structure. Unlike MXF it is designed for project interchange such that elements within the project can be modified, post-processed and resynchronised. It is not, therefore, directly suitable as a streaming format but can easily be converted to MXF for streaming if necessary.

Rather like OMFI it is designed to enable complex relationships to be described between content elements, to map these elements onto a timeline, to describe the processing of effects, synchronise streams of essence, retain historical metadata and refer to external essence (essence not contained within the AAF package itself). It has three essential parts: the AAF Object Specification (which defines a container for essence and metadata, the logical contents of objects and rules for relationships between them); the AAF Low-Level Container Specification (which defines a disk filing structure for the data, based on Microsoft’s Structured Storage); and the AAF SDK Reference Implementation (which is a software development kit that enables applications to deal with AAF files). The Object Specification is extensible in that it allows new object classes to be defined for future development purposes.

image

Figure 6.8 Graphical conceptualisation of some metadata package relationships in AAF: a simple audio post-production example

The basic object hierarchy is illustrated in Figure 6.8, using an example of a typical audio post-production scenario. ‘Packages’ of metadata are defined that describe either compositions, essence or physical media. Some package types are very ‘close’ to the source material (they are at a lower level in the object hierarchy, so to speak) – for example a ‘file source package’ might describe a particular sound file stored on disk. The metadata package, however, would not be the file itself, but it would describe its name and where to find it. Higher level packages would refer to these lower level packages in order to put together a complex program. A composition package is one that effectively describes how to assemble source clips to make up a finished program. Some composition packages describe effects that require a number of elements of essence to be combined or processed in some way.

Packages can have a number of ‘slots’. These are a bit like tracks in more conventional terminology, each slot describing only one kind of essence (e.g. audio, video, graphics). Slots can be static (not time-dependent), timeline (running against a timing reference) or event-based (one-shot, triggered events). Slots have segments that can be source clips, sequences, effects or fillers. A source clip segment can refer to a particular part of a slot in a separate essence package (so it could refer to a short portion of a sound file that is described in an essence package, for example).

6.2 Disk pre-mastering formats

The original tape format for submitting CD masters to pressing plants was Sony’s audio-dedicated PCM 1610/1630 format on U-matic video tape. This is now ‘old technology’ and has been replaced by alternatives based on more recent data storage media and file storage protocols. These include the PMCD (pre-master CD), CD-R, Exabyte and DLT tape formats. DVD mastering also requires high-capacity media for transferring the many gigabytes of information to mastering houses in order that glass masters can be created.

The Disk Description Protocol (DDP) developed by Doug Carson and Associates is now widely used for describing disk masters. Version 1 of the DDP laid down the basic data structure but said little about higher level issues involved in interchange, making it more than a little complicated for manufacturers to ensure that DDP masters from one system would be readable on another. Version 2 addressed some of these issues.

DDP is a protocol for describing the contents of a disk, which is not medium specific. That said it is common to interchange CD masters with DDP data on 8 mm Exabyte data cartridges and DVD masters are typically transferred on DLT Type III or IV compact tapes or on DVD-R(A) format disks with CMF (cutting master format) DDP headers. DDP files can be supplied separately to the audio data if necessary. DDP can be used for interchanging the data for a number of different disk formats, such as CD-ROM, CD-DA, CD-I and CD-ROM-XA, DVD-Video and Audio, and the protocol is really extremely simple. It consists of a number of ‘streams’ of data, each of which carries different information to describe the contents of the disk. These streams may be either a series of packets of data transferred over a network, files on a disk or tape, or raw blocks of data independent of any filing system. The DDP protocol simply maps its data into whatever block or packet size is used by the medium concerned, provided that the block or packet size is at least 128 bytes. Either a standard computer filing structure can be used, in which case each stream is contained within a named file, or the storage medium is used ‘raw’ with each stream starting at a designated sector or block address.

The ANSI tape labelling specification is used to label the tapes used for DDP transfers. This allows the names and locations of the various streams to be identified. The principal streams included in a DDP transfer for CD mastering are as follows:

1  DDP ID stream or ‘DDPID’ file. 128 bytes long, describing the type and level of DDP information, various ‘vital statistics’ about the other DDP files and their location on the medium (in the case of physically addressed media), and a user text field (not transferred to the CD).

2  DDP Map stream or ‘DDPMS’ file. This is a stream of 128-byte data packets which together give a map of the CD contents, showing what types of CD data are to be recorded in each part of the CD, how long the streams are, what types of subcode are included, and so forth. Pointers are included to the relevant text, subcode and main streams (or files) for each part of the CD.

3  Text stream. An optional stream containing text to describe the titling information for volumes, tracks or index points (not currently stored in CD formats), or for other text comments. If stored as a file, its name is indicated in the appropriate map packet.

4  Subcode stream. Optionally contains information about the subcode data to be included within a part of the disk, particularly for CD-DA. If stored as a file, its name is indicated in the appropriate map packet.

5  Main stream. Contains the main data to be stored on a part of the CD, treated simply as a stream of bytes, irrespective of the block or packet size used. More than one of these files can be used in cases of mixed-mode disks, but there is normally only one in the case of a conventional audio CD. If stored as a file, its name is indicated in the appropriate map packet.

6.3 Interconnecting audio devices

In the case of analog interconnection between devices, replayed digital audio is converted to the analog domain by the replay machine’s D/A convertors, routed to the recording machine via a conventional audio cable and then reconverted to the digital domain by the recording machine’s A/D convertors. The audio is subject to any gain changes that might be introduced by level differences between output and input, or by the record gain control of the recorder and the replay gain control of the player. Analog domain copying is necessary if any analog processing of the signal is to happen in between one device and another, such as gain correction, equalisation, or the addition of effects such as reverberation. Most of these operations, though, are now possible in the digital domain.

An analog domain copy cannot be said to be a perfect copy or a clone of the original master, since the data values will not be exactly the same (owing to slight differences in recording level, differences between convertors, the addition of noise, and so on). For a clone it is necessary to make a true digital copy.

Professional digital audio systems, and some consumer systems, have digital interfaces conforming to one of the standard protocols and allow for a number of channels of digital audio data to be transferred between devices with no loss of sound quality. Any number of generations of digital copies may be made without affecting the sound quality of the latest generation, provided that errors have been fully corrected. The digital outputs of a recording device are taken from a point in the signal chain after error correction, which results in the copy being error corrected. Thus the copy does not suffer from any errors that existed in the master, provided that those errors were correctable. This process takes place in real time, requiring the operator to put the receiving device into record mode such that it simply stores the incoming stream of audio data. Any accompanying metadata may or may not be recorded (often most of it is not).

Digital interfaces may be used for the interconnection of recording systems and other audio devices such as mixers and effects units. It is now common only to use analog interfaces at the very beginning and end of the signal chain, with all other interconnections being made digitally.

Making a copy of a recording using any of the digital interface standards involves the connection of appropriate cables between player and recorder, and the switching of the recorder’s input to ‘digital’ as opposed to ‘analog’, since this sets it to accept a signal from the digital input as opposed to the A/D convertor. It is necessary for both machines to be operating at the same sampling frequency (unless a sampling frequency convertor is used) and may require the recorder to be switched to ‘external sync’ mode, so that it can lock its sampling frequency to that of the player. Alternatively (and preferably) a common reference signal may be used to synchronise all devices that are to be interconnected digitally. A recorder should be capable of at least the same quantising resolution (number of bits per sample) as a player, otherwise audio resolution will be lost. If there is a difference in resolution between the systems it is advisable to use a processor in between the machines that optimally dithers the signal for the new resolution, or alternatively to use redithering options on the source machine to prepare the signal for its new resolution.

6.4 Computer networks and digital audio interfaces compared

Dedicated ‘streaming’ interfaces, as employed in broadcasting, production and post-production environments, are the digital audio equivalent of analog signal cables, down which signals for one or more channels are carried in real time from one point to another, possibly with some auxiliary information (metadata) attached. An example is the AES-3 interface, described below. Such an audio interface uses a data format dedicated to audio purposes, whereas a computer data network may carry numerous types of information.

Dedicated interfaces are normally unidirectional, point-to-point connections, and should be distinguished from buses and computer networks that are often bidirectional and carry data in a packet format. Sources may be connected to destinations using a routeing matrix or by patching individual connections, very much as with analog signals. Audio data are transmitted in an unbroken stream, there is no handshaking process involved in the data transfer, and erroneous data are not retransmitted because there is no mechanism for requesting its retransmission. The data rate of a dedicated audio interface is directly related to the audio sampling frequency, wordlength and number of channels of the audio data to be transmitted, ensuring that the interface is always capable of serving the specified number of channels. If a channel is unused for some reason its capacity is not normally available for assigning to other purposes (such as higher-speed transfer of another channel, for example).

Dedicated audio interfaces, therefore, may be thought of as best suited to operational situations in which analog signal cabling needs to be replaced by a digital equivalent, and where digital audio signals are to be routed from place to place within a studio environment so as to ensure dedicated signal feeds. There are, however, a number of developments in real-time computer networking that begin to blur the distinction between such approaches and conventional asynchronous file transfers, owing to the increased use of ‘streaming media’, as discussed below.

There is an increasing trend towards employing standard computer interfaces and networks to transfer audio information, as opposed to using dedicated audio interfaces. Such computer networks are typically used for a variety of purposes in general data communications and they may need to be adapted for audio applications that require sample-accurate realtime transfer. The increasing ubiquity of computer systems in audio environments makes it inevitable that generic data communication technology will gradually take the place of dedicated interfaces. It also makes sense economically to take advantage of the ‘mass market’ features of the computer industry.

Computer networks are typically general-purpose data carriers that may have asynchronous features and may not always have the inherent quality-of-service (QoS) features that are required for ‘streaming’ applications. They also normally use an addressing structure that enables packets of data to be carried from one of a number of sources to one of a number of destinations and such packets will share the connection in a more or less controlled way. Data transport protocols such as TCP/IP are often used as a universal means of managing the transfer of data from place to place, adding overheads in terms of data rate, delay and error handling that may work against the efficient transfer of audio. Such networks may be intended primarily for file transfer applications where the time taken to transfer the file is not a crucial factor – as fast as possible will do.

Conventional office Ethernet is a good example of a computer network interface that has limitations in respect of audio streaming. The original 10 Mbit s–1 data rate was quite slow, although theoretically capable of handling a number of channels of real-time audio data. If employed between only two devices and used with a low-level protocol such as UDP (user datagram protocol) audio can be streamed quite successfully, but problems can arise when multiple devices contend for use of the bus and where the network is used for general purpose data communications in addition to audio streaming. There is no guarantee of a certain quality of service, because the bus is a sort of ‘free for all’, ‘first-come-first-served’ arrangement that is not designed for real-time applications. To take a simple example, if one’s colleague attempts to download a huge file from the Internet just when one is trying to stream a broadcast live-to-air in a local radio station, using the same data network, the chances are that one’s broadcast will drop out occasionally.

One can partially address such limitations in a crude way by throwing data-handling capacity at the problem, hoping that increasing the network speed to 100 Mbit s–1 or even 1 Gbit s–1 will avoid it ever becoming overloaded. Circuit-switched networks can also be employed to ease these problems (that is networks where individual circuits are specifically established between sources and destinations). Unless capacity can be reserved and service quality guaranteed a data network will never be a suitable replacement for dedicated audio interfaces in critical environments such as broadcasting stations. This has led to the development of realtime protocols and/or circuit-switched networks for handling audio information on data interfaces, in which latency (delay) and bandwidth are defined and guaranteed. The audio industry can benefit from the increased data rates, flexibility and versatility of general purpose interfaces provided that these issues are taken seriously.

Desktop computers and consumer equipment are also increasingly equipped with general purpose serial data interfaces such as USB (universal serial bus) and FireWire (IEEE 1394). These are examples of personal area network (PAN) technology, allowing a number of devices to be interconnected within a limited range around the user. These have a high enough data rate to carry a number of channels of audio data over relatively short distances, either over copper or optical fibre. Audio protocols also exist for these, as described below.

6.5 Dedicated audio interface formats

6.5.1 Digital interface types

There are a number of types of digital interface, some of which are international standards and others of which are manufacturer-specific. They all carry digital audio for one or more channels with at least 16-bit resolution and will operate at the standard sampling rates of 44.1 and 48 kHz, as well as at 32 kHz if necessary, some having a degree of latitude for varispeed. Some interface standards have been adapted to handle higher sampling frequencies such as 88.2 and 96 kHz. The interfaces vary as to how many physical interconnections are required. Some require one link per channel plus a synchronisation signal, whilst others carry all the audio information plus synchronisation information over one cable.

The interfaces are described below in outline. It is common for subtle incompatibilities to arise between devices, even when interconnected with a standard interface, owing to the different ways in which non-audio information is implemented. This can result in anything from minor operational problems to total non-communication and the causes and remedies are unfortunately far too detailed to go into here. The reader is referred to The Digital Interface Handbook by Rumsey and Watkinson, as well as to the standards themselves, if a greater understanding of the intricacies of digital audio interfaces is required.

6.5.2 The AES 3 interface (AES 3)

The AES 3 interface, described almost identically in AES3-1992, IEC 60958 and EBU Tech. 3250E among others, allows for two channels of digital audio (A and B) to be transferred serially over one balanced interface, using drivers and receivers similar to those used in the RS422 data transmission standard, with an output voltage of between 2 and 7 volts as shown in Figure 6.9. The interface allows two channels of audio to be transferred over distances up to 100 m, but longer distances may be covered using combinations of appropriate cabling, equalisation and termination. Standard XLR-3 connectors are used, often labelled DI (for digital in) and DO (for digital out).

image

Figure 6.9 Recommended electrical circuit for use with the standard two-channel interface

image

Figure 6.10 Format of the standard two-channel interface frame

Each audio sample is contained within a ‘subframe’ (see Figure 6.10), and each subframe begins with one of three synchronising patterns to identify the sample as either A or B channel, or to mark the start of a new channel status block (see Figure 6.11). These synchronising patterns violate the rules of bi-phase mark coding (see below) and are easily identified by a decoder. One frame (containing two audio samples) is normally transmitted in the time period of one audio sample, so the data rate varies with the sampling frequency. (Note, though, that the recently introduced ‘single-channel-double-sampling-frequency’ mode of the interface allows two samples for one channel to be transmitted within a single frame in order to allow the transport of audio at 88.2 or 96 kHz sampling frequency.)

Additional data is carried within the subframe in the form of 4 bits of auxiliary data (which may either be used for additional audio resolution or for other purposes such as low-quality speech), a validity bit (V), a user bit (U), a channel status bit (C) and a parity bit (P), making 32 bits per subframe and 64 bits per frame. Channel status bits are aggregated at the receiver to form a 24-byte word every 192 frames, and each bit of this word has a specific function relating to interface operation, an overview of which is shown in Figure 6.12. Examples of bit usage in this word are the signalling of sampling frequency and pre-emphasis, as well as the carrying of a sample address ‘timecode’ and labelling of source and destination. Bit 1 of the first byte signifies whether the interface is operating according to the professional (set to 1) or consumer (set to 0) specification.

Bi-phase mark coding, the same channel code as used for SMPTE/EBU timecode, is used in order to ensure that the data is self-clocking, of limited bandwidth, DC free, and polarity independent, as shown in Figure 6.13. The interface has to accommodate a wide range of cable types and a nominal 110 ohms characteristic impedance is recommended. Originally (AES3-1985) up to four receivers with a nominal input impedance of 250 ohms could be connected across a single professional interface cable, but the later modification to the standard recommended the use of a single receiver per transmitter, having a nominal input impedance of 110 ohms.

image

Figure 6.11 Three different preambles (X, Y and Z) are used to synchronise a receiver at the starts of subframes

image

Figure 6.12 Overview of the professional channel status block

image

Figure 6.13 An example of the bi-phase mark channel code

6.5.3 Standard consumer interface (IEC 60958-3)

The most common consumer interface (historically related to SPDIF – the Sony/Philips digital interface) is very similar to the AES 3 interface, but uses unbalanced electrical interconnection over a coaxial cable having a characteristic impedance of 75 ohms, as shown in Figure 6.14. It can be found on many items of semi-professional or consumer digital audio equipment, such as CD players and DAT machines, and is also widely used on computer sound cards because of the small physical size of the connectors. It usually terminates in an RCA phono connector, although some equipment makes use of optical fibre interconnects (TOSlink) carrying the same data. Format convertors are available for converting consumer format signals to the professional format, and vice versa, and for converting between electrical and optical formats.

When the IEC standardised the two-channel digital audio interface, two requirements existed: one for ‘consumer use’, and one for ‘broadcasting or similar purposes’. A single IEC standard (IEC 958) resulted with only subtle differences between consumer and professional implementation. Occasionally this caused problems in the interconnection of machines, such as when consumer format data was transmitted over professional electrical interfaces. IEC 958 has now been rewritten as IEC 60958 and many of these uncertainties have been addressed.

The data format of subframes is the same as that used in the professional interface, but the channel status implementation is almost completely different, as shown in Figure 6.15. The second byte of channel status in the consumer interface has been set aside for the indication of ‘category codes’, these being set to define the type of consumer usage. Current examples of defined categories are (00000000) for the General category, (10000000) for Compact Disc and (11000000) for a DAT machine. Once the category has been defined, the receiver is expected to interpret certain bits of the channel status word in a particular way, depending on the category. For example, in CD usage, the four control bits from the CD’s ‘Q’ channel subcode are inserted into the first four control bits of the channel status block (bits 1–4). Copy protection can be implemented in consumer-interfaced equipment, according to the Serial Copy Management System (SCMS).

image

Figure 6.14 The consumer electrical interface (transformer and capacitor are optional but may improve the electrical characteristics of the interface)

image

Figure 6.15 Overview of the consumer channel status block

The user bits of the consumer interface are often used to carry information derived from the subcode of recordings, such as track identification and cue point data. This can be used when copying CDs and DAT tapes, for example, to ensure that track start ID markers are copied along with the audio data. This information is not normally carried over AES/EBU interfaces.

6.5.4 Carrying data-reduced audio over standard digital interfaces

The increased use of data-reduced multichannel audio has resulted in methods by which such data can be carried over standard two-channel interfaces, either for professional or consumer purposes. This makes use of the ‘non-audio’ or ‘other uses’ mode of the interface, indicated in the second bit of channel status, which tells conventional PCM audio decoders that the information is some other form of data that should not be converted directly to analog audio. Because data-reduced audio has a much lower rate than the PCM audio from which it was derived, a number of audio channels can be carried in a data stream that occupies no more space than two channels of conventional PCM. These applications of the interface are described in SMPTE 337M (concerned with professional applications) and IEC 61937, although the two are not identical. SMPTE 338M and 339M specify data types to be used with this standard. The SMPTE standard packs the compressed audio data into 16, 20 or 24 bits of the audio part of the AES 3 sub-frame and can use the two sub-frames independently (e.g. one for PCM audio and the other for data-reduced audio), whereas the IEC standard only uses 16 bits and treats both sub-frames the same way.

Consumer use of this mode is evident on DVD players, for example, for connecting them to home cinema decoders. Here the Dolby Digital or DTS-encoded surround sound is not decoded in the player but in the attached receiver/decoder. IEC 61937 has parts, either pending or published, dealing with a range of different codecs including ATRAC, Dolby AC-3, DTS and MPEG (various flavours). An ordinary PCM convertor trying to decode such a signal would simply reproduce it as a loud, rather unpleasant noise, which is not advised and does not normally happen if the second bit of channel status is correctly observed. Professional applications of the mode vary, but are likely to be increasingly encountered in conjunction with Dolby E data reduction – a relatively recent development involving mild data reduction for professional multichannel applications in which users wish to continue making use of existing AES 3-compatible equipment (e.g. VTRs, switchers and routers). Dolby E enables 5.1-channel surround audio to be carried over conventional two-channel interfaces and through AES 3-transparent equipment at a typical rate of about 1.92 Mbit s–1 (depending on how many bits of the audio sub-frame are employed). It is designed so that it can be switched or edited at video frame boundaries without disturbing the audio.

image

Figure 6.16 Format of TDIF data and LRsync signal

6.5.5 Tascam digital interface (TDIF)

Tascam’s interfaces have become popular owing to the widespread use of the company’s DA-88 multitrack recorder and more recent derivatives. The primary TDIF-1 interface uses a 25-pin D-sub connector to carry eight channels of audio information in two directions (in and out of the device), sampling frequency and pre-emphasis information (on separate wires, two for fs and one for emphasis) and a synchronising signal. The interface is unbalanced and uses CMOS voltage levels. Each data connection carries two channels of audio data, odd channel and MSB first, as shown in Figure 6.16. As can be seen, the audio data can be up to 24 bits long, followed by 2 bits to signal the word length, 1 bit to signal emphasis and 1 bit for parity. There are also four user bits per channel that are not usually used.

This resembles a modified form of the AES3 interface frame format. An accompanying left/right clock signal is high for the odd samples and low for the even samples of the audio data. It is difficult to find information about this interface but the output channel pairs appear to be on pins 1–4 with the left/right clock on pin 5, while the inputs are on pins 13–10 with the left/right clock on pin 9. Pins 7, 14–17 (these seem to be related to output signals) and 22–25 (related to the input signals) are grounded. The unbalanced, multi-conductor, non-coaxial nature of this interface makes it only suitable for covering short distances up to 5 metres.

6.5.6 Alesis digital interface

The ADAT multichannel optical digital interface, commonly referred to as the ‘light pipe’ interface or simply ‘ADAT Optical’, is a serial, self-clocking, optical interface that carries eight channels of audio information. It is described in US Patent 5,297,181: ‘Method and apparatus for providing a digital audio interface protocol’. The interface is capable of carrying up to 24 bits of digital audio data for each channel and the eight channels of data are combined into one serial frame that is transmitted at the sampling frequency. The data is encoded in NRZI format for transmission, with forced ones inserted every five bits (except during the sync pattern) to provide clock content. This can be used to synchronise the sampling clock of a receiving device if required, although some devices require the use of a separate 9-pin ADAT sync cable for synchronisation. The sampling frequency is normally limited to 48 kHz with varispeed up to 50.4 kHz and TOSLINK optical connectors are typically employed (Toshiba TOCP172 or equivalent). In order to operate at 96 kHz sampling frequency some implementations use a ‘double-speed’ mode in which two channels are used to transmit one channel’s audio data (naturally halving the number of channels handled by one serial interface). Although 5 m lengths of optical fibre are the maximum recommended, longer distances may be covered if all the components of the interface are of good quality and clean. Experimentation is required.

image

Figure 6.17 Basic format of ADAT data

As shown in Figure 6.17 the frame consists of an 11-bit sync pattern consisting of 10 zeros followed by a forced one. This is followed by four user bits (not normally used and set to zero), the first forced one, then the first audio channel sample (with forced ones every five bits), the second audio channel sample, and so on.

6.5.7 Roland R-bus

Roland has recently introduced its own proprietary multichannel audio interface that, like TDIF (but not directly compatible with it), uses a 25-way D-type connector to carry eight channels of audio in two directions. Called R-bus it is increasingly used on Roland’s digital audio products and convertor boxes are available to mediate between R-bus and other interface formats. Little technical information about R-bus is available publicly at the time of writing.

6.5.8 Sony digital interface for DSD (SDIF-3)

Sony has recently introduced a high-resolution digital audio format known as ‘Direct Stream Digital’ or DSD (see Chapter 2). This encodes audio using one-bit sigma-delta conversion at a very high sampling frequency of typically 2.8224 MHz (64 times 44.1 kHz). There are no internationally agreed interfaces for this format of data, but Sony has released some preliminary details of an interface that can be used for the purpose, known as SDIF-3. Some early DSD equipment used a data format known as ‘DSD-raw’ which was simply a stream of DSD samples in non-return-to-zero (NRZ) form, as shown in Figure 6.18(a).

In SDIF-3 data is carried over 75 ohm unbalanced coaxial cables, terminating in BNC connectors. The bit rate is twice the DSD sampling frequency (or 5.6448 Mbit s–1 at the sampling frequency given above) because phase modulation is used for data transmission as shown in Figure 6.18(b). A separate word clock at 44.1 kHz is used for synchronisation purposes. It is also possible to encounter a DSD clock signal connection at the 64 times 44.1 kHz (2.8224 MHz).

6.5.9 Sony multichannel DSD interface (MAC-DSD)

Sony has also developed a multichannel interface for DSD signals, capable of carrying 24 channels over a single physical link. The transmission method is based on the same technology as used for the Ethernet 100BASE-TX (100 Mbit s–1) twisted-pair physical layer (PHY), but it is used in this application to create a point-to-point audio interface. Category 5 cabling is used, as for Ethernet, consisting of eight conductors. Two pairs are used for bi-directional audio data and the other two pairs for clock signals, one in each direction.

image

Figure 6.18 Direct Stream Digital interface data is either transmitted ‘raw’, as shown at (a) or phase modulated as in the SDIF-3 format shown at (b)

Twenty-four channels of DSD audio require a total bit rate of 67.7 Mbit s–1, leaving an appreciable spare capacity for additional data. In the MAC-DSD interface this is used for error correction (parity) data, frame header and auxiliary information. Data is formed into frames that can contain Ethernet MAC headers and optional network addresses for compatibility with network systems. Audio data within the frame is formed into 352 32-bit blocks, 24 bits of each being individual channel samples, six of which are parity bits and two of which are auxiliary bits.

In a recent enhancement of this interface, Sony has introduced ‘SuperMAC’ which is capable of handling either DSD or PCM audio with very low latency (delay), typically less than 50 μs. The number of channels carried depends on the sampling frequency. Twenty-four DSD channels can be handled, or 48 PCM channels at 44.1/48 kHz, reducing proportionately as the sampling frequency increases. In conventional PCM mode the interface is transparent to AES3 data including user and channel status information.

6.6 Networking

6.6.1 Basic principles of networking

A network carries data either on wire or optical fibre, and is normally shared between a number of devices and users. The sharing is achieved by containing the data in packets of a limited number of bytes (usually between 64 and 1518), each with an address attached. The packets may share a common physical link, normally a high speed serial bus of some kind, being multiplexed in time either using a regular slot structure synchronised to a system clock (isochronous transfer) or in an asynchronous fashion whereby the time interval between packets may be varied or transmission may not be regular, as shown in Figure 6.19. The length of packets may not be constant, depending on the requirements of different protocols sharing the same network. Packets for a particular file transfer between two devices may not be contiguous and may be transferred erratically, depending on what other traffic is sharing the same physical link.

Figure 6.20 shows some common physical layouts for local area networks (LANs). LANs are networks that operate within a limited area, such as an office building or studio centre, within which it is common for every device to ‘see’ the same data, each picking off that which is addressed to it and ignoring the rest. Routers and bridges can be used to break up complex LANs into subnets. WANs (wide area networks) and MANs (metropolitan area networks) are larger entities that link LANs within communities or regions. PANs (personal area networks) are typically limited to a range of a few tens of metres around the user (e.g. Firewire, USB, Bluetooth). Wireless versions of these network types are increasingly common, as discussed in Section 6.6.8.

image

Figure 6.19 Packets for different destinations (A, B and C) multiplexed onto a common serial bus. (a) Time division multiplexed into a regular time slot structure. (b) Asynchronous transfer showing variable time gaps and packet lengths between transfers for different destinations

image

Figure 6.20 Two examples of computer network topologies. (a) Devices connected by spurs to a common hub, and (b) devices connected to a common ‘backbone’. The former is now by far the most common, typically using CAT 5 cabling

In order to place a packet of data on the network, devices must have a means for determining whether the network is busy and there are various protocols in existence for arbitrating network access. Taking Ethernet as an example: in the ‘backbone’ configuration devices are connected to spurs off a common serial bus that requires the bus to be ‘chained’ between each successive device. Here, a break in the chain can mean disconnection for more than one device. The star configuration involves a central hub or switch that distributes the data to each device separately. This is more reliable because a break in any one link does not affect the others. Bus arbitration in both these cases is normally performed by collision detection which is a relatively crude approach, relying very much on the rules of spoken conversation between people. Devices attempt to place packets on the bus whenever it appears to be quiet, but a collision may take place if another device attempts to transmit before the first one has finished. The network interface of the transmitting device detects the collision by attempting to read the data it has just transmitted and retransmits it after transmitting a brief ‘blocking signal’ if it has been corrupted by the collision.

A token ring configuration places each device within a ‘ring’, each device having both an ‘in’ and an ‘out’ port, with bus arbitration performed using a process of token passing from one device to the next. This works rather like trains running on a single track line, in that a single token is carried by the train using the line and trains can only use the line if carrying the token. The token is passed to the next train upon leaving the single-track sector to show that the line is clear. Network communication is divided into a number of ‘layers’, each relating to an aspect of the communication protocol and interfacing correctly with the layers either side. The ISO seven-layer model for open systems interconnection (OSI) shows the number of levels at which compatibility between systems needs to exist before seamless interchange of data can be achieved (Figure 6.21). It shows that communication begins with the application is passed down through various stages to the layer most people understand – the physical layer, or the piece of wire over which the information is carried. Layers 3, 4 and 5 can be grouped under the broad heading of ‘protocol’, determining the way in which data packets are formatted and transferred. There is a strong similarity here with the exchange of data on physical media, as discussed earlier, where a range of compatibility layers from the physical to the application determine whether or not one device can read another’s disks.

6.6.2 Extending a network

It is common to need to extend a network to a wider area or to more machines. As the number of devices increases so does the traffic, and there comes a point when it is necessary to divide a network into zones, separated by ‘repeaters’, ‘bridges’ or ‘routers’. Some of these devices allow network traffic to be contained within zones, only communicating between the zones when necessary. This is vital in large interconnected networks because otherwise data placed anywhere on the network would be present at every other point on the network, and overload could quickly occur.

image

Figure 6.21 The ISO model for Open Systems Interconnection is arranged in seven layers, as shown here

A repeater is a device that links two separate segments of a network so that they can talk to each other, whereas a bridge isolates the two segments in normal use, only transferring data across the bridge when it has a destination address on the other side. A router is very selective in that it examines data packets and decides whether or not to pass them depending on a number of factors. A router can be programmed only to pass certain protocols and only certain source and destination addresses. It therefore acts as something of a network policeman and can be used as a first level of ensuring security of a network from unwanted external access. Routers can also operate between different standards of network, such as between FDDI and Ethernet, and ensure that packets of data are transferred over the most time/cost-effective route.

One could also use some form of router to link a local network to another that was quite some distance away, forming a wide area network (WAN), as shown in Figure 6.22. Data can be routed either over dialled data links such as ISDN (see below), in which the time is charged according to usage just like a telephone call, or over leased circuits. The choice would depend on the degree of usage and the relative costs. The Internet provides a means by which LANs are easily interconnected, although the data rate available will depend on the route, the service provider and the current traffic.

6.6.3 Network standards

Ethernet, FDDI (Fibre Distributed Data Interface), ATM (Asynchronous Transfer Mode) and Fibre Channel are examples of network standards, each of which specifies a number of layers within the OSI model. FDDI, for example, specifies only the first three layers of the OSI model (the physical, data link and network layers).

image

Figure 6.22 A WAN is formed by linking two or more LANs using a router

Ethernet allows a number of different methods of interconnection and runs at various rates from 10 Mbit s–1 to 1 Gbit s–1, using collision detection for network access control. Twistedpair (Base-T) connection using CAT 5 cabling and RJ 45 connectors is probably the most widely encountered physical interconnect these days, usually configured in the star topology using a central hub or switch. Devices can then be plugged and unplugged without affecting others on the network. Interconnection can also be via either thick (Base-10) or thin (Base-2) coaxial cable, normally working in the backbone-type configuration shown in the previous section, using 50-ohm BNC connectors and T-pieces to chain devices on the network (see Figure 6.23). Such a configuration requires resistive terminators at the ends of the bus to avoid reflections, as with SCSI connections.

FDDI is a high speed optical fibre network running at 100 Mbit s–1, operating on the token passing principle described above, allowing up to 2 km between stations. It is often used as a high-speed backbone for large networks. There is also a copper version of FDDI called CDDI which runs at the same rate but restricts interconnection distance.

image

Figure 6.23 Typical thin Ethernet interconnection arrangement (this is becoming less common now)

ATM is a protocol for data communication and does not specify the physical medium for interconnection. It is connection-oriented, in other words it sets up connections between source and destination and can guarantee a certain quality of service, which makes it quite suitable for audio and video data. ATM allows for either guaranteed bandwidth communications between a source and a destination (needed for AV applications), or for more conventional variable bandwidth communication. It operates in a switched fashion and can extend over wide or metropolitan areas. Switched networks involve the setting up of specific circuits between a transmitter and one or more receivers, rather like a dialled telephone network (indeed this is the infrastructure of the digital telephone network). The physical network is made up of a series of interconnected switches that are reconfigured to pass the information from sources to destinations according to the header information attached to each data packet. A network management system handles the negotiation between different devices that are contending for bandwidth, according to current demand. ATM typically operates over SONET (synchronous optical network) or SDH (synchronous digital hierarchy) networks, depending on the region of the world. Data packets on ATM networks consist of a fixed 48 bits, typically preceded by a 5-byte header that identifies the virtual channel of the packet.

Fibre Channel is used increasingly for the interconnection of workstations and disk storage arrays, using so-called ‘storage area network’ structures. It uses a half-duplex interface, but it has a separate fibre (or copper connection) for transmit and receive circuits so can operate in full-duplex mode. The standard allows for data rates up to 4 Gbit s–1 depending on the capabilities of the implementation.

6.6.4 Network protocols

A protocol specifies the rules of communication on a network. In other words it determines things like the format of data packets, their header information and addressing structure, and any handshaking and error retrieval schemes, amongst other things. One physical network can handle a wide variety of protocols, and packets conforming to different protocols can coexist on the same bus.

Some common examples of general purpose network protocols are TCP/IP (Transport Control Protocol/Internet Protocol), used for communications over the Internet (see below) and also over LANs, and UDP (User Datagram Protocol) often used for basic streaming applications. These general-purpose protocols are not particularly efficient or reliable for real-time audio transfer, but they can be used for non-real-time transfer of audio files between workstations or for streaming. Specially designed protocols may be needed for audio purposes, as described below.

6.6.5 Audio network requirements

The principal application of computer networks in audio systems is in the transfer of audio data files between workstations, or between workstations and a central ‘server’ which stores shared files. The device requesting the transfer is known as the ‘client’ and the device providing the data is known as the ‘server’. When a file is transferred in this way a byte-for-byte copy is reconstructed on the client machine, with the file name and any other header data intact. There are considerable advantages in being able to perform this operation at speeds in excess of real time for operations in which real-time feeds of audio are not the aim. For example, in a news editing environment a user might wish to load up a news story file from a remote disk drive in order to incorporate it into a report, this being needed as fast as the system is capable of transferring it. Alternatively, the editor might need access to remotely stored files, such as sound files on another person’s system, in order to work on them separately. In audio post-production for films or video there might be a central store of sound effects, accessible by everyone on the network, or it might be desired to pass on a completed portion of a project to the next stage in the post-production process.

Wired Ethernet is fast enough to transfer audio data files faster than real time, depending on network loading and speed. For satisfactory operation it is advisable to use 100 Mbit s–1 or even 1 Gbit s–1 Ethernet as opposed to the basic 10 Mbit s–1 version. Switched Ethernet architectures allow the bandwidth to be more effectively utilised, by creating switched connections between specific source and destination devices. Approaches using FDDI or ATM are appropriate for handling large numbers of sound file transfers simultaneously at high speed. Unlike a real-time audio interface, the speed of transfer of a sound file over a packet-switched network (when using conventional file transfer protocols) depends on how much traffic is currently using it. If there is a lot of traffic then the file may be transferred more slowly than if the network is quiet (very much like motor traffic on roads). The file might be transferred erratically as traffic volume varies, with the file arriving at its destination in ‘spurts’. There therefore arises the need for network communication protocols designed specifically for the transfer of real-time data, which serve the function of reserving a proportion of the network bandwidth for a given period of time, as described below. This is known as engineering a certain ‘quality of service’.

Without real-time protocols designed as indicated above, the computer network may not be relied upon for transferring audio where an unbroken audio output is to be reconstructed at the destination from the data concerned. The faster the network the more likely it is that one would be able to transfer a file fast enough to feed an unbroken audio output, but this should not be taken for granted. Even the highest speed networks can be filled up with traffic! This may seem unnecessarily careful until one considers an application in which a disk drive elsewhere on the network is being used as the source for replay by a local workstation, as illustrated in Figure 6.24. Here it must be possible to ensure guaranteed access to the remote disk at a rate adequate for real-time transfer, otherwise gaps will be heard in the replayed audio.

image

Figure 6.24 In this example of a networked system a remote disk is accessed over the network to provide data for real time audio playout from a workstation used for on-air broadcasting. Continuity of data flow to the on-air workstation is of paramount importance here

6.6.6 ISDN

ISDN is an extension of the digital telephone network to the consumer, providing two 64 kbit s–1 digital channels (‘B’ channels) which can be connected to ISDN terminals anywhere in the world simply by dialling. Data of virtually any kind may be transferred over the dialled-up link, and potential applications for ISDN include audio transfer. ISDN is really a subset of ATM, and ATM has sometimes been called broadband ISDN for this reason.

The total usable capacity of a single ISDN-2 connection is only 128 kbit s–1 and so it is not possible to carry linear PCM data at normal audio resolutions over such a link, but it is possible to carry moderately high-quality stereo digital audio at this rate using a data reduction system such as MPEG (see Chapter 2). Higher rates can be achieved by combining more than one ISDN link to obtain data rates of, say, 256 or 384 kbit s. Multiple ISDN lines must be synchronised together using devices known as inverse multiplexers if the different delays that may arise over different connections are to be compensated. There are also ISDN-30 lines, providing 30 simultaneous ‘B’ channels of 64 kbit s–1 (giving roughly 2 Mbit s–1).

It is possible to use ISDN links for non-real time audio file transfer, and this can be economically viable depending on the importance of the project and the size of files. The cost of an ISDN call is exactly the same as the equivalent duration of normal telephone call, and therefore it can be quite a cheap way of getting information from one place to another.

In the USA, there still remain a lot of circuits which are very similar to ISDN but not identical. These are called ‘Switched 56’ and carry data at 56 kbit s rather than 64 kbit s–1 (the remaining 8 kbit s–1 that makes up the total of 64 kbit s–1 is used for housekeeping data in Switched 56, whereas with ISDN the housekeeping data is transferred in a ‘D’ channel on top of the two 64 kbit data channels). This can create some problems when trying to link ISDN terminals, if there is a Switched 56 bridge somewhere in the way, requiring file transfer to take place at the lower rate.

For many applications, ISDN services are being superceded by ADSL (Asymmetric Digital Subscriber Line) technology, allowing higher data rates than offered by the normal pair of ISDN B channels to be offered to consumers and business over conventional telephone lines. The two technologies are somewhat different though, and ISDN may be considered superior for real-time applications requiring switched circuits and quality of service guarantees.

6.6.7 Protocols for the Internet

The Internet is now established as a universal means for worldwide communication. Although real-time protocols and quality of service do not sit easily with the idea of a free-for-all networking structure, there is growing evidence of applications that allow real-time audio and video information to be streamed with reasonable quality. The RealAudio format, for example, developed by Real Networks, is designed for coding audio in streaming media applications, currently at rates between 12 and 352 kbit s–1 for stereo audio, achieving respectable quality at the higher rates. People are also increasingly using the Internet for transferring multimedia projects between sites using FTP (file transfer protocol).

The Internet is a collection of interlinked networks with bridges and routers in various locations, which originally developed amongst the academic and research community. The bandwidth (data rate) available on the Internet varies from place to place, and depends on the route over which data is transferred. In this sense there is no easy way to guarantee a certain bandwidth, nor a certain ‘time slot’, and when there is a lot of traffic it simply takes a long time for data transfers to take place. Users access the Internet through a service provider (ISP), using either a telephone line and a modem, ISDN or an ADSL connection. The most intensive users will probably opt for high-speed leased lines giving direct access to the Internet.

The common protocol for communication on the Internet is called TCP/IP (Transmission Control Protocol/Internet Protocol). This provides a connection-oriented approach to data transfer, allowing for verification of packet integrity, packet order and retransmission in the case of packet loss. At a more detailed level, as part of the TCP/IP structure, there are high level protocols for transferring data in different ways. There is a file transfer protocol (FTP) used for downloading files from remote sites, a simple mail transfer protocol (SMTP) and a post office protocol (POP) for transferring email, and a hypertext transfer protocol (HTTP) used for interlinking sites on the world-wide web (WWW). The WWW is a collection of file servers connected to the Internet, each with its own unique IP address (the method by which devices connected to the Internet are identified), upon which may be stored text, graphics, sounds and other data.

UDP (user datagram protocol) is a relatively low-level connectionless protocol that is useful for streaming over the Internet. Being connectionless, it does not require any handshaking between transmitter and receiver, so the overheads are very low and packets can simply be streamed from a transmitter without worrying about whether or not the receiver gets them. If packets are missed by the receiver, or received in the wrong order, there is little to be done about it except mute or replay distorted audio, but UDP can be efficient when bandwidth is low and quality of service is not the primary issue.

Various real-time protocols have also been developed for use on the Internet, such as RTP (real-time transport protocol). Here packets are time-stamped and may be reassembled in the correct order and synchronised with a receiver clock. RTP does not guarantee quality of service or reserve bandwidth but this can be handled by a protocol known as RSVP (reservation protocol). RTSP is the real-time streaming protocol that manages more sophisticated functionality for streaming media servers and players, such a stream control (play, stop, fast-forward, etc.) and multicast (streaming to numerous receivers).

6.6.8 Wireless networks

Increasing use is made of wireless networks these days, the primary advantage being the lack of need for a physical connection between devices. There are various IEEE 802 standards for wireless networking, including 802.11 which covers wireless Ethernet or ‘Wi-Fi’. These typically operate on either the 2.4 GHz or 5 GHz radio frequency bands, at relatively low power, and use various interference reduction and avoidance mechanisms to enable networks to coexist with other services. It should, however, be recognised that wireless networks will never be as reliable as wired networks owing to the differing conditions under which they operate, and that any critical applications in which real-time streaming is required would do well to stick to wired networks where the chances of experiencing drop-outs owing to interference or RF fading are almost non-existent. They are however extremely convenient for mobile applications and when people move around with computing devices, enabling reasonably high data rates to be achieved with the latest technology.

Bluetooth is one example of a wireless personal area network (WPAN) designed to operate over limited range at data rates of up to 1 Mbit s–1. Within this there is the capacity for a number of channels of voice quality audio at data rates of 64 kbit s–1 and asynchronous channels up to 723 kbit s–1. Taking into account the overhead for communication and error protection, the actual data rate achievable for audio communication is usually only sufficient to transfer data-reduced audio for a few channels at a time.

6.7 Streaming audio over computer interfaces

Desktop computers and consumer equipment are increasingly equipped with general purpose serial data interfaces such as USB (universal serial bus) and FireWire (IEEE 1394). These have a high enough data rate to carry a number of channels of audio data over relatively short distances, either over copper or optical fibre. Audio protocols exist for these, as described below. There are also a number of protocols designed to enable audio to be streamed in real time over general-purpose data networks such as Ethernet and ATM.

6.7.1 Audio over Firewire (IEEE 1394)

Firewire is an international standard serial data interface specified in IEEE 1394-1995. One of its key applications has been as a replacement for SCSI (Small Computer Systems Interface) for connecting disk drives and other peripherals to computers. It is extremely fast, running at rates of 100, 200 and 400 Mbit s–1 in its original form, with higher rates appearing all the time up to 3.2 Gbit s–1. It is intended for optical fibre or copper interconnection, the copper 100 Mbit s–1 (S100) version being limited to 4.5 m between hops (a hop is the distance between two adjacent devices). The S100 version has a maximum realistic data capacity of 65 Mbit s–1, a maximum of 16 hops between nodes and no more than 63 nodes on up to 1024 separate buses. On the copper version there are three twisted pairs – data, strobe and power – and the interface operates in half duplex mode, which means that communications in two directions are possible, but only one direction at a time. The ‘direction’ is determined by the current transmitter which will have arbitrated for access to the bus. Connections are ‘hot pluggable’ with auto-reconfiguration – in other words one can connect and disconnect devices without turning off the power and the remaining system will reconfigure itself accordingly. It is also relatively cheap to implement.

Unlike, for example, the AES3 audio interface, data and clock (strobe) signals are separated. A clock signal can be derived by exlusive-or’ing the data and strobe signals, as shown in Figure 6.25. Firewire combines features of network and point-to-point interfaces, offering both asynchronous and isochronous communication modes, so guaranteed latency and bandwidth are available if needed for time-critical applications. Communications are established between logical addresses, and the end point of an isochronous stream is called a ‘plug’. Logical connections between devices can be specified as either ‘broadcast’ or ‘point-to-point’. In the broadcast case either the transmitting or receiving plug is defined, but not both, and broadcast connections are unprotected in that any device can start and stop it. A primary advantage for audio applications is that point-to-point connections are protected – only the device that initiated a transfer can interfere with that connection, so once established the data rate is guaranteed for as long as the link remains intact. The interface can be used for real-time multichannel audio interconnections, file transfer, MIDI and machine control, carrying digital video, carrying any other computer data and connecting peripherals (e.g. disk drives).

Data is transferred in packets within a cycle of defined time (125 μs) as shown in Figure 6.26. The data is divided into 32 bit ‘quadlets’ and isochronous packets (which can be time stamped for synchronisation purposes) consist of between 1 and 256 quadlets (1024 bytes). Packet headers contain data from a cycle time register that allows for sample accurate timing to be indicated. Resolutions down to about 40 nanoseconds can be indicated. One device on the bus acts as a bus master, initiating each cycle with a cycle start packet. Subsequently devices having isochronous packets to transmit do so, with short gaps between the packets, followed by a longer subaction gap after which any asynchronous information is transmitted.

image

Figure 6.25 Data and strobe signals on the 1394 interface can be exclusive-or’ed to create a clock signal

image

Figure 6.26 Typical arrangement of isochronous and asynchronous packets within a 1394 cycle

Originating partly in Yamaha’s ‘m-LAN’ protocol, the 1394 Audio and Music Data Transmission Protocol is now also available as an IEC PAS component of the IEC 61883 standard (a PAS is a publicly available specification that is not strictly defined as a standard but is made available for information purposes by organisations operating under given procedures). It offers a versatile means of transporting digital audio and MIDI control data. It specifies that devices operating this protocol should be capable of the ‘arbitrated short bus reset’ function which ensures that audio transfers are not interrupted during bus resets. Those wishing to implement this protocol should, of course, refer directly to the standard, but a short summary of some of the salient points is given here.

The complete model for packetising audio data so that it can be transported over the 1394 interface is complex and very hard to understand, but some applications make the overall structure seem more transparent, particularly if the audio samples are carried in a simple ‘AM824’ format, each quadlet of which has an 8-bit label and 24 bits of data. The model is layered as shown in Figure 6.27 in such a way that audio applications generate data that is formed (adapted) into blocks or clusters with appropriate labels and control information such as information about the nominal sampling frequency, channel configuration and so forth. Each block contains the information that arrives for transmission within one audio sample period, so in a surround sound application it could be a sample of data for each of six channels of audio plus related control information. The blocks, each representing ‘events’, are then ‘packetised’ for transmission over the interface. The so-called ‘CIP layer’ is the common isochronous packet layer that is the transport stream of 1394. Each isochronous packet has a header that is two quadlets long, defining it as an isochronous packet and indicating its length, and a two quadlet CIP header that describes the following data as audio/music data and indicates (among other things) the presentation time of the event for synchronisation purposes. A packet can contain more than one audio event and this becomes obvious when one notices that the cycle time of 1394 (the time between consecutive periods in which a packet can be transmitted) is normally 125 μs and an audio sample period at 48 kHz is only 22 μs. 1394 can carry audio data in IEC 60958 format (see Section 6.5.3). This is based on the AM824 data structure in which the 8-bit label serves as a substitute for the preamble and VUCP data of the IEC subframe, as shown in Figure 6.28. The following 24 bits of data are then simply the audio data component of the IEC subframe. The two subframes forming an IEC frame are transmitted within the same event and each has to have the 8-bit label at the start of the relevant quadlet (indicating left or right channel).

image

Figure 6.27 Example of layered model of 1394 audio/music protocol transfer

image

Figure 6.28 AM824 data structure for IEC 60958 audio data on 1394 interface. Other AM824 data types use a similar structure but the label values are different to that shown here

The same AM824 structure can be used for carrying other forms of audio data including multibit linear audio (a raw audio data format used in some DVD applications, termed MBLA), high resolution MBLA, 1-bit audio (e.g. DSD), MIDI, SMPTE timecode and sample count or ancillary data. These are indicated by different 8-bit labels. One-bit audio can be either raw or DST (Direct Stream Transfer) encoded. DST is a lossless data reduction system employed in Direct Stream Digital equipment and Super Audio CD.

Audio data quadlets in these different modes can be clustered into compound data blocks. As a rule a compound data block contains samples from a number of related streams of audio and ancillary information that are based on the same sampling frequency table (see Section 6.7.2). The parts of these blocks can be application specific or unspecific. In general, compound blocks begin with an unspecified region (although this is not mandatory) followed by one or more application-specific regions (see Figure 6.29). The unspecified region can contain audio/music content data and it is recommended that this always starts with basic two-channel stereo data in either IEC or raw audio format, followed by any other unspecified content data in a recommended order. An example of an application-specific part is the transfer of multiple synchronous channels from a DVD player. Here ancillary data quadlets indicate the starts of blocks and control factors such as downmix values, multichannel type (e.g. different surround modes), dynamic range control and channel assignment. An example of such a multichannel cluster is shown in Figure 6.30.

image

Figure 6.29 General structure of a compound data block

6.7.2 Audio over universal serial bus (USB)

The Universal Serial Bus is not the same as IEEE 1394, but it has some similar implications for desktop multimedia systems, including audio peripherals. USB has been jointly supported by a number of manufacturers including Microsoft, Digital, IBM, NEC, Intel and Compaq. Version 1.0 of the copper interface runs at a lower speed than 1394 (typically either 1.5 or 12 Mbit s–1) and is designed to act as a low-cost connection for multiple input devices to computers such as joysticks, keyboards, scanners and so on. USB 2.0 runs at a higher rate up to 480 Mbit s–1 and is supposed to be backwards-compatible with 1.0.

USB 1.0 supports up to 127 devices for both isochronous and asynchronous communication and can carry data over distances of up to 5 m per hop (similar to 1394). A hub structure is required for multiple connections to the host connector. Like 1394 it is hot pluggable and reconfigures the addressing structure automatically, so when new devices are connected to a USB setup the host device assigns a unique address. Limited power is available over the interface and some devices are capable of being powered solely using this source – known as ‘bus-powered’ devices – which can be useful for field operation of, say, a simple A/D convertor with a laptop computer.

image

Figure 6.30 Specific example of an application-specific data block for multichannel audio transfer from a DVD player

Data transmissions are grouped into frames of 1 ms duration in USB 1.0 but a ‘micro-frame’ of 1/8 of 1 ms was also defined in USB 2.0. A start-of-frame packet indicates the beginning of a cycle and the bus clock is normally at 1 kHz if such packets are transmitted every millsecond. So the USB frame rate is substantially slower than the typical audio sampling rate. The transport structure and different layers of the network protocol will not be described in detail as they are long and complex and can be found in the USB 2.0 specification. However it is important to be aware that transactions are set up between sources and destinations over so-called ‘pipes’ and that numerous ‘interfaces’ can be defined and run over a single USB cable, only dependent on the available bandwidth.

The way in which audio is handled on USB is well defined and somewhat more clearly explained than the 1394 audio/music protocol. It defines three types of communication: audio control, audio streaming and MIDI streaming. We are concerned primarily with audio streaming applications. Audio data transmissions fall into one of three types. Type 1 transmissions consist of channel-ordered PCM samples in consecutive sub-frames, while Type 2 transmissions typically contain non-PCM audio data that does not preserve a particular channel order in the bitstream, such as certain types of multichannel data-reduced audio stream. Type 3 transmissions are a hybrid of the two such that non-PCM data is packed into pseudo-stereo data words in order that clock recovery can be made easier. This method is in fact very much the same as the way data-reduced audio is packed into audio subframes within the IEC 61937 format described earlier in this chapter, and follows much the same rules.

Audio samples are transferred in subframes, each of which can be 1–4 bytes long (up to 24 bits resolution). An audio frame consists of one or more subframes, each of which represents a sample of different channel in the cluster (see below). As with 1394, a USB packet can contain a number of frames in succession, each containing a cluster of subframes. Frames are described by a format descriptor header that contains a number of bytes describing the audio data type, number of channels, subframe size, as well as information about the sampling frequency and the way it is controlled (for Type 1 data). An example of a simple audio frame would be one containing only two subframes of 24-bit resolution for stereo audio.

Audio of a number of different types can be transferred in Type 1 transmissions, including PCM audio (twos complement, fixed point), PCM-8 format (compatible with original 8-bit WAV, unsigned, fixed point), IEEE floating point, A-law and μ-law (companded audio corresponding to relatively old telephony standards). Type 2 transmissions typically contain data-reduced audio signals such as MPEG or AC-3 streams. Here the data stream contains an encoded representation of a number of channels of audio, formed into encoded audio frames that relate to a large number of original audio samples. An MPEG encoded frame, for example, will be typically be longer than a USB packet (a typical MPEG frame might be 8 or 24 ms long), so it is broken up into smaller packets for transmission over USB rather like the way it is streamed over the IEC 60958 interface described in Section 6.5.4. The primary rule is that no USB packet should contain data for more than one encoded audio frame, so a new encoded frame should always be started in a new packet. The format descriptor for Type 2 is similar to Type 1 except that it replaces subframe size and number of channels indication with maximum bit rate and number of audio samples per encoded frame. Currently only MPEG and AC-3 audio are defined for Type 2.

Rather like the compound data blocks possible in 1394 (see above), audio data for closely related synchronous channels can be clustered for USB transmission in Type 1 format. Up to 254 streams can be clustered and there are 12 defined spatial positions for reproduction, to simplify the relationship between channels and the loudspeaker locations to which they relate. (This is something of a simplification of the potentially complicated formatting of spatial audio signals and assumes that channels are tied to loudspeaker locations, but it is potentially useful. It is related to the channel ordering of samples within a WAVE format extensible file, described earlier.) The first six defined streams follow the internationally standardised order of surround sound channels for 5.1 surround, that is left, right, centre, LFE (low frequency enhancement), left surround, right surround. Subsequent streams are allocated to other loudspeaker locations around a notional listener. Not all the spatial location streams have to be present but they are supposed to be presented in the defined order. Clusters are defined in a descriptor field that includes ‘bNrChannels’ (specifying how many logical audio channels are present in the cluster) and ‘wChannelConfig’ (a bit field that indicates which spatial locations are present in the cluster). If the relevant bit is set then the relevant location is present in the cluster. The bit allocations are shown in Table 6.9.

6.7.3 AES 47: Audio over ATM

AES 47 defines a method by which linear PCM data, either conforming to AES 3 format or not, can be transferred over ATM. There are various arguments for doing this, not the least being the increasing use of ATM-based networks for data communications within the broadcasting industry and the need to route audio signals over longer distances than possible using standard digital interfaces. There is also a need for low latency, guaranteed bandwidth and switched circuits, all of which are features of ATM. Essentially an ATM connection is established in a similar way to making a telephone call. A SETUP message is sent at the start of a new ‘call’ that describes the nature of the data to be transmitted and defines its vital statistics. The AES 47 standard describes a specific professional audio implementation of this procedure that includes information about the audio signal and the structure of audio frames in the SETUP at the beginning of the call.

Table 6.9 Channel identification in USB audio cluster descriptor

Data bit Spatial location
D0 Left Front (L)
D1 Right Front (R)
D2 Center Front (C)
D3 Low Frequency Enhancement
  (LFE)
D4 Left Surround (LS)
D5 Right Surround (RS)
D6 Left of Center (LC)
D7 Right of Center (RC)
D8 Surround (S)
D9 Side Left (SL)
D10 Side Right (SR)
D11 Top (T)
D15..12 Reserved

image

Figure 6.31 General audio subframe format of AES 47

For some reason bytes are termed octets in ATM terminology, so this section will follow that convention. Audio data is divided into subframes and each subframe contains a sample of audio as well as optional ancillary data and protocol overhead data, as shown in Figure 6.31. The setup message at the start of the call determines the audio mode and whether or not this additional data is present. The subframe should occupy a whole number of octets and the length of the audio sample should be such that the subframe is 8, 16, 24, 32 or 48 bits long. The ancillary data field, if it is present, is normally used for carrying the VUC bits from the AES 3 subframe, along with a B bit to replace the P (parity) bit of the AES 3 subframe (which has little relevance in this new application). The B bit in the ‘1’ state indicates the start of an AES 3 channel status block, taking the place of the Z preamble that is no longer present. This data is transmitted in the order BCUV.

image

Figure 6.32 Packing of audio subframes into ATM cells. (a) Example of temporal ordering with two channels, left and right. ‘a’, ‘b’, ‘c’, etc., are successive samples in time for each channel. Co-temporal samples are grouped together. (b) Example of multichannel packing whereby concurrent samples from a number of channels are arranged sequentially. (c) Example of ordering by channel, with a number of samples from the same channel being grouped together. (If the number of channels is the same as the number of samples per cell, all three methods turn out to be identical.)

Table 6.10 Audio packing within ATM cells – options in AES 47

image

* This should be signalled within the second and third octets of the user-defined AAL part of the SETUP message that is an optional part of the ATM protocol for setting up calls between sources and destinations.

Samples are packed into the ATM cell either ordered in time, in multichannel groups or by channel, as shown in Figure 6.32. Only certain combinations of channels and data formats are allowed and all the channels within the stream have to have the same resolution and sampling frequency, as shown in Table 6.10.

Four octets in the user-defined AAL part of the SETUP message that begins a new ATM call define aspects of the audio communication that will take place. The first byte contains so-called ‘qualifying information’, only bit 4 of which is currently specified indicating that the sampling frequency is locked to some global reference. The second byte indicates the subframe format and sample length, while the third byte specifies the packing format. The fourth byte contains information about the audio sampling frequency (32, 44.1 or 48 kHz), its scaling factor (from 0.25 up to 8 times) and multiplication factor (e.g. 1/1.001 or 1.001/1 for ‘pull-down’ or ‘pull-up’modes). It also has limited information for varispeed rates.

6.7.4 CobraNet

CobraNet is a proprietary audio networking technology developed by Peak Audio, a division of Cirrus Logic. It is designed for carrying audio over conventional Fast Ethernet networks (typically 100 Mbit s–1), preferably using a dedicated Ethernet for audio purposes or using a switched Ethernet network. Switched Ethernet acts more like a telephone or ATM network where connections are established between specific sources and destinations, with no other data sharing that ‘pipe’. For the reasons stated earlier in this chapter, Ethernet is not ideal for audio communications without some provisos being observed. CobraNet, however, implements a method of arbitration, bandwidth reservation and an isochronous transport protocol that enables it to be used successfully.

The CobraNet protocol has been allocated its own protocol identifier at the Data Link Layer of the ISO 7-layer network model, so it does not use Internet Protocol (IP) for data transport (this is typically inefficient for audio streaming purposes and involves too much overhead). Because it does not use IP it is not particularly suitable for wide area network (WAN) operation and would typically be operated over a local area network (LAN). It does however enable devices to be allocated IP addresses using the BOOTP (boot protocol) process and supports the use of IP and UDP (user datagram protocol) for other purposes than the carrying of audio. It is capable of transmitting packets in isochronous cycles, each packet transferring data for a ‘bundle’ of audio channels. Each bundle contains between 0 and 8 audio channels and these can either be unicast or multicast. Unicast bundles are intended for a single destination whereas multicast bundles are intended for ‘broadcast’ transmissions whereby a sending device broadcasts packets no matter whether any receiving device is contracted to receive them.

6.7.5 MAGIC

MAGIC (Media-accelerated Global Information Carrier) was developed by the Gibson Guitar Corporation, originally going under the name GMICS. It is a relatively recent audio interface that typically uses the Ethernet physical layer for transporting audio between devices, although it is not compatible with higher layers and does not appear to be interoperable with conventional Ethernet data networks. It uses its own application and data link layers, the data link layer of which is based on the Ethernet 802.3 data link layer, using a frame header that would be recognised by 802.3-compatible devices.

Although it is not limited to doing so, the described implementation uses 100 Mbit s–1 Fast Ethernet over standard CAT 5 cables, using four of the wires in a conventional Ethernet crossover implementation and the other four for power to devices capable of operating on limited power (9 volt, 500 mA). Data is formed into frames of 55 bytes, including relevant headers, and transmitted at a synchronous rate between devices. The frame rate is related to the audio sampling rate and a sampling clock can be recovered from the interface. Very low latency of 10–40 μs is claimed. MAGIC data can be daisy-chained between devices in a form more akin to point-to-point audio interfacing than computer networking, although routing and switching configurations are also possible using routing or switching hubs.

6.7.6 MOST

MOST (media oriented synchronous transfer) is an alternative network protocol designed for synchronous, asynchronous and control data over a low-cost optical fibre network. It is claimed that the technology sits between USB and IEEE 1394 in terms of performance and that MOST has certain advantages in the transfer of synchronous data produced by multimedia devices that are not well catered for in other protocols. It is stated that interfaces based on copper connections are prone to electromagnetic interference and that the optical fibre interface of this system provides immunity to such, in addition to allowing distances of up to 250 m between nodes in this case.

MOST specifies physical, data link and network layers in the OSI reference model for data networks and dedicated silicon has been developed for the physical layer. Data is transferred in 64-byteframes and the frame rate of data is dependent on the sampling rate in use by the connected devices, being 22.5 Mbit s–1 at a 44.1 kHz audio sampling rate. The bandwidth can be divided between synchronous and asynchronous data. Potential applications are described including professional audio, for transferring up to 15 stereo 16-bit audio channels or 10 stereo channels of 24-bit audio; consumer electronics, as an alternative to SPDIF at similar cost; automotive and home multimedia networking.

There is now a detailed specification framework for MOST (see Further reading) and it is the subject of a cooperation agreement between a number of manufacturers. It seems to have been most widely adopted in the automotive industry where it is close to being endorsed by a consortium of car makers.

6.7.7 BSS SoundWeb

BSS developed its own audio network interface for use with its SoundWeb products that are typically used in large venue installations and for live sound. It uses CAT 5 cabling over distances up to 300 m, but is not based on Ethernet and behaves more like a token ring network. Data is carried at a rate of about 12 Mbit s–1 and transports eight audio channels along with control information.

6.8 Digital content protection

Copy protection of digital content is increasingly required by the owners of intellectual property and data encryption is now regarded as the most appropriate way of securing such content from unwanted copying. The SCMS method used for copy protection on older interfaces such as IEC 60958 involved the use of two bits plus category codes to indicate the copy permission status of content, but no further attempt was made to make the audio content unreadable or to scramble it in the case of non-permitted transfers. A group of manufacturers known as 5C has now defined a method of digital content protection that is initially defined for IEEE 1394 transfers (see Section 6.7.1) but which is likely to be extended to other means of interconnection between equipment. It is written in a relatively generic sense, but the packet header descriptions currently refer directly to 1394 implementations. 5C is the five manufacturers Hitachi, Intel, Matsushita, Sony and Toshiba. The 1394 interface is increasingly used on high-end consumer digital products for content transfer, although it has not been seen much on DVD and SACD players yet because the encryption model has only recently been agreed. There has also been the issue of content watermarking to resolve.

Table 6.11 Copy state indication in EMI bits of 1394 header

EMI bit states Copy state Authentication required
11 Copy never (Mode A) Full
10 Copy one generation (Mode B) Restricted or full
01 No more copies (Mode C) Restricted or full
00 Copy free(ly) (Mode D) None (not encrypted)

Content protection is managed in this model by means of both embedded copy control information (CCI) and by using two bits in the header of isochronous data packets (the so-called EMI or encryption mode indicator bits). Embedded CCI is that contained within the application-specific data stream itself. In other words it could be the SCMS bits in the channel status of IEC 60958 data or it could be the copy control information in an MPEG transport stream. This can only be accessed once a receiving device has decrypted the data that has been transmitted to it. In order that devices can inspect the copy status of a stream without decrypting the data, the packet header containing the EMI bits is not encrypted. Two EMI bits allow four copy states to be indicated as shown in Table 6.11.

The authentication requirement indicated by the copy state initiates a negotiation between the source and receiver that sets up an encrypted transfer using an exchanged key. The full details of this are beyond the scope of this book and require advanced understanding of cryptography, but it is sufficient to explain that full authentication involves more advanced cryptographic techniques than restricted authentication (which is intended for implementation on equipment with limited or reduced computational resources, or where copy protection is not a major concern). The negotiation process, if successful, results in an encrypted and decrypted transfer being possible between the two devices. Embedded CCI can then be accessed from within the content stream.

When there is a conflict between embedded CCI and EMI indications, as there might be during a stream (for example when different songs on a CD have different CCI but where the EMI setting remains constant throughout the stream) it is recommended that the EMI setting is the most strict of those that will be encountered in the transfer concerned. However the embedded CCI seems to have the final say-so when it comes to deciding whether the receiving device can record the stream. For example, even if EMI indicates ‘copy never’, the receiving device can still record it if the embedded CCI indicates that it is recordable. This ensures that a stream is as secure as it should be, and the transfer properly authenticated, before any decisions can be made by the receiving device about specific instances within the stream.

Certain AM824 audio applications (a specific form of 1394 Audio/Music Protocol interchange) have defined relationships between copy states and SCMS states, for easy translation when carrying data like IEC 60958 data over 1394. In this particular case the EMI ‘copy never’ state is not used and SCMS states are mapped onto the three remaining EMI states. For DVD applications the application-specific CCI is indicated in ancillary data and there is a mapping table specified for various relationships between this data and the indicated copy states. It depends to some extent on the quality of the transmitted data and whether or not it matches that indicated in the audio_quality field of ancillary data. (Typically DVD players have allowed single generation home copies of audio material over IEC 60958 interfaces at basic sampling rates, e.g. 48 kHz, but not at very high quality rates such as 96 kHz or 192 kHz.) SuperAudio CD applications currently have only one copy state defined and that is ‘no more copies’, presumably to avoid anyone duplicating the 1-bit stream that would have the same quality as the master recording.

Further reading

1394 Trade Association (2001) TA Document 2001003: Audio and Music Data Transmission Protocol 2.0.

5C (2001) Digital transmission content protection specification, Volume 1 (informational version). Revision 1.2. Available from: www.dtcp.com/.

AES (2002) AES 47-2002: Transmission of digital audio over asynchronous transfer mode networks.

Bailey, A. (2001) Network Technology for Digital Audio. Focal Press.

Gibson Guitar Corporation (2002) Media-accelerated Global Information Carrier. Engineering Specification Version 2.4. Available from: www.gibsonmagic.com.

IEC (1998) IEC/PAS 61883-6. Consumer audio/video equipment – Digital interface – Part 6: Audio and music data transmission protocol.

IEEE (1995) IEEE 1394: Standard for a high performance serial bus.

Oasis Technology (1999) MOST Specification Framework v1.1. Available from: www.oasis.com/technology/index.htm.

Page, M., Bentall, N., Cook, G. et al. (2002) Multichannel audio connection for Direct Stream Digital. Presented at AES 113th Convention, Los Angeles, Oct 5–8.

Philips (2002) Direct Stream Digital Interchange File Format: DSD-IFF version 1.4, revision 2. Available from: www.superaudiocd.philips.com.

Philips (2002) Recommended usage of DSD-IFF, version 1.4. Available from: www.superaudiocd.philips.com.

Rumsey, F. and Watkinson, J. (2004) The Digital Interface Handbook, 3rd edition. Focal Press.

USB (1998) Universal serial bus: device class definition for audio devices, v1.0.

Useful websites

Audio Engineering Society: www.aes.org

IEEE 1394: www.1394ta.org

Universal Serial Bus: www.usb.org

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset