5

Carrying Real-Time Audio Over Computer Interfaces

There is an increasing trend towards employing standard computer interfaces and networks to transfer audio information, as opposed to dedicated audio interfaces. Such computer interfaces are typically used for a variety of purposes in general data communications and they may need to be adapted for audio applications that require sample-accurate real-time transfer. The increasing ubiquity of computer systems in audio environments makes it inevitable that generic data communication technology will gradually take the place of dedicated interfaces. It also makes sense economically to take advantage of the ‘mass market’ features of the computer industry.

The applications and protocols described in this chapter are primarily concerned with real-time audio communications or ‘streaming’, rather than file transfer, and the coverage is limited to studio contexts. The reason for this is that these applications are similar to those for which the dedicated audio interfaces described elsewhere in this book would be used. Some examples are given of proprietary technology that addresses the problems of streaming audio over computer networks but not every proprietary solution is covered in detail. The wider issue of Internet audio streaming for the consumer distribution of music or broadcasts is not covered in any detail. Those interested in more detailed aspects of network applications in digital audio are referred to Andy Bailey’s book Network Technology for Digital Audio1.

5.1  Introduction to Carrying Audio Over Computer Interfaces

Dedicated audio interfaces mostly carry audio data in a sample-clock-synchronized fashion, often with an embedded sample clock signal, carrying little or no other data than that required to move audio samples for one or more channels between devices. They behave like ‘digital wires’, acting as the digital equivalent of an analog signal cable, connecting one point in a system directly to another. Computer interfaces, on the other hand, are typically general-purpose data carriers that may have asynchronous features and may not always have the inherent quality-of-service features that are required for ‘streaming’ applications. They also normally use an addressing structure that enables packets of data to be carried from one of a number of sources to one of a number of destinations and such packets will share the connection in a more or less controlled way. Data transport protocols such as TCP/IP are often used as a universal means of managing the transfer of data from place to place, adding overheads in terms of data rate, delay and error handling that may work against the efficient transfer of audio. Data interfaces may be intended primarily for file transfer applications where the time taken to transfer the file is not a crucial factor – as fast as possible will do.

Conventional office Ethernet is a good example of a computer network interface that has limitations in respect of audio streaming. The original 10 Mbit/s data rate was quite slow, although theoretically capable of handling a number of channels of real-time audio data. If employed between only two devices and used with a low-level protocol such as UDP (user datagram protocol) audio can be streamed quite successfully, but problems can arise when multiple devices contend for use of the bus and where the network is used for general-purpose data communications in addition to audio streaming. There is no guarantee of a certain quality of service, because the bus is a sort of ‘free for all’, ‘first-come-first-served’ arrangement that is not designed for real-time applications. To take a simple example, if one’s colleague attempts to download a huge file from the Internet just when one is trying to stream a broadcast live to air in a local radio station, using the same data network, the chances are that one’s broadcast will drop out occasionally.

One can partially address such limitations in a crude way by throwing data-handling capacity at the problem, hoping that increasing the network speed to 100 Mbit/s or even 1 Gbit/s will avoid it ever becoming overloaded. Circuit-switched networks can also be employed to ease these problems (that is networks where individual circuits are specifically established between sources and destinations). Unless capacity can be reserved and service quality guaranteed a data network will never be a suitable replacement for dedicated audio interfaces in critical environments such as broadcasting stations. This has led to the development of real-time protocols and/or circuit-switched networks for handling audio information on data interfaces, in which latency (delay) and bandwidth are defined and guaranteed. The audio industry can benefit from the increased data rates, flexibility and versatility of general-purpose interfaces provided that these issues are taken seriously.

Desktop computers and consumer equipment are also increasingly equipped with general-purpose serial data interfaces such as USB (universal serial bus) and FireWire (IEEE 1394). These have a high enough data rate to carry a number of channels of audio data over relatively short distances, either over copper or optical fibre. Audio protocols also exist for these, as described below.

5.2  Audio Over Firewire (IEEE 1394)

5.2.1 Basic FireWire Principles

FireWire is an international standard serial data interface specified in IEEE 1394–19952. One of its key applications has been as a replacement for SCSI (Small Computer Systems Interface) for connecting disk drives and other peripherals to computers. It is extremely fast, running at rates of 100, 200 and 400 Mbit/s in its original form, with higher rates appearing all the time up to 3.2 Gbit/s. It is intended for optical fibre or copper interconnection, the copper 100 Mbit/s (S100) version being limited to 4.5 m between hops (a hop is the distance between two adjacent devices). The S100 version has a maximum realistic data capacity of 65 Mbit/s, a maximum of 16 hops between nodes and no more than 63 nodes on up to 1024 separate buses. On the copper version there are three twisted pairs – data, strobe and power – and the interface operates in half duplex mode, which means that communications in two directions are possible, but only one direction at a time. The ‘direction’ is determined by the current transmitter which will have been arbitrated for access to the bus. Connections are ‘hot pluggable’ with auto-reconfiguration – in other words one can connect and disconnect devices without turning off the power and the remaining system will reconfigure itself accordingly. It is also relatively cheap to implement.

Unlike, for example, the AES3 audio interface, data and clock (strobe) signals are separated. A clock signal can be derived by exclusive-OR’ing the data and strobe signals, as shown in Figure 5.1. FireWire combines features of network and point-to-point interfaces, offering both asynchronous and isochronous communication modes, so guaranteed latency and bandwidth are available if needed for time-critical applications. Communications are established between logical addresses, and the end point of an isochronous stream is called a ‘plug’. Logical connections between devices can be specified as either ‘broadcast’ or ‘point-to-point’. In the broadcast case either the transmitting or receiving plug is defined, but not both, and broadcast connections are unprotected in that any device can start and stop it. A primary advantage for audio applications is that point-to-point connections are protected – only the device that initiated a transfer can interfere with that connection, so once established the data rate is guaranteed for as long as the link remains intact. The interface can be used for real-time multichannel audio interconnections, file transfer, MIDI and machine control, carrying digital video, carrying any other computer data and connecting peripherals (e.g. disk drives).

Figure 5.1 Data and strobe signals on the 1394 interface can be exclusive-OR’ed to create a clock signal.

Data is transferred in packets within a cycle of defined time (125 μs) as shown in Figure 5.2. The data is divided into 32-bit ‘quadlets’ and isochronous packets (which can be time stamped for synchronization purposes) consist of between 1 and 256 quadlets (1024 bytes). Packet headers contain data from a cycle time register that allows for sample accurate timing to be indicated. Resolutions down to about 40 nanoseconds can be indicated. One device on the bus acts as a bus master, initiating each cycle with a cycle start packet. Subsequently devices having isochronous packets to transmit do so, with short gaps between the packets, followed by a longer subaction gap after which any asynchronous information is transmitted.

Figure 5.2 Typical arrangement of isochronous and asynchronous packets within a 1394 cycle.

5.2.2 Audio and Music Data Transmission Protocol

Originating partly in Yamaha’s ‘m-LAN’ protocol, the 1394 Audio and Music Data Transmission Protocol3 is now also available as an IEC PAS component of the IEC 61883 standard4 (a PAS is a publically available specification that is not strictly defined as a standard but is made available for information purposes by organizations operating under given procedures). It offers a versatile means of transporting digital audio and MIDI control data. It specifies that devices operating this protocol should be capable of the ‘arbitrated short bus reset’ function which ensures that audio transfers are not interrupted during bus resets. Those wishing to implement this protocol should, of course, refer directly to the standard, but a short summary of some of the salient points is given here.

The complete model for packetizing audio data so that it can be transported over the 1394 interface is complex and very hard to understand, but some applications make the overall structure seem more transparent, particularly if the audio samples are carried in a simple ‘AM824’ format, each quadlet of which has an eight-bit label and 24 bits of data. The model is layered as shown in Figure 5.3 in such a way that audio applications generate data that is formed (adapted) into blocks or clusters with appropriate labels and control information such as information about the nominal sampling frequency, channel configuration and so forth. Each block contains the information that arrives for transmission within one audio sample period, so in a surround sound application it could be a sample of data for each of six channels of audio plus related control information. The blocks, each representing ‘events’, are then ‘packetized’ for transmission over the interface. The so-called ‘CIP layer’ is the common isochronous packet layer that is the transport stream of 1394. Each isochronous packet has a header that is two quadlets long, defining it as an isochronous packet and indicating its length, and a two quadlet CIP header that describes the following data as audio/music data and indicates (among other things) the presentation time of the event for synchronization purposes. A packet can contain more than one audio event and this becomes obvious when one notices that the cycle time of 1394 (the time between consecutive periods in which a packet can be transmitted) is normally 125 μs and an audio sample period at 48 kHz is only 22 μs.

Figure 5.3 Example of layered model of 1394 audio/music protocol transfer.

1394 can carry audio data in IEC 60958 format (see section 4.3). This is based on the AM824 data structure in which the eight-bit label serves as a substitute for the preamble and VUCP data of the IEC subframe, as shown in Figure 5.4. The following 24 bits of data are then simply the audio data component of the IEC subframe. The two subframes forming an IEC frame are transmitted within the same event and each has to have the eight-bit label at the start of the relevant quadlet (indicating left or right channel).

Figure 5.4 AM824 data structure for IEC 60958 audio data on 1394 interface. Other AM824 data types use a similar structure but the label values are different to that shown here.

The same AM824 structure can be used for carrying other forms of audio data including multibit linear audio (a raw audio data format used in some DVD applications, termed MBLA), high resolution MBLA, one-bit audio (see section 4.12.2), MIDI, SMPTE timecode and sample count or ancillary data. These are indicated by different eight-bit labels. One-bit audio can be either raw or DST (Direct Stream Transfer) encoded. DST is a lossless data reduction system employed in Direct Stream Digital equipment and Super Audio CD.

Audio data quadlets in these different modes can be clustered into compound data blocks. As a rule a compound data block contains samples from a number of related streams of audio and ancillary information that are based on the same sampling frequency table (see next section). The parts of these blocks can be application specific or unspecific. In general, compound blocks begin with an unspecified region (although this is not mandatory) followed by one or more application-specific regions (see Figure 5.5). The unspecified region can contain audio/music content data and it is recommended that this always starts with basic two-channel stereo data in either IEC or raw audio format, followed by any other unspecified content data in a recommended order. An example of an application-specific part is the transfer of multiple synchronous channels from a DVD player. Here ancillary data quadlets indicate the starts of blocks and control factors such as downmix values, multichannel type (e.g. different surround modes), dynamic range control and channel assignment. An example of such a multichannel cluster is shown in Figure 5.6.

Figure 5.5 General structure of a compound data block.

Figure 5.6 Specific example of an application-specific data block for multichannelaudio transfer from a DVD player.

5.2.3 Clock Synchronization

It is also possible to transfer information relating to the synchronization of a sample clock using the audio/music protocol over 1394. The instantaneous actual sampling frequency (the rate at which the audio system is running) can be worked out from the time stamps contained in the SYT part of packet headers and the SYT interval (the number of data blocks or sample periods between two successive time stamps). Audio clock information can be derived from this and a receiver clock can be controlled. The SYT time stamp is intended to indicate the time at which the event concerned is to be presented to the receiver and is not supposed to be transmitted at a rate lower than 3.5 kHz under normal circumstances. If there is more than one event in a packet then this usually corresponds to the start of the first one. Professional receivers are supposed to make it possible to use the SYT information to control the presentation time of events, but consumer devices where implementation costs are critical do not have to do this and can ‘free run’.

The nominal sampling frequency is usually indicated as a part of the CIP header, this being a formal means of indicating to a receiver what the intended sampling frequency should be (rather like the sampling frequency indicated in channel status of AES3). It is contained within the three LSBs of the format dependent field (FDF) of the CIP header and there are a number of defined tables that show the relationship between values of the sample frequency code (SFC) and the corresponding SYT interval for different transmission modes.

5.3  Audio Over Universal Serial Bus (USB)

5.3.1 Basic USB Principles

The universal serial bus is not the same as IEEE1394, but it has some similar implications for desktop multimedia systems, including audio peripherals. USB has been jointly supported by a number of manufacturers including Microsoft, Digital, IBM, NEC, Intel and Compaq. It is a copper interface that, in its basic version, runs at a lower speed than 1394 (typically either 1.5 or 12 Mbit/s) and is designed to act as a low cost connection for multiple input devices to computers such as joysticks, keyboards, scanners and so on. The data rate is, however, high enough for it to be used for transferring limited audio information if required. A recent revision of the USB standard enables newer interfaces to operate at a high rate of up to 480 Mbit/s.

USB supports up to 127 devices for both isochronous and asynchronous communication and can carry data over distances of up to 5 m per hop (similar to 1394). A hub structure is required for multiple connections to the host connector. Like 1394 it is hot pluggable and reconfigures the addressing structure automatically. When new devices are connected to a USB setup the host device assigns a unique address. Limited power is available over the interface and some devices are capable of being powered solely using this source – known as ‘bus-powered’ devices – which can be useful for field operation of, say, a simple A/D convertor with a laptop computer.

Data transmissions are grouped into frames of 1 ms duration in USB 1.0 but a ‘micro-frame’ of one-eighth of 1 ms was also defined in USB 2.0. A start-of-frame packet indicates the beginning of a cycle and the bus clock is normally at 1 kHz if such packets are transmitted every millisecond. So the USB frame rate is substantially slower than the typical audio sampling rate. The transport structure and different layers of the network protocol will not be described in detail as they are long and complex and can be found in the USB 2.0 specification5. However, it is important to be aware that transactions are set up between sources and destinations over so-called ‘pipes’ and that numerous ‘interfaces’ can be defined and run over a single USB cable, only dependent on the available bandwidth. Some salient features of the audio specification will be described.

5.3.2 Audio over USB

The way in which audio is handled on USB is well defined and somewhat more clearly explained than the 1394 audio/music protocol6. It defines three types of communication: audio control, audio streaming and MIDI streaming. We are concerned primarily with audio streaming applications.

Audio data transmissions fall into one of three types. Type 1 transmissions consist of channel-ordered PCM samples in consecutive subframes, whilst Type 2 transmissions typically contain non-PCM audio data that does not preserve a particular channel order in the bitstream, such as certain types of multichannel data-reduced audio stream. Type 3 transmissions are a hybrid of the two such that non-PCM data is packed into pseudo-stereo data words in order that clock recovery can be made easier. This method is in fact very much the same as the way data-reduced audio is packed into audio subframes within the IEC 61937 format described in Chapter 4, and follows much the same rules.

Audio samples are transferred in subframes, each of which can be one to four bytes long (up to 24 bits resolution). An audio frame consists of one or more subframes, each of which represents a sample of different channel in the cluster (see below). As with 1394, a USB packet can contain a number of frames in succession, each containing a cluster of subframes. Frames are described by a format descriptor header that contains a number of bytes describing the audio data type, number of channels, subframe size, as well as information about the sampling frequency and the way it is controlled (for Type 1 data). An example of a simple audio frame would be one containing only two subframes of 24-bit resolution for stereo audio.

Audio of a number of different types can be transferred in Type 1 transmissions, including PCM audio (two’s complement, fixed point), PCM-8 format (compatible with original eight-bit WAV, unsigned, fixed point), IEEE floating point, A-law and µ-law (companded audio corresponding to relatively old telephony standards). Type 2 transmissions typically contain data-reduced audio signals such as MPEG or AC-3 streams. Here the data stream contains an encoded representation of a number of channels of audio, formed into encoded audio frames that relate to a large number of original audio samples. An MPEG encoded frame, for example, will typically be longer than a USB packet (a typical MPEG frame might be 8 or 24 ms long), so it is broken up into smaller packets for transmission over USB rather like the way it is streamed over the IEC 60958 interface described in Chapter 4. The primary rule is that no USB packet should contain data for more than one encoded audio frame, so a new encoded frame should always be started in a new packet. The format descriptor for Type 2 is similar to Type 1 except that it replaces subframe size and number of channels indication with maximum bit rate and number of audio samples per encoded frame. Currently only MPEG and AC-3 audio are defined for Type 2.

Rather like the compound data blocks possible in 1394 (see above), audio data for closely related synchronous channels can be clustered for USB transmission in Type 1 format. Up to 254 streams can be clustered and there are 12 defined spatial positions for reproduction, to simplify the relationship between channels and the loudspeaker locations to which they relate. (This is something of a simplification of the potentially complicated formatting of spatial audio signals and assumes that channels are tied to loudspeaker locations, but it is potentially useful.) The first six defined streams follow the internationally standardized order of surround sound channels for 5.1 surround, that is left, right, centre, LFE (low frequency effects), left surround, right surround. Subsequent streams are allocated to other loudspeaker locations around a notional listener. Not all the spatial location streams have to be present but they are supposed to be presented in the defined order. Clusters are defined in a descriptor field that includes ‘bNrChannels’ (specifying how many logical audio channels are present in the cluster) and ‘wChannelConfig’ (a bit field that indicates which spatial locations are present in the cluster). If the relevant bit is set then the relevant location is present in the cluster. The bit allocations are shown in Table 5.1.

Table 5.1 Channel identification in USB audio cluster descriptor

Data bit

Spatial location

D0

Left Front (L)

D1

Right Front (R)

D2

Center Front (C)

D3

Low Frequency Enhancement (LFE)

D4

Left Surround (LS)

D5

Right Surround (RS)

D6

Left of Center (LC)

D7

Right of Center (RC)

D8

Surround (S)

D9

Side Left (SL)

D10

Side Right (SR)

D11

Top (T)

D15..12

Reserved

5.3.3 Clock Synchronization

Audio devices transferring signals over USB can have sample clocks that are either asynchronous with the USB data transfer, or that are locked in some way to the USB start-of-frame (SOF) identifier (that occurs every 1 ms). Asynchronous devices would typically use free-running or externally synchronized audio clocks, whereas synchronous devices would either have a means of locking their sample clocks to the 1 ms SOF point or (perhaps unusually) have a means of controlling the USB clock rate so that it became locked to the audio sampling frequency. It is up to host applications to ensure that groups of audio channels that belong together and are supposed to be sample-aligned are kept so through any buffering that is employed. Buffering of at least one frame is normally required at the receiver and the management and reporting of delays is an inherent feature of the recommendations.

5.4  AES47: Audio over ATM

Asynchronous transfer mode (ATM) is a protocol for data transmission over high speed data networks that operates in a switched fashion and can extend over wide or metropolitan areas. It typically operates over SONET (synchronous optical network) or SDH (synchronous digital hierarchy) networks, depending on the region of the world. Switched networks involve the setting up of specific connections between a transmitter and one or more receivers, rather like a dialled telephone network (indeed this is the infrastructure of the digital telephone network). Data packets on ATM networks consist of a fixed 48 bits, typically preceded by a five-byte header that identifies the virtual channel of the packet.

AES477 defines a method by which linear PCM data, either conforming to AES3 format or not, can be transferred over ATM. There are various arguments for doing this, not least being the increasing use of such networks for data communications within the broadcasting industry and the need to route audio signals over longer distances than possible using standard digital interfaces. There is also a need for low latency, guaranteed bandwidth and switched circuits, all of which are features of ATM. Essentially an ATM connection is established in a similar way to making a telephone call. A SETUP message is sent at the start of a new ‘call’ that describes the nature of the data to be transmitted and defines its vital statistics. The AES47 standard describes a specific professional audio implementation of this procedure that includes information about the audio signal and the structure of audio frames in the SETUP at the beginning of the call.

For some reason bytes are termed octets in ATM terminology, so this section will follow that convention. Audio data is divided into subframes and each subframe contains a sample of audio as well as optional ancillary data and protocol overhead data, as shown in Figure 5.7. The setup message at the start of the call determines the audio mode and whether or not this additional data is present. The subframe should occupy a whole number of octets and the length of the audio sample should be such that the subframe is 8, 16, 24, 32 or 48 bits long. The ancillary data field, if it is present, is normally used for carrying the VUC bits from the AES3 subframe, along with a B bit to replace the P (parity) bit of the AES3 subframe (which has little relevance in this new application). The B bit in the ‘1’ state indicates the start of an AES3 channel status block, taking the place of the Z preamble that is no longer present. This data is transmitted in the order BCUV.

Figure 5.7 General audio subframe format of AES47.

The protocol overhead bits, if present, consist of a sequencing bit followed by three data protection bits (used for error checking). These sequencing bits are assembled from all the subframes in an ATM cell, rather as channel status bits are assembled from successive AES3 subframes to form a sequencing word. The first four bits of this form the sequencing number, the point of which is to act as an incremented count of ATM cells since the start of the call. Bits 5–7 act as protection bits for the sequencing word, bit 8 is even parity for the first eight bits, and bits 9–12 (if present) can form a second sequencing number that can be used to align samples from multiple virtual circuits carrying nominally timealigned signals (see Figure 5.8).

Figure 5.8 Components of the sequencing word in AES47.

Samples are packed into the ATM cell either ordered in time, in multichannel groups or by channel, as shown in Figure 5.9. Only certain combinations of channels and data formats are allowed and all the channels within the stream have to have the same resolution and sampling frequency, as shown in Table 5.2.

Figure 5.9 Packing of audio subframes into ATM cells. (a) Example of temporal ordering with two channels, left and right. ‘a’, ‘b’, ‘c’, etc., are successive samples in time for each channel. Co-temporal samples are grouped together. (b) Example of multichannel packing whereby concurrent samples from a number of channels are arranged sequentially.(c) Example of ordering by channel, with a number of samples from the same channel being grouped together. (If the number of channels is the same as the number of samples per cell, all three methods turn out to be identical.)

Table 5.2 Audio packing within ATM cells – options in AES47

* This should be signalled within the second and third octets of the user-defined AAL part of the SETUP message that is an optional part of the ATM protocol for setting up calls between sources and destinations.

Four octets in the user-defined AAL part of the SETUP message that begins a new ATM call define aspects of the audio communication that will take place. The first byte contains so-called ‘qualifying information’, only bit 4 of which is currently specified indicating that the sampling frequency is locked to some global reference. The second byte indicates the subframe format and sample length, whilst the third byte specifies the packing format. The fourth byte contains information about the audio sampling frequency (32, 44.1 or 48 kHz), its scaling factor (from 0.25 up to 8 times) and multiplication factor (e.g. 1/1.001 or 1.001/1 for ‘pull-down’ or ‘pull-up’modes). It also has limited information for varispeed rates.

There is provision within the standard for the sender to include a local clock that ticks once per second. It is expected that cells will be blocked such that a block consists of either eight cells or eight sets of samples. The User Indication (UI) bit in the cell header should be set to 1 in the first and last cells of the first block following a clock tick. This can be used to derive a pulse train related to the sampling frequency.

5.5  ISDN

ISDN, the Integrated Services Digital Network, is an extension of the digital telephone network to the consumer, providing two 64 Kb/s digital channels that can be connected to ISDN terminals anywhere in the world by dialling. Since the total usable capacity of an ISDN ‘B’ connection is only 128 Kb/s it is not possible to carry linear PCM data at normal audio resolutions over such a link, but it is possible to carry moderately high quality stereo audio at this rate using a data reduction system such as MPEG Layer 38, or to achieve higher rates by combining more than one ISDN link to obtain data rates of, say, 256 or 384 Kb/s9. ‘Broadband ISDN’, on the other hand, is capable of much higher data rates than a simple ‘B’ channel connection and is in fact another name for ATM networking, described in the previous section.

5.6  CobraNet

CobraNet is a proprietary audio networking technology developed by Peak Audio, a division of Cirrus Logic. It is designed for carrying audio over conventional Fast Ethernet networks (typically 100 Mbit/s), preferably using a dedicated Ethernet for audio purposes or using a switched Ethernet network. Switched Ethernet acts more like a telephone or ATM network where connections are established between specific sources and destinations, with no other data sharing that ‘pipe’. For the reasons stated earlier in this chapter, Ethernet is not ideal for audio communications without some provisos being observed. CobraNet, however, implements a method of arbitration, bandwidth reservation and an isochronous transport protocol that enables it to be used successfully.

It is claimed that conventional data communications and CobraNet applications can coexist on the same physical network but the system implements new arbitration rules, in the form of a so-called ‘O-persistent’ layer within the data link layer, to ensure that collisions do not take place on the network. All devices have to be able to abide by these rules if they are to be used. The company provides software that can be used to verify that an existing network design is capable of handling the audio information intended (in respect of bandwidth, delay and other critical parameters).

CobraNet can also be used for sample clock distribution and for equipment control purposes (interfacing with RS-232 and RS-485 equipment) and it is becoming popular in venue or live sound installations for transferring audio between multiple locations. It requires a dedicated CobraNet interface to convert audio and control data streams to and from the relevant Ethernet protocol and claims a low audio latency of 5.3 ms. A number of items of audio equipment are already equipped with CobraNet technology, such as microphone preamplifiers, power amplifiers and routers. Users claim considerable benefits in being able to integrate audio transport and equipment control/monitoring using a single network interface.

The CobraNet protocol has been allocated its own protocol identifier at the Data Link Layer of the ISO 7-layer network model, so it does not use Internet Protocol (IP) for data transport (this is typically inefficient for audio streaming purposes and involves too much overhead). Because it does not use IP it is not particularly suitable for wide area network (WAN) operation and would typically be operated over a local area network (LAN). It does, however, enable devices to be allocated IP addresses using the BOOTP (boot protocol) process and supports the use of IP and UDP (user datagram protocol) for other purposes than the carrying of audio. It is capable of transmitting packets in isochronous cycles, each packet transferring data for a ‘bundle’ of audio channels. Each bundle contains between zero and eight audio channels and these can either be unicast or multicast. Unicast bundles are intended for a single destination whereas multicast bundles are intended for ‘broadcast’ transmissions whereby a sending device broadcasts packets no matter whether any receiving device is contracted to receive them.

Rather like 1394 (see above) isochronous cycles are initiated by one bus-controlling device (the ‘conductor’) that sends a multicast packet to indicate the start of a cycle to all other devices (‘performers’). In CobraNet terminology this is called the ‘beat packet’. This beat packet is a form of clock for the network and also carries information about the overall network operation, so it is sensitive to delays and must be maintained within a very narrow time window (250 μs) if it is to be used for sample clock locking. These packets are typically small (100 bytes) whereas audio data packets are typically much larger (e.g. 1000 bytes). The CobraNet interface derives a sample clock from the network clock by means of a VCXO (voltage-controlled crystal oscillator) circuit.

5.7  MAGIC

MAGIC10 (Media-accelerated Global Information Carrier) was developed by the Gibson Guitar Corporation, originally going under the name GMICS. It is a relatively recent audio interface that typically uses the Ethernet physical layer for transporting audio between devices, although it is not compatible with higher layers and does not appear to be interoperable with conventional Ethernet data networks. It uses its own application and data link layers, the data link layer of which is based on the Ethernet 802.3 data link layer, using a frame header that would be recognized by 802.3-compatible devices.

Although it is not limited to doing so, the described implementation uses 100 Mbit/s Fast Ethernet over standard CAT 5 cables, using four of the wires in a conventional Ethernet crossover implementation and the other four for power to devices capable of operating on limited power (9 volt, 500 mA). Data is formed into frames of 55 bytes, including relevant headers, and transmitted at a synchronous rate between devices. The frame rate is related to the audio sampling rate and a sampling clock can be recovered from the interface. Very low latency of 10–40 μs is claimed. MAGIC data can be daisy-chained between devices in a form more akin to point-to-point audio interfacing than computer networking, although routing and switching configurations are also possible using routing or switching hubs.

5.8  MOST

MOST (media-oriented synchronous transfer) is described by Heck et al.11 as an alternative network protocol designed for synchronous, asynchronous and control data over a low-cost optical fibre network. It is claimed that the technology sits between USB and IEEE 1394 in terms of performance and that MOST has certain advantages in the transfer of synchronous data produced by multimedia devices that are not well catered for in other protocols. It is stated that interfaces based on copper connections are prone to electromagnetic interference and that the optical fibre interface of this system provides immunity to such, in addition to allowing distances of up to 250 m between nodes in this case.

MOST specifies physical, data link and network layers in the OSI reference model for data networks and dedicated silicon has been developed for the physical layer. Data is transferred in 64-byte frames and the frame rate of data is dependent on the sampling rate in use by the connected devices, being 22.5 Mbit/s at a 44.1 kHz audio sampling rate. The bandwidth can be divided between synchronous and asynchronous data. Potential applications are described including professional audio, for transferring up to 15 stereo 16-bit audio channels or ten stereo channels of 24-bit audio; consumer electronics, as an alternative to SPDIF at similar cost; automotive and home multimedia networking.

There is now a detailed specification framework for MOST12 and it is the subject of a co-operation agreement between a number of manufacturers. It seems to have been most widely adopted in the automotive industry where it is close to being endorsed by a consortium of car makers.

5.9  BSS SoundWeb

BSS developed its own audio network interface for use with its SoundWeb products that are typically used in large venue installations and for live sound. It uses CAT 5 cabling over distances up to 300 m, but is not based on Ethernet and behaves more like a Token Ring network. Data is carried at a rate of about 12 Mbit/s and transports eight audio channels along with control information.

5.10  Digital Content Protection

Digital content protection is rather like a more sophisticated version of SCMS, the serial copy management system, described in Chapter 4. Copy protection of digital content is increasingly required by the owners of intellectual property and data encryption is now regarded as the most appropriate way of securing such content from unwanted copying. The SCMS method used for copy protection on older interfaces such as IEC 60958 involved the use of two bits plus category codes to indicate the copy permission status of content, but no further attempt was made to make the audio content unreadable or to scramble it in the case of non-permitted transfers. A group of manufacturers known as 5C has now defined a method of digital content protection that is initially defined for IEEE 1394 transfers13 (see section 5.2) but which is likely to be extended to other means of interconnection between equipment. It is written in a relatively generic sense, but the packet header descriptions currently refer directly to 1394 implementations. 5C is the five manufacturers Hitachi, Intel, Matsushita, Sony and Toshiba. The 1394 interface is increasingly used on high-end consumer digital products for content transfer, although it has not been seen much on DVD and SACD players yet because the encryption model has only recently been agreed. There has also been the issue of content watermarking to resolve.

Content protection is managed in this model by means of both embedded copy control information (CCI) and by using two bits in the header of isochronous data packets (the so-called EMI or encryption mode indicator bits). Embedded CCI is that contained within the application-specific data stream itself. In other words it could be the SCMS bits in the channel status of IEC 60958 data or it could be the copy control information in an MPEG transport stream. This can only be accessed once a receiving device has decrypted the data that has been transmitted to it. In order that devices can inspect the copy status of a stream without decrypting the data, the packet header containing the EMI bits is not encrypted. Two EMI bits allow four copy states to be indicated as shown in Table 5.3.

Table 5.3 Copy state indication in EMI bits of 1394 header

EMI bit states Copy state Authentication required

11

Copy never (Mode A)

Full

10

Copy one generation (Mode B)

Restricted or full

01

No more copies (Mode C)

Restricted or full

00

Copy free(ly) (Mode D)

None (not encrypted)

The authentication requirement indicated by the copy state initiates a negotiation between the source and receiver that sets up an encrypted transfer using an exchanged key. The full details of this are beyond the scope of this book and require advanced understanding of cryptography, but it is sufficient to explain that full authentication involves more advanced cryptographic techniques than restricted authentication (which is intended for implementation on equipment with limited or reduced computational resources, or where copy protection is not a major concern). The process is explained in detail in the specification document13. The negotiation process, if successful, results in an encrypted and decrypted transfer being possible between the two devices. Embedded CCI can then be accessed from within the content stream.

When there is a conflict between embedded CCI and EMI indications, as there might be during a stream (for example, when different songs on a CD have different CCI but where the EMI setting remains constant throughout the stream) it is recommended that the EMI setting is the most strict of those that will be encountered in the transfer concerned. However, the embedded CCI seems to have the final say-so when it comes to deciding whether the receiving device can record the stream. For example, even if EMI indicates ‘copy never’, the receiving device can still record it if the embedded CCI indicates that it is recordable. This ensures that a stream is as secure as it should be, and the transfer properly authenticated, before any decisions can be made by the receiving device about specific instances within the stream.

Certain AM824 audio applications (a specific form of 1394 Audio/Music Protocol interchange) have defined relationships between copy states and SCMS states, for easy translation when carrying data like IEC 60958 data over 1394. In this particular case the EMI ‘copy never’ state is not used and SCMS states are mapped onto the three remaining EMI states. For DVD applications the application-specific CCI is indicated in ancillary data and there is a mapping table specified for various relationships between this data and the indicated copy states. It depends to some extent on the quality of the transmitted data and whether or not it matches that indicated in the audio_quality field of ancillary data. (Typically DVD players have allowed single generation home copies of audio material over IEC 60958 interfaces at basic sampling rates – e.g. 48 kHz – but not at very high quality rates such as 96 kHz or 192 kHz.) SuperAudio CD applications currently have only one copy state defined and that is ‘no more copies’, presumably to avoid anyone duplicating the one-bit stream that would have the same quality as the master recording.

References

1.  Bailey, A., Network Technology for Digital Audio. Focal Press (2001)

2.  IEEE, IEEE 1394: Standard for a high performance serial bus (1995)

3.  1394 Trade Association, TA Document 2001003: Audio and Music Data Transmission Protocol 2.0(2001)

4.  IEC, IEC/PAS 61883–6. Consumer audio/video equipment – Digital interface – Part 6: Audio and music data transmission protocol (1998)

5.  USB, Universal serial bus, Revision 2.0 specification. Available from http://www.usb.org/developers/docs.html (2000)

6.  USB, Universal serial bus: device class definition for audio devices, v1.0 (1998)

7.  AES, AES47-2002: Transmission of digital audio over asynchronous transfer mode networks (2002)

8.  Brandenburg, K. and Stoll, G., The ISO/MPEG audio codec: a generic standard for coding of high quality digital audio. Presented at the 92nd AES Convention, Vienna, Austria, 24–27 March, preprint 3336 (1992)

9.  Burkhardtsmaier, B. et al., The ISDN MusicTAXI. Presented at the 92nd AES Convention, Vienna, Austria, 24–27 March, preprint 3344 (1992)

10.  Gibson Guitar Corporation, Media-accelerated Global Information Carrier. Engineering Specification Version 2.4. Available from http://www.gibsonmagic.com (2002)

11.  Heck, P. et al., Media oriented synchronous transfer: a network protocol for high quality, low cost transfer of synchronous, asynchronous and control data on fiber optic. Presented at the 103rd AES Convention, New York, 26–29 September, preprint 4551 (1997)

12.  Oasis Technology, MOST Specification Framework v1.1. Available from: http://www.oasis.com/technology/index.htm (1999)

13.  5C, Digital transmission content protection specification, Volume 1 (informational version). Revision 1.2 (2001)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset