Although it is still possible to find examples of dedicated digital tape recording formats in use, they have largely been superseded by recording systems that use computer mass storage media. The economies of scale of the computer industry have made data storage relatively cheap and there is no longer a strong jusitification for systems dedicated to audio purposes. Tape has a relatively slow access time, because it is a linear storage medium. However, a dedicated tape format can easily be interchanged between recorders, provided that another machine operating to the same standard can be found. Computer mass storage media, on the other hand, come in a very wide variety of sizes and formats, and there are numerous levels at which compatibility must exist between systems before interchange can take place. This matter is discussed in the next chapter.

Background to digital tape recording

When commercial digital audio recording systems were first introduced in the 1970s and early 1980s it was necessary to employ recorders with sufficient bandwidth for the high data rates involved (a machine capable of handling bandwidths of a few megahertz was required). Analog audio tape recorders were out of the question because their bandwidths extended only up to around 35 kHz at best, so video tape recorders (VTRs) were often utilized because of their wide recording bandwidth. PCM adaptors converted digital audio data into a waveform which resembled a television waveform, suitable for recording on to a VTR. The Denon company of Japan developed such a system in partnership with the NHK broadcasting organization and they released the world’s first PCM recording onto LP in 1971. In the early 1980s, devices such as Sony’s PCM-F1 became available at modest prices, allowing 16 bit, 44.1 kHz digital audio to be recorded on to a consumer VTR, resulting in widespread proliferation of stereo digital recording. Dedicated open-reel digital recorders using stationary heads were also developed (see Fact File 9.1). High-density tape formulations were then manufactured for digital use, and this, combined with new channel codes (see below), improvements in error correction and better head design, led to the use of a relatively low number of tracks per channel, or even singletrack recording of a given digital signal, combined with playing speeds of 15 or 30 inches per second. Dedicated rotary-head systems, not based on a VTR, were also developed – the R-DAT format being the most well known.

FACT FILE 9.1 ROTARY AND STATIONARY HEADS

There are two fundamental mechanisms for the recording of digital audio on tape, one which uses a relatively low linear tape speed and a quickly rotating head, and one which uses a fast linear tape speed and a stationary head. In the rotary-head system the head either describes tracks almost perpendicular to the direction of tape travel, or it describes tracks which are almost in the same plane as the tape travel. The former is known as transverse scanning, and the latter is known as helical scanning, as shown in (a). Transverse scanning uses more tape when compared with helical scanning. It is not common for digital tape recording to use the transverse scanning method. The reason for using a rotary head is to achieve a high head-to-tape speed, since it is this which governs the available bandwidth. Rotary-head recordings cannot easily be splice-edited because of the track pattern, but they can be electronically edited using at least two machines.

Stationary heads allow the design of tape machines that are very similar in many respects to analog transports. With stationary-head recording it is possible to record a number of narrow tracks in parallel across the width of the tape, as shown in (b). Tape speed can be traded off against the number of parallel tracks used for each audio channel, since the required data rate can be made up by a combination of recordings made on separate tracks. This approach was used in the DASH format, where the tape speed could be 30 ips (76 cms^–1) using one track per channel, 15 ips using two tracks per channel, or 7.5 ips using four tracks per channel.

Digital recording tape is thinner (27.5 microns) than that used for analog recordings; long playing times can be accommodated on a reel, but also thin tape contacts the machine’s heads more intimately than does standard 50 micron thickness tape which tends to be stiffer. Intimate contact is essential for reliable recording and replay of such a densely packed and high bandwidth signal.

Channel coding for dedicated tape formats

Since ‘raw’ binary data is normally unsuitable for recording directly by dedicated digital recording systems, a ‘channel code’ is used which matches the data to the characteristics of the recording system, uses storage space efficiently, and makes the data easy to recover on replay. A wide range of channel codes exists, each with characteristics designed for a specific purpose.

The channel code converts a pattern of binary data into a different pattern of transitions in the recording or transmission medium. It is another stage of modulation, in effect. Thus the pattern of bumps in the optical surface of a CD bears little resemblance to the original audio data, and the pattern of magnetic flux transitions on a DAT cassette would be similarly different. Given the correct code book, one could work out what audio data was represented by a given pattern from either of these systems.

Many channel codes are designed for a low DC content (in other words, the data is coded so as to spend, on average, half of the time in one state and half in the other) in cases where signals must be coupled by transformers (see ‘Transformers’, Chapter 12), and others may be designed for narrow bandwidth or a limited high-frequency content. Certain codes are designed specifically for very high-density recording, and may have a low clock content with the possibility for long runs in one binary state or the other without a transition. Channel coding involves the incorporation of the data to be recorded with a clock signal, such that there is a sufficient clock content to allow the data and clock to be recovered on replay (see Fact File 9.2). Channel codes vary as to their robustness in the face of distortion, noise and timing errors in the recording channel.

Some examples of channel codes used in audio systems are shown in Figure 9.1. FM is the simplest, being an example of binary frequency modulation. It is otherwise known as ‘bi-phase mark’, one of the Manchester codes, and is the channel code used by SMPTE/EBU timecode (see Chapter 15). MFM and Miller-squared are more efficient in terms of recording density. MFM is more efficient than FM because it eliminates the transitions between successive ones, only leaving them between successive zeros. Miller-squared eliminates the DC content present in MFM by removing the transition for the last one in an even number of successive ones.

Group codes, such as that used in the Compact Disc and R-DAT, involve the coding of patterns of bits from the original audio data into new codes with more suitable characteristics, using a look-up table or ‘code book’ to keep track of the relationship between recorded and original codes. This has clear parallels with coding as used in intelligence operations, in which the recipient of a message requires the code book to be able to understand the message. CD uses a method known as 8-to-14 modulation, in which 16 bit audio sample words are each split into two 8 bit words, after which a code book is used to generate a new 14 bit word for each of the 256 possible combinations of 8 bits. Since there are many more words possible with 14 bits than with 8, it is possible to choose those which have appropriate characteristics for the CD recording channel. In this case, it is those words which have no more than 11 consecutive bits in the same state, and no less than three. This limits the bandwidth of the recorded data, and makes it suitable for the optical pick-up process, whilst retaining the necessary clock content.

FACT FILE 9.2 DATA RECOVERY

Channel-coded data must be decoded on replay, but first the audio data must be separated from the clock information which was combined with it before recording. This process is known as data and sync separation, as shown in (a).

It is normal to use a phase-locked loop for the purpose of regenerating the clock signal from the replayed data, as shown in (b), this being based around a voltage-controlled oscillator (VCO) which runs at some multiple of the off-tape clock frequency. A phase comparator compares the relative phases of the divided VCO output and the clock data off tape, producing a voltage proportional to the error which controls the frequency of the VCO. With suitable damping, the phase-locked oscillator will flywheel’ over short losses or irregularities of the off-tape clock.

Recorded data is usually interspersed with synchronizing patterns in order to give the PLL in the data separator a regular reference in the absence of regular clock data from the encoded audio signal, since many channel codes have long runs without a transition. Even if the off-tape data and clock have timing irregularities, such as might manifest themselves as ‘wow’ and ‘flutter’ in analog reproducers (see Chapter 18), these can be removed in digital systems. The erratic data (from tape or disk, for example) is written into a short-term solid state memory (RAM) and read out again a fraction of a second later under control of a crystal clock (which has an exceptionally stable frequency), as shown in (c). Provided that the average rate of input to the buffer is the same as the average rate of output, and the buffer is of sufficient size to soak up short-term irregularities in timing, the buffer will not overflow or become empty.

Error correction

There are two stages to the error correction process used in digital tape recording systems. First, the error must be detected, and then it must be corrected. If it cannot be corrected then it must be concealed. In order for the error to be detected it is necessary to build in certain protection mechanisms.

FIGURE 9.1 Examples of three channel codes used in digital recording. Miller-squared is the most efficient of those shown since it involves the smallest number of transitions for the given data sequence.

FIGURE 9.2 Interleaving is used in digital recording and broadcasting systems to rearrange the original order of samples for storage or transmission. This can have the effect of converting burst errors into random errors when the samples are deinterleaved.

Two principal types of error exist: the burst error and the random error. Burst errors result in the loss of many successive samples and may be due to major momentary signal loss, such as might occur at a tape drop-out or at an instant of impulsive interference such as an electrical spike induced in a cable or piece of dirt on the surface of a CD. Burst error correction capability is usually quoted as the number of consecutive samples which may be corrected perfectly. Random errors result in the loss of single samples in randomly located positions, and are more likely to be the result of noise or poor signal quality. Random error rates are normally quoted as an average rate, for example 1 in 10⁶. Error correction systems must be able to cope with the occurrence of both burst and random errors in close proximity.

Audio data is normally interleaved before recording, which means that the order of samples is shuffled (as shown conceptually in Figure 9.2). Samples that had been adjacent in real time are now separated from each other on the tape. The benefit of this is that a burst error, which destroys consecutive samples on tape, will result in a collection of single-sample errors in between good samples when the data is deinterleaved, allowing for the error to be concealed. A common process, associated with interleaving, is the separation of odd and even samples by a delay. The greater the interleave delay, the longer the burst error that can be handled. A common example of this is found in the DASH tape format (an open-reel digital recording format), and involves delaying odd samples so that they are separated from even samples by 2448 samples, as well as reordering groups of odd and even samples within themselves.

Redundant data is also added before recording. Redundancy, in simple terms, involves the recording of data in more than one form or place. A simple example of the use of redundancy is found in the twin-DASH format, in which all audio data is recorded twice. On a second pair of tracks (handling the duplicated data), the odd – even sequence of data is reversed to become even – odd. First, this results in double protection against errors, and second, it allows for perfect correction at a splice, since two burst errors will be produced by the splice, one in each set of tracks. Because of the reversed odd – even order in the second set of tracks, uncorrupted odd data can be used from one set of tracks, and uncorrupted even data from the other set, obviating the need for interpolation (see Fact File 9.3).

FACT FILE 9.3 ERROR HANDLING

True Correction

Up to a certain random error rate or burst error duration an error correction system will be able to reconstitute erroneous samples perfectly. Such corrected samples are indistinguishable from the originals, and sound quality will not be affected. Such errors are often signaled by green lights showing ‘CRC’ failure or ‘Parity’ failure.

Interpolation

When the error rate exceeds the limits for perfect correction, an error correction system may move to a process involving interpolation between good samples to arrive at a value for a missing sample (as shown in the diagram). The interpolated value is the mathematical average of the foregoing and succeeding samples, which may or may not be correct. This process is also known as concealment or averaging, and the audible effect is not unpleasant, although it will result in a temporary reduction in audio bandwidth. Interpolation is usually signaled by an orange indicator to show that the error condition is fairly serious. In most cases the duration of such concealment is very short, but prolonged bouts of concealment should be viewed warily, since sound quality will be affected. This will usually point to a problem such as dirty heads or a misaligned transport, and action should be taken.

Hold

In extreme cases, where even interpolation is impossible (when there are not two good samples either side of the bad one), a system may ‘hold’. In other words, it will repeat the last correct sample value. The audible effect of this will not be marked in isolated cases, but is still a severe condition. Most systems will not hold for more than a few samples before muting. Hold is normally indicated by a red light.

Mute

When an error correction system is completely overwhelmed it will usually effect a mute on the audio output of the system. The duration of this mute may be varied by the user in some systems. The alternative to muting is to hear the output, regardless of the error. Depending on the severity of the error, it may sound like a small ‘spit’, click, or even a more severe breakup of the sound. In some cases this may be preferable to muting.

Cyclic redundancy check (CRC) codes, calculated from the original data and recorded along with that data, are used in many systems to detect the presence and position of errors on replay. Complex mathematical procedures are also used to form codewords from audio data which allow for both burst and random errors to be corrected perfectly up to a given limit. Reed – Solomon encoding is another powerful system which is used to protect digital recordings against errors, but it is beyond the scope of this book to cover these codes in detail.

Digital tape formats

There have been a number of commercial recording formats over the last 20 years, and only a brief summary will be given here of the most common.

Sony’s PCM-1610 and PCM-1630 adaptors dominated the CD-mastering market for a number of years, although by today’s standards they used a fairly basic recording format and relied on 60 Hz/525 line U-matic cassette VTRs (Figure 9.3). The system operated at a sampling rate of 44.1 kHz and used 16 bit quantization, being designed specifically for the making of tapes to be turned into CDs. Recordings made in this format could be electronically edited using the Sony DAE3000 editing system, and the playing time of tapes ran up to 75 minutes using a tape specially developed for digital audio use.

The R-DAT or DAT format was a small stereo, rotary-head, cassette-based format offering a range of sampling rates and recording times, including the professional rates of 44.1 and 48 kHz. Originally, consumer machines operated at 48 kHz to avoid the possibility for digital copying of CDs, but professional versions became available which would record at either 44.1 or 48 kHz. Consumer machines would record at 44.1 kHz, but usually only via the digital inputs. DAT was a 16 bit format, but had a non-linearly encoded long-play mode as well, sampled at 32 kHz. Truly professional designs offering editing facilities, external sync and IEC-standard timecode were also developed. The format became exceptionally popular with professionals owing to its low cost, high performance, portability and convenience. Various non-standard modifications were introduced, including a 96 kHz sampling rate machine and adaptors enabling the storage of 20 bit audio on such a high sampling rate machine (sacrificing the high sampling rate for more bits). The IEC timecode standard for R-DAT was devised in 1990. It allowed for SMPTE/EBU timecode of any frame rate to be converted into the internal DAT ‘running-time’ code, and then converted back into any SMPTE/EBU frame rate on replay. A typical machine is pictured in Figure 9.4.

FIGURE 9.3 Sony DMR-4000 digital master recorder. (Courtesy of Sony Broadcast and Professional Europe.)

FIGURE 9.4 Sony PCM-7030 professional DAT machine. (Courtesy of Sony Broadcast and Professional Europe.)

The Nagra-D recorder (Figure 9.5) was designed as a digital replacement for the world-famous Nagra analog recorders, and as such was intended for professional use in field recording and studios. The format was designed to have considerable commonality with the audio format used in D1- and D2-format digital VTRs, having rotary heads, although it used open reels for operational convenience. Allowing for 20–24 bits of audio resolution, the Nagra-D format was appropriate for use with high-resolution convertors. The error correction and recording density used in this format were designed to make recordings exceptionally robust, and recording time could be up to 6 hours on a 7 inch (18 cm) reel, in two-track mode. The format was also designed for operation in a four-track mode at twice the stereo tape speed, such that in stereo the tape travels at 4.75 cms^–1, and in four track at 9.525 cms^–1.

The DASH (Digital Audio Stationary Head) format consisted of a whole family of open-reel stationary-head recording formats from two tracks up to 48 tracks. DASH-format machines operated at 44.1 kHz or 48 kHz rates (and sometimes optionally at 44.056 kHz), and they allowed varispeed ±12.5%. They were designed to allow gapless punch-in and punchout, splice editing, electronic editing and easy synchronization. Multitrack DASH machines (an example is shown in Figure 9.6) gained wide acceptance in studios, but the stereo machines did not. Later developments resulted in DASH multitracks capable of storing 24 bit audio instead of the original 16 bits.

Subsequently budget modular multitrack formats were introduced. Most of these were based on eight-track cassettes using rotary head transports borrowed from consumer video technology. The most widely used were the DA-88 format (based on Hi-8 cassettes) and the ADAT format (based on VHS cassettes). These offered most of the features of open-reel machines and a number of them could be synchronized to expand the channel capacity. An example is shown in Figure 9.7.

Editing digital tape recordings

Razor blade cut-and-splice editing was possible on open-reel digital formats, and the analog cue tracks were monitored during these operations.

A 90° butt joint was used for the splice editing of digital tape. The discontinuity in the data stream caused by the splice would cause complete momentary drop-out of the digital signal if no further action were taken, so circuits were incorporated that sensed the splice and performed an electronic crossfade from one side of the splice to the other, with error concealment to minimize the audibility of the splice. It was normally advised that a 0.5 mm gap should be left at the splice so that its presence would easily be detected by the crossfade circuitry. The thin tape could easily be damaged during the cut-and-splice edit procedure and this method failed to gain an enthusiastic following, despite its having been the norm in the analog world. Electronic editing was far more desirable, and was the usual method.

FIGURE 9.5 Nagra-D open-reel digital tape recorder. (Courtesy of Sound PR.)

FIGURE 9.6 An open-reel digital multitrack recorder: the Sony PCM-3348. (Courtesy of Sony Broadcast and Professional Europe.)

FIGURE 9.7 A modular digital multitrack machine, Sony PCM-800. (Courtesy of Sony Broadcast and Professional Europe.)

Electronic editing normally required the use of two machines plus a control unit, as shown in the example in Figure 9.8. A technique was employed whereby a finished master tape was assembled from source takes on player machines. This was a relatively slow process, as it involved real-time copying of audio from one machine to another, and modifications to the finished master were difficult. The digital editor could often store several seconds of program in its memory and this could be replayed at normal speed or under the control of a search knob which enabled very slow to-and-fro searches to be performed in the manner of rock and roll editing on an analog machine. Edits could be rehearsed prior to execution. When satisfactory edit points had been determined the two machines were synchronized using timecode, and the record machine switched to drop in the new section of the recording from the replay machine at the chosen moment. Here a crossfade was introduced between old and new material to smooth the join. The original source tape was left unaltered.

FIGURE 9.8 In electronic tape copy editing selected takes are copied in sequence from player to recorder with appropriate crossfades at joins.

MASS STORAGE-BASED SYSTEMS

Once audio is in a digital form it can be handled by a computer, like any other data. The only real difference is that audio requires a high sustained data rate, substantial processing power and large amounts of storage compared with more basic data such as text. The following is an introduction to some of the technology associated with computer-based audio workstations and audio recording using computer mass storage media such as hard disks. More detail will be found in Desktop Audio Technology, as detailed in the ‘Recommended further reading’ list. The MIDI-based aspects of such systems are covered in Chapter 14.

Magnetic hard disks

Magnetic hard disk drives are probably the most common form of mass storage. They have the advantage of being random-access systems – in other words any data can be accessed at random and with only a short delay. There exist both removable and fixed media disk drives, but in almost all cases the fixed media drives have a higher performance than removable media drives. This is because the design tolerances can be made much finer when the drive does not have to cope with removable media, allowing higher data storage densities to be achieved. Some disk drives have completely removable drive cartridges containing the surfaces and mechanism, enabling hard disk drives to be swapped between systems for easy project management (an example is shown in Figure 9.9).

The general structure of a hard disk drive is shown in Figure 9.10. It consists of a motor connected to a drive mechanism that causes one or more disk surfaces to rotate at anything from a few hundred to many thousands of revolutions per minute. This rotation may either remain constant or may stop and start, and it may either be at a constant rate or a variable rate, depending on the drive. One or more heads are mounted on a positioning mechanism which can move the head across the surface of the disk to access particular points, under the control of hardware and software called a disk controller. The heads read data from and write data to the disk surface by whatever means the drive employs.

FIGURE 9.9 A typical removable disk drive system allowing multiple drives to be inserted or removed from the chassis at will. Frame housing multiple removable drives. (Courtesy of Glyph Technologies Inc.)

FIGURE 9.10 The general mechanical structure of a disk drive.

The disk surface is normally divided up into tracks and sectors, not physically but by means of ‘soft’ formatting (see Figure 9.11). Low-level formatting places logical markers, which indicate block boundaries, amongst other processes. On most hard disks the tracks are arranged as a series of concentric rings, but with some optical discs there is a continuous spiral track.

Disk drives look after their own channel coding, error detection and correction so there is no need for system designers to devise dedicated audio processes for disk-based recording systems. The formatted capacity of a disk drive is all available for the storage of ‘raw’ audio data, with no additional overhead required for redundancy and error checking codes. Bad blocks’ are mapped out during the formatting of a disk, and not used for data storage. If a disk drive detects an error when reading a block of data it will attempt to read it again. If this fails then an error is normally generated and the file cannot be accessed, requiring the user to resort to one of the many file recovery packages on the market. Disk-based audio systems do not resort to error interpolation or sample hold operations, unlike tape recorders. Replay is normally either correct or not possible.

RAID arrays enable disk drives to be combined in various ways as described in Fact File 9.4

FIGURE 9.11 Disk formatting divides the storage area into tracks and sectors.

FACT FILE 9.4 RAID ARRAYS

Hard disk drives can be combined in various ways to improve either data integrity or data throughput. RAID stands for Redundant Array of Inexpensive Disks, and is a means of linking ordinary disk drives under one controller so that they form an array of data storage space. A RAID array can be treated as a single volume by a host computer. There are a number of levels of RAID array, each of which is designed for a slightly different purpose, as summarized in the table.

RAID level	Features
0	Data blocks split alternately between a pair of disks, but no redundancy so actually less reliable than a single disk. Transfer rate is higher than a single disk. Can improve access times by intelligent controller positioning of heads so that next block is ready more quickly
1	Offers disk mirroring. Data from one disk is automatically duplicated on another. A form of real-time backup
2	Uses bit interleaving to spread the bits of each data word across the disks, so that, say, eight disks each hold one bit of each word, with additional disks carrying error protection data. Non-synchronous head positioning. Slow to read data, and designed for mainframe computers
3	Similar to level 2, but synchronizes heads on all drives, and ensures that only one drive is used for error protection data. Allows high-speed data transfer, because of multiple disks in parallel. Cannot perform simultaneous read and write operations
4	Writes whole blocks sequentially to each drive in turn, using one dedicated error protection drive. Allows multiple read operations but only single write operations
5	As level 4 but splits error protection between drives, avoiding the need for a dedicated check drive. Allows multiple simultaneous reads and writes
6	As level 5 but incorporates RAM caches for higher performance

Optical discs

There are a number of families of optical disc drive that have differing operational and technical characteristics, although they share the universal benefit of removable media. They are all written and read using a laser, which is a highly focused beam of coherent light, although the method by which the data is actually stored varies from type to type. Optical discs are sometimes enclosed in a plastic cartridge that protects the disc from damage, dust and fingerprints, and they have the advantage that the pickup never touches the disc surface making them immune from the ‘head crashes’ that can affect magnetic hard disks.

Compatibility between different optical discs and drives is something of a minefield because the method of formatting and the read/write mechanism may differ. The most obvious differences lie in the erasable or nonerasable nature of the discs and the method by which data is written to and read from the disc, but there are also physical sizes and the presence or lack of a cartridge to consider. Drives tend to split into two distinct families from a compatibility point of view: those that handle CD/DVD formats and those that handle magneto-optical (M-O) and other cartridge-type ISO standard disc formats. The latter may be considered more suitable for ‘professional purposes’ whereas the former are often encountered in consumer equipment.

WORM discs (for example, the cartridges that were used quite widely for archiving in the late 1980s and 1990s) may only be written once by the user, after which the recording is permanent (a CD-R is therefore a type of WORM disc). Other types of optical discs can be written numerous times, either requiring pre-erasure or using direct overwrite methods (where new data is simply written on top of old, erasing it in the process). The read/write process of most current rewritable discs is typically ‘phase change’ or ‘magneto-optical’. The CD-RW is an example of a rewritable disc that now uses direct overwrite principles.

The speed of some optical drives approaches that of a slow hard disk, which makes it possible to use them as an alternative form of primary storage, capable of servicing a number of audio channels.

Memory cards

Increasing use is also made in audio systems of small flash memory cards, particularly in portable recorders. These cards are capable of storing many gigabytes of data on a solid state chip with fast access time, and they have no moving parts which makes them relatively robust. Additionally they have the benefit of being removable, which makes them suitable for transfer of some projects between systems, although the capacity and speed limitations still make disks the medium of choice for large professional projects. Such memory cards come in a variety of formats such as Compact Flash (CF), Secure Digital (SD) and Memory Stick, and card readers can be purchased that will read multiple types. There is a limit to the number of times such devices can be rewritten, which is likely to be lower than that for a typical magnetic disk drive.

Recording audio on to mass storage media

Mass storage media need to offer at least a minimum level of performance capable of handling the data rates and capacities associated with digital audio, as described in Fact File 9.5 The discontinuous ‘bursty’ nature of recording onto such media usually requires the use of a buffer RAM (Random Access Memory) during replay, which accepts this interrupted data stream and stores it for a short time before releasing it as a continuous stream. It performs the opposite function during recording, as shown in Figure 9.12. Several things cause a delay in the retrieval of information from disks: the time it takes for the head positioner to move across a disk, the time it takes for the required data in a particular track to come around to the pickup head, and the transfer of the data from the disk via the buffer RAM to the outside world, as shown in Figure 9.13. Total delay, or data access time, is in practice several milliseconds. The instantaneous rate at which the system can accept or give out data is called the transfer rate and varies with the storage device.

FACT FILE 9.5 STORAGE REQUIREMENTS OF DIGITAL AUDIO

The table shows the data rates required to support a single channel of digital audio at various resolutions. Media to be used as primary storage would need to be able to sustain data transfer at a number of times these rates to be useful for multimedia workstations. The table also shows the number of megabytes of storage required per minute of audio, showing that the capacity needed for audio purposes is considerably greater than that required for text or simple graphics applications. Storage requirements increase pro rata with the number of audio channels to be handled.

Storage systems may use removable media but many have fixed media. It is advantageous to have removable media for audio purposes because it allows different jobs to be kept on different media and exchanged at will, but unfortunately the highest performance is still obtainable from storage systems with fixed media. Although the performance of removable media drives is improving all the time, fixed media drives have so far retained their advantage.

Sound is stored in named data files on the disk, the files consisting of a number of blocks of data stored either separately or together. A directory stored on the disk keeps track of where the blocks of each file are stored so that they can be retrieved in correct sequence. Each file normally corresponds to a single recording of a single channel of audio, although some stereo file formats exist.

FIGURE 9.12 RAM buffering is used to convert burst data flow to continuous data flow, and vice versa.

FIGURE 9.13 The delays involved in accessing a block of data stored on a disk.

Multiple channels are handled by accessing multiple files from the disk in a time-shared manner, with synchronization between the tracks being performed subsequently in RAM. The storage capacity of a disk can be divided between channels in whatever proportion is appropriate, and it is not necessary to pre-allocate storage space to particular audio channels. For example, a 360 Mbyte disk will store about 60 minutes of mono audio at professional rates. This could be subdivided to give 30 minutes of stereo, 15 minutes of four track, etc., or the proportions could be shared unequally. A feature of the disk system is that unused storage capacity is not necessarily ‘wasted’ as can be the case with a tape system. During recording of a multitrack tape there will often be sections on each track with no information recorded, but that space cannot be allocated elsewhere. On a disk these gaps do not occupy storage space and can be used for additional space on other channels at other times.

The number of audio channels that can be recorded or replayed simultaneously depends on the performance of the storage device, interface, drivers and host computer. Slow systems may only be capable of handling a few channels whereas faster systems with multiple disk drives may be capable of expansion up to a virtually unlimited number of channels. Some systems are modular, allowing for expansion of storage and other audio processing facilities as means allow, with all modules communicating over a high-speed data bus, as shown in Figure 9.14. Increasingly external disks are connected using high-speed serial interfaces such as Firewire (IEEE 1394) (see Fact File 9.6), and as desktop computers get faster and more capable there is no longer a strong need to have dedicated cards for connecting audio-only disk drives. These days one or more of the host computer’s internal or external disks is usually employed, although it is often recommended that this is not the same disk as used for system software in order to avoid conflicts of demand betweeen system housekeeping and audio needs.

FIGURE 9.14 Arrangement of multiple disks in a typical modular system, showing how a number of disks can be attached to a single SCSI chain to increase storage capacity. Additional IO cards can be added to increase data throughput for additional audio channels.

FACT FILE 9.6 PERIPHERAL INTERFACES

A variety of different physical interfaces can be used for interconnecting storage devices and host workstations. Some are internal buses only designed to operate over limited lengths of cable and some are external interfaces that can be connected over several meters. The interfaces can be broadly divided into serial and parallel types, the serial types tending to be used for external connections owing to their size and ease of use. The disk interface can be slower than the drive attached to it in some cases, making it into a bottleneck in some applications. There is no point having a super-fast disk drive if the interface cannot handle data at that rate.

SCSI

For many years the most commonly used interface for connecting mass storage media to host computers was SCSI (the Small Computer Systems Interface), pronounced ‘scuzzy’. It is still used quite widely for very high performance applications but EIDE interfaces and drives are now capable of very good performance that can be adequate for many purposes. SCSI is a high-speed parallel interface found on many computer systems, originally allowing up to seven peripheral devices to be connected to a host on a single bus. SCSI has grown through a number of improvements and revisions, the latest being Ultra160 SCSI, capable of addressing 16 devices at a maximum data rate of 160 Mbyte/sec. A new generation of Serial Attached SCSI (SAS) interfaces is also beginning to become available, which retains many of the features of SCSI but uses a serial format.

ATA/IDE

The ATA and IDE family of interfaces has evolved through the years as the primary internal interface for connecting disk drives to PC system buses. It is cheap and ubiquitous. Although drives with such interfaces were not considered adequate for audio purposes in the past, many people are now using them with the on-board audio processing of modern computers as they are cheap and the performance is adequate for many needs. Recent flavors of this interface family include Ultra ATA/66 and Ultra ATA/100 that use a 40-pin, 80 conductor connector and deliver data rates up to either 66 or 100 Mbyte/sec. ATAPI (ATA Packet Interface) is a variant used for storage media such as CD drives.

Serial ATA is a relatively recent development designed to enable disk drives to be interfaced serially, thereby reducing the physical complexity of the interface. High data transfer rates are planned, eventually up to 600 Mbyte/sec. It is intended primarily for internal connection of disks within host workstations, rather than as an external interface like USB or Firewire.

PCMCIA

PCMCIA is a standard expansion port for notebook computers and other small-size computer products. A number of storage media and other peripherals are available in PCMCIA format, and these include flash memory cards, modem interfaces and super-small hard disk drives. The standard is of greatest use in portable and mobile applications where limited space is available for peripheral storage.

Firewire and USB

Firewire and USB are both serial interfaces for connecting external peripherals. They both enable disk drives to be connected in a very simple manner, with high transfer rates (many hundreds of megabits per second), although USB 1.0 devices are limited to 12 Mbit/s. A key feature of these interfaces is that they can be ‘hot plugged’ (in other words devices can be connected and disconnected with the power on). The interfaces also supply basic power that enables some simple devices to be powered from the host device. Interconnection cables can usually be run up to between 5 and 10 meters, depending on the cable and the data rate.

Media formatting

The process of formatting a storage device erases all of the information in the volume. (It may not actually do this, but it rewrites the directory and volume map information to make it seem as if the disk is empty again.) Effectively the volume then becomes virgin territory again and data can be written anywhere.

When a disk is formatted at a low level the sector headers are written and the bad blocks mapped out. A map is kept of the locations of bad blocks so that they may be avoided in subsequent storage operations. Low-level formatting can take quite a long time as every block has to be addressed. During a high-level format the disk may be subdivided into a number of ‘partitions’. Each of these partitions can behave as an entirely independent ‘volume’ of information, as if it were a separate disk drive (see Figure 9.15). It may even be possible to format each partition in a different way, such that a different filing system may be used for each partition. Each volume then has a directory created, which is an area of storage set aside to contain information about the contents of the disk. The directory indicates the locations of the files, their sizes, and various other vital statistics.

FIGURE 9.15 A disk may be divided up into a number of different partitions, each acting as an independent volume of information.

The most common general purpose filing systems in audio workstations are HFS (Hierarchical Filing System) or HFS (for Mac OS), FAT 32 (for Windows PCs) and NTFS (for Windows NT and 2000). The Unix operating system is used on some multi-user systems and high-powered workstations and also has its own filing system. These were not designed principally with real-time requirements such as audio and video replay in mind but they have the advantage that disks formatted for a widely used filing system will be more easily interchangeable than those using proprietary systems. Further information about audio file formats and interchange is provided in the next chapter.

When an erasable volume like a hard disk has been used for some time there will be a lot of files on the disk, and probably a lot of small spaces where old files have been erased. New files must be stored in the available space and this may involve splitting them up over the remaining smaller areas. This is known as disk fragmentation, and it seriously affects the overall performance of the drive. The reason is clear to see from Figure 9.16. More head seeks are required to access the blocks of a file than if they had been stored contiguously, and this slows down the average transfer rate considerably. It may come to a point where the drive is unable to supply data fast enough for the purpose.

There are only two solutions to this problem: one is to reformat the disk completely (which may be difficult, if one is in the middle of a project), the other is to optimize or consolidate the storage space. Various software utilities exist for this purpose, whose job is to consolidate all the little areas of free space into fewer larger areas. They do this by juggling the blocks of files between disk areas and temporary RAM – a process that often takes a number of hours. Power failure during such an optimization process can result in total corruption of the drive, because the job is not completed and files may be only half moved, so it is advisable to back up the drive before doing this. It has been known for some such utilities to make the files unusable by some audio editing packages, because the software may have relied on certain files being in certain physical places, so it is wise to check first with the manufacturer.

AUDIO PROCESSING FOR COMPUTER WORKSTATIONS

Introduction

A lot of audio processing now takes place within the workstation, usually relying either on the host computer’s processing power (using the CPU to perform signal processing operations) or on one or more DSP (digital signal processing) cards attached to the workstation’s expansion bus. Professional systems usually use external A/D and D/A convertors, connected to a ‘core’ card attached to the computer’s expansion bus. This is because it is often difficult to obtain the highest technical performance from convertors mounted on internal sound cards, owing to the relatively ‘noisy’ electrical environment inside most computers. Furthermore, the number of channels required may not fit onto an internal card. As more and more audio work takes place entirely in the digital domain, though, the need for analog convertors decreases. Digital interfaces are also often provided on external ‘breakout boxes’, partly for convenience and partly because of physical size of the connectors. Compact connectors such as the optical connector used for the ADAT eight-channel interface or the two-channel SPDIF phono connector are accommodated on some cards, but multiple AES/EBU connectors cannot be.

It is also becoming increasingly common for substantial audio processing power to exist on integrated sound cards that contain digital interfaces and possibly A/D and D/A convertors. These cards are typically used for consumer or semi-professional applications on desktop computers, although many now have very impressive features and can be used for advanced operations. Such cards are now available in ‘full duplex’ configurations that enable audio to be received by the card from the outside world, processed and/or stored, then routed back to an external device. Full duplex operation usually allows recording and replay simultaneously.

Sound cards and DSP cards are commonly connected to the workstation using the PCI (peripheral component interface) expansion bus. Older ISA (PC) buses or NuBus (Mac) slots did not have the same data throughput capabilities and performance was therefore somewhat limited. PCI or the more recent PCI Express bus can be extended to an external expansion chassis that enables a larger number of cards to be connected than allowed for within the host computer. Sufficient processing power can now be installed for the workstation to become the audio processing ‘heart’ of a larger studio system, as opposed to using an external mixing console and effects units. The higher the sampling frequency, the more DSP operations will be required per second, so it is worth bearing in mind that going up to, say, 96 kHz sampling frequency for a project will require double the processing power and twice the storage space of 48 kHz. The same is true of increasing the number of channels to which processing is applied.

FIGURE 9.16 At (a) a file is stored in three contiguous blocks and these can be read sequentially without moving the head. At (b) the file is fragmented and is distributed over three remote blocks, involving movement of the head to read it. The latter read operation will take more time.

The issue of latency is important in the choice of digital audio hardware and software, as discussed in Fact File 9.7

DSP cards

DSP cards can be added to widely used workstation packages such as Digidesign’s ProTools. These so-called ‘DSP Farms’ or ‘Mix Farms’ are expansion cards that connect to the PCI bus of the workstation and take on much of the ‘number crunching’ work involved in effects processing and mixing. ‘Plug-in’ processing software is becoming an extremely popular and cost-effective way of implementing effects processing within the workstation, and this is discussed further in Chapter 13. ProTools plug-ins usually rely either on DSP Farms or on host-based processing (see the next section) to handle this load.

Digidesign’s TDM (Time Division Multiplex) architecture is a useful example of the way in which audio processing can be handled within the workstation. Here the processing tasks are shared between DSP cards, each card being able to handle a certain number of operations per second. If the system runs out of ‘horse power’ it is possible to add further DSP cards to share the load. Audio is routed and mixed at 24 bit resolution, and a common audio bus links the card that is connected on a separate multiway ribbon cable.

FACT FILE 9.7 AUDIO PROCESSING LATENCY

Latency is the delay incurred in executing audio operations between input and output of a system. The lower the better is the rule, particularly when operating systems in full ‘duplex’ mode, because processed sound may be routed back to musicians (for foldback purposes) or may be combined with undelayed sound at some point. The management of latency is a software issue and some systems have sophisticated approaches to ensuring that all supposedly synchronous audio reaches the output at the same time no matter what processing it has encountered on the way.

Minimum latency achievable is both a hardware and a software issue. The poorest systems can give rise to tens or even hundreds of milliseconds between input and output whereas the best reduce this to a few milliseconds. Audio I/O that connects directly to an audio processing card can help to reduce latency, otherwise the communication required between host and various cards can add to the delay. Some real-time audio processing software also implements special routines to minimize and manage critical delays and this is often what distinguishes professional systems from cheaper ones. The audio driver software or ‘middleware’ that communicates between applications and sound cards influences latency considerably. One example of such middleware intended for low latency audio signal routing in computers is Steinberg’s ASIO (Audio Stream Input Output).

Host-based audio processing

An alternative to using dedicated DSP cards is to use the now substantial processing capacity of a typical desktop workstation. The success of such ‘host-based processing’ obviously depends on the number of tasks that the workstation is required to undertake and this capacity may vary with time and context. It is, however, quite possible to use the host’s own CPU to run DSP ‘plug-ins’ for implementing equalization, mixing and limited effects, provided it is fast enough. The ‘multi-core’ (e.g. quad-core) processor architectures of some modern computers enables the division of processing power between applications, and in some cases one can allocate a specific number of processor cores to an audio application, leaving, say, one or two for system tasks and other applications. This ensures the greatest degree of hardware independence between processing tasks, and avoids conflicts of demand at times of peak processor load.

The software architecture required to run plug-in operations on the host CPU is naturally slightly different to that used on dedicated DSP cards, so it is usually necessary to specify whether the plug-in is to run on the host or on a dedicated resource such as Digidesign’s TDM cards. A number of applications are now appearing, however, that enable the integration of host-based (or ‘native’) plug-ins and dedicated DSP such as TDM-bus cards. Audio processing that runs on the host may be subject to greater latency (input to output delay) than when using dedicated signal processing, and it obviously takes up processing power that could be used for running the user interface or other software. It is nonetheless a cost-effective option for many users that do not have high expectations of a system and it may be possible to expand the system to include dedicated DSP in the future.

Integrated sound cards

Integrated sound cards typically contain all the components necessary to handle audio for basic purposes within a desktop computer and may be able to operate in full duplex mode (in and out at the same time). They typically incorporate convertors, DSP, a digital interface, FM and/or wavetable synthesis engines. Optionally, they may also include some sort of I/O daughter board that can be connected to a break-out audio interface, increasing the number of possible connectors and the options for external analog conversion. Such cards also tend to sport MIDI/joystick interfaces. A typical example of this type of card is the ‘SoundBlaster’ series from Creative Labs.

Any analog audio connections are normally unbalanced and the convertors may be of only limited quality compared with the best external devices. For professional purposes it is advisable to use high-quality external convertors and balanced analog audio connections.

MASS STORAGE-BASED EDITING SYSTEM PRINCIPLES

Introduction

The random access nature of mass storage media led to the coining of the term non-linear editing for the process of audio editing. With non-linear editing the editor may preview a number of possible masters in their entirety before deciding which should be the final one. Even after this, it is a simple matter to modify the edit list to update the master. Edits may also be previewed and experimented with in order to determine the most appropriate location and processing. Crossfades may be modified and adjustments made to equalization and levels, all in the digital domain. Non-linear editing has also come to feature very widely in post-production for video and film.

Non-linear editing is truly non-destructive in that the edited master only exists as a series of instructions to replay certain parts of sound files at certain times, with specified signal processing overlaid, as shown in Figure 9.17. The original sound files remain intact at all times, and a single sound file can be used as many times as desired in different locations and on different tracks without the need for copying the audio data. Editing may involve the simple joining of sections, or it may involve more complex operations such as long crossfades between one album track and the next, or gain offsets between one section and another. All these things are possible without affecting the original source material.

FIGURE 9.17 Instructions from an edit decision list (EDL) are used to control the replay of sound file segments from disk, which may be subjected to further processing (also under EDL control) before arriving at the audio outputs.

Sound files and sound segments

In the case of music editing sound files might be session takes, anything from a few bars to a whole movement, while in picture dubbing they might contain a phrase of dialog or a sound effect. Specific segments of these sound files can be defined while editing, in order to get rid of unwanted material or to select useful extracts. The terminology varies but such identified parts of sound files are usually termed either ‘clips’ or ‘segments’. Rather than creating a copy of the segment or clip and storing it as a separate sound file, it is normal simply to store it as a ‘soft’ entity – in other words as simply commands in an edit list or project file that identify the start and end addresses of the segment concerned and the sound file to which it relates. It may be given a name by the operator and subsequently used as if it were a sound file in its own right. An almost unlimited number of these segments can be created from original sound files, without the need for any additional audio storage space.

Edit point handling

Edit points can be simple butt joins or crossfades. A butt join is very simple because it involves straightforward switching from the replay of one sound segment to another. Since replay involves temporary storage of the sound file blocks in RAM (see above) it is a relatively simple matter to ensure that both outgoing and incoming files in the region of the edit are available in RAM simultaneously (in different address areas). Up until the edit, blocks of the outgoing file are read from the disk into RAM and thence to the audio outputs. As the edit point is reached a switch occurs between outgoing and incoming material by instituting a jump in the memory read address corresponding to the start of the incoming material. Replay then continues by reading subsequent blocks from the incoming sound file. It is normally possible to position edits to single sample accuracy, making the timing resolution as fine as a number of tens of microseconds if required.

The problem with butt joins is that they are quite unsubtle. Audible clicks and bumps may result because of the discontinuity in the waveform that may result, as shown in Figure 9.18. It is normal, therefore, to use at least a short crossfade at edit points to hide the effect of the join. This is what happens when analog tape is spliced, because the traditional angled cut has the same effect as a short crossfade (of between 5 and 20 ms depending on the tape speed and angle of cut). Most workstations have considerable flexibility with crossfades and are not limited to short durations. It is now common to use crossfades of many shapes and durations (e.g. linear, root cosine, equal power) for different creative purposes. This, coupled with the ability to preview edits and fine-tune their locations, has made it possible to put edits in places previously considered impossible.

FIGURE 9.18 (a) A bad butt edit results in a waveform discontinuity. (b) Butt edits can be made to work if there is minimal discontinuity.

The locations of edit points are kept in an edit decision list (EDL) which contains information about the segments and files to be replayed at each time, the in and the out points of each section and details of the crossfade time and shape at each edit point. It may also contain additional information such as signal processing operations to be performed (gain changes, EQ, etc.).

Crossfading

Crossfading is similar to butt joining, except that it requires access to data from both incoming and outgoing files for the duration of the crossfade. The crossfade calculation involves simple signal processing, during which the values of outgoing samples are multiplied by gradually decreasing coefficients whilst the values of incoming samples are multiplied by gradually increasing coefficients. Time coincident samples of the two files are then added together to produce output samples, as described in the previous chapter. The duration and shape of the crossfade can be adjusted by altering the coefficients involved and the rate at which the process is executed.

Crossfades are either performed in real time, as the edit point passes, or pre-calculated and written to disk as a file. Real-time crossfades can be varied at any time and are simply stored as commands in the EDL, indicating the nature of the fade to be executed. The process is similar to that for the butt edit, except that as the edit point approaches samples from both incoming and outgoing segments are loaded into RAM in order that there is an overlap in time. During the crossfade it is necessary to continue to load samples from both incoming and outgoing segments into their respective areas of RAM, and for these to be routed to the crossfade processor, as shown in Figure 9.19. The resulting samples are then available for routing to the output. Alternatively the crossfade can be calculated in non-real time. This incurs a short delay while the system works out the sums, after which a new sound file is stored which contains only the crossfade. Replay of the edit then involves playing the outgoing segment up to the beginning of the crossfade, then the crossfade file, then the incoming segment from after the crossfade, as shown in Figure 9.20. Load on the disk drive is no higher than normal in this case.

FIGURE 9.19 Conceptual diagram of the sequence of operations which occur during a crossfade. X and Y are the incoming and outgoing sound segments.

FIGURE 9.20 Replay of a precalculated crossfade file at an edit point between files X and Y.

The shape of the crossfade can usually be changed to suit different operational purposes. Standard linear fades (those where the gain changes uniformly with time) are not always the most suitable for music editing, especially when the crossfade is longer than about ten millseconds. The result may be a momentary drop in the resulting level in the center of the crossfade that is due to the way in which the sound levels from the two files add together. If there is a random phase difference between the signals, as there will often be in music, the rise in level resulting from adding the two signals will normally be around 3 dB, but the linear crossfade is 6 dB down in its center resulting in an overall level drop of around 3 dB (see Figure 9.21). Exponential crossfades and other such shapes may be more suitable for these purposes, because they have a smaller level drop in the center. It may even be possible to design customized crossfade laws. It is often possible to alter the offset of the start and end of the fade from the actual edit point and to have a faster fade-up than fade-down.

Many systems also allow automated gain changes to be introduced as well as fades, so that level differences across edit points may be corrected. Figure 9.22 shows a crossfade profile which has a higher level after the edit point than before it, and different slopes for the in- and out-fades. A lot of the difficulties that editors encounter in making edits work can be solved using a combination of these facilities.

FIGURE 9.21 Summation of levels at a crossfade. (a) A linear crossfade can result in a level drop if the incoming and outgoing material are non-coherent. (b) An exponential fade, or other similar laws, can help to make the level more constant across the edit.

Editing modes

During the editing process the operator will load appropriate sound files and audition them, both on their own and in a sequence with other files. The exact method of assembling the edited sequence depends very much on the user interface, but it is common to present the user with a visual analogy of moving tape, allowing files to be ‘cut and spliced’ or ‘copied and pasted’ into appropriate locations along the virtual tape. These files, or edited clips of them, are then played out at the timecode locations corresponding to their positions on this ‘virtual tape’. It is also quite common to display a representation of the audio waveform that allows the editor to see as well as hear the signal around the edit point (see Figure 9.23).

In non-linear systems the tape-based approach is often simulated, allowing the user to roughly locate an edit point while playing the virtual tape followed by a fine trim using simulated reel-rocking or a detailed view of the waveform. Some software presents source and destination streams as well, in further simulation of the tape approach. It is also possible to insert or change sections in the middle of a finished master, provided that the EDL and source files are still available. To take an example, assume that an edited opera has been completed and that the producer now wishes to change a take somewhere in the middle (see Figure 9.24). The replacement take is unlikely to be exactly the same length but it is possible simply to shuffle all of the following material along or back slightly to accommodate it, this being only a matter of changing the EDL rather than modifying the stored music in any way. The files are then simply played out at slightly different times than in the first version of the edit.

It is also normal to allow edited segments to be fixed in time if desired, so that they are not shuffled forwards or backwards when other segments are inserted. This ‘anchoring’ of segments is often used in picture dubbing when certain sound effects and dialog have to remain locked to the picture.

FIGURE 9.22 The system may allow the user to program a gain profile around an edit point, defining the starting gain (A), the fadedown time (B), the fade-up time (D), the point below unity at which the two files cross over (C) and the final gain (E).

FIGURE 9.23 Example from SADiE editing system showing the ‘trim editor’ in which is displayed a detailed view of the audio waveform around the edit point, together with information about the crossfade.

FIGURE 9.24 Replacing a take in the middle of an edited program. (a) Tape-based copy editing results in a gap of fixed size, which may not match the new take length. (b) Non-linear editing allows the gap size to be adjusted to match the new take.

Simulation of ‘reel-rocking’

It is common to simulate the effect of analog tape ‘reel-rocking’ in nonlinear editors, providing the user with the sonic impression that reels of analog tape are being ‘rocked’ back and forth as they are in analog tape editing when fine-searching edit points. Editors are used to the sound of tape moving in this way, and are skilled at locating edit points when listening to such a sound.

The simulation of variable speed replay in both directions (forwards and backwards) is usually controlled by a wheel or sideways movement of a mouse which moves the ‘tape’ in either direction around the current play location. The magnitude and direction of this movement is used to control the rate at which samples are read from the disk file, via the buffer, and this replaces the fixed sampling rate clock as the controller of the replay rate. Systems differ very greatly as to the sound quality achieved in this mode, because it is in fact quite a difficult task to provide convincing simulation. So poor have been many attempts that many editors do not use the feature, preferring to judge edit points accurately ‘on the fly’, followed by trimming or nudging them either way if they are not successful the first time. Good simulation requires very fast, responsive action and an ergonomically suitable control. A mouse is very unsuitable for the purpose. It also requires a certain amount of DSP to filter the signal correctly, in order to avoid the aliasing that can be caused by varying the sampling rate.

EDITING SOFTWARE

It is increasingly common for MIDI (see Chapter 14) and digital audio editing to be integrated within one software package, particularly for pop music recording and other multitrack productions where control of electronic sound sources is integrated with recorded natural sounds. Such applications used to be called sequencers but this is less common now that MIDI sequencing is only one of many tasks that are possible. Although most sequencers contain some form of audio editing these days, there are some software applications more specifically targeted at high-quality audio editing and production. These have tended to come from a professional audio background rather than a MIDI sequencing background, although it is admitted that the two fields have met in the middle now and it is increasingly hard to distinguish a MIDI sequencer with added audio features from an audio editor with added MIDI features.

Audio applications such as those described here are used in contexts where MIDI is not particularly important and where fine control over editing crossfades, dithering, mixing, mastering and post-production functions are required. Here the editor needs tools for such things as: previewing and trimming edits, such as might be necessary in classical music postproduction; PQ editing CD masters; preparing surround sound DVD material for encoding; MLP or AC-3 encoding of audio material; editing of DSD material for SuperAudio CD. The following example, based on the SADiE audio editing system, demonstrates some of the practical concepts.

SADiE workstations run on the PC platform and most utilize an external audio interface. Recent Series 5 systems, however, can be constructed as an integrated rack-mounted unit containing audio interfaces and a Pentium PC. Both PCM and DSD signal processing options are available and the system makes provision for lossless encoding for DVD-Audio and Blu-Ray, as well as SACD mastering and encoding. A typical user interface for SADiE is shown in Figure 9.25. It is possible to see transport controls, the mixer interface and the playlist display. The main part of the screen is occupied by a horizontal display of recording tracks or ‘streams’, and these are analogous to the tracks of a multitrack tape recorder. A record icon associated with each stream is used to arm it ready for recording. As recording proceeds, the empty streams are filled from left to right across the screen in real time, led by a vertical moving cursor. These streams can be displayed either as solid continuous blocks or as waveforms, the latter being the usual mode when editing is undertaken. After recording, extra streams can be recorded if required simply by disarming the record icons of the streams already used and arming the record icons of empty streams below them, making it possible to build up a large number of ‘virtual’ tracks as required. The maximum number that can be replayed simultaneously depends upon the memory and DSP capacity of the system used. A basic two-input/four-output might allow up to eight streams to be replayed (depending on the amount of DSP being used for other tasks), and a fully equipped system can allow at least 32 simultaneous streams of program material to be recorded and replayed, i.e. it is a complete multitrack recording machine.

FIGURE 9.25 SADiE editor displays, showing mixer, playlist, transport controls and project elements.

Replay involves either using the transport control display or clicking the mouse at a desired position on a time-bar towards the top of the screen, this positioning the moving cursor (which is analogous to a tape head) where one wishes replay to begin. Editing is performed by means of a razorblade icon, which will make the cut where the moving cursor is positioned. Alternatively, an edit icon can be loaded to the mouse’s cursor for positioning anywhere on any individual stream to make a cut.

Audio can be arranged in the playlist by the normal processes of placing, dragging, copying and pasting, and there is a range of options for slipping material left or right in the list to accommodate new material (this ensures that all previous edits remain attached in the right way when the list is slipped backwards or forwards in time). Audio to be edited in detail can be viewed in the trim window (shown earlier in Figure 9.23) which shows a detailed waveform display, allowing edits to be previewed either to or from the edit point, or across the edit, using the play controls in the top righthand corner (this is particularly useful for music editing). The crossfade region is clearly visible, with different colors and shadings used to indicate the ‘live’ audio streams before and after the edit. There are many stages of undo and redo so that nothing need be permanent at this stage. When a satisfactory edit is achieved, it can be written back to the main display where it will be incorporated. Scrub and jog actions for locating edit points are also possible. A useful ‘lock to time’ icon is provided which can be activated to prevent horizontal movement of the streams so that they cannot be accidentally moved out of sync with each other during editing.

The mixer section can be thought of in conventional terms, and indeed some systems offer physical plug-in interfaces with moving fader automation for those who prefer them. As well as mouse control of such things as fader, pan, solo and mute, processing such as EQ, filters, aux send and compression can be selected from an effects ‘rack’, and each can be dragged across and dropped in above a fader where it will become incorporated into that channel. Third party ‘plug-in’ software is also available for many systems to enhance the signal processing features, including CEDAR audio restoration software, as described below. The latest software allows for the use of DirectX plug-ins for audio processing. Automation of faders and other processing is also possible.

FIGURE 9.26 CEDAR Retouch display for SADiE, showing frequency (vertical) against time (horizontal) and amplitude (color/ density). Problem areas of the spectrographic display can be highlighted and a new signal synthesized using information from the surrounding region. (a) Harmonics of an interfering signal can be clearly seen. (b) A short-term spike crosses most of the frequency range.

MASTERING AND RESTORATION

Software

Some software applications are designed specifically for the mastering and restoration markets. These products are designed either to enable ‘fine tuning’ of master recordings prior to commercial release, involving subtle compression, equalization and gain adjustment (mastering), or to enable the ‘cleaning up’ of old recordings that have hiss, crackle and clicks (restoration).

CEDAR applications or plug-ins are good examples of the restoration group. Sophisticated controls are provided for the adjustment of dehissing and decrackling parameters, which often require considerable skill to master. Recently the company has introduced advanced visualization tools that enable restoration engineers to ‘touch up’ audio material using an interface not dissimilar to that used for photo editing on computers. Audio anomalies (unwanted content) can be seen in the time and frequency domains, highlighted and interpolated based on information either side of the anomaly. A typical display from its Retouch product for the SADiE platform is shown in Figure 9.26.

CEDAR’s restoration algorithms are typically divided into ‘decrackle’, ‘declick’, ‘dethump’ and ‘denoise’, each depending on the nature of the anomaly to be corrected. Some typical user interfaces for controlling these processes are shown in Figure 9.27.

Mastering software usually incorporates advanced dynamics control such as the TC Works Master X series, based on its Finalizer products, a user interface of which is pictured in Figure 9.28. Here compressor curves and frequency dependency of dynamics can be adjusted and metered. The display also allows the user to view the number of samples at peak level to watch for digital overloads that might be problematic.

FIGURE 9.27 CEDAR restoration plug-ins for SADiE, showing (a) declick and (b) denoise processes.

Level control in mastering

Level control, it might be argued, is less crucial than it used to be in the days when a recording engineer struggled to optimize a recording’s dynamic range between the noise floor and the distortion ceiling (see Figure 9.29). However, there are still artistic and technical considerations.

FIGURE 9.28 TC Works Master X mastering dynamics plug-in interface.

FIGURE 9.29 Comparison of analog and digital dynamic range. (a) Analog tape has increasing distortion as the recording level increases, with an effective maximum output level at 3% third harmonic distortion. (b) Modern high-resolution digital systems have wider dynamic range with a noise floor fixed by dither noise and a maximum recording level at which clipping occurs. The linearity of digital systems does not normally become poorer as signal level increases, until 0 dBFS is reached. This makes level control a somewhat less important issue at the initial recording stage, provided sufficient headroom is allowed for peaks.

The dynamic range of a typical digital audio system can now be well over 100 dB and there is room for the operator to allow a reasonable degree of ‘headroom’ between the peak audio signal level and the maximum allowable level. Meters are provided to enable the signal level to be observed, and they are usually calibrated in dB, with zero at the top and negative dBs below this. The full dynamic range is not always shown, and there may be a peak bar that can hold the maximum level permanently or temporarily. As explained in Chapter 8, 0 dBFS (full scale) is the point at which all of the bits available to represent the signal have been used. Above this level the signal clips and the effect of this is quite objectionable, except on very short transients where it may not be noticed. It follows that signals should never be allowed to clip.

There is a tendency in modern audio production to want to master everything so that it sounds as loud as possible, and to ensure that the signal peaks as close to 0 dBFS as possible. This level maximizing or normalizing process can be done automatically in most packages, the software searching the audio track for its highest level sample and then adjusting the overall gain so that this just reaches 0 dBFS. In this way the recording can be made to use all the bits available, which can be useful if it is to be released on a relatively low-resolution consumer medium where noise might be more of a problem. (It is important to make sure that correct redithering is used when altering the level and requantizing, as explained in Chapter 8.) This does not, of course, take into account any production decisions that might be involved in adjusting the overall levels of individual tracks on an album or other compilation, where relative levels should be adjusted according to the nature of the individual items, their loudness and the producer’s intent.

A little-known but important fact is that even if the signal is maximized in the automatic fashion, so that the highest sample value just does not clip, subsequent analog electronics in the signal chain may still do so. Some equipment is designed in such a way that the maximum digital signal level is aligned to coincide with the clipping voltage of the analog electronics in a D/A convertor. In fact, owing to the response of the reconstruction filter in the D/A convertor (which reconstructs an analog waveform from the PAM pulse train) intersample signal peaks can be created that slightly exceed the analog level corresponding to 0 dBFS, thereby clipping the analog side of the convertor. For this reason it is recommended that digital-side signals are maximized so that they peak a few dB below 0 dBFS, in order to avoid the distortion that might otherwise result on the analog side. Some mastering software provides detailed analysis of the signal showing exactly how many samples occur in sequence at peak level, which can be a useful warning of potential or previous clipping.

PREPARING FOR AND UNDERSTANDING RELEASE MEDIA

Consumer release formats such as CD, DVD (see Chapter 10), and MP3 (see Chapter 8) usually require some form of mastering and pre-release preparation. This can range from subtle tweaks to the sound quality and relative levels on tracks to PQ encoding, DVD authoring, data encoding and the addition of graphics, video and text. Some of these have already been mentioned in other places in this book.

PQ encoding for CD mastering can often be done in some of the application packages designed for audio editing, such as SADiE and Pyramix. In this case it may involve little more than marking the starts and ends of the tracks in the playlist and allowing the software to work out the relevant frame advances and Red Book requirements for the assembly of the PQ code that will either be written to a CD-R or included in the DDP file for sending to the pressing plant. The CD only comes at one resolution and sampling frequency (16 bit, 44.1 kHz) making release preparation a relatively straightforward matter.

DVD mastering is considerably more complicated than CD and requires advanced authoring software that can deal with all the different options possible on this multi-faceted release format. A number of different combinations of players and discs are possible, as explained in Fact File 9.8, although the DVD-Audio format has not been particularly successful commercially. DVD-Video allows for 48 or 96 kHz sampling frequency and 16, 20 or 24 bit PCM encoding. A two-channel downmix must be available on the disc in linear PCM form (for basic compatibility), but most discs also include Dolby Digital or possibly DTS surround audio. Dolby Digital encoding usually involves the preparation of a file or files containing the compressed data, and a range of settings have to be made during this process, such as the bit rate, dialog normalization level, rear channel phase shift and so on. A typical control screen is shown in Figure 9.30. Then of course there are the pictures, but they are not the topic of this book.

Playing time depends on the way in which producers decide to use the space available on the disc, and this requires the juggling of the available bit budget. DVD-Audio can store at least 74 minutes of stereo audio even at the highest sample rate and resolution (192/24). Other modes are possible, with up to six channels of audio playing for at least 74 minutes, using combinations of sample frequency and resolution, together with MLP. Sixchannel audio can only operate at the two lower sample rates of either class (44.1/88.2 or 48/96).

DVD masters are usually transferred to the pressing plant on DLT tapes, using the Disc Description Protocol (DDP), or on DVD-R(A) discs as a disc image with a special CMF (cutting master format) header in the disc lead-in area containing the DDP data.

FACT FILE 9.8 DVD DISCS AND PLAYERS

There are at least three DVD player types in existence (audio, universal and video), although only the DVD-Video format has gained a substantial market presence. There are also two types of DVD-Audio disc, one containing only audio objects and the other (the DVD-AudioV) capable of holding video objects as well. The video objects on a DVD-AudioV are just the same as DVD-Video objects and therefore can contain video clips, Dolby AC-3 compressed audio and other information. In addition, there is the standard DVD-Video disc.

DVD-AudioV discs should play back in audio players and universal players. Any video objects on an AudioV disc should play back on video-only players. The requirement for video objects on DVD-AudioV discs to contain PCM audio was dropped at the last moment so that such objects could only contain AC-3 audio if desired. This means that an audio disc could contain a multichannel AC-3 audio stream in a video object, enabling it to be played in a video player. This is a good way of ensuring that a multichannel audio disc plays back in as many different types of player as possible, but requires that the content producer makes sure to include the AC-3 video object in addition to MLP or PCM audio objects. The video object can also contain a DTS audio bitstream if desired. Figure courtesy of Bike Suzuki (DVD-Audio Forum).

FIGURE 9.30 Screen display of Dolby Digital encoding software options.

FIGURE 9.31 Example of SACD text authoring screen from SADiE.

SACD Authoring software enables the text information to be added, as shown in Figure 9.31. SACD masters are normally submitted to the pressing plant on AIT format data tapes.

Sony and Philips have paid considerable attention to copy protection and anti-piracy measures on the disc itself. Comprehensive visible and invisible watermarking are standard features of the SACD. Using a process known as PSP (Pit Signal Processing) the width of the pits cut into the disc surface is modulated in such a fashion as to create a visible image on the surface of the CD layer, if desired by the originator. This provides a visible means of authentication. The invisible watermark is a mandatory feature of the SACD layer and is used to authenticate the disc before it will play on an SACD player. The watermark is needed to decode the data on the disc. Discs without this watermark will simply be rejected by the player. It is apparently not possible to copy this watermark by any known means. Encryption of digital music content is also optional, at the request of software providers.

MP3, as already explained elsewhere, is actually MPEG-1, Layer 3 encoded audio, stored in a data file, usually for distribution to consumers Recommended Further Reading either on the Internet or on other release media. Consumer disc players are increasingly capable of replaying MP3 files from CDs, for example. MP3 mastering requires that the two-channel audio signal is MPEG-encoded, using one of the many MP3 encoders available. Some mastering software now includes MP3 encoding as an option, as well as other data-reduced formats such as AAC.

Some of the choices to be made in this process concern the data rate and audio bandwidth to be encoded, as this affects the sound quality. The lowest bit rates (e.g. below 64 kbit/s) will tend to sound noticeably poorer than the higher ones, particularly if full audio bandwidth is retained. For this reason some encoders limit the bandwidth or halve the sampling frequency for very low bit rate encoding, because this tends to minimize the unpleasant side-effects of MPEG encoding. It is also possible to select joint stereo coding mode, as this will improve the technical quality somewhat at low bit rates, possibly at the expense of stereo imaging accuracy. As mentioned above, at very low bit rates some audio processing may be required to make sound quality acceptable when squeezed down such a small pipe.

Commercial tools for interactive authoring and MPEG-4 encoding are only just beginning to appear at the time of writing. Such tools enable audio scenes to be described and data encoded in a scalable fashion so that they can be rendered at the consumer replay end of the chain, according to the processing power available.

Table of Contents for
Chapter 9 Digital Recording, Editing and Mastering Systems

CHAPTER 9

Digital Recording, Editing and
Mastering Systems

DIGITAL TAPE RECORDING

Background to digital tape recording

Channel coding for dedicated tape formats

Error correction

Digital tape formats

Editing digital tape recordings

MASS STORAGE-BASED SYSTEMS

Magnetic hard disks

Optical discs

Memory cards

Recording audio on to mass storage media

Media formatting

AUDIO PROCESSING FOR COMPUTER WORKSTATIONS

Introduction

DSP cards

Host-based audio processing

Integrated sound cards

MASS STORAGE-BASED EDITING SYSTEM PRINCIPLES

Introduction

Sound files and sound segments

Edit point handling

Crossfading

Editing modes

Simulation of ‘reel-rocking’

EDITING SOFTWARE

MASTERING AND RESTORATION

Software

Level control in mastering

PREPARING FOR AND UNDERSTANDING RELEASE MEDIA

RECOMMENDED FURTHER READING

Table of Contents for Chapter 9 Digital Recording, Editing and Mastering Systems

Create new playlist

Sign In

Sign Up

CHAPTER 9

Digital Recording, Editing andMastering Systems

DIGITAL TAPE RECORDING

Background to digital tape recording

Channel coding for dedicated tape formats

Error correction

Digital tape formats

Editing digital tape recordings

MASS STORAGE-BASED SYSTEMS

Magnetic hard disks

Optical discs

Memory cards

Recording audio on to mass storage media

Media formatting

AUDIO PROCESSING FOR COMPUTER WORKSTATIONS

Introduction

DSP cards

Host-based audio processing

Integrated sound cards

MASS STORAGE-BASED EDITING SYSTEM PRINCIPLES

Introduction

Sound files and sound segments

Edit point handling

Crossfading

Editing modes

Simulation of ‘reel-rocking’

EDITING SOFTWARE

MASTERING AND RESTORATION

Software

Level control in mastering

PREPARING FOR AND UNDERSTANDING RELEASE MEDIA

RECOMMENDED FURTHER READING

Table of Contents for
Chapter 9 Digital Recording, Editing and Mastering Systems

Digital Recording, Editing and
Mastering Systems