CHAPTER 5 Digital Video Processing

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

INTRODUCTION

As the power of digital processing devices continues to rise without a corresponding increase in price, this subject has seen and will continue to see dramatic changes. Increasingly processes that required dedicated hardware are carried out in general-purpose processors. Gate arrays grow ever more capable. Although the hardware changes, nevertheless the principles explained here remain the same.

A SIMPLE DIGITAL VISION MIXER

The luminance path of a simple SD component digital mixer is shown in Figure 5.1. The CCIR-601 digital input is offset binary in that it has a nominal black level of 16₁₀ in an 8-bit system (64 in a 10-bit system), and a subtraction has to be made in order that fading will take place with respect to black. On a perfect signal, subtracting 16 (or 64) would achieve this, but on a slightly out-of-range signal, it would not. Because the digital active line is slightly longer than the analog active line, the first sample should be blanking level, and this will be the value to subtract to obtain pure binary luminance with respect to black. This is the digital equivalent of black-level clamping. The two inputs are then multiplied by their respective coefficients and added together to achieve the mix. Peak limiting will be required as in Chapter 3, under Binary Addition, and then, if the output is to be to CCIR-601, 16₁₀ (or 64) must be added to each sample value to establish the correct offset. In some video applications, a cross-fade will be needed, and a rearrangement of the cross-fading equation allows one multiplier to be used instead of two, as shown in Figure 5.2.

FIGURE 5.1

A simple digital mixer. Offset binary inputs must have the offset removed. A digital integrator will produce a counter-offset, which is subtracted from every input sample. This will increase or reduce until the output of the subtractor is zero during blanking. The offset must be added back after processing if a CCIR-601 output is required.

The colour difference signals are offset binary with an offset of 128₁₀ in 8-bit systems (512 in 10-bit systems), and again it is necessary to normalize these with respect to blanking level so that proper fading can be carried out. Because colour difference signals can be positive or negative, this process results in two's complement samples. Figure 5.3 shows some examples. In this form, the samples can be added with respect to blanking level.

Following addition, a limiting stage is used as before, and then, if it is desired to return to CCIR-601 standard, the MSB must be inverted once more to convert from two's complement to offset binary.

In practice the same multiplier can be used to process luminance and colour difference signals. Because these will be arriving time multiplexed at 27 MHz, it is necessary only to ensure that the correct coefficients are provided at the right time. Figure 5.4 shows an example of part of a slow fade. As the co-sited samples C_B, Y, and C_R enter, all are multiplied by the same coefficient K_n, but the next sample will be luminance only, so this will be multiplied by K_n + 1. The next set of co-sited samples will be multiplied by K_n + 2 and so on. Clearly coefficients that change at 13.5 MHz must be provided. The sampling rate of the two inputs must be exactly the same, and in the same phase, or the circuit will not be able to add on a sample-by-sample basis. If the two inputs have come from different sources, they must be synchronised by the same master clock and/or time base correction must be provided on the inputs.

FIGURE 5.2

(a) Cross-fade requires two multipliers. (b) Reconfiguration requires only one multiplier.

Some thought must be given to the word length of the system. If a sample is attenuated, it will develop bits that are below the radix point. For example, if an 8-bit sample is attenuated by 24 dB, the sample value will be shifted four places down. Extra bits must be available within the mixer to accommodate this shift. Digital vision mixers may have an internal word length of 16 bits or more. When several attenuated sources are added together to produce the final mix, the result will be a 16-bit sample stream. As the output will generally need to be of the same format as the input, the word length must be shortened. Shortening the word length of samples effectively makes the quantizing intervals larger and can thus be called requantizing. This must be done very carefully to avoid artifacts and the necessary processes were shown in Chapter 4 under Requantizing and Digital Dither.

FIGURE 5.3

Offset binary colour difference values are converted to two's complement by reversing the state of the first bit. Two's complement values A and B will then add around blanking level.

FIGURE 5.4

When using one multiplier to fade both luminance and colour difference in a 27 MHz multiplex 4:2:2 system, one coefficient will be used three times on the co-sited samples, whereas the next coefficient will be used for only a single luminance sample.

FIGURE 5.5

To fade an offset binary signal, a correction term from a table can be added to remove the level shift caused by fading.

BLANKING

It is often necessary to blank the ends of active lines smoothly to prevent out-of-band signals being generated. This is usually the case when an effects machine has cropped the picture to fit inside a coloured border. The border will be generated by supplying constant luminance and colour difference values to the data stream. Blanking consists of sloping-off the active line by multiplying the sample values by successively smaller coefficients until blanking is reached. This is easy when the sample values have been normalized so that zero represents black, but when the usual offset of 16₁₀ is present, multiplication by descending coefficients will cause a black-level shift. The solution is to use a correction table, which can be seen in Figure 5.5. This is addressed by the multiplier coefficient and adds a suitable constant to the multiplier output. If the multiplier were to have a gain of one-half, this would shift the black level by eight quantizing intervals, and so the correction table would add eight to the output. When the multiplier has fully blanked, the output will be zero, and the correction table has to add 16₁₀ to the output.

KEYING

Keying is the process in which one video signal can be cut into another to replace part of the picture with a different image. One application of keying is where a switcher can wipe from one input to another using one of a variety of different patterns. Figure 5.6 shows that an analog switcher performs such an effect by generating a binary switching waveform in a pattern generator. Video switching between inputs actually takes place during the active line. In most analog switchers, the switching waveform is digitally generated and then fed to a DAC, whereas in a digital switcher, the pattern generator outputs become the coefficients supplied to the cross-fader, which is sometimes referred to as a cutter. The switching edge must be positioned to an accuracy of a few nanoseconds, much less than the spacing of the pixels, otherwise slow wipes will not appear to move smoothly, and diagonal wipes will have stepped edges, a phenomenon known as ratcheting.

FIGURE 5.6

In a video switcher a pattern generator produces a switching waveform, which changes from line to line and from frame to frame to allow moving pattern wipes between sources.

Positioning the switch point to subpixel accuracy is not particularly difficult, as Figure 5.7 shows. A suitable series of coefficients can position the effective crossover point anywhere. The finite slope of the coefficients results in a brief cross-fade from one video signal to the other. This soft keying gives a much more realistic effect than binary switchers, which often give a “cut out with scissors” appearance. In some machines the slope of the cross-fade can be adjusted to achieve the desired degree of softness.

CHROMA KEYING

Another application of keying is to derive the switching signal by processing video from a camera in some way. By analysing colour difference signals, it is possible to determine where in a picture a particular colour occurs. When a key signal is generated in this way, the process is known as chroma keying, which is the electronic equivalent of matting in film.

FIGURE 5.7

Soft keying. See text for details.

In a 4:2:2 component system, it will be necessary to provide coefficients to the luminance cross-fader at 13.5 MHz. Chroma samples occur at only half this frequency, so it is necessary to provide a chroma interpolator artificially to raise the chroma sampling rate. For chroma keying a simple linear interpolator is perfectly adequate. Intermediate chroma samples are simply the average of two adjacent samples. Figure 5.8 shows how a multiplexed C_R, C_B signal can be averaged using a delay of two clocks.

FIGURE 5.8

Alternate C_R and C_B samples may be averaged by a two-sample delay and an adder. Output will then alternate between sample values and averaged sample values at 27 MHz. Demultiplexing the output will give two colour difference signals each at 13.5 MHz so that they can be used to produce coefficients at that rate to key luminance.

As with analog switchers, chroma keying is also possible with composite digital inputs, but decoding must take place before it is possible to obtain the key signals. The video signals that are being keyed may, however, remain in the composite digital format.

In switcher/keyers, it is necessary to obtain a switching signal, which ramps between two states from an input signal, which can be any allowable video waveform. Manual controls are provided so that the operator can set thresholds and gains to obtain the desired effect. In the analog domain, these controls distort the transfer function of a video amplifier so that it is no longer linear. A digital keyer will perform the same functions using logic circuits.

Figure 5.9a shows that the effect of a non-linear transfer function is to switch when the input signal passes through a particular level. The transfer function is implemented in a memory in digital systems. The incoming video sample value acts as the memory address, so that the selected memory location is proportional to the video level. At each memory location, the appropriate output level code is stored. If, for example, each memory location stored its own address, the output would equal the input, and the device would be transparent. In practice, switching is obtained by distorting the transfer function to obtain more gain in one particular range of input levels at the expense of less gain at other input levels. With the transfer function shown in Figure 5.9b, an input level change from a to b causes a smaller output change, whereas the same level change between c and d causes a considerable output change. If the memory is RAM, different transfer functions can be loaded in by the control system, and this requires multiplexers in both data and address lines as shown in Figure 5.9c. In practice such a RAM will be installed in Y, C_R, and C_B channels, and the results will be combined to obtain the final switching coefficients.

FIGURE 5.9

(a) A non-linear transfer function can be used to produce a keying signal. (b) The non-linear transfer function emphasizes contrast in part of the range but reduces it at other parts. (c) If a RAM is used as a flexible transfer function, it will be necessary to provide multiplexers so that the RAM can be preset with the desired values from the control system.

SIMPLE EFFECTS

If a RAM of the type shown in Figure 5.9 is inserted in a digital luminance path, the result will be solarizing, which is a form of contrast enhancement. Figure 5.10 shows that a family of transfer functions that control the degree of contrast enhancement can be implemented. When the transfer function becomes so distorted that the slope reverses, the result is luminance reversal, in which black and white are effectively interchanged. Solarizing can also be implemented in colour difference channels to obtain chroma solarizing. In effects machines, the degree of solarizing may need to change smoothly so that the effect can be gradually introduced. In this case the various transfer functions will be kept in different pages of a table, so that the degree of solarization can be selected immediately by changing the page address of the table. One page will have a straight transfer function, so the effect can be turned off by selecting that page.

In the digital domain it is easy to introduce various forms of quantizing distortion to obtain special effects. Figure 5.11 shows that eight-bit luminance allows 256 different brightnesses, which to the naked eye appears to be a continuous range. If some of the low-order bits of the samples are disabled, then a smaller number of brightness values describes the range from black to white. For example, if six bits are disabled, only two bits remain, and so only four possible brightness levels can be output. This gives an effect known as contouring because the visual effect somewhat resembles a relief map.

When the same process is performed with colour difference signals, the result is to limit the number of possible colours in the picture, which gives an effect known as posterizing, because the picture appears to have been coloured by paint from pots. Solarizing, contouring, and posterizing cannot be performed in the composite digital domain, due to the presence of the subcarrier in the sample values.

Figure 5.12 shows a latch in the luminance data that is being clocked at the sampling rate. It is transparent to the signal, but if the clock to the latch is divided down by some factor n, the result will be that the same sample value will be held on the output for n clock periods, giving the video waveform a staircase characteristic. This is the horizontal component of the effect known as mosaicing. The vertical component is obtained by feeding the output of the latch into a line memory, which stores one horizontally mosaiced line and then repeats that line m times. As n and m can be independently controlled, the mosaic tiles can be made to be of any size and rectangular or square at will. Clearly the mosaic circuitry must be implemented simultaneously in luminance and colour difference signal paths. It is not possible to perform mosaicing on a composite digital signal, because it will destroy the chroma. It is common to provide a bypass route, which allows mosaiced and unmosaiced video to be simultaneously available. Dynamic switching between the two sources controlled by a separate key signal then allows mosaicing to be restricted to certain parts of the picture.

FIGURE 5.10

Solarization. (a) The non-linear transfer function emphasizes contrast in part of the range but reduces it in other parts. (b) The desired transfer function is implemented in a table. Each input sample value is used as the address to select a corresponding output value stored in the table. (c) A family of transfer functions can be accommodated in a larger table. Page select affects the high-order address bits of the table. (d) Transfer function for luminance reversal.

FIGURE 5.11

(a) In contouring, the least significant bits of the luminance samples are discarded, which reduces the number of possible output levels. (b) At left, the 8-bit colour difference signals allow 2¹⁶ different colours. At right, eliminating all but 2 bits of each colour difference signal allows only 2⁴ different colours.

FIGURE 5.12

(a) Simplified diagram of mosaicing system. At the left-hand side, horizontal mosaicing is done by intercepting sample clocks. On one line in m, the horizontally mosaiced line becomes the output and is simultaneously written into a one-line memory. On the remaining (m − 1) lines the memory is read to produce several identical successive lines to give the vertical dimensions of the tile. (b) In mosaicing, input samples are neglected, and the output is held constant by failing to clock a latch in the data stream for several sample periods. Heavy vertical lines here correspond to the clock signal occurring. Heavy horizontal line is the resultant waveform.

MAPPING

The principle of all video manipulators is the same as the technique used by cartographers for centuries. Cartographers are faced with a continual problem in that the earth is round and paper is flat. To produce flat maps, it is necessary to project the features of the round original onto a flat surface. Figure 5.13 shows an example of this. There are a number of different ways of projecting maps, and all of them must, by definition, produce distortion. The effect of this distortion is that distances measured near the extremities of the map appear farther than they actually are. Another effect is that great circle routes (the shortest or longest path between two places on a planet) may appear curved on a projected map. The type of projection used is usually printed somewhere on the map, a very common system being that due to Mercator. Clearly the process of mapping involves some three-dimensional geometry to simulate the paths of light rays from the map so that they appear to have come from the curved surface. Video effects machines work in exactly the same way.

The distortion of maps means that things are not where they seem. In time-sharing computers, every user appears to have his own identical address space in which his program resides, despite the fact that many different programs are simultaneously in the memory. To resolve this contradiction, memory management units are constructed that add a constant value to the address the user thinks he has (the virtual address) in order to produce the physical address. As long as the unit gives each user a different constant, they can all program in the same virtual address space without one corrupting another's programs. Because the program is no longer where it seems to be, the term of mapping was introduced. The address space of a computer is one dimensional, but a video frame expressed as rows and columns of pixels can be considered to have a two-dimensional address, as in Figure 5.14. Video manipulators work by mapping the pixel addresses in two dimensions.

FIGURE 5.13

Map projection is a close relative of video effects units, which manipulate the shape of pictures.

FIGURE 5.14

The entire TV picture can be broken down into uniquely addressable pixels.

PLANAR DIGITAL VIDEO EFFECTS

One can scarcely watch television nowadays without becoming aware of picture manipulation. Flips, tumbles, spins, and page-turn effects; perspective rotation; and rolling the picture onto the surface of a solid are all commonly seen. In all but the last mentioned, the picture remains flat, hence the title of this section. Non-planar manipulation requires further complexity, which will be treated in due course.

Effects machines that manipulate video pictures are close relatives of the machines that produce computer-generated images.¹ Such images require an enormous number of computations per frame processing power, yet must work in real time, unlike computer rendering that may work offline.

ADDRESS GENERATION AND INTERPOLATION

There are many different manipulations possible, and the approach here will be to begin with the simplest, which require the least processing, and to graduate to the most complex, introducing the necessary processes at each stage.

It has been stated that address mapping is used to perform transforms. Now that rows and columns are processed individually, the mapping process becomes much easier to understand. Figure 5.15 shows a single row of pixels, which are held in a buffer in which each can be addressed individually and transferred to another. If a constant is added to the read address, the selected pixel will be to the right of the place where it will be put. This has the effect of moving the picture to the left. If the buffer represented a column of pixels, the picture would be moved vertically. As these two transforms can be controlled independently, the picture could be moved diagonally.

FIGURE 5.15

Address generation is the fundamental process behind transforms.

If the read address is multiplied by a constant, say two, the effect is to bring samples from the input closer together on the output, so that the picture size is reduced. Again independent control of the horizontal and vertical transforms is possible, so that the aspect ratio of the picture can be modified. This is very useful for telecine work when CinemaScope films are to be broadcast. Clearly the secret of these manipulations is in the constants fed to the address generators. The added constant represents displacement, and the multiplied constant represents magnification. A multiplier constant of less than 1 will result in the picture getting larger. Figure 5.15 also shows, however, that there is a problem. If a constant of 0.5 is used, to make the picture twice as big, half of the addresses generated are not integers. A memory does not understand an address of 2.5! If an arbitrary magnification is used, nearly all the addresses generated are noninteger. A similar problem crops up if a constant of less than one is added to the address in an attempt to move the picture less than the pixel spacing. The solution to the problem is interpolation. Because the input image is spatially sampled, those samples contain enough information to represent the brightness and colour all over the screen. When the address generator comes up with an address of 2.5, it actually means that what is wanted is the value of the signal interpolated halfway between pixel 2 and pixel 3. The output of the address generator will thus be split into two parts. The integer part will become the memory address, and the fractional part is the phase of the necessary interpolation. To interpolate pixel values a digital filter is necessary.

Figure 5.16 shows that the input and output of an effects machine must be at standard sampling rates to allow digital interchange with other equipment. When the size of a picture is changed, this causes the pixels in the picture to fail to register with output pixel spacing. The problem is exactly the same as sampling-rate conversion, which produces a differently spaced set of samples that still represent the original waveform. One pixel value actually represents the peak brightness of a two-dimensional intensity function, which is the effect of the modulation transfer function of the system on an infinitely small point. As each dimension can be treated separately, the equivalent in one axis is that the pixel value represents the peak value of an infinitely short impulse that has been low-pass filtered to the system bandwidth. The waveform is that of a sin x/x curve, which has value everywhere except at the centre of other pixels. To compute an interpolated value, it is necessary to add together the contribution from all relevant samples, at the point of interest. Each contribution can be obtained by looking up the value of a unity sin x/x curve at the distance from the input pixel to the output pixel to obtain a coefficient and multiplying the input pixel value by that coefficient. The process of taking several pixel values, multiplying each by a different coefficient, and summing the products can be performed by the FIR (finite-impulse response) configuration described earlier. The impulse response of the filter necessary depends on the magnification. When the picture is being enlarged, the impulse response can be the same as at normal size, but as the size is reduced, the impulse response has to become broader (corresponding to a reduced spatial frequency response) so that more input samples are averaged together to prevent aliasing. The coefficient store will need a two-dimensional structure, such that the magnification and the interpolation phase must both be supplied to obtain a set of coefficients. The magnification can easily be obtained by comparing successive outputs from the address generator.

FIGURE 5.16

It is easy, almost trivial, to reduce the size of a picture by pushing the samples closer together, but this is not often of use, because it changes the sampling rate in proportion to the compression. When a standard sampling-rate output is needed, interpolation must be used.

As was seen in Chapter 3, the number of points in the filter is a compromise between cost and performance, 8 being a typical number for high quality. As there are two transform processes in series, every output pixel will be the result of 16 multiplications, so there will be 216 million multiplications per second taking place in the luminance channel alone for a 13.5 MHz sampling rate unit. The quality of the output video also depends on the number of different interpolation phases available between pixels. The address generator may compute fractional addresses to any accuracy, but these will be rounded off to the nearest available phase in the digital filter. The effect is that the output pixel value provided is actually the value a tiny distance away and has the same result as sampling clock jitter, which is to produce program-modulated noise. The greater the number of phases provided, the larger will be the size of the coefficient store needed. As the coefficient store is two-dimensional, an increase in the number of filter points and phases causes an exponential growth in size and cost. The filter itself can be implemented readily with fast multiplier chips, but one problem is accessing the memory to provide input samples. What the memory must do is take the integer part of the address generator output and provide simultaneously as many adjacent pixels as there are points in the filter. This problem may be solved by making the memory from several smaller memories with an interleaved address structure, so that several pixel values can be provided simultaneously.

SKEW AND ROTATION

It has been seen that adding a constant to the source address produces a displacement. It is not necessary for the displacement constant to be the same throughout the frame. If the horizontal transform is considered, as in Figure 5.17a, the effect of making the displacement a function of line address is to cause a skew. Essentially each line is displaced by a different amount. The necessary function generator is shown in simplified form in Figure 5.17b, although it could equally be realized in a fast CPU with appropriate software.

It will be seen that the address generator is really two accumulators in series, of which the first operates once per line to calculate a new offset, which grows linearly from line to line, and the second operates at pixel rate to calculate source addresses from the required magnification. The initial state of the second accumulator is the offset from the first accumulator.

If two skews, one vertical and one horizontal, are performed in turn on the same frame, the result is a rotation as shown in Figure 5.17c. Clearly the skew angle parameters for the two transforms must be in the correct relationship to obtain pure rotation. Additionally the magnification needs to be modified by a cosine function of the rotation angle to counteract the stretching effect of the skews.

In the horizontal process, the offset will change once per line, whereas in the vertical process, the offset will change once per column. For simplicity, the offset generators are referred to as the slow address generators, whereas the accumulators, which operate at pixel rate, are called the fast address generators.

FIGURE 5.17

(a) Skew is achieved by subjecting each line of pixels to a different offset. (b) The hardware necessary to perform a skew in which the left-hand accumulator produces the offset, which increases every line, and the right-hand accumulator adds it to the address.

Unfortunately skew rotations cannot approach 90°, because the skew parameter goes to infinity, and so a skew rotate is generally restricted to rotations of 45°. This is not a real restriction, because the apparatus already exists to turn a picture on its side. This can be done readily by failing to transpose from rows to columns at some stage so the picture will be turned through 90°.

Figure 5.17d shows how continuous rotation can be obtained. From −45° to +45°, normal skew rotation is used. At 45° during the vertical interval, the memory transpose is turned off, causing the picture to be flipped 90° and laterally inverted. Reversing the source address sequence cancels the lateral inversion, and at the same time the skew parameters are changed from +45° to −45°. In this way the picture passes smoothly through the 45° barrier, and skew parameters continue to change until 135° (90° transpose +45° skew) is reached. At this point, three things happen, again during the vertical interval. The transpose is switched back on, re-orienting the picture; the source addresses are both reversed, which turns the picture upside down; and a skew rotate of 45° is applied, returning the picture to 135° of rotation, from which point motion can continue. The remainder of the rotation takes place along similar lines, which can be followed in the diagram.

FIGURE 5.17

(c) A z-axis rotate is performed using a pair of skews in succession. The magnification of each transform must also change from unity to cosθ because horizontal and vertical components of distances on the frame reduce as the frame turns. (d) The four modes necessary for a complete z-axis rotation using skews. Switching between modes at the vertical interval allows a skew range of 45° (outer ring) to embrace a complete revolution in conjunction with memory transposes, which exchange rows and columns to give 90° changes.

The rotation described is in the z axis, i.e., the axis coming out of the source picture at right angles. Rotation about the other axes is rather more difficult, because to perform the effect properly, perspective is needed. In simple machines, there is no perspective, and the effect of rotation is as if viewed from a long way away. These nonperspective pseudo-rotations are achieved by simply changing the magnification in the appropriate axis as a cosine function of the rotation angle.

PERSPECTIVE ROTATION

To follow the operation of a true perspective machine, some knowledge of perspective is necessary. Stated briefly, the phenomenon of perspective is due to the angle subtended to the eye by objects being a function not only of their size but also of their distance. Figure 5.18 shows that the size of an image on the rear wall of a pinhole camera can be increased by either making the object larger or bringing it closer. In the absence of stereoscopic vision, it is not possible to tell which has happened. The pinhole camera is very useful for the study of perspective and has indeed been used by artists for that purpose. The clinically precise perspective of Canaletto paintings was achieved through the use of the camera obscura (“darkened room” in Italian).³

It is sometimes claimed that the focal length of the lens used on a camera changes the perspective of a picture. This is not true; perspective is only a function of the relative positions of the camera and the subject. Fitting a wide-angle lens simply allows the camera to come near enough to keep dramatic perspective within the frame, whereas fitting a long-focus lens allows the camera to be far enough away to display a reasonably sized image with flat perspective.⁴

FIGURE 5.18

The image on the rear of the pinhole camera is identical for the two solid objects shown because the size of the object is proportional to distance, and the subtended angle remains the same. The image can be made larger (dotted) by making the object larger or moving it closer.

Because a single eye cannot tell distance unaided, all current effects machines work by simply producing the correct subtended angles, which the brain perceives as a three-dimensional effect. Figure 5.19 shows that to a single eye, there is no difference between a three-dimensional scene and a two-dimensional image formed where rays traced from features to the eye intersect an imaginary plane. This is exactly the reverse of the map projection shown in Figure 5.13 and is the principle of all perspective manipulators.

The case of perspective rotation of a plane source will be discussed first. Figure 5.19 shows that, when a plane input frame is rotated about a horizontal axis, the distance from the top of the picture to the eye is no longer the same as the distance from the bottom of the picture to the eye. The result is that the top and bottom edges of the picture subtend different angles to the eye, and where the rays cross the target plane, the image has become trapezoidal. There is now no such thing as the magnification of the picture. The magnification changes continuously from top to bottom of the picture, and if a uniform grid is input, after a perspective rotation it will appear non-linear as the diagram shows.

Early DVEs (digital video effects generators) performed perspective transforms by separate but co-operative processes in two orthogonal axes, whereas with greater computing power, it is possible to perform the manipulation in one stage using two-dimensional addressing and interpolation. Clearly a significant part of the process must be the calculation of the addresses, magnifications, and interpolation phases.

The address generators for perspective operation are necessarily complex, and a careful approach is necessary to produce the complex calculations at the necessary speed.⁵ Figure 5.20 shows a section through a transform in which the source plane has been rotated about an axis perpendicular to the page. The mapping or ray-tracing process must produce a straight line from every target pixel (corresponding to where an output value is needed) to locate a source address (corresponding to where an input value is available). Moving the source value to the target performs the necessary transform.

FIGURE 5.19

In a planar rotation effect the source plane ABCD is the rectangular input picture. If it is rotated through the angle shown, ray tracing to a single eye at left will produce a trapezoidal image A' B' C' D' on the target. Magnification will now vary with position on the picture.

FIGURE 5.20

A rotation of the source plane along with a movement away from the observer is shown here. The system has to produce pixel values at the spacing demanded by the output. Thus a ray from the eye to each target pixel is produced to locate a source pixel. Because the chances of exactly hitting a source pixel are small, the need for interpolation is clear. If the source plane is missed, this will result in an out-of-range source address, and a background value will be substituted.

The derivation of the address generator equations is in highly complex mathematics, which can be found in Newman and Sproull² and in the ADO patent.⁵ For the purposes of this chapter, an understanding of the result can be had by considering the example of Figure 5.21. In this figure, successive integer values of x have been used in a simple source address calculation equation, which contains a division stage. The addresses produced form a non-linear sequence and will be seen to lie on a rotated source plane.

All perspective machines must work with dynamic changes in magnification throughout the frame. The situation often arises in which at one end of a pixel row the magnification is greater than unity, and the FIR filter has to interpolate between available pixels, whereas at the other end of the row the magnification will be less than unity and the FIR filter has to adopt a low-pass and decimate mode to eliminate excessive pixels without aliasing. The characteristics of the filter are changed at will by selecting different coefficient sets from pages of memory according to the instantaneous magnification at the centre of the filter window. The magnification can be determined by computing the address slope. This is done by digitally differentiating the output of the address generator, which is to say that the difference between one source address and the next is computed. This produces the address slope, which is inversely proportional to the magnification and can be used to select the appropriate impulse response width in the interpolator. The interpolator output is then a transformed image.

FIGURE 5.21

(a) The equation at the top calculates the source address for each evenly spaced target address from 0 to 10. All numbers are kept positive for simplicity, so only one side of the picture is represented here. (b) A ray-tracing diagram corresponding to the calculations of (a). Following a ray from the virtual eye through any target pixel address will locate the source addresses calculated.

NON-PLANAR EFFECTS

The basic approach to perspective rotation of plane pictures has been described, and this can be extended to embrace transforms that make the source picture appear non-planar. Effects in this category include rolling the picture onto the surface of an imaginary solid such as a cylinder or a cone. Figure 5.22 shows that the ray-tracing principle is still used, but that the relationship between source and target addresses has become much more complex. The problem is that when a source picture can be curved, it may be put in such an attitude that one part of the source can be seen through another part. This results in two difficulties. First, the source address function needs to be of higher order, and second, the target needs to be able to accept and accumulate pixel data from two different source addresses, with weighting given to the one nearer the viewer according to the transparency allocated to the picture.

FIGURE 5.22

(a) To produce a rolled-up image for a given target pixel address C, there will be two source addresses A and B. Pixel data from A and B must be added with weighting dependent on the transparency of the nearer pixel to produce the pixel value to be put on the target plane at C. (b) Transfer function for a rolling-up transform. There are two source addresses for every target address; hence the need for target accumulation.

CONTROLLING EFFECTS

The basic mechanism of the transform process has been described, but this is only half of the story, because these transforms have to be controlled. There is a lot of complex geometrical calculation necessary to perform even the simplest effect, and the operator cannot be expected to calculate directly the parameters required for the transforms. All effects machines require a computer of some kind, with which the operator communicates using keyboard entry or joystick/trackball movements at high level. These high-level commands will specify such things as the position of the axis of rotation of the picture relative to the viewer, the position of the axis of rotation relative to the source picture, and the angle of rotation in the three axes.

An essential feature of this kind of effects machines is fluid movement of the source picture as the effect proceeds. If the source picture is to be made to move smoothly, then clearly the transform parameters will be different in each field. The operator cannot be expected to input the source position for every field, because this would be an enormous task. Additionally, storing the effect would require a lot of space. The solution is for the operator to specify the picture position at strategic points during the effect, and then digital filters are used to compute the intermediate positions so that every field will have different parameters.

The specified positions are referred to as knots, nodes, or keyframes, the first being the computer graphics term. The operator is free to enter knots anywhere in the effect, and so they will not necessarily be evenly spaced in time, i.e., there may well be different numbers of fields between each knot. In this environment it is not possible to use conventional FIR-type digital filtering, because a fixed-impulse response is inappropriate for irregularly spaced samples.

Interpolation of various orders is used, ranging from zero-order hold for special jerky effects through linear interpolation to cubic interpolation for very smooth motion. The algorithms used to perform the interpolation are known as splines, a term that has come down from shipbuilding via computer graphics.⁶ When a ship is designed, the draughtsman produces hull cross sections at intervals along the keel, whereas the shipyard needs to re-create a continuous structure. The solution is a lead-filled bar, known as a spline, which can be formed to join up each cross section in a smooth curve and then used as a template to form the hull plating.

The filter that does not ring cannot be made, and so the use of spline algorithms for smooth motion sometimes results in unintentional overshoots of the picture position. This can be overcome by modifying the filtering algorithm. Spline algorithms usually look ahead beyond the next knot to compute the degree of curvature in the graph of the parameter against time. If a break is put in that parameter at a given knot, the spline algorithm is prevented from looking ahead, and no overshoot will occur. In practice the effect is created and run without breaks, and then breaks are added later where they are subjectively thought necessary.

It will be seen that there are several levels of control in an effects machine. At the highest level, the operator can create, store, and edit knots and specify the times that elapse between them. The next level is for the knots to be interpolated by spline algorithms to produce parameters for every field in the effect. The field frequency parameters are then used as the inputs to the geometrical computation of transform parameters, which the lowest level of the machine will use as microinstructions to act upon the pixel data. Each of these layers will often have a separate processor, not just for speed, but also to allow software to be updated at certain levels without disturbing others.

GRAPHICS

Although there is no easy definition of a video graphics system that distinguishes it from a graphic art system, for the purposes of discussion it can be said that graphics consists of generating alphanumerics on the screen, whereas graphic art is concerned with generating more general images. The simplest form of screen presentation of alphanumerics is the visual display unit (VDU), which was used to control early computer-based systems. The mechanism used for character generation in such devices is very simple and thus makes a good introduction to the subject.

FIGURE 5.23

Elementary character generation. (a) White on black waveform for two raster lines passing through letter A. (b) Black on white is simple inversion. (c) Reverse video highlight waveforms.

In VDUs, there is no grey scale, and the characters are formed by changing the video signal between two levels at the appropriate place in the line. Figure 5.23 shows how a character is built up in this way and also illustrates how easy it is to obtain the reversed video used in some word processor displays to simulate dark characters on white paper. Also shown is the method of highlighting single characters or words by using localized reverse video.

Figure 5.24 is a representative character generator, as might be used in a VDU. The characters to be displayed are stored as ASCII symbols in a RAM, which has one location for each character position on each available text line on the screen. Each character must be used to generate a series of dots on the screen, which will extend over several lines. Typically the characters are formed by an array 5 dots by 9. To convert from the ASCII code to a dot pattern, a table is programmed with a conversion. This will be addressed by the ASCII character, and the column and row addresses in the character array, and will output a high or low (bright or dark) output.

As the VDU screen is a raster-scanned device, the display scan will begin at the lefthand end of the top line. The first character in the ASCII RAM will be selected, and this and the first row and column addresses will be sent to the character generator, which outputs the video level for the first pixel. The next column address will then be selected, and the next pixel will be output. As the scan proceeds, it will pass from the top line of the first character to the top line of the second character, so that the ASCII RAM address will need to be incremented. This process continues until the whole video line is completed. The next line on the screen is generated by repeating the selection of characters from the ASCII RAM, but using the second array line as the address to the character generator. This process will repeat until all the video lines needed to form one row of characters are complete. The next row of characters in the ASCII RAM can then be accessed to create the next line of text on the screen and so on.

FIGURE 5.24

Simple character generator produces characters as rows and columns of pixels. See text for details.

FIGURE 5.25

Font characters store only the shape of the character. This can be used to key any coloured character into a background.

High-quality broadcast graphics requires further complexity. The characters will be needed in colour and in varying sizes. Different fonts will be necessary, and additional features such as solid lines around characters and drop shadows are desirable.

To generate a character in a broadcast machine, a font and the character within that font are selected. The characters are actually stored as key signals, because the only difference between one character and another in the same font is the shape. A character is generated by specifying a constant background colour and luminance, and a constant character colour and luminance, and by using the key signal to cut a hole in the background and insert the character colour. This is illustrated in Figure 5.25. The problem of stepped diagonal edges is overcome by giving the key signal a grey scale. The grey scale eliminates the quantizing distortion responsible for the stepped edges. The edge of the character now takes the form of a ramp, which has the desirable characteristic of limiting the bandwidth of the character generator output. Early character generators were notorious for producing out-of-band frequencies, which drove equipment further down the line to distraction and in some cases would interfere with the sound channel on being broadcast. Figure 5.26 illustrates how, in a system with grey scale and sloped edges, the edge of a character can be positioned to sub-pixel resolution, which completely removes the stepped effect on diagonals.

In a powerful system, the number of fonts available will be large, and all the necessary characters will be stored on disk drives. Some systems allow users to enter their own fonts using a rostrum camera. A frame grab is performed, but the system can be told to file the image as a font character key signal rather than as a still frame. This approach allows infinite flexibility if it is desired to work in Kanji or Cyrillic and allows European graphics to be done with all necessary umlauts, tildes, and cedillas.

FIGURE 5.26

When a character has a ramped edge, the edge position can be moved in subpixel steps by changing the pixel values in the ramp.

To create a character string on the screen, it is necessary to produce a key signal that has been assembled from all the individual character keys. The keys are usually stored in a large format to give highest quality, and it will be necessary to reduce the size of the characters to fit the available screen area. The size reduction of a key signal in the digital domain is exactly the same as the zoom function of an effects machine, requiring FIR filtering and interpolation, but again, it is not necessary for it to be done in real time, and so less hardware can be used. The key source for the generation of the final video output is a RAM that has one location for every screen pixel. Position of the characters on the screen is controlled by changing the addresses in the key RAM into which the size-reduced character keys are written.

The keying system necessary is shown in Figure 5.27. The character colour and the background colour are produced by latches on the control system bus, which output continuous digital parameters. The grey-scale key signal obtained by scanning the key memory is used to provide coefficients for the digital cross-fader, which cuts between background and character colour to assemble the video signal in real time.

If characters with contrasting edges are required, an extra stage of keying can be used. The steps described above take place, but the background colour is replaced by the desired character edge colour. The size of each character key is then increased slightly, and the new key signal is used to cut the characters and a contrasting border into the final background.

FIGURE 5.27

Simple character generator using keying. See text for details.

Early character generators were based on a frame store, which refreshes the dynamic output video. Recent devices abandon the frame-store approach in favour of real-time synthesis. The symbols that make up a word can move on and turn with respect to the plane in which they reside as a function of time in any way, individually or together. Text can also be mapped onto an arbitrarily shaped line. The angle of the characters can follow a tangent to the line or can remain at a fixed angle regardless of the line angle.

By controlling the size of planes, characters or words can appear to zoom into view from a distance and recede again. Rotation of the character planes off the plane of the screen allows the perspective effects to be seen. Rotating a plane back about a horizontal axis by 90° will reduce it to an edge-on line, but lowering the plane to the bottom of the screen allows the top surface to be seen, receding into the distance like a road. Characters or text strings can then roll off into the distance, getting smaller as they go. In fact the planes do not rotate, but a perspective transform is performed on them.

CONVERTING BETWEEN COMPUTER AND VIDEO FORMATS

Computer terminals have evolved quite rapidly from devices that could display only a few lines of text in monochrome into high-resolution colour graphics displays that outperform conventional television. The domains of computer graphics and television have in common only that images are represented. The degree of incompatibility is such that one could be forgiven for thinking that it was the outcome of a perverse competition. Nevertheless with sufficient care good results can be obtained. Figure 5.28 shows that the number of issues involved is quite large. If only one of these is not correctly addressed, the results will be disappointing. The number of processes also suggests that each must be performed with adequate precision, otherwise there will be a tolerance buildup or generation loss problem.

Figure 5.29 shows a typical graphics card. The pixel values to be displayed are written into a frame store by the CPU and the display mechanism reads the frame store line by line to produce a raster scanned image. Pixel array sizes are described as x by y pixels and these have been subject to much greater variation than has been the case in television. Figure 5.30 shows some of the array sizes supported in graphics devices. Note that datacine frame sizes eclipse these examples. As computer screens tend to be used in brighter ambient light than television screens, the displays have to be brighter and this makes flicker more visible. This can be overcome by running at a frame rate of 75 Hz or more.

Colour primaries

Gamma

Digital gamut

Pixel aspect ratio

Interlace/progressive

Picture aspect ratio

Picture rate

FIGURE 5.28

The various issues involved in converting between broadcast video and computer graphics formats. The problem is non-trivial but failure to address any one of these aspects will result in impairment.

FIGURE 5.29

A typical computer graphics card. See text for details.

A typical graphics card outputs analog RGB, which can drive a CRT or more recent type of display. The analog outputs are provided by 8-bit DACs. Figure 5.31 shows the standard IBM graphics connector. To avoid storing 24 bits per pixel, some systems restrict the number of different colours that can be displayed at once. Between the frame store and the DACs is a device called a palette or Colour Look Up Table (CLUT). This can be preloaded with a range of colours that are appropriate for the image to be displayed. Whilst this is adequate for general-purpose computing, it is unsuitable for quality image portrayal.

Computer graphics takes a somewhat more casual view of gamma than does television. This may be due to the fact that early computer displays had no grey scale and simply produced binary video (black or white), in which linearity has no meaning. As computer graphics became more sophisticated, each pixel became a binary number and a grey scale was possible. The gamma of the CRT display was simply compensated by an inverse gamma lookup table (LUT) prior to the video DAC as shown in Figure 5.32a. This approach means that the pixel data within the computer are in the linear light domain. This in itself is not a problem, but when linear light is represented by only 8-bit pixels, then contouring in dark areas is inevitable. Linear light needs to be expressed by around 14 bits for adequate resolution, as was seen in Chapter 2. To improve the situation, certain manufacturers moved away from the linear light domain, but without going as far as conventional television practice. The solution was that the internal data would be subject to a partial inverse gamma, as shown in Figure 5.32b, followed by a further partial inverse gamma stage in the LUT of the graphics card. The combined effect of the two inverse gammas was correctly to oppose the display gamma.

320 × 200

320 × 350

360 × 400

640 × 200

720 × 400

720 × 350

640 × 350

640 × 400

640 × 480

640 × 473

800 × 600

1056 × 350

1056 × 480

1056 × 473

1118 × 350

1118 × 480

1118 × 473

1024 × 768

FIGURE 5.30

The array sizes that may be found in computer graphics.

1 Red

2 Green

3 Blue

4 N/C

5 Ground

6 Red return

7 Green return

8 Blue return

9 Key pin

10 Sync return

11 Monitor ID (not used)

12 Ground if monochrome monitor

13 H sync

14 V sync

15 N/C

FIGURE 5.31

The standard IBM graphics connector and its associated signals.

Unfortunately Silicon Graphics and Macintosh came up with systems in which the two gamma stages were completely incompatible, even though the overall result in both cases is correct. Data from one format cannot be displayed on the other format (or as video) without gamma conversion. In the absence of gamma conversion the grey scale will be non-linear, crushing either dark areas or light areas depending on the direction of data transfer. Gamma conversion is relatively straightforward, as a simple lookup table can be created with 8-bit data. Whatever the direction of conversion, one of the formats involved is likely to be RGB. It is useful if this is made the internal format of the conversion. Figure 5.33 shows that if the input is colour-difference based, conversion should be done early, whereas if the output is to be colour-difference based, the conversion should be done late. It is also worth considering the use of the linear light domain and suitably long word length within the conversion process. This overcomes any quality loss due to failure of constant luminance and distortion due to interpolating gamma-based signals. Figure 5.34 shows the principle. The gamma of the input format is reversed at the input and the gamma of the output format is re-created after all other processing is complete. Gamma in television signals generally follows a single standard, whereas with a computer format it will be necessary to establish exactly what gamma was assumed.

FIGURE 5.32

Computers and gamma: a dog's dinner. (a) A simple system uses linear light-coding internals and an inverse gamma LUT prior to the CRT. With only 8-bit data this suffers excessive quantizing error. (b) Improved performance is obtained by having partial inverse gamma internal data in tandem with a further partial inverse gamma prior to the CRT. Unfortunately there are two conflicting incompatible standards.

FIGURE 5.33

Possible strategies for video/computer conversion. (a) Video to graphics RGB. (b) Graphics RGB to video.

FIGURE 5.34

Gamma is a compression technique and for the finest results it should not be used in any image-manipulation process because the result will be distorted. Accurate work should be done in the linear light domain.

Computer formats tend to use the entire number scale from black to white, such that in 8-bit systems black is 00Hex and white is FF. However, television signals according to ITU-601 have some head room above white and foot room below black. If in gamma head room and foot room conversion is not properly performed, the result will be black crushing, white crushing, lack of contrast, or a distorted grey scale.

Colorimetry may be a problem in conversion. Television signals generally abide by ITU-709 colorimetry, whereas computer graphic files could use almost any set of primaries. It is not unusual for computer screens to run at relatively high colour temperatures to give brighter pictures. If the primaries are known, then it is possible to convert between colour spaces using matrix arithmetic. Figure 5.35 shows that if two triangles are created on the chromaticity diagram, one for each set of primaries, then wherever the triangles overlap, ideal conversion is possible. In the case of colours in which there is no overlap the best that can be done is to produce the correct hue by calculating the correct vector from the white point, even if the saturation is incorrect. When the colorimetry is not known, accurate conversion is impossible. However, in practice acceptable results can be obtained by adjusting the primary gains to achieve an acceptable colour balance on a recognizable part of the image, such as a white area or a flesh tone.

The image size or pixel count will be different and, with the exception of recent formats, the television signal will be interlaced and will not necessarily use square pixels. Spatial interpolation will be needed to move between pixel array sizes and pixel aspect ratios. The frame rate may also be different. The best results will be obtained using motion compensation. If both formats are progressively scanned, resizing and rate conversion are separable, but if interlace is involved the problem is not separable and resizing and rate conversion should be done simultaneously in a three-dimensional filter.

FIGURE 5.35

Conversion between colour spaces works only where the areas enclosed by the primary triangles overlap (shaded). Outside these areas the best that can be done is to keep the hue correct by accepting a saturation error.

GRAPHIC ART/PAINT SYSTEMS

In graphic art systems, there is a requirement for disk storage of the generated images, and some art machines incorporate a still store unit, whereas others can be connected to a separate one by an interface. Disk-based stores are discussed in Chapter 9. The essence of an art system is that an artist can draw images that become a video signal directly with no intermediate paper and paint. Central to the operation of most art systems is a digitizing tablet, which is a flat surface over which the operator draws a stylus. The tablet can establish the position of the stylus in vertical and horizontal axes. One way in which this can be done is to launch ultrasonic pulses down the tablet, which are detected by a transducer in the stylus. The time taken to receive the pulse is proportional to the distance to the stylus. The coordinates of the stylus are converted to addresses in the frame store that correspond to the same physical position on the screen. To make a simple sketch, the operator specifies a background parameter, perhaps white, which would be loaded into every location in the frame store. A different parameter is then written into every location addressed by movement of the stylus, which results in a line drawing on the screen. The art world uses pens and brushes of different shapes and sizes to obtain a variety of effects, one common example being the rectangular pen nib, in which the width of the resulting line depends on the angle at which the pen is moved. This can be simulated on art systems, because the address derived from the tablet is processed to produce a range of addresses within a certain screen distance of the stylus. If all these locations are updated as the stylus moves, a broad stroke results.

If the address range is larger in the horizontal axis than in the vertical axis, for example, the width of the stroke will be a function of the direction of stylus travel. Some systems have a sprung tip on the stylus that connects to a force transducer, so that the system can measure the pressure the operator uses. By making the address range a function of pressure, broader strokes can be obtained simply by pressing harder. To simulate a set of colours available on a palette, the operator can select a mode in which small areas of each colour are displayed in boxes on the monitor screen. The desired colour is selected by moving a screen cursor over the box using the tablet. The parameter to be written into selected locations in the frame RAM now reflects the chosen colour. In more advanced systems, simulation of airbrushing is possible. In this technique, the transparency of the stroke is great at the edge, where the background can be seen showing through, but transparency reduces to the centre of the stroke. A read–modify–write process is necessary in the frame memory, in which background values are read, mixed with paint values with the appropriate transparency, and written back. The position of the stylus effectively determines the centre of a two-dimensional transparency contour, which is convolved with the memory contents as the stylus moves.

LINEAR AND NON-LINEAR EDITING

The term “editing” covers a multitude of possibilities in video production. Simple video editors work in two basic ways, by assembling or by inserting sections of material or clips comprising a whole number of frames to build the finished work. Assembly begins with a blank master recording. The beginning of the work is copied from the source, and new material is successively appended to the end of the previous material. Figure 5.36 shows how a master recording is made up by assembly from source recordings. Insert editing begins with an existing recording in which a section is replaced by the edit process.

At its most general, editing is subdivided into horizontal editing, which refers to any changes with respect to the time axis, and vertical editing,⁷ which is the generic term for processes taking place on an imaginary z axis running back into the screen. These include keying, dissolves, wipes, layering, and so on.⁸ DVEs may also be used for editing, where a page turn or rotate effect reveals a new scene.

In all types of editing the goal is the appropriate sequence of material at the appropriate time. The first type of picture editing was done physically by cutting and splicing film, to assemble the finished work mechanically. This approach was copied on early quadruplex video recorders, a difficult and laborious process. This gave way to electronic editing on VTRs, in which lengths of source tape were copied to the master. Once the speed and capacity of disk drives became sufficient, it was obvious that they would ultimately take over as editing media as soon as they became economical.

FIGURE 5.36

Assembly is the most basic form of editing, in which source clips are sequentially added to the end of a recording.

When video tape was the only way of editing, it did not need a qualifying name. Now that video is stored as data, alternative storage media have become available, which allow editors to reach the same goal but using different techniques. Whilst digital VTR formats copy their analog predecessors and support field-accurate editing on the tape itself, in all other digital editing, pixels from various sources are brought from the storage media to various pages of RAM. The edit is previewed by selectively processing two (or more) sample streams retrieved from RAM. Once the edit is satisfactory it may subsequently be written on an output medium. Thus the nature of the storage medium does not affect the form of the edit in any way except for the amount of time needed to execute it.

Tapes allow only serial or linear access to data, whereas disks and RAM allow random access and so can be much faster. Editing using random access storage devices is very powerful as the shuttling of tape reels is avoided. The technique is called non-linear editing. This is not a very helpful name, as all editing is non-linear. In fact it is only the time axis of the storage medium that is non-linear.

ONLINE AND OFFLINE EDITING

In many workstations, compression is employed, and the appropriate coding and decoding logic will be required adjacent to the inputs and outputs. With mild compression, the video quality of the machine may be used directly for some purposes. This is known as online editing and this may be employed for the creation of news programs. Alternatively a high compression factor may be used, and the editor is then used only to create an edit decision list (EDL). This is known as offline editing. The EDL is subsequently used to control automatic editing of the full-bandwidth source material, possibly on tape. The full-bandwidth material is conformed to the edit decisions taken on the compressed material.

One of the advantages of offline editing is that the use of compression in the images seen by the editor reduces the bandwidth/bit rate requirement between the hardware and the control panel. Consequently it becomes possible for editing to be performed remotely. The high-resolution images on a central file server can be conformed locally to an EDL created by viewing compressed images in any location that has Internet access.

DIGITAL FILMMAKING

The power of non-linear editing can be applied to filmmaking as well as to video production. There are various levels at which this can operate. Figure 5.37 shows the simplest level. Here the filming takes place as usual, and after development the uncut film is transferred to video using a telecine or datacine machine. The data are also compressed and stored on a disk-based workstation. The workstation is used to make all the edit decisions and these are stored as an EDL. This will be sent to the film laboratory to control the film cutting.

In Figure 5.38 a more sophisticated process is employed. Here the film camera viewfinder is replaced with a video camera so that a video signal is available during filming. This can be recorded on disk so that an immediate replay is available following each take. In the event that a retake is needed, the film need not be developed, reducing costs. Edit decisions can be taken before the film has been developed.

FIGURE 5.37

Films can be edited more quickly by transferring the uncut film to video and then to disk-based storage.

FIGURE 5.38

With a modified film camera that can also output video, the editing can begin before the film is developed.

TIMECODE

Timecode is essential to editing, as many different processes occur during an edit, and each one is programmed beforehand to take place at a given timecode value. Provision of a timecode reference effectively synchronises the processes.

SMPTE standard timecode for 525/60 use is shown in Figure 5.39. EBU timecode is basically similar to SMPTE except that the frame count will reach a lower value in each second. These store hours, minutes, seconds, and frames as binary-coded decimal (BCD) numbers, which are serially encoded along with user bits into an FM channel code (see Chapter 8), which is recorded on one of the linear audio tracks of the tape. The user bits are not specified in the standard, but a common use is to record the take or session number. Disks also use timecode for synchronisation, but the timecode forms part of the file structure so that frames of data may be retrieved by specifying the required timecode.

FIGURE 5.39

In SMPTE standard timecode, the frame number and time are stored as eight BCD symbols. There is also space for 32 user-defined bits. The code repeats every frame. Note the asymmetrical sync word, which allows the direction of media movement to be determined.

In some cases the disk database can be extended to include the assistant's notes and the film dialog. The editor can search for an edit point by having the system search for a text string. The display would then show a mosaic of all frames in which that dialog was spoken.

The complex special effects seen on modern films can be performed only in the digital domain. In this case the film is converted to data in a datacine and all production is then done by manipulating that data. The extremely high pixel counts used in digital film result in phenomenal amounts of data and it is common to use adapted broadcast DVTR formats to store it economically. After production the digital images are transferred back to release prints.

Ultimately films may be made entirely electronically. When cameras of sufficient resolution and dynamic range become available, the cost of storage will be such that filming will be replaced by a direct camera-to-disk transfer. The production process will consist entirely of digital signal-processing steps, resulting in a movie in the shape of a large data file. One of the present difficulties with electronic film production is the small physical size of CCD sensors in comparison with the traditional film frame, which makes controlled depth of focus difficult to achieve.

Digital films can be distributed to the cinema via copper or fibre-optic link, using encryption to prevent piracy and mild compression for economy. At the cinema the signal would be stored on a file server. Projection would then consist of accessing the data, decrypting, decoding the compression, and delivering the data to a digital projector. This technology will change the nature of the traditional cinema out of recognition.

THE NON-LINEAR WORKSTATION

Figure 5.40 shows the general arrangement of a hard-disk-based workstation. The graphic display in such devices has a screen that is a montage of many different signals, each of which appears in a window. In addition to the video windows there will be a number of alphanumeric and graphic display areas required by the control system. There will also be a cursor, which can be positioned by a trackball or mouse. The screen is refreshed by a frame store, which is read at the screen refresh rate. The frame store can be written by various processes simultaneously to produce a windowed image. In addition to the graphic display, there may be one or more further screens that reproduce full-size images for preview purposes.

A master timing generator provides reference signals to synchronise the internal processes. This also produces an external reference to which source devices such as VTRs can lock. The timing generator may free-run in a standalone system or genlock to station reference to allow playout to air.

Digital inputs and outputs are provided, along with optional convertors to allow working in an analog environment. A compression process will generally be employed to extend the playing time of the disk storage.

FIGURE 5.40

A hard-disk-based workstation. Note the screen, which can display numerous clips at the same time.

Disk-based workstations fall into several categories depending on the relative emphasis of the vertical or horizontal aspects of the process. High-end postproduction emphasizes the vertical aspect of the editing, as a large number of layers may be used to create the output image. The length of such productions is generally quite short and so disk capacity is not an issue and compression may not be employed. It is unlikely that such a machine would need to play out to air. In contrast, a general-purpose editor used for program production will emphasize the horizontal aspect of the task. Extended recording ability will be needed, and the use of compression is more likely. News-editing systems would emphasize speed and simplicity, such that the editing could be performed by journalists.

A typical machine will be based around a high-data-rate bus, connecting the I/O, RAM, disk server, and processor. If magnetic disks are used, these will be Winchester types, because they offer the largest capacity. Exchangeable magneto-optic disks may also be supported.

Before any editing can be performed, it is necessary to have source material online. If the source material exists on MO disks with the appropriate file structure, these may be used directly. Otherwise it will be necessary to input the material in real time and record it on magnetic disks via the data-reduction system. In addition to recording the data-reduced source video, reduced-size versions of each frame that are suitable for the screen windows may also be recorded.

LOCATING THE EDIT POINT

Digital editors simulate the “rock and roll” process of edit-point location originally used in VTRs, in which the tape is moved to and fro by the action of a jog wheel or joystick. Whilst DVTRs with track-following systems can work in this way, disks cannot. Disk drives transfer data intermittently and not necessarily in real time. The solution is to transfer the recording in the area of the edit point to RAM in the editor. RAM access can take place at any speed or direction and the precise edit point can then conveniently be found by monitoring signals from the RAM. In a window-based display, a source recording is attributed to a particular window and will be reproduced within that window, with time-code displayed adjacently.

Figure 5.41 shows how the area of the edit point is transferred to the memory. The source device is commanded to play, and the operator watches the replay in the selected window. The same frames are continuously written into a memory within the editor. This memory is addressed by a counter, which repeatedly overflows to give the memory a ring-like structure rather like that of a time base corrector, but somewhat larger. When the operator sees the rough area in which the edit is required, he or she will press a button. This action stops the memory writing, not immediately, but one-half of the memory contents later. The effect is then that the memory contains an equal number of samples before and after the rough edit point. Once the recording is in the memory, it can be accessed at leisure, and the constraints of the source device play no further part in the edit-point location.

FIGURE 5.41

The use of a ring memory, which overwrites and allows storage of frames before and after the coarse edit point.

There are a number of ways in which the memory can be read. If the field address in memory is supplied by a counter that is clocked at the appropriate rate, the edit area can be replayed at normal speed, or at some fraction of normal speed, repeatedly. To simulate the analog method of finding an edit point, the operator is provided with a scrub wheel or rotor, and the memory field address will change at a rate proportional to the speed with which the rotor is turned and in the same direction. Thus the recording can be seen forward or backward at any speed, and the effect is exactly that of manually jogging an analog tape. The operation of a jog-wheel encoder was shown in Chapter 3 under Digital Faders and Controls.

If the position of the jog address pointer through the memory is compared with the addresses of the ends of the memory, it will be possible to anticipate that the pointer is about to reach the end of the memory. A disk transfer can be performed to fetch new data farther up the time axis, so that it is possible to jog an indefinite distance along the source recording. The user is never aware of the finite amount of memory between the storage device and the display. Data that will be used to make the master recording need never pass through these processes; they are solely to assist in the location of the edit points.

The act of pressing the coarse edit-point button stores the timecode of the source at that point, which is frame-accurate. As the rotor is turned, the memory address is monitored and used to update the timecode.

Before the edit can be performed, two edit points must be determined, the out-point at the end of the previously recorded signal and the in-point at the beginning of the new signal. The second edit point can be determined by moving the cursor to a different screen window in which video from a different source is displayed. The jog wheel will now roll this material to locate the second edit point whilst the first source video remains frozen in the deselected window. The editor's microprocessor stores these in an EDL to control the automatic assemble process.

It is also possible to locate a rough edit point by typing in a previously noted timecode, and the image in the window will automatically jump to that time. In some systems, in addition to recording video and audio, there may also be text files locked to timecode that contain the dialog. Using these systems one can allocate a textual dialog display to a further window and scroll down the dialog or search for a key phrase as in a word processor. Unlike a word processor, the timecode pointer from the text access is used to jog the video window. As a result an edit point can be located in the video if the actor's lines at the desired point are known.

PERFORMING THE EDIT

Using one or other of the above methods, an edit list can be made that contains an in-point, an out-point, and a filename for each of the segments of video that need to be assembled to make the final work, along with a time-code-referenced transition command and period for the vision mixer. This edit list will also be stored on the disk. When a preview of the edited work is required, the edit list is used to determine what files will be necessary and when, and this information drives the disk controller.

Figure 5.42 shows the events during an edit between two files. The edit list causes the relevant blocks from the first file to be transferred from disk to memory, and these will be read by the signal processor to produce the preview output. As the edit point approaches, the disk controller will also place blocks from the incoming file into the memory. In different areas of the memory there will simultaneously be the end of the outgoing recording and the beginning of the incoming recording. Before the edit point, only pixels from the outgoing recording are accessed, but as the transition begins, pixels from the incoming recording are also accessed, and for a time both data streams will be input to the vision mixer according to the transition period required.

The output of the signal processor becomes the edited preview material, which can be checked for the required subjective effect. If necessary the in- or out-point can be trimmed, or the cross-fade period changed, simply by modifying the edit-list file. The preview can be repeated as often as needed, until the desired effect is obtained. At this stage the edited work does not exist as a file, but is re-created each time by a further execution of the EDL. Thus a lengthy editing session need not fill up the disks.

FIGURE 5.42

Sequence of events for a hard-disk edit. See text for details.

It is important to realize that at no time during the edit process were the original files modified in any way. The editing was done solely by reading the files. The power of this approach is that if an edit list is created wrongly, the original recording is not damaged, and the problem can be put right simply by correcting the edit list. The advantage of a disk-based system for such work is that location of edit points, previews, and reviews are all performed almost instantaneously, because of the random access of the disk. This can reduce the time taken to edit a program to a fraction of that needed with a tape machine.

During an edit, the disk controller has to provide data from two different files simultaneously, and so it has to work much harder than for a simple playback. If there are many close-spaced edits, the controller and drives may be hard-pressed to keep ahead of real time, especially if there are long transitions, because during a transition a vertical edit is taking place between two video signals and the source data rate is twice as great as during replay. A large buffer memory helps this situation because the drive can fill the memory with files before the edit actually begins, and thus the instantaneous sample rate can be met by allowing the memory to empty during disk-intensive periods.

Disk formats that handle defects dynamically, such as defect skipping, will also be superior to bad-block files when throughput is important. Some drives rotate the sector addressing from one cylinder to the next so that the drive does not lose a revolution when it moves to the next cylinder. Disk-editor performance is usually specified in terms of peak editing activity that can be achieved, but with a recovery period between edits. If an unusually severe editing task is necessary for which the drive just cannot access files fast enough, it will be necessary to rearrange the files on the disk surface so that the files that will be needed at the same time are on nearby cylinders.⁸ An alternative is to spread the material between two or more drives so that overlapped seeks are possible.

Once the editing is finished, it will generally be necessary to transfer the edited material to form a contiguous recording so that the source files can make way for new work. If the source files already exist on tape the disk files can simply be erased. If the disks hold original recordings they will need to be backed up to tape if they will be required again. In large broadcast systems, the edited work can be broadcast directly from the disk file server. In smaller systems it will be necessary to output to some removable medium, because the Winchester drives in the editor have fixed media.

APPLICATIONS OF MOTION COMPENSATION

In Chapter 2, the section Motion Portrayal and Dynamic Resolution introduced the concept of eye tracking and the optic flow axis. The optic flow axis is the locus of some point on a moving object that will be in a different place in successive pictures. Any device that computes with respect to the optic flow axis is said to be motion compensated. Until recently the amount of computation required in motion compensation was too expensive, but now that this is no longer the case the technology has become very important in moving-image portrayal systems.

Figure 5.43a shows an example of a moving object that is in a different place in each of three pictures. The optic flow axis is shown. The object is not moving with respect to the optic flow axis and if this axis can be found some very useful results are obtained. The process of finding the optic flow axis is called motion estimation. Motion estimation is literally a process that analyses successive pictures and determines how objects move from one to the next. It is an important enabling technology because of the way it parallels the action of the human eye.

FIGURE 5.43

Motion compensation is an important technology. (a) The optic flow axis is found for a moving object. (b) The object in pictures (n + 1) and (n + 2) can be re-created by shifting the object of picture n using motion vectors. MPEG uses this process for compression. (c) A standards convertor creates a picture on a new time base by shifting object data along the optic flow axis. (d) With motion compensation a moving object can still correlate from one picture to the next so that noise reduction is possible.

Figure 5.43b shows that if the object does not change its appearance as it moves, it can be portrayed in two of the pictures by using data from one picture only, simply by shifting part of the picture to a new location. This can be done using vectors as shown. Instead of transmitting a lot of pixel data, a few vectors are sent instead. This is the basis of motion-compensated compression, which is used extensively in MPEG as will be seen in Chapter 6.

Figure 5.43c shows that if a high-quality standards conversion is required between two different frame rates, the output frames can be synthesized by moving-image data, not through time but along the optic flow axis. This locates objects where they would have been if frames had been sensed at those times, and the result is a judder-free conversion. This process can be extended to drive image displays at a frame rate higher than the input rate so that flicker and background strobing are reduced. This technology is available in certain high-quality consumer television sets.

Attempts to use this approach to eliminate judder from 24 Hz film have not been successful. It appears that at this very low frame rate there is simply insufficient motion information available.

Figure 5.43d shows that noise reduction relies on averaging two or more images so that the images add but the noise cancels. Conventional noise reducers fail in the presence of motion, but if the averaging process takes place along the optic flow axis, noise reduction can continue to operate.

The way in which eye tracking avoids aliasing is fundamental to the perceived quality of television pictures. Many processes need to manipulate moving images in the same way to avoid the obvious difficulty of processing with respect to a fixed frame of reference. Processes of this kind are referred to as motion compensated and rely on a quite separate process that has measured the motion.

Motion compensation is also important when interlaced video needs to be processed as it allows de-interlacing with the smallest number of artifacts.

MOTION-ESTIMATION TECHNIQUES

There are three main methods of motion estimation, which are to be found in various applications: block matching, gradient matching, and phase correlation. Each has its own characteristics, which are quite different from those of the others.

BLOCK MATCHING

This is the simplest technique to follow. In a given picture, a block of pixels is selected and stored as a reference. If the selected block is part of a moving object, a similar block of pixels will exist in the next picture, but not in the same place. As Figure 5.44 shows, block matching simply moves the reference block around over the second picture looking for matching pixel values. When a match is found, the displacement needed to obtain it is used as a basis for a motion vector.

Whilst simple in concept, block matching requires an enormous amount of computation because every possible motion must be tested over the assumed range. Thus if the object is assumed to have moved over a 16-pixel range, then it will be necessary to test 16 different horizontal displacements in each of 16 vertical positions, in excess of 65,000 positions. At each position every pixel in the block must be compared with every pixel in the second picture. In a typical video, displacements of twice the figure quoted here may be found, particularly for sporting events, and the computation then required becomes enormous. If the motion is required to subpixel accuracy, then before any matching can be attempted the picture will need to be interpolated, requiring further computation.

One way of reducing the amount of computation is to perform the matching in stages of which the first stage is inaccurate but covers a large motion range and the last stage is accurate but covers a small range. The first matching stage is performed on a heavily filtered and subsampled picture, which contains far fewer pixels.

FIGURE 5.44

In block matching the search block has to be positioned at all possible relative motions within the search area and a correlation measured at each one.

When a match is found, the displacement is used as a basis for a second stage, which is performed with a less heavily filtered picture. Eventually the last stage takes place to any desired accuracy, including subpixel. This hierarchical approach does reduce the computation required, but it suffers from the problem that the filtering of the first stage may make small objects disappear and they can never be found by subsequent stages if they are moving with respect to their background. This is not a problem for compression, because a prediction error will provide the missing detail, but it is an issue for standards convertors, which require more accurate motion than compressors. Many televised sports events contain small, fast-moving objects. As the matching process depends upon finding similar luminance values, this can be confused by objects moving into shade or fades.

GRADIENT MATCHING

At some point in a picture, the function of brightness with respect to distance across the screen will have a certain slope, known as the spatial luminance gradient. If the associated picture area is moving, the slope will traverse a fixed point on the screen and the result will be that the brightness now changes with respect to time. This is a temporal luminance gradient. Figure 5.45 shows the principle. For a given spatial gradient, the temporal gradient becomes steeper as the speed of movement increases. Thus motion speed can be estimated from the ratio of the spatial and temporal gradients.⁹

In practice this is difficult because there are numerous processes that can change the luminance gradient. When an object moves so as to obscure or reveal the background, the spatial gradient will change from field to field even if the motion is constant. Variations in illumination, such as when an object moves into shade, also cause difficulty. The process can be assisted by recursion, in which the motion in a current picture is predicted by extrapolating the optic flow axis from earlier pictures, but this will result in problems at cuts.

FIGURE 5.45

The principle of gradient matching. The luminance gradient across the screen is compared with that through time.

PHASE CORRELATION

Phase correlation works by performing a discrete Fourier transform on two successive fields and then subtracting all the phases of the spectral components. The phase differences are then subject to a reverse transform, which directly reveals peaks whose positions correspond to motions between the fields.^10,11 The nature of the transform domain means that if the distance and direction of the motion is measured accurately, the area of the screen in which it took place is not. Thus in practical systems the phase-correlation stage is followed by a matching stage not dissimilar to the block-matching process. However, the matching process is steered by the motions from the phase correlation, and so there is no need to attempt to match at all possible motions. By attempting matching on measured motion the overall process is made much more efficient.

One way of considering phase correlation is that by using the Fourier transform to break the picture into its constituent spatial frequencies the hierarchical structure of block matching at various resolutions is in fact performed in parallel. In this way small objects are not missed because they will generate high-frequency components in the transform.

Although the matching process is simplified by adopting phase correlation, the Fourier transforms themselves require complex calculations. The high performance of phase correlation would remain academic if it were too complex to put into practice. However, if realistic values are used for the motion speeds that can be handled, the computation required by block matching actually exceeds that required for phase correlation. The elimination of amplitude information from the phase correlation process ensures that motion estimation continues to work in the case of fades, objects moving into shade, or flashguns firing.

The details of the Fourier transform are described in Chapter 3. A one-dimensional example of phase correlation will be given here by way of introduction. A line of luminance, which in the digital domain consists of a series of samples, is a function of brightness with respect to distance across the screen. The Fourier transform converts this function into a spectrum of spatial frequencies (units of cycles per picture width) and phases.

All television signals must be handled in linear-phase systems. A linear-phase system is one in which the delay experienced is the same for all frequencies. If video signals pass through a device that does not exhibit linear phase, the various frequency components of edges become displaced across the screen.

Figure 5.46 shows what phase linearity means. If the left-hand end of the frequency axis (DC) is considered to be firmly anchored, but the right-hand end can be rotated to represent a change of position across the screen, it will be seen that as the axis twists evenly the result is phase shift proportional to frequency. A system having this characteristic is said to display linear phase.

In the spatial domain, a phase shift corresponds to a physical movement. Figure 5.47 shows that if between fields a waveform moves along the line, the lowest frequency in the Fourier transform will suffer a given phase shift, twice that frequency will suffer twice that phase shift, and so on. Thus it is potentially possible to measure movement between two successive fields if the phase differences between the Fourier spectra are analysed. This is the basis of phase correlation.

Figure 5.48 shows how a one-dimensional phase correlator works. The Fourier transforms of two lines from successive fields are computed and expressed in polar (amplitude and phase) notation (see Chapter 3).

FIGURE 5.46

The definition of phase linearity is that phase shift is proportional to frequency. In phase-linear systems the waveform is preserved and simply moves in time or space.

FIGURE 5.47

In a phase-linear system, shifting the video waveform across the screen causes phase shifts in each component proportional to frequency.

FIGURE 5.48

The basic components of a phase correlator.

The phases of one transform are all subtracted from the phases of the same frequencies in the other transform. Any frequency component having significant amplitude is then normalized, or boosted to full amplitude.

The result is a set of frequency components that all have the same amplitude, but have phases corresponding to the difference between two fields. These coefficients form the input to an inverse transform. Figure 5.49a shows what happens. If the two fields are the same, there are no phase differences between the two, and so all the frequency components are added with zero-degree phase to produce a single peak in the centre of the inverse transform. If, however, there was motion between the two fields, such as a pan, all the components will have phase differences, and this results in a peak shown in Figure 5.49b, which is displaced from the centre of the inverse transform by the distance moved. Phase correlation thus actually measures the movement between fields. In the case in which the line of video in question intersects objects moving at different speeds, Figure 5.49c shows that the inverse transform would contain one peak corresponding to the distance moved by each object.

Whilst this explanation has used one dimension for simplicity, in practice the entire process is two dimensional. A two-dimensional Fourier transform of each field is computed, the phases are subtracted, and an inverse two-dimensional transform is computed, the output of which is a flat plane out of which three-dimensional peaks rise. This is known as a correlation surface.

Figure 5.50 shows some examples of a correlation surface. In Figure 5.50a there has been no motion between fields and so there is a single central peak. In (b) there has been a pan and the peak moves across the surface. In (c) the camera has been depressed and the peak moves upward.

Where more complex motions are involved, perhaps with several objects moving in different directions and/or at different speeds, one peak will appear in the correlation surface for each object.

It is a fundamental strength of phase correlation that it actually measures the direction and speed of moving objects rather than estimating, extrapolating, or searching for them. The motion can be measured to subpixel accuracy. However, it should be understood that according to Heisenberg's uncertainty theorem, accuracy in the transform domain is incompatible with accuracy in the spatial domain. Although phase correlation accurately measures motion speeds and directions, it cannot specify where in the picture these motions are taking place. It is necessary to look for them in a further matching process. The efficiency of this process is dramatically improved by the inputs from the phase-correlation stage.

The input to a motion estimator for most applications consists of interlaced fields. The lines of one field lie between those of the next, making comparisons between them difficult. A further problem is that vertical spatial aliasing may exist in the fields. Preprocessing solves these problems by performing a two-dimensional spatial low-pass filtering operation on input fields. Alternate fields are also interpolated up or down by half a line using the techniques of Chapter 3 under Sampling-Rate Conversion, so that interlace disappears and all fields subsequently have the same sampling grid. The spatial frequency response in 625-line systems is filtered to 72 cycles per picture height. This is half the response possible from the number of lines in a field, but is necessary because subsequent correlation causes a frequency-doubling effect. The spatial filtering also cuts down the amount of computation required.

The computation needed to perform a two-dimensional Fourier transform increases dramatically with the size of the block employed, and so no attempt is made to transform the down-sampled fields directly. Instead the fields are converted into overlapping blocks by the use of window functions as shown in Figure 5.51. The size of the window controls the motion speed that can be handled, and so a window size is chosen that allows motion to be detected up to the limit of human judder visibility.

FIGURE 5.49

(a) The peak in the inverse transform is central for no motion. (b) In the case of motion, the peak shifts by the distance moved. (c) If there are several motions, each one results in a peak.

FIGURE 5.50

(a) A two-dimensional correlation surface has a central peak when there is no motion. (b) In the case of a pan, the peak moves laterally. (c) A camera tilt moves the peak at right angles to the pan.

FIGURE 5.51

The input fields are converted into overlapping windows. Each window is individually transformed.

FIGURE 5.52

The block diagram of a phase-correlated motion estimator. See text for details.

Figure 5.52 shows a block diagram of a phase-correlated motion-estimation system. Following the preprocessing, each windowed block is subject to a fast Fourier transform (FFT), and the output spectrum is converted to the amplitude and phase representation. The phases are subtracted from those of the previous field in each window, and the amplitudes are normalized to eliminate any variations in illumination or the effect of fades from the motion sensing. A reverse transform is performed, which results in a correlation surface. The correlation surface contains peaks whose positions actually measure distances and directions moved by some feature in the window.

It is a characteristic of all transforms that the more accurately the spectrum of a signal is known, the less accurately the spatial domain is known. Thus the whereabouts within the window of the moving objects that gave rise to the correlation peaks is not known. Figure 5.53 illustrates the phenomenon. Two windowed blocks are shown in consecutive fields. Both contain the same objects, moving at the same speed, but from different starting points. The correlation surface will be the same in both cases. The phase-correlation process therefore needs to be followed by a further process called image correlation, which identifies the picture areas in which the measured motion took place and establishes a level of confidence in the identification. This stage can also be seen in the block diagram of Figure 5.52.

To employ the terminology of motion estimation, the phase-correlation process produces candidate vectors, and the image-correlation process assigns the vectors to specific areas of the picture. In many ways the vector-assignment process is more difficult than the phase-correlation process, as the latter is a fixed computation, whereas the vector assignment has to respond to infinitely varying picture conditions.

FIGURE 5.53

Phase correlation measures motion, not the location of moving objects. These two examples give the same correlation surface.

Figure 5.54 shows the image-correlation process, which is used to link the candidate vectors from the phase correlator to the picture content. In this example, the correlation surface contains three peaks, which define three possible motions between two successive fields. One down-sampled field is successively shifted by each of the candidate vectors and compared with the next field a pixel at a time. Similarities or correlations between pixel values indicate that an area with the measured motion has been found. This happens for two of the candidate vectors, and these vectors are then assigned to those areas. However, shifting by the third vector does not result in a meaningful correlation. This is taken to mean that it was a spurious vector, one which was produced in error because of difficult program material. The ability to eliminate spurious vectors and establish confidence levels in those that remain is essential to artifact-free conversion.

The phase-correlation process produces candidate vectors in each window. The vectors from all windows must be combined to obtain an overall view of the motion in the field before attempting to describe the motion of each pixel individually.

FIGURE 5.54

Image correlation uses candidate vectors to locate picture areas with the corresponding motion. If no image correlation is found the vector was spurious and is discounted.

Figure 5.55a shows that if a zoom is in progress, the vectors in the various windows will form a geometric progression, becoming longer in proportion to the distance from the axis of the zoom. However, if there is a pan, it will be seen from Figure 5.55b that there will be similar vectors in all the windows. In practice both motions may occur simultaneously.

An estimate will be made of the speed of a field-wide zoom or of the speed of picture areas that contain receding or advancing motions, which give a zoom-like effect. If the effect of zooming is removed from each window by shifting the peaks by the local zoom magnitude, but in the opposite direction, the positions of the peaks will reveal any component due to panning. This can be found by summing all the windows to create a histogram. Panning results in a dominant peak in the histogram, where all windows contain peaks in a similar place, which reinforce the dominant peak.

Each window is then processed in turn. Where only a small part of an object overlaps into a window, it will result in a small peak in the correlation surface, which might be missed. The windows are deliberately overlapped so that a given pixel may appear in four windows. Thus a moving object will appear in more than one window. If most of an object lies within one window, a large peak will be produced from the motion in that window. The resulting vector will be added to the candidate vector list of all adjacent windows. When the vector assignment is performed, image correlations will result if a small overlap occurred, and the vector will be validated. If there was no overlap, the vector will be rejected.

FIGURE 5.55

The results of (a) a zoom and (b) a pan on the vectors in various windows in the field.

The peaks in each window reflect the degree of correlation between the two fields for different offsets in two dimensions. The volume of the peak corresponds to the amount of the area of the window (i.e., the number of pixels) having that motion. Thus peaks should be selected starting with the largest. However, periodic structures in the input field, such as grilles and striped shirts, will result in partial correlations at incorrect distances that differ from the correct distance by the period of the structure. The effect is that a large peak on the correlation surface will be flanked by smaller peaks at uniform spacing in a straight line. The characteristic pattern of subsidiary peaks can be identified and the vectors invalidated.

One way in which this can be done is to compare the positions of the peaks in each window with those estimated by the pan/zoom process. The true peak due to motion will be similar; the false subpeaks due to image periodicity will not be and can be rejected.

Correlations with candidate vectors are then performed. The image in one field is shifted in an interpolator by the amount specified by a candidate vector and the degree of correlation is measured. Note that this interpolation is to subpixel accuracy because phase correlation can accurately measure subpixel motion. High correlation results in vector assignment, low correlation results in the vector being rejected as unreliable.

If all the peaks are evaluated in this way, then most of the time valid assignments for which there is acceptable confidence from the correlator will be made. Should it not be possible to obtain any correlation with confidence in a window, then the pan/zoom values will be inserted so that that the window moves in a way similar to the overall field motion.

MOTION-COMPENSATED STANDARDS CONVERSION

A conventional standards convertor is not transparent to motion portrayal, and the effect is judder and loss of resolution. Figure 5.56 shows what happens on the time axis in a conversion between 60 and 50 Hz (in either direction). Fields in the two standards appear in different planes cutting through the spatiotemporal volume, and the job of the standards convertor is to interpolate along the time axis between input planes in one standard to estimate what an intermediate plane in the other standard would look like. With still images, this is easy, because planes can be slid up and down the time axis with no ill effect. If an object is moving, it will be in a different place in successive fields. Interpolating between several fields results in multiple images of the object. The position of the dominant image will not move smoothly, an effect that is perceived as judder. Motion compensation is designed to eliminate this undesirable judder.

A conventional standards convertor interpolates only along the time axis, whereas a motion-compensated standards convertor can swivel its interpolation axis off the time axis. Figure 5.57a shows the input fields in which three objects are moving in a different way. In Figure 5.57b it will be seen that the interpolation axis is aligned with the optic flow axis of each moving object in turn.

Each object is no longer moving with respect to its own optic flow axis, and so on that axis it no longer generates temporal frequencies due to motion, and temporal aliasing due to motion cannot occur.¹² Interpolation along the optic flow axes will then result in a sequence of output fields in which motion is properly portrayed. The process requires a standards convertor that contains filters that are modified to allow the interpolation axis to move dynamically within each output field. The signals that move the interpolation axis are known as motion vectors. It is the job of the motion-estimation system to provide these motion vectors. The overall performance of the convertor is determined primarily by the accuracy of the motion vectors. An incorrect vector will result in unrelated pixels from several fields being superimposed, and the result is unsatisfactory.

FIGURE 5.56

The different temporal distributions of input and output fields in a 50/60 Hz convertor.

FIGURE 5.57

(a) Input fields with moving objects. (b) Moving the interpolation axes to make them parallel to the trajectory of each object.

Figure 5.58 shows the sequence of events in a motion-compensated standards convertor. The motion estimator measures movements between successive fields. These motions must then be attributed to objects by creating boundaries around sets of pixels having the same motion. The result of this process is a set of motion vectors, hence the term “vector assignation.” The motion vectors are then input to a modified four-field standards convertor to deflect the interfield interpolation axis.

The vectors from the motion estimator actually measure the distance moved by an object from one input field to another. What the standards convertor requires is the value of motion vectors at an output field. A vector interpolation stage is required, which computes where between the input fields A and B the current output field lies and uses this to proportion the motion vector into two parts. Figure 5.59a shows that the first part is the motion between field A and the output field; the second is the motion between field B and the output field. Clearly the difference between these two vectors is the motion between input fields. These processed vectors are used to displace parts of the input fields so that the axis of interpolation lies along the optic flow axis. The moving object is stationary with respect to this axis so interpolation between fields along it will not result in any judder.

FIGURE 5.58

The essential stages of a motion-compensated standards convertor.

Whilst a conventional convertor needs to interpolate only vertically and temporally, a motion-compensated convertor also needs to interpolate horizontally to account for lateral movement in images. Figure 5.59b shows that the motion vector from the motion estimator is resolved into two components, vertical and horizontal. The spatial impulse response of the interpolator is shifted in two dimensions by these components. This shift may be different in each of the fields that contribute to the output field.

When an object in the picture moves, it will obscure its background. The vector interpolator in the standards convertor handles this automatically, provided the motion estimation has produced correct vectors. Figure 5.60 shows an example of background handling. The moving object produces a finite vector associated with each pixel, whereas the stationary background produces zero vectors except in the area OX where the background is being obscured. Vectors converge in the area where the background is being obscured and diverge where it is being revealed. Image correlation is poor in these areas so no valid vector is assigned.

FIGURE 5.59

(a) The motion vectors on the input field structure must be interpolated onto the output field structure. The field to be interpolated is positioned temporally between source fields and the motion vector between them is apportioned according to the location. Motion vectors are two dimensional and can be transmitted as vertical and horizontal components, shown in (b), which control the spatial shifting of input fields.

An output field is located between input fields, and vectors are projected through it to locate the intermediate position of moving objects. These are interpolated along an axis that is parallel to the optic flow axis. This results in address mapping that locates the moving object in the input field RAMs. However, the background is not moving and so the optic flow axis is parallel to the time axis. The pixel immediately below the leading edge of the moving object does not have a valid vector because it is in the area OX where forward image correlation failed.

The solution is for that pixel to assume the motion vector of the background below point X, but only to interpolate in a backward direction, taking pixel data from previous fields. In a similar way, the pixel immediately behind the trailing edge takes the motion vector for the background above point Y and interpolates only in a forward direction, taking pixel data from future fields. The result is that the moving object is portrayed in the correct place on its trajectory, and the background around it is filled in only from fields that contain useful data.

FIGURE 5.60

Background handling. When a vector for an output pixel near a moving object is not known, the vectors from adjacent background areas are assumed. Converging vectors imply obscuring is taking place, which requires that interpolation can use only previous field data. Diverging vectors imply that the background is being revealed and interpolation can use data only from later fields.

The technology of the motion-compensated standards convertor can be used in other applications. When video recordings are played back in slow motion, the result is that the same picture is displayed several times, followed by a jump to the next picture. Figure 5.61 shows that a moving object would remain in the same place on the screen during picture repeats, but jump to a new position as a new picture was played. The eye attempts to track the moving object, but, as Figure 5.61 also shows, the location of the moving object wanders with respect to the trajectory of the eye, and this is visible as judder.

Motion-compensated slow-motion systems are capable of synthesizing new images that lie between the original images from a slow-motion source. Figure 5.62 shows that two successive images in the original recording (using DVE terminology, these are source fields) are fed into the unit, which then measures the distance travelled by all moving objects between those images. Using interpolation, intermediate fields (target fields) are computed in which moving objects are positioned so that they lie on the eye trajectory. Using the principles described above, background information is removed as moving objects conceal it and replaced as the rear of an object reveals it. Jitter is thus removed and motion with a fluid quality is obtained.

FIGURE 5.61

(a) Conventional slow motion using field repeating with stationary eye. (b) With a tracking eye the source of judder can be seen.

FIGURE 5.62

In motion-compensated slow motion, output fields are interpolated with moving objects displaying judder-free linear motion between input fields.

CAMERA-SHAKE COMPENSATION

As video cameras become smaller and lighter, it becomes increasingly difficult to move them smoothly and the result is camera shake. This is irritating to watch, as well as requiring a higher bit-rate in compression systems. There are two solutions to the problem, one that is contained within the camera and one that can be used at some later time on the video data.

FIGURE 5.63

Image-stabilizing cameras sense shake using a pair of orthogonal gyros that sense movement of the optical axis.

Figure 5.63 shows that image-stabilizing cameras contain miniature gyroscopes, which produce an electrical output proportional to their rate of turn about a specified axis. A pair of these, mounted orthogonally, can produce vectors describing the camera shake. This can be used to oppose the shake by shifting the image. In one approach, the shifting is done optically. Figure 5.64 shows a pair of glass plates with the intervening spaced filled with transparent liquid. By tilting the plates a variable-angle prism can be obtained and this is fitted in the optical system before the sensor. If the prism plates are suitably driven by servos from the gyroscopic sensors, the optical axis along which the camera is looking can remain constant despite shake.

Alternatively, the camera can contain a DVE by which the vectors from the gyroscopes cause the CCD camera output to be shifted horizontally or vertically so that the image remains stable. This approach is commonly used in consumer camcorders.

A great number of video recordings and films already exist in which there is camera shake. Film also suffers from weave in the telecine machine. In this case the above solutions are inappropriate and a suitable signal processor is required. Figure 5.65 shows that motion compensation can be used. If a motion estimator is arranged to find the motion between a series of pictures, camera shake will add a fixed component in each picture to the genuine object motions. This can be used to compute the optic flow axis of the camera, independent of the objects portrayed.

FIGURE 5.64

Image-stabilizing cameras. (a) The image is stabilized optically prior to the CCD sensors. (b) The CCD output contains image shake, but this is opposed by the action of a DVE configured to shift the image under control of the gyro inputs.

FIGURE 5.65

(a) In digital image stabilising the optic flow axis of objects in the input video is measured. (b) This motion is smoothed to obtain a close approximation to the original motion. (c) If this is subtracted from (a) the result is the camera-shake motion, which is used to drive the image stabiliser.

Operating over several pictures, the trend in camera movement can be separated from the shake by filtering, to produce a position error for each picture. Each picture is then shifted in a DVE to cancel the position error. The result is that the camera shake is gone and the camera movements appear smooth. To prevent the edges of the frame moving visibly, the DVE also performs a slight magnification so that the edge motion is always outside the output frame.

DE-INTERLACING

Interlace is a legacy compression technique that sends only half of the picture lines in each field. Whilst this works reasonably well for transmission, it causes difficulty in any process that requires image manipulation. This includes DVEs, standards convertors, and display convertors. All these devices give better results when working with progressively scanned data, and if the source material is interlaced, a de-interlacing process will be necessary.

Interlace distributes vertical detail information over two fields, and for maximum resolution all that information is necessary. Unfortunately it is not possible to use the information from two different fields directly. Figure 5.66 shows a scene in which an object is moving. When the second field of the scene leaves the camera, the object will have assumed a position different from the one it had in the first field, and the result of combining the two fields to make a de-interlaced frame will be a double image. This effect can easily be demonstrated on any video recorder that offers a choice of still field or still frame. Stationary objects before a stationary camera, however, can be de-interlaced perfectly.

In simple de-interlacers, motion sensing is used so that de-interlacing can be disabled when movement occurs, and interpolation from a single field is used instead. Motion sensing implies comparison of one picture with the next. If interpolation is to be used only in areas where there is movement, it is necessary to test for motion over the entire frame. Motion can be simply detected by comparing the luminance value of a given pixel with the value of the same pixel two fields earlier. As two fields are to be combined, and motion can occur in either, then the comparison must be made between two odd fields and two even fields. Thus four fields of memory are needed to perform motion sensing correctly. The luminance from four fields requires about a megabyte of storage.

FIGURE 5.66

A moving object will be in a different place in two successive fields and will produce a double image.

At some point a decision must be made to abandon pixels from the previous field that are in the wrong place due to motion and to interpolate them from adjacent lines in the current field. Switching suddenly in this way is visible, and there is a more sophisticated mechanism that can be used. In Figure 5.67, two fields, separated in time, are shown. Interlace can be seen by following lines from pixels in one field, which pass between pixels in the other field. If there is no movement, the fact that the two fields are separated in time is irrelevant, and the two can be superimposed to make a frame array. When there is motion, pixels from above and below the unknown pixels are added together and divided by two, to produce interpolated values. If both of these mechanisms work all the time, a better quality picture results if a cross-fade is made between the two based on the amount of motion. At some motion value, or some magnitude of pixel difference, the loss of resolution due to a double image is equal to the loss of resolution due to interpolation. That amount of motion should result in the cross-fader arriving at a 50/50 setting. Any less motion will result in a fade toward both fields, any more motion will result in a fade toward the interpolated values.

FIGURE 5.67

Pixels from the most recent field are interpolated spatially to form low vertical resolution pixels, which will be used if there is excessive motion; pixels from the previous field will be used to give maximum vertical resolution. The best possible de-interlaced frame results.

The most efficient way to de-interlace is to use motion compensation. Figure 5.68 shows that when an object moves in an interlaced system, the interlace breaks down with respect to the optic flow axis as was seen in Chapter 2. If the motion is known, two or more fields can be shifted so that a moving object is in the same place in both. Pixels from both fields can then be used to describe the object with better resolution than would be possible from one field alone. It can be seen in Figure 5.69 that the combination of two fields in this way will result in pixels having a highly irregular spacing, and a special type of filter is needed to convert this back to a progressive frame with regular pixel spacing. At some critical vertical speeds there will be alignment between pixels in adjacent fields and no improvement is possible, but at other speeds the process will always give better results.

FIGURE 5.68

In the presence of vertical motion or motion having a vertical component, interlace breaks down and the pixel spacing with respect to the tracking eye becomes irregular.

FIGURE 5.69

A de-interlacer needs an interpolator, which can operate with input samples that are positioned arbitrarily rather than regularly.

NOISE REDUCTION

The basic principle of all video noise reducers is that there is a certain amount of correlation between the video content of successive frames, whereas there is no correlation between the noise content.

A basic recursive device is shown in Figure 5.70. There is a frame store, which acts as a delay, and the output of the delay can be fed back to the input through an attenuator, which in the digital domain will be a multiplier. In the case of a still picture, successive frames will be identical, and the recursion will be large. This means that the output video will actually be the average of many frames. If there is movement of the image, it will be necessary to reduce the amount of recursion to prevent the generation of trails or smears. Probably the most famous examples of recursion smear are the television pictures sent back of astronauts walking on the moon. The received pictures were very noisy and needed a lot of averaging to make them viewable. This was fine until the astronaut moved. The technology of the day did not permit motion sensing.

FIGURE 5.70

A basic recursive device feeds back the output to the input via a frame store, which acts as a delay. The characteristics of the device are controlled totally by the values of the two coefficients, K₁ and K₂, which control the multipliers.

The noise reduction increases with the number of frames over which the noise is integrated, but image motion prevents simple combining of frames. If motion estimation is available, the image of a moving object in a particular frame can be integrated from the images in several frames that have been superimposed on the same part of the screen by displacements derived from the motion measurement. The result is that greater reduction of noise becomes possible.¹³ In fact a motion-compensated standards convertor performs such a noise-reduction process automatically and can be used as a noise reducer, albeit an expensive one, by setting both input and output to the same standard.

References

1. Aird, B. Three dimensional picture synthesis. Broadcast Syst. Eng., 12 No. 3, 34–40 (1986).

2. Newman, W.M., and Sproull, R.F. Principles of Interactive Computer Graphics. Tokyo: McGraw–Hill (1979).

3. Gernsheim, H. A Concise History of Photography, p. 915, London: Thames & Hudson (1971).

4. Hedgecoe, J. The Photographer's Handbook, pp. 104–105, London: Ebury Press (1977).

5. Bennett, P., et al. Spatial transformation system including key signal generator. U.S. Patent No. 4,463,372 (1984).

6. de Boor, C. A Practical Guide to Splines. Berlin: Springer (1978).

7. Rubin, M. The emergence of the desktop: implications to offline editing. Record of 18th ITS (Montreaux), pp. 384–389 (1993).

8. Trottier, L. Digital video compositing on the desktop. Record of 18th ITS (Montreaux), pp. 564–570 (1993).

9. Limb, J.O., and Murphy, J.A. Measuring the speed of moving objects from television signals. IEEE Trans. Commun., 474–478 (1975).

10. Thomas, G.A. Television motion measurement for DATV and other applications. BBC Res. Dept. Rept, RD 1987/11 (1987).

11. Pearson, J.J., et al. Video rate image correlation processor. SPIE, Vol. 119, Application of Digital Image Processing, IOCC (1977).

12. Lau, H., and Lyon, D. Motion compensated processing for enhanced slow motion and standards conversion. IEE Conference Publ. No. 358, pp. 62–66 (1992).

13. Weiss, P., and Christensson, J., Real time implementation of sub-pixel motion estimation for broadcast applications. IEE Digest, 128 (1990).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 5 Digital Video Processing

Create new playlist

Sign In

Sign Up