CHAPTER 6 Video Compression and MPEG

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

INTRODUCTION TO COMPRESSION

Compression, bit rate reduction, and data reduction are all terms that mean basically the same thing in this context. An impression of the original information is expressed using a smaller quantity or rate of data. It should be pointed out that in audio, compression traditionally means a process in which the dynamic range of the sound is reduced. In the context of MPEG the same word means that the bit rate is reduced, ideally leaving the dynamics of the signal unchanged. Provided the context is clear, the two meanings can co-exist without a great deal of confusion.

There are several reasons compression techniques are popular:

1. Compression extends the playing time of a given storage device.

2. Compression allows miniaturization. With fewer data to store, the same playing time is obtained with smaller hardware. This is useful in ENG (electronic news gathering) and consumer devices.

3. Tolerances can be relaxed. With fewer data to record, storage density can be reduced, making equipment that is more resistant to adverse environments and that requires less maintenance.

4. In transmission systems, compression allows a reduction in bandwidth, which will generally result in a reduction in cost, to make possible some process that would be impracticable without it.

5. If a given bandwidth is available to an uncompressed signal, compression allows faster-than-real-time transmission in the same bandwidth.

6. If a given bandwidth is available, compression allows a better quality signal in the same bandwidth.

FIGURE 6.1

(a) A compression system consists of a compressor or coder, a transmission channel, and a matching expander or decoder. The combination of coder and decoder is known as a codec. (b) MPEG is asymmetrical because the encoder is much more complex than the decoder.

Compression is summarised in Figure 6.1. It will be seen in Figure 6.1a that the data rate is reduced at the source by the compressor. The compressed data are then passed through a communication channel and returned to the original rate by the expander. The ratio between the source data rate and the channel data rate is called the compression factor. The term “coding gain” is also used. Sometimes a compressor and an expander in series are referred to as a compander. The compressor may equally well be referred to as a coder and the expander a decoder, in which case the tandem pair may be called a codec.

In audio and video compression, when the encoder is more complex than the decoder the system is said to be asymmetrical, as in Figure 6.1b. The encoder needs to be algorithmic or adaptive, whereas the decoder is “dumb” and carries out fixed actions. This is advantageous in applications such as broadcasting in which the number of expensive complex encoders is small but the number of simple inexpensive decoders is large. In point-to-point applications the advantage of asymmetrical coding is not so great.

Although there are many different coding techniques, all of them fall into one of these categories. In lossless coding, the data from the expander are identical bit for bit with the original source data. The so-called “stacker” programs, which increase the apparent capacity of disk drives in personal computers, use lossless codecs. Clearly with computer programs the corruption of a single bit can be catastrophic. Lossless coding is generally restricted to compression factors of around 2:1.

It is important to appreciate that a lossless coder cannot guarantee a particular compression factor and the communications link or recorder used with it must be able to function with the variable output data rate. Source data that result in poor compression factors on a given codec are described as difficult. It should be pointed out that the difficulty is often a function of the codec. In other words, data that one codec finds difficult may not be found difficult by another. Lossless codecs can be included in bit-error-rate testing schemes. It is also possible to cascade or concatenate lossless codecs without any special precautions.

In lossy coding, data from the expander are not identical bit for bit with the source data, and as a result comparing the input with the output is bound to reveal differences. Lossy codecs are not suitable for computer data, but are used in MPEG as they allow greater compression factors than lossless codecs. Successful lossy codecs are those in which the errors are arranged so that a human viewer or listener finds them subjectively difficult to detect. Thus lossy codecs must be based on an understanding of psychoacoustic and psychovisual perception and are often called perceptive codes.

In perceptive coding, the greater the compression factor required, the more accurately the human senses must be modelled. Perceptive coders can be forced to operate at a fixed compression factor. This is convenient for practical transmission applications in which a fixed data rate is easier to handle than a variable rate. The result of a fixed compression factor is that the subjective quality can vary with the “difficulty” of the input material. Perceptive codecs should not be concatenated indiscriminately, especially if they use different algorithms. As the reconstructed signal from a perceptive codec is not bit-for-bit accurate, clearly such a codec cannot be included in any bit-error-rate testing system, as the coding differences would be indistinguishable from real errors.

Although the adoption of digital techniques is recent, compression itself is as old as television. Figure 6.2 shows some of the compression techniques used in traditional television systems.

FIGURE 6.2

Compression is as old as television. (a) Interlace is a primitive way of halving the bandwidth. (b) Colour difference working invisibly reduces colour resolution. (c) Composite video transmits colour in the same bandwidth as monochrome.

One of the oldest techniques is interlace, which has been used in analog television from the very beginning as a primitive way of reducing bandwidth. As seen in Chapter 2, interlace is not without its problems, particularly in motion rendering. MPEG-2 supports interlace simply because legacy interlaced signals exist and there is a requirement to compress them further. This should not be taken to mean that it is an optimal approach.

The generation of colour difference signals from RGB in video represents an application of perceptive coding. The HVS (human visual system) sees no change in quality although the bandwidth of the colour difference signals is reduced. This is because human perception of detail in colour changes is much less than that in brightness changes. This approach is sensibly retained in MPEG.

Composite video systems such as PAL, NTSC, and SECAM are all analog compression schemes that embed a modulated subcarrier in the luminance signal so that colour pictures are available in the same bandwidth as monochrome. In comparison with a progressive-scan RGB picture, interlaced composite video has a compression factor of 6:1.

In a sense MPEG-2 can be considered to be a modern digital equivalent of analog composite video, as it has most of the same attributes. For example, the eight-field sequence of a PAL subcarrier that makes editing difficult has its equivalent in the GOP (group of pictures) of MPEG.¹

In a PCM digital system the bit rate is the product of the sampling rate and the number of bits in each sample and this is generally constant. Nevertheless the information rate of a real signal varies. In all real signals, part of the signal is obvious from what has gone before or what may come later and a suitable receiver can predict that part so that only the true information actually has to be sent. If the characteristics of a predicting receiver are known, the transmitter can omit parts of the message in the knowledge that the receiver has the ability to re-create it. Thus all encoders must contain a model of the decoder.

One definition of information is that it is the unpredictable or surprising elem ent of data. Newspapers are a good example of information because they mention only items that are surprising. Newspapers never carry items about individuals who have not been involved in an accident, as this is the normal case. Consequently the phrase “no news is good news” is remarkably true because if an information channel exists but nothing has been sent, then it is most likely that nothing remarkable has happened.

The difference between the information rate and the overall bit rate is known as the redundancy. Compression systems are designed to eliminate as much of that redundancy as practicable or perhaps affordable. One way in which this can be done is to exploit statistical predictability in signals. The information content or entropy of a sample is a function of how different it is from the predicted value. Most signals have some degree of predictability. A sine wave is highly predictable, because all cycles look the same. According to Shannon's theory, any signal that is totally predictable carries no information. In the case of the sine wave this is clear because it represents a single frequency and so has no bandwidth.

At the opposite extreme a signal such as noise is completely unpredictable, and as a result all codecs find noise difficult. There are two consequences of this characteristic. First, a codec that is designed using the statistics of real material should not be tested with random noise because it is not a representative test. Second, a codec that performs well with clean source material may perform badly with source material having superimposed noise. Most practical compression units require some form of pre-processing before the compression stage proper and appropriate noise reduction should be incorporated into the pre-processing if noisy signals are anticipated. It will also be necessary to restrict the degree of compression applied to noisy signals.

All real signals fall part way between the extremes of total predictability and total unpredictability or noisiness. If the bandwidth (set by the sampling rate) and the dynamic range (set by the word length) of the transmission system delineate an area, this sets a limit on the information capacity of the system.

Figure 6.3a shows that most real signals occupy only part of that area. The signal may not contain all frequencies, or it may not have full dynamics at certain frequencies.

Entropy can be thought of as a measure of the actual area occupied by the signal. This is the area that must be transmitted if there are to be no subjective differences or artifacts in the received signal. The remaining area is called the redundancy because it adds nothing to the information conveyed. Thus an ideal coder could be imagined that miraculously sorts out the entropy from the redundancy and sends only the former. An ideal decoder would then re-create the original impression of the information quite perfectly.

FIGURE 6.3

(a) A perfect coder removes only the redundancy from the input signal and results in subjectively lossless coding. If the remaining entropy is beyond the capacity of the channel some of it must be lost and the codec will then be lossy. An imperfect coder will also be lossy as it fails to keep all entropy. (b) As the compression factor rises, the complexity must also rise to maintain quality. (c) High compression factors also tend to increase latency or delay through the system.

As the ideal is approached, the coder complexity and the latency, or delay, both rise. Figure 6.3b shows how complexity increases with compression factor and (c) shows how increasing the codec latency can improve the compression factor. Obviously we would have to provide a channel that could accept whatever entropy the coder extracts to have transparent quality. As a result moderate coding gains that remove only redundancy need not cause artifacts and result in systems that are described as subjectively lossless.

If the channel capacity is not sufficient for that, then the coder will have to discard some of the entropy and with it useful information. Larger coding gains that remove some of the entropy must result in artifacts. It will also be seen from Figure 6.3 that an imperfect coder will fail to separate the redundancy and may discard entropy instead, resulting in artifacts at a sub-optimal compression factor.

A single variable-rate transmission or recording channel is inconvenient and unpopular with channel providers because it is difficult to police. The requirement can be overcome by combining several compressed channels into one constant rate transmission in a way that flexibly allocates the data rate between the channels. Provided the material is unrelated, the probability of all channels reaching peak entropy at once is very small and so those channels that are at one instant passing easy material will free up transmission capacity for those channels that are handling difficult material. This is the principle of statistical multiplexing.

Where the same type of source material is used consistently, e.g., English text, then it is possible to perform a statistical analysis on the frequency with which particular letters are used. Variable-length coding is used, in which frequently used letters are allocated short codes and letters that occur infrequently are allocated long codes. This results in a lossless code. The well-known Morse code used for telegraphy is an example of this approach. The letter “E” is the most frequent letter in English and is sent with a single dot.

An infrequent letter such as “Z” is allocated a long complex pattern. It should be clear that codes of this kind that rely on a prior knowledge of the statistics of the signal are effective only with signals actually having those statistics. If Morse code is used with another language, the transmission becomes significantly less efficient because the statistics are quite different; the letter “Z”, for example, is quite common in Czech.

The Huffman code² is one that is designed for use with a data source having known statistics and shares the same principles with the Morse code.

The probability of the different code values to be transmitted is studied, and the most frequent codes are arranged to be transmitted with short word-length symbols. As the probability of a code value falls, it will be allocated a longer word length. The Huffman code is used in conjunction with a number of compression techniques and is shown in Figure 6.4.

The input or source codes are assembled in order of descending probability. The two lowest probabilities are distinguished by a single code bit and their probabilities are combined. The process of combining probabilities is continued until unity is reached and at each stage a bit is used to distinguish the path. The bit will be a 0 for the most probable path and 1 for the least. The compressed output is obtained by reading the bits that describe which path to take going from right to left.

In the case of computer data, there is no control over the data statistics. Data to be recorded could be instructions, images, tables, text files, and so on, each having its own code value distributions. In this case a coder relying on fixed-source statistics will be completely inadequate. Instead a system that can learn the statistics as it goes along is used. The LZW (Lempel-Ziv-Welch) Lempel– Ziv–Welch lossless codes are in this category. These codes build up a conversion table between frequent long-source data strings and short transmitted data codes at both coder and decoder, and initially their compression factor is below unity as the contents of the conversion tables are transmitted along with the data. However, once the tables are established, the coding gain more than compensates for the initial loss. In some applications, a continuous analysis of the frequency of code selection is made and if a data string in the table is no longer being used with sufficient frequency it can be deselected and a more common string substituted.

FIGURE 6.4

The Huffman code achieves compression by allocating short codes to frequent values. To aid in de-serializing, the short codes are not prefixes of longer codes.

Lossless codes are less common for audio and video coding in which perceptive codes are permissible. The perceptive codes often obtain a coding gain by shortening the word length of the data representing the signal waveform. This must increase the noise level and the trick is to ensure that the resultant noise is placed at frequencies at which human senses are least able to perceive it. As a result, although the received signal is measurably different from the source data, it can appear the same to the human listener or viewer at moderate compression factors. As these codes rely on the characteristics of human sight and hearing, they can be fully tested only subjectively.

The compression factor of such codes can be set at will by choosing the word length of the compressed data. Whilst mild compression will be undetectable, with greater compression factors, artifacts become noticeable. Figure 6.3 shows that this is inevitable from entropy considerations.

WHAT IS MPEG?

MPEG is actually an acronym for the Moving Pictures Experts Group, which was formed by the ISO (International Standards Organization) to set standards for audio and video compression and transmission. The first compression standard for audio and video was MPEG-1,³, ⁴ but this was of limited application and the subsequent MPEG-2 standard was considerably broader in scope and of wider appeal. For example, MPEG-2 supports interlace, whereas MPEG-1 did not.

The approach of the ISO to standardisation in MPEG is novel because it is not the encoder that is standardised. Instead, the way in which a decoder shall interpret the bitstream, is defined. Figure 6.5a shows that a decoder can successfully interpret the bitstream is said to be compliant. Figure 6.5b shows that the advantage of standardising the decoder is that, over time, encoding algorithms can improve yet compliant decoders will continue to function with them.

Manufacturers can supply encoders using algorithms that are proprietary and their details do not need to be published. A useful result is that there can be competition between different encoder designs, which means that better designs will evolve. The user will have greater choice because different levels of cost and complexity can exist in a range of coders, yet a compliant decoder will operate with them all.

FIGURE 6.5

(a) MPEG defines the protocol of the bitstream between encoder and decoder. The decoder is defined by implication, the encoder is left very much to the designer. (b) This approach allows future encoders of better performance to remain compatible with existing decoders. (c) This approach also allows an encoder to produce a standard bitstream, whereas its technical operation remains a commercial secret.

MPEG is, however, much more than a compression scheme, as it also standardises the protocol and syntax under which it is possible to combine or multiplex audio data with video data to produce a digital equivalent of a television program. Many such programs can be combined in a single multiplex and MPEG defines the way in which such multiplexes can be created and transported. The definitions include the metadata that decoders require to demultiplex correctly and that users will need to locate programs of interest.

FIGURE 6.6

Compression can be used around a recording medium. The storage capacity may be increased or the access time reduced according to the application.

As with all video systems there is a requirement for synchronising or genlocking and this is particularly complex when a multiplex is assembled from many signals that are not necessarily synchronised with one another.

The applications of audio and video compression are limitless and the ISO has done well to provide standards that are appropriate to the wide range of possible compression products.

MPEG-2 embraces video pictures from the tiny screen of a videophone to the high-definition images needed for electronic cinema. Audio coding stretches from speech-grade mono to multichannel surround sound.

Figure 6.6 shows the use of a codec with a recorder. The playing time of the medium is extended in proportion to the compression factor. In the case of tapes, the access time is improved because the length of tape needed for a given recording is reduced and so it can be rewound more quickly.

In the case of DVD (digital video disc, aka digital versatile disc) the challenge was to store an entire movie on one 12cm disc. The storage density available is such that recording of SD uncompressed video would be out of the question. By the same argument, compression is also required in HD video disks.

In communications, the cost of data links is often roughly proportional to the data rate and so there is simple economic pressure to use a high compression factor. However, it should be borne in mind that implementing the codec also has a cost, which rises with compression factor, and so a degree of compromise will be inevitable.

In the case of video-on-demand, technology exists to convey full bandwidth video to the home, but to do so for a single individual at the moment would be prohibitively expensive. Without compression, HDTV (high-definition television) requires too much bandwidth. With compression, HDTV can economically be transmitted to the home. Compression does not make video-on-demand or HDTV possible, it makes them economically viable.

MPEG-1, -2, AND -4 AND H.264 CONTRASTED

The first compression standard for audio and video was MPEG-1. Although many applications have been found, MPEG-1 was basically designed to allow moving pictures and sound to be encoded into the bit rate of an audio Compact Disc. The resultant video CD was quite successful but has now been superseded by DVD. To meet the low bit requirement, MPEG-1 down sampled the images heavily as well as using picture rates of only 24–30 Hz, and the resulting quality was moderate.

The subsequent MPEG-2 standard was considerably broader in scope and of wider appeal. For example, MPEG-2 supports interlace and HD, whereas MPEG-1 did not. MPEG-2 has become very important because it has been chosen as the compression scheme for both DVB (digital video broadcasting) and DVD (digital video disc). Developments in standardising scaleable and multiresolution compression that would have become MPEG-3 were ready by the time MPEG-2 was ready to be standardised and so this work was incorporated into MPEG-2, and as a result there is no MPEG-3 standard.

MPEG-4 uses further coding tools with additional complexity to achieve higher compression factors than MPEG-2. In addition to more efficient coding of video, MPEG-4 moves closer to computer graphics applications. In the more complex profiles, the MPEG-4 decoder effectively becomes a rendering processor and the compressed bitstream describes three-dimensional shapes and surface texture. It is to be expected that MPEG-4 will become as important to the Internet and wireless delivery as MPEG-2 has become in DVD and DVB.

The MPEG-4 standard is extremely wide ranging and it is unlikely that a single decoder will ever be made that can handle every possibility. Many of the graphics applications of MPEG-4 are outside telecommunications requirements. In 2001 the International Telecommunications Union (ITU) VCEG (Video Coding Experts Group) joined with ISO MPEG to form the JVT (Joint Video Team). The resulting standard is variously known as AVC (advanced video coding), H.264, or MPEG-4 Part 10. This standard further refines the video coding aspects of MPEG-4, which were themselves refinements of MPEG-2, to produce a coding scheme having the same applications as MPEG-2 but with the higher performance needed to broadcast HDTV.

To avoid tedium, in cases in which the term MPEG is used in this chapter without qualification, it can be taken to mean MPEG-1, -2, or -4 or H.264. Where a specific standard is being contrasted it will be made clear.

In workstations designed for the editing of audio and/or video, the source material is stored on hard disks for rapid access. Whilst top-grade systems may function without compression, many systems use compression to offset the high cost of disk storage. When a workstation is used for offline editing, a high compression factor can be used and artifacts will be visible in the picture.

This is of no consequence, as the picture is seen only by the editor, who uses it to make an EDL (edit decision list), which is no more than a list of actions and the timecodes at which they occur. The original uncompressed material is then conformed to the EDL to obtain a high-quality edited work. When online editing is being performed, the output of the workstation is the finished product and clearly a lower compression factor will have to be used.

Perhaps it is in broadcasting where the use of compression will have its greatest impact. There is only one electromagnetic spectrum and pressure from other services such as cellular telephones makes efficient use of bandwidth mandatory. Analog television broadcasting is an old technology and makes very inefficient use of bandwidth. Its replacement by a compressed digital transmission will be inevitable for the practical reason that the bandwidth is needed elsewhere.

Fortunately in broadcasting there is a mass market for decoders and these can be implemented as low-cost integrated circuits. Fewer encoders are needed and so it is less important if these are expensive. Whilst the cost of digital storage goes down year by year, the cost of electromagnetic spectrum goes up. Consequently in the future the pressure to use compression in recording will ease, whereas the pressure to use it in radio communications will increase.

SPATIAL AND TEMPORAL REDUNDANCY IN MPEG

Video signals exist in four dimensions: the attributes of the sample, the horizontal and vertical spatial axes, and the time axis. Compression can be applied in any or all of those four dimensions. MPEG-2 assumes 8-bit colour difference signals as the input, requiring rounding if the source is 10-bit. The sampling rate of the colour signals is less than that of the luminance. This is done by down sampling the colour samples horizontally and generally vertically as well. Essentially an MPEG-2 system has three parallel simultaneous channels, one for luminance and two for colour difference, which after coding are multiplexed into a single bitstream.

Figure 6.7a shows that when individual pictures are compressed without reference to any other pictures, the time axis does not enter the process, which is therefore described as intra-coded (intra = within) compression. The term spatial coding will also be found. It is an advantage of intra-coded video that there is no restriction to the editing that can be carried out on the picture sequence. As a result compressed VTRs such as Digital Betacam, DVC, and D-9 use spatial coding. Cut editing may take place on the compressed data directly if necessary. As spatial coding treats each picture independently, it can employ certain techniques developed for the compression of still pictures. The ISO JPEG (Joint Photographic Experts Group) compression standards^5,6 are in this category. Where a succession of JPEG-coded images is used for television, the term “motion JPEG” will be found.

FIGURE 6.7

(a) Spatial or intra-coding works on individual images. (b) Temporal or inter-coding works on successive images.

Greater compression factors can be obtained by taking account of the redundancy from one picture to the next. This involves the time axis, as Figure 6.7b shows, and the process is known as inter-coded (inter = between) or temporal compression.

Temporal coding allows a higher compression factor, but has the disadvantage that an individual picture may exist only in terms of the differences from a previous picture. Clearly editing must be undertaken with caution and arbitrary cuts simply cannot be performed on the MPEG bitstream. If a previous picture is removed by an edit, the difference data will then be insufficient to re-create the current picture.

Intra-coding works in three dimensions: on the horizontal and vertical spatial axes and on the sample values. Analysis of typical television pictures reveals that, whilst there is a high spatial frequency content due to detailed areas of the picture, there is a relatively small amount of energy at such frequencies. Often pictures contain sizeable areas in which the same or similar pixel values exist. This gives rise to low spatial frequencies. The average brightness of the picture results in a substantial zero frequency component. Simply omitting the high-frequency components is unacceptable as this causes an obvious softening of the picture.

A coding gain can be obtained by taking advantage of the fact that the amplitude of the spatial components falls with frequency. It is also possible to take advantage of the eye's reduced sensitivity to noise in high spatial frequencies. If the spatial frequency spectrum is divided into frequency bands the high-frequency bands can be described by fewer bits, not only because their amplitudes are smaller, but also because more noise can be tolerated. The wavelet transform and the discrete cosine transform used in MPEG allow two-dimensional pictures to be described in the frequency domain and these were discussed in Chapter 3. Inter-coding takes further advantage of the similarities between successive pictures in real material. Instead of sending information for each picture separately, inter-coders will send the difference between the previous picture and the current picture in a form of differential coding. Figure 6.8 shows the principle. A picture store is required at the coder to allow comparison to be made between successive pictures and a similar store is required at the decoder to make the previous picture available. The difference data may be treated as a picture itself and subjected to some form of transform-based spatial compression.

The simple system of Figure 6.8a is of limited use, as in the case of a transmission error, every subsequent picture would be affected. Channel switching in a television set would also be impossible. In practical systems a modification is required. One approach is the so-called “leaky predictor” in which the next picture is predicted from a limited number of previous pictures rather than from an indefinite number. As a result errors cannot propagate indefinitely. The approach used in MPEG is that periodically some absolute picture data are transmitted in place of difference data.

Figure 6.8b shows that absolute picture data, known as I or intra pictures, are interleaved with pictures that are created using difference data, known as P or predicted pictures. I pictures require a large amount of data, whereas the P pictures require fewer data. As a result the instantaneous data rate varies dramatically and buffering has to be used to allow a constant transmission rate. The leaky predictor needs less buffering as the compression factor does not change so much from picture to picture.

The I picture and all the P pictures prior to the next I picture are called a group of pictures (GOP). For a high compression factor, a large number of P pictures should be present between I pictures, making a long GOP. However, a long GOP delays recovery from a transmission error. The compressed bitstream can be edited only at I pictures as shown.

In the case of moving objects, although their appearance may not change greatly from picture to picture, the data representing them on a fixed sampling grid will change and so large differences will be generated between successive pictures. It is a great advantage if the effect of motion can be removed from difference data so that they reflect only the changes in appearance of a moving object because a much greater coding gain can then be obtained. This is the objective of motion compensation.

FIGURE 6.8

(a) An inter-coded system uses a delay to calculate the pixel differences between successive pictures. To prevent error propagation, (b) intra-coded pictures may be used periodically.

In real television program material objects move around before a fixed camera or the camera itself moves. Motion compensation is a process that effectively measures motion of objects from one picture to the next so that it can allow for that motion when looking for redundancy between pictures. Figure 6.9 shows that moving pictures can be expressed in a three-dimensional space, which results from the screen area moving along the time axis. In the case of still objects, the only motion is along the time axis. However, when an object moves, it does so along the optic flow axis, which is not parallel to the time axis. The optic flow axis joins the same point on a moving object as it takes on various screen positions.

FIGURE 6.9

Objects travel in a three-dimensional space along the optic flow axis, which is parallel to the time axis only if there is no movement.

It will be clear that the data values representing a moving object change with respect to the time axis. However, looking along the optic flow axis, the appearance of an object changes only if it deforms, moves into shadow, or rotates. For simple translational motions the data representing an object are highly redundant with respect to the optic flow axis. Thus if the optic flow axis can be located, coding gain can be obtained in the presence of motion.

A motion-compensated coder works as follows. An I picture is sent, but is also locally stored so that it can be compared with the next input picture to find motion vectors for various areas of the picture. The I picture is then shifted according to these vectors to cancel interpicture motion. The resultant predicted picture is compared with the actual picture to produce a prediction error also called a residual. The prediction error is transmitted with the motion vectors. At the receiver the original I picture is also held in a memory. It is shifted according to the transmitted motion vectors to create the predicted picture and then the prediction error is added to it to re-create the original. When a picture is encoded in this way MPEG calls it a P picture.

Figure 6.10a shows that spatial redundancy is redundancy within a single image, for example, repeated pixel values in a large area of blue sky. Temporal redundancy (Figure 6.10b) exists between successive images.

Where temporal compression is used, the current picture is not sent in its entirety; instead the difference between the current picture and the previous picture is sent. The decoder already has the previous picture, and so it can add the difference to make the current picture. A difference picture is created by subtracting every pixel in one picture from the corresponding pixel in another pixel. This is trivially easy in a progressively scanned system, but MPEG-2 has had to develop greater complexity so that this can also be done with interlaced pictures. The handling of interlace in MPEG will be detailed later.

A difference picture is an image of a kind, although not a viewable one, and so should contain some kind of spatial redundancy. Figure 6.10c shows that MPEG-2 takes advantage of both forms of redundancy. Picture differences are spatially compressed prior to transmission. At the decoder the spatial compression is decoded to re-create the difference picture, and this difference picture is added to the previous picture to complete the decoding process.

FIGURE 6.10

(a) Spatial or intra-coding works on individual images. (b) Temporal or inter-coding works on successive images. (c) In MPEG inter-coding is used to create difference images. These are then compressed spatially.

Whenever objects move they will be in a different place in successive pictures. This will result in large amounts of difference data. MPEG-2 overcomes the problem using motion compensation. The encoder contains a motion estimator, which measures the direction and distance of motion between pictures and outputs these as vectors, which are sent to the decoder. When the decoder receives the vectors it uses them to shift data in a previous picture to resemble the current picture more closely. Effectively the vectors are describing the optic flow axis of some moving screen area, along which axis the image is highly redundant. Vectors are bipolar codes that determine the amount of horizontal and vertical shift required.

In real images, moving objects do not necessarily maintain their appearance as they move; for example, objects may turn, move into shade or light, or move behind other objects. Consequently motion compensation can never be ideal and it is still necessary to send a picture difference to make up for any shortcomings in the motion compensation.

Figure 6.11 shows how this works. In addition to the motion-encoding system, the coder also contains a motion decoder. When the encoder outputs motion vectors, it also uses them locally in the same way that a real decoder will and is able to produce a predicted picture based solely on the previous picture shifted by motion vectors. This is then subtracted from the actual current picture to produce a prediction error or residual, which is an image of a kind that can be spatially compressed.

The decoder takes the previous picture, shifts it with the vectors to re-create the predicted picture, and then decodes and adds the prediction error to produce the actual picture. Picture data sent as vectors plus prediction error are said to be P coded.

The concept of sending a prediction error is a useful approach because it allows both the motion estimation and the compensation to be imperfect.

A good motion-compensation system will send just the right amount of vector data. With insufficient vector data, the prediction error will be large, but transmission of excess vector data will also cause the bit rate to rise. There will be an optimum balance, which minimizes the sum of the prediction error data and the vector data.

FIGURE 6.11

A motion-compensated compression system. The coder calculates motion vectors, which are transmitted as well as being used locally to create a predicted picture. The difference between the predicted picture and the actual picture is transmitted as a prediction error.

In MPEG-2 the balance is obtained by dividing the screen into areas called macroblocks, which are 16 luminance pixels square. Each macroblock is steered by a vector. The boundary of a macroblock is fixed and so the vector does not move the macroblock. Instead the vector tells the decoder where to look in another frame to find pixel data to fetch to the macroblock. Figure 6.12a shows this concept. The shifting process is generally done by modifying the read address of a RAM using the vector. This can shift by one-pixel steps. MPEG-2 vectors have half-pixel resolution, so it is necessary to interpolate between pixels from RAM to obtain half-pixel-shifted values.

Real moving objects will not coincide with macroblocks and so the motion compensation will not be ideal but the prediction error makes up for any shortcomings. Figure 6.12b shows the case in which the boundary of a moving object bisects a macroblock. If the system measures the moving part of the macroblock and sends a vector, the decoder will shift the entire block, making the stationary part wrong. If no vector is sent, the moving part will be wrong. Both approaches are legal in MPEG-2, because the prediction error sorts out the incorrect values.

FIGIRE 6.12

(a) In motion compensation, pixel data are brought to a fixed macroblock in the target picture from a variety of places in another picture. (b) Where only part of a macroblock is moving, motion compensation is nonideal. The motion can be coded (c), causing a prediction error in the background, or the background can be coded (d), causing a prediction error in the moving object.

An intelligent coder might try both approaches to see which requires the least prediction error data.

The prediction error concept also allows the use of simple but inaccurate motion estimators in low-cost systems. The greater prediction error data are handled using a higher bit rate. On the other hand, if a precision motion estimator is available, a very high compression factor may be achieved because the prediction error data are minimized. MPEG-2 does not specify how motion is to be measured; it simply defines how a decoder will interpret the vectors. Encoder designers are free to use any motion-estimation system provided that the right vector protocol is created. Chapter 5 contrasted a number of motion estimation techniques.

Figure 6.13a shows that a macroblock contains both luminance and colour difference data at different resolutions. Most of the MPEG-2 profiles use a 4:2:0 structure, which means that the colour is down sampled by a factor of 2 in both axes. Thus in a 16 × 16 pixel block, there are only 8 × 8 colour difference sampling sites. MPEG-2 is based upon the 8 × 8 DCT (see The Discrete Cosine Transform in Chapter 3) and so the 16 × 16 block is the screen area that contains an 8 × 8 colour difference sampling block. Thus in 4:2:0 in each macro-block there are four luminance DCT blocks, one R–Y DCT block, and one B–Y DCT block, all steered by the same vector.

In the 4:2:2 profile of MPEG-2, shown in Figure 6.13b, the chroma is not down sampled vertically, and so there are twice as many chroma data in each macro-block, which is otherwise substantially the same.

I AND P CODING

Predictive (P) coding cannot be used indefinitely, as it is prone to error propagation. A further problem is that it becomes impossible to decode the transmission if reception begins part way through. In real video signals, cuts or edits can be present, across which there is little redundancy and which make motion estimators throw up their hands.

FIGURE 6.13

The structure of a macroblock. (A macroblock is the screen area steered by one vector.) (a) In 4:2:0, there are two chroma DCT blocks per macroblock, whereas (b) in 4:2:2 there are four; 4:2:2 needs 33% more data than 4:2:0.

In the absence of redundancy over a cut, there is nothing to be done but to send the new picture information in absolute form. This is called I coding, in which I is an abbreviation of intra-coding. As I coding needs no previous picture for decoding, then decoding can begin at I-coded information.

MPEG-2 is effectively a toolkit and there is no compulsion to use all the tools available. Thus an encoder may choose whether to use I or P coding, either once and for all or dynamically on a macroblock-by-macroblock basis.

For practical reasons, an entire frame may be encoded as I macroblocks periodically. This creates a place where the bitstream might be edited or where decoding could begin.

Figure 6.14 shows a typical application of the Simple Profile of MPEG-2. Periodically an I picture is created. Between I pictures are P pictures, which are based on the previous picture. These P pictures predominantly contain macro-blocks having vectors and prediction errors. However, it is perfectly legal for P pictures to contain I macroblocks. This might be useful where, for example, a camera pan introduces new material at the edge of the screen that cannot be created from an earlier picture.

Note that although what is sent is called a P picture, it is not a picture at all. It is a set of instructions to convert the previous picture into the current picture. If the previous picture is lost, decoding is impossible. An I picture together with all the pictures before the next I picture form a GOP.

FIGURE 6.14

A Simple Profile MPEG-2 signal may contain periodic I pictures with a number of P pictures between.

BI-DIRECTIONAL CODING

Motion-compensated predictive coding is a useful compression technique, but it does have the drawback that it can take data only from a previous picture. When moving objects reveal a background this is completely unknown in previous pictures and forward prediction fails. However, more of the background is visible in later pictures. Figure 6.15 shows the concept. In the centre of the diagram, a moving object has revealed some background. The previous picture can contribute nothing, whereas the next picture contains all that is required.

Bi-directional coding is shown in Figure 6.16. A bi-directional or B macroblock can be created using a combination of motion compensation and the addition of a prediction error. This can be done by forward prediction from a previous picture or backward prediction from a subsequent picture. It is also possible to use an average of both forward and backward prediction. On noisy material this may result in some reduction in bit rate. The technique is also a useful way of portraying a dissolve.

The averaging process in MPEG-2 is a simple linear interpolation, which works well when only one B picture exists between the reference pictures before and after. A larger number of B pictures would require weighted interpolation but MPEG-2 does not support this.

Typically two B pictures are inserted between P pictures or between I and P pictures. As can be seen, B pictures are never predicted from one another, only from I or P pictures. A typical GOP for broadcasting purposes might have the structure IBBPBBPBBPBB. Note that the last B pictures in the GOP require the I picture in the next GOP for decoding and so the GOPs are not truly independent. Independence can be obtained by creating a closed GOP that may contain B pictures but that ends with a P picture. It is also legal to have a B picture in which every macroblock is forward predicted, needing no future picture for decoding.

FIGURE 6.15

In bi-directional coding the revealed background can be efficiently coded by bringing data back from a future picture.

Bi-directional coding is very powerful. Figure 6.17 is a constant quality curve showing how the bit rate changes with the type of coding. On the left, only I or spatial coding is used, whereas on the right an IBBP structure is used. This means that there are two bi-directionally coded pictures in between a spatially coded picture (I) and a forward-predicted picture (P). Note how, for the same quality, the system that uses only spatial coding needs two and a half times the bit rate that the bi-directionally coded system needs.

Clearly information in the future has yet to be transmitted and so is not normally available to the decoder. MPEG-2 gets around the problem by sending pictures in the wrong order. Picture reordering requires a delay in the encoder and a delay in the decoder to put the order right again. Thus the overall codec delay must rise when bi-directional coding is used. This is quite consistent with Figure 6.3, which showed that as the compression factor rises the latency must also rise.

Figure 6.18 shows that although the original picture sequence is IBBPBBPBBIBB…, this is transmitted as IPBBPBBIBB… so that the future picture is already in the decoder before bi-directional decoding begins. Note that the I picture of the next GOP is actually sent before the last B pictures of the current GOP.

Figure 6.18 also shows that the amount of data required by each picture is dramatically different. I pictures have only spatial redundancy and so need a lot of data to describe them. P pictures need fewer data because they are created by shifting the I picture with vectors and then adding a prediction error picture. B pictures need the fewest data of all because they can be created from I or P.

With pictures requiring a variable length of time to transmit, arriving in the wrong order, the decoder needs some help. This takes the form of picture-type flags and time stamps.

FIGURE 6.16

In bi-directional coding, a number of B pictures can be inserted between periodic forward-predicted pictures. See text.

FIGURE 6.17

Bi-directional coding is very powerful as it allows the same quality with only 40 percent of the bit rate of intra-coding. However, the encoding and decoding delays must increase. Coding over a longer time span is more efficient but editing is more difficult.

FIGURE 6.18

Comparison of pictures before and after compression showing sequence change and varying amount of data needed by each picture type. I, P, and B pictures use unequal amounts of data.

CODING APPLICATIONS

Figure 6.19 shows a variety of GOP structures. The simplest is the III… sequence in which every picture is intra-coded. Pictures can be fully decoded without reference to any other pictures and so editing is straightforward. However, this approach requires about two and a half times the bit rate of a full bi-directional system. Bi-directional coding is most useful for final delivery of postproduced material either by broadcast or on pre-recorded media, as there is then no editing requirement. As a compromise the IBIB… structure can be used, which has some of the bit rate advantage of bi-directional coding but without too much latency. It is possible to edit an IBIB stream by performing some processing. If it is required to remove the video following a B picture, that B picture could not be decoded because it needs I pictures on either side of it for bi-directional decoding. The solution is to decode the B picture first and then reencode it with forward prediction only from the previous I picture. The subsequent I picture can then be replaced by an edit process. Some quality loss is inevitable in this process but this is acceptable in applications such as ENG and industrial video.

FIGURE 6.19

Various possible GOP structures used with MPEG. See text for details.

SPATIAL COMPRESSION

Spatial compression in MPEG-2 is used in I pictures on actual picture data and in P and B pictures on prediction error data. MPEG-2 uses the discrete cosine transform described in Chapter 3. The DCT works on blocks and in MPEG-2 these are 8 × 8 pixels. The macroblocks of the motion-compensation structure are designed so they can be broken down into 8 × 8 DCT blocks. In a 4:2:0 macroblock there will be six DCT blocks, whereas in a 4:2:2 macroblock there will be eight.

Figure 6.20 shows the table of basis functions or wave table for an 8 × 8 DCT. Adding these two-dimensional waveforms together in different proportions will give any original 8 × 8-pixel block. The coefficients of the DCT simply control the proportion of each wave that is added in the inverse transform. The top-left wave has no modulation at all because it conveys the DC component of the block. This coefficient will be a unipolar (positive only) value in the case of luminance and will typically be the largest value in the block as the spectrum of typical video signals is dominated by the DC component.

FIGURE 6.20

The discrete cosine transform breaks up an image area into discrete frequencies in two dimensions. The lowest frequency can be seen here at the top-left corner. Horizontal frequency increases to the right and vertical frequency increases downward.

Increasing the DC coefficient adds a constant amount to every pixel. Moving to the right, the coefficients represent increasing horizontal spatial frequencies, and moving downward, the coefficients represent increasing vertical spatial frequencies. The bottom-right coefficient represents the highest diagonal frequencies in the block. All these coefficients are bipolar, where the polarity indicates whether the original spatial waveform at that frequency was inverted.

Figure 6.21 shows a one-dimensional example of an inverse transform. The DC coefficient produces a constant level throughout the pixel block. The remaining waves in the table are AC coefficients. A zero coefficient would result in no modulation, leaving the DC level unchanged. The wave next to the DC component represents the lowest frequency in the transform, which is half a cycle per block. A positive coefficient would make the left side of the block brighter and the right side darker, whereas a negative coefficient would do the opposite. The magnitude of the coefficient determines the amplitude of the wave that is added. Figure 6.21 also shows that the next wave has a frequency of one cycle per block, i.e., the block is made brighter at both sides and darker in the middle.

FIGURE 6.21

A one-dimensional inverse transform. See text for details.

Consequently an inverse DCT is no more than a process of mixing various pixel patterns from the wave table in which the relative amplitudes and polarity of these patterns are controlled by the coefficients. The original transform is simply a mechanism that finds the coefficient amplitudes from the original pixel block.

The DCT itself achieves no compression at all. Sixty-four pixels are converted to 64 coefficients. However, in typical pictures, not all coefficients will have significant values; there will often be a few dominant coefficients. The coefficients representing the higher two-dimensional spatial frequencies will often be zero or of small value in large areas, due to blurring or simply plain undetailed areas before the camera.

Statistically, the farther from the top-left corner of the wave table the coefficient is, the smaller will be its magnitude. Coding gain (the technical term for reduction in the number of bits needed) is achieved by transmitting the low-valued coefficients with shorter word lengths. The zero-valued coefficients need not be transmitted at all. Thus it is not the DCT that compresses the data, it is the subsequent processing. The DCT simply expresses the data in a form that makes the subsequent processing easier.

Higher compression factors require the coefficient word length to be further reduced using requantizing. Coefficients are divided by some factor that increases the size of the quantizing step. The smaller number of steps that result permits coding with fewer bits, but, of course, with an increased quantizing error. The coefficients will be multiplied by a reciprocal factor in the decoder to return to the correct magnitude.

Inverse transforming a requantized coefficient means that the frequency it represents is reproduced in the output with the wrong amplitude. The difference between original and reconstructed amplitude is regarded as a noise added to the wanted data. Figure 6.22 shows that the visibility of such noise is far from uniform. The maximum sensitivity is found at DC and falls thereafter. As a result the top-left coefficient is often treated as a special case and left unchanged. It may warrant more error protection than other coefficients.

MPEG-2 takes advantage of the falling sensitivity to noise. Prior to requantizing, each coefficient is divided by a different weighting constant as a function of its frequency. Figure 6.23 shows a typical weighting process. Naturally the decoder must have a corresponding inverse weighting. This weighting process has the effect of reducing the magnitude of high-frequency coefficients disproportionately. Clearly, different weighting will be needed for colour difference data as colour is perceived differently.

FIGURE 6.22

The sensitivity of the eye to noise is greatest at low frequencies and drops rapidly with increasing frequency. This can be used to mask quantizing noise caused by the compression process.

P and B pictures are decoded by adding a prediction error image to a reference image. That reference image will contain weighted noise. One purpose of the prediction error is to cancel that noise to prevent tolerance buildup. If the prediction error were also to contain weighted noise, this result would not be obtained. Consequently prediction error coefficients are flat weighted.

When forward prediction fails, such as in the case of new material introduced in a P picture by a pan, P coding would set the vectors to zero and encode the new data entirely as an unweighted prediction error. In this case it is better to encode that material as an I macroblock because then weighting can be used and this will require fewer bits.

Requantizing increases the step size of the coefficients, whereas the inverse weighting in the decoder results in step sizes that increase with frequency. The larger step size increases the quantizing noise at high frequencies where it is less visible. Effectively the noise floor is shaped to match the sensitivity of the eye. The quantizing table in use at the encoder can be transmitted to the decoder periodically in the bitstream.

FIGURE 6.23

Weighting is used to make the noise caused by requantizing different at each frequency.

SCANNING AND RUN-LENGTH/VARIABLE-LENGTH CODING

Study of the signal statistics gained from extensive analysis of real material is used to measure the probability of a given coefficient having a given value. This probability turns out to be highly nonuniform, suggesting the possibility of a variable-length encoding for the coefficient values. On average, the higher the spatial frequency, the lower the value of a coefficient will be. This means that the value of a coefficient falls as a function of its radius from the DC coefficient.

Typical material often has many coefficients that are zero valued, especially after requantizing. The distribution of these also follows a pattern. The nonzero values tend to be found in the top-left corner of the DCT block, but as the radius increases, not only do the coefficient values fall, but it becomes increasingly likely that these small coefficients will be interspersed with zero-valued coefficients. As the radius increases further it is probable that a region where all coefficients are zero will be entered.

MPEG-2 uses all these attributes of DCT coefficients when encoding a coefficient block. By sending the coefficients in an optimum order, by describing their values with Huffman coding, and by using run-length encoding for the zero-valued coefficients it is possible to achieve a significant reduction in coefficient data that remains entirely lossless. Despite the complexity of this process, it does contribute to improved picture quality because for a given bit rate lossless coding of the coefficients must be better than requantizing, which is lossy. Of course, for lower bit rates both will be required.

It is an advantage to scan in a sequence of which the largest coefficient values are scanned first. Then the next coefficient is more likely to be zero than the previous one. With progressively scanned material, a regular zig-zag scan begins in the top-left corner and ends in the bottom-right corner as shown in Figure 6.24. Zig-zag scanning means that significant values are more likely to be transmitted first, followed by the zero values. Instead of coding these zeros, a unique “end of block” (EOB) symbol is transmitted instead.

As the zig-zag scan approaches the last finite coefficient it is increasingly likely that some zero-value coefficients will be scanned. Instead of transmitting the coefficients as zeros, the zero-run-length, i.e., the number of zero-valued coefficients in the scan sequence, is encoded into the next nonzero coefficient, which is itself variable-length coded. This combination of run-length and variable-length coding is known as RLC/VLC in MPEG-2.

FIGURE 6.24

The zig-zag scan for a progressively scanned image.

The DC coefficient is handled separately because it is differentially coded and this discussion relates to the AC coefficients. Three items need to be handled for each coefficient: the zero-run-length prior to this coefficient, the word length, and the coefficient value itself. The word length needs to be known by the decoder so that it can correctly parse the bitstream. The word length of the coefficient is expressed directly as an integer called the size.

Figure 6.25a shows that a two-dimensional run/size table is created. One dimension expresses the zero-run-length, the other the size. A run length of zero is obtained when adjacent coefficients are nonzero, but a code of 0/0 has no meaningful run/size interpretation and so this bit pattern is used for the EOB symbol.

In the case in which the zero-run-length exceeds 14, a code of 15/0 is used, signifying that there are 15 zero-valued coefficients. This is then followed by another run/size parameter whose run-length value is added to the previous 15.

The run/size parameters contain redundancy because some combinations are more common than others. Figure 6.25b shows that each run/size value is converted to a variable-length Huffman code word for transmission. The Huffman codes are designed so that short codes are never a prefix of long codes, so that the decoder can deduce the parsing by testing an increasing number of bits until a match with the lookup table is found. Having parsed and decoded the Huffman run/size code, the decoder then knows what the coefficient word length will be and can correctly parse that.

FIGURE 6.25

Run-length and variable-length coding simultaneously compresses runs of zero-valued coefficients and describes the word length of a non-zero coefficient.

The variable-length coefficient code has to describe a bipolar coefficient, i.e., one that can be positive or negative. Figure 6.25c shows that for a particular size, the coding scale has a certain gap in it. For example, all values from −7 to +7 can be sent by a size 3 code, so a size 4 code has to send only the values of −15 to −8 and +8 to +15. The coefficient code is sent as a pure binary number whose value ranges from all zeros to all ones where the maximum value is a function of the size. The number range is divided into two, the lower half of the codes specifying negative values and the upper half specifying positive values.

In the case of positive numbers, the transmitted binary value is the actual coefficient value, whereas in the case of negative numbers a constant must be subtracted that is a function of the size. In the case of a size 4 code, the constant is 15₁₀. Thus a size 4 parameter of 0111₂ (7₁₀) would be interpreted as 7−15 = − 8. A size of 5 has a constant of 31 so a transmitted code of 01010₂ (10₂) would be interpreted as 10−31 = −21.

FIGURE 6.25

This technique saves a bit because, for example, 63 values from −31 to +31 are coded with only 5 bits having only 32 combinations. This is possible because that extra bit is effectively encoded into the run/size parameter.

Figure 6.26 shows the whole spatial coding subsystem. Macroblocks are subdivided into DCT blocks and the DCT is calculated. The resulting coefficients are multiplied by the weighting matrix and then requantized. The coefficients are then reordered by the zig-zag scan so that full advantage can be taken of run-length and variable-length coding. The last non-zero coefficient in the scan is followed by the EOB symbol.

FIGURE 6.26

A complete spatial coding system, which can compress an I picture or the prediction error in P and B pictures. See text for details.

In predictive coding, sometimes the motion-compensated prediction is nearly exact and so the prediction error will be almost zero. This can also happen on still parts of the scene. MPEG-2 takes advantage of this by sending a code to tell the decoder there is no prediction error data for the macroblock concerned.

The success of temporal coding depends on the accuracy of the vectors. Trying to reduce the bit rate by reducing the accuracy of the vectors is false economy as this simply increases the prediction error. Consequently for a given GOP structure it is only in the spatial coding that the overall bit rate is determined. The RLC/VLC coding is lossless and so its contribution to the compression cannot be varied. If the bit rate is too high, the only option is to increase the size of the coefficient-requantizing steps. This has the effect of shortening the word length of large coefficients, and rounding small coefficients to zero, so that the bit rate goes down. Clearly if taken too far the picture quality will also suffer because at some point the noise floor will become visible as some form of artifact.

A BI-DIRECTIONAL CODER

MPEG-2 does not specify how an encoder is to be built or what coding decisions it should make. Instead it specifies the protocol of the bitstream at the output. As a result the coder shown in Figure 6.27 is only an example.

Figure 6.27a shows the component parts of the coder. At the input is a chain of picture stores, which can be bypassed for re-ordering purposes. This allows a picture to be encoded ahead of its normal timing when bi-directional coding is employed.

At the centre is a dual motion estimator, which can simultaneously measure motion between the input picture, an earlier picture, and a later picture. These reference pictures are held in frame stores. The vectors from the motion estimator are locally used to shift a picture in a frame store to form a predicted picture. This is subtracted from the input picture to produce a prediction error picture, which is then spatially coded.

FIGURE 6.27

A bi-directional coder. (a) The essential components.

The bi-directional encoding process will now be described. A GOP begins with an I picture, which is intra-coded. In Figure 6.27b the I picture emerges from the reordering delay. No prediction is possible on an I picture so the motion estimator is inactive. There is no predicted picture and so the prediction error subtractor is set simply to pass the input. The only processing that is active is the forward spatial coder, which describes the picture with DCT coefficients. The output of the forward spatial coder is locally decoded and stored in the past picture frame store.

The reason for the spatial encode/decode is that the past picture frame store now contains exactly what the decoder frame store will contain, including the effects of any requantizing errors. When the same picture is used as a reference at both ends of a differential coding system, the errors will cancel out.

FIGURE 6.27

(b) Signal flow when coding an I picture.

Having encoded the I picture, attention turns to the P picture. The input sequence is IBBP, but the transmitted sequence must be IPBB. Figure 6.27c shows that the reordering delay is bypassed to select the P picture. This passes to the motion estimator, which compares it with the I picture and outputs a vector for each macro-block. The forward predictor uses these vectors to shift the I picture so that it more closely resembles the P picture. The predicted picture is then subtracted from the actual picture to produce a forward prediction error. This is then spatially coded. Thus the P picture is transmitted as a set of vectors and a prediction error image.

The P picture is locally decoded in the right-hand decoder. This takes the forward-predicted picture and adds the decoded prediction error to obtain exactly what the decoder will obtain.

FIGURE 6.27

Figure 6.27d shows that the encoder now contains an I picture in the left store and a P picture in the right store. The reordering delay is reselected so that the first B picture can be input. This passes to the motion estimator by which it is compared with both the I and the P pictures to produce forward and backward vectors. The forward vectors go to the forward predictor to make a B prediction from the I picture. The backward vectors go to the backward predictor to make a B prediction from the P picture. These predictions are simultaneously subtracted from the actual B picture to produce a forward prediction error and a backward prediction error. These are then spatially encoded. The encoder can then decide which direction of coding resulted in the best prediction, i.e., the smallest prediction error.

Not shown in the interests of clarity is a third signal path, which creates a predicted B picture from the average of forward and backward predictions. This is subtracted from the input picture to produce a third prediction error. In some circumstances this prediction error may use fewer data than either forward or backward prediction alone.

FIGURE 6.27

(d) Signal flow when bi-directional coding.

As B pictures are never used to create other pictures, the decoder does not locally decode the B picture. After decoding and displaying the B picture, the decoder will discard it. At the encoder, I and P pictures remain in their frame stores and the second B picture is input from the reordering delay.

Following the encoding of the second B picture, the encoder must reorder again to encode the second P picture in the GOP. This will locally be decoded and will replace the I picture in the left store. The stores and predictors switch designation because the left store is now a future P picture and the right store is now a past P picture. B pictures between them are encoded as before.

SLICES

There is still some redundancy in the output of a bi-directional coder and MPEG-2 is remarkably diligent in finding it. In I pictures, the DC coefficient describes the average brightness of an entire DCT block. In real video the DC component of adjacent blocks will be similar much of the time. A saving in bit rate can be obtained by differentially coding the DC coefficient.

In P and B pictures this is not done because these are prediction errors, not actual images, and the statistics are different. However, P and B pictures send vectors and instead the redundancy in these is explored. In a large moving object, many macroblocks will be moving at the same velocity and their vectors will be the same. Thus differential vector coding will be advantageous.

As has been seen, differential coding cannot be used indiscriminately as it is prone to error propagation. Periodically absolute DC coefficients and vectors must be sent and the slice is the logical structure that supports this mechanism. In I pictures, the first DC coefficient in a slice is sent in absolute form, whereas the subsequent coefficients are sent differentially. In P or B pictures, the first vector in a slice is sent in absolute form, but the subsequent vectors are differential.

Slices are horizontal picture strips that are one macroblock (16 pixels) high and that proceed from left to right across the screen. The sides of the picture must coincide with the beginning or the end of a slice in MPEG-2, but otherwise the encoder is free to decide how big slices should be and where they begin.

In the case of a central dark building silhouetted against a bright sky, there would be two large changes in the DC coefficients, one at each edge of the building. It may be advantageous to the encoder to break the width of the picture into three slices, one each for the left and right areas of sky and one for the building. In the case of a large moving object, different slices may be used for the object and the background.

Each slice contains its own synchronising pattern, so after a transmission error, correct decoding can resume at the next slice. Slice size can also be matched to the characteristics of the transmission channel. For example, in an error-free transmission system the use of a large number of slices in a packet simply wastes data capacity on surplus synchronising patterns. However, in a non-ideal system it might be advantageous to have frequent resynchronising.

HANDLING INTERLACED PICTURES

Spatial coding, predictive coding, and motion compensation can still be performed using interlaced source material at the cost of considerable complexity. Despite that complexity, MPEG-2 cannot be expected to perform as well with interlaced material.

Figure 6.28 shows that in an incoming interlaced frame there are two fields, each of which contains half of the lines in the frame. In MPEG-2 these are known as the top field and the bottom field. In video from a camera, these fields represent the state of the image at two different times. When there is little image motion, this is unimportant and the fields can be combined, obtaining more effective compression. However, in the presence of motion the fields become increasingly decorrelated because of the displacement of moving objects from one field to the next.

This characteristic determines that MPEG-2 must be able to handle fields independently or together. This dual approach permeates all aspects of MPEG-2 and affects the definition of pictures, macroblocks, DCT blocks, and zig-zag scanning.

Figure 6.28 also shows how MPEG-2 designates interlaced fields. In picture types I, P, and B, the two fields can be superimposed to make a frame picture or the two fields can be coded independently as two field pictures. As a third possibility, in I pictures only, the bottom field picture can be predictively coded from the top field picture to make an IP frame picture.

FIGURE 6.28

An interlaced frame consists of top and bottom fields. MPEG-2 can code a frame in the ways shown here.

A frame picture is one in which the macroblocks contain lines from both field types over a picture area 16 scan lines high. Each luminance macroblock contains the usual four DCT blocks but there are two ways in which these can be assembled. Figure 6.29a shows how a frame is divided into frame DCT blocks. This is identical to the progressive scan approach in that each DCT block contains 8 contiguous picture lines. In 4:2:0, the colour difference signals have been down sampled by a factor of 2 and shifted as was shown in Chapter 4. Figure 6.29a also shows how one 4:2:0 DCT block contains the chroma data from 16 lines in two fields.

Even small amounts of motion in any direction can destroy the correlation between odd and even lines and a frame DCT will result in an excessive number of coefficients. Figure 6.29b shows that instead the luminance component of a frame can also be divided into field DCT blocks. In this case one DCT block contains odd lines and the other contains even lines. In this mode the chroma still produces one DCT block from both fields as in Figure 6.29a.

When an input frame is designated as two field pictures, the macroblocks come from a screen area that is 32 lines high. Figure 6.29c shows that the DCT blocks contain the same data as if the input frame had been designated a frame picture but with field DCT. Consequently it is only frame pictures that have the option of field or frame DCT. These may be selected by the encoder on a macroblock-by-macroblock basis and, of course, the resultant bitstream must specify what has been done.

In a frame that contains a small moving area, it may be advantageous to encode as a frame picture with frame DCT except in the moving area where field DCT is used. This approach may result in fewer bits than coding as two field pictures. In a field picture and in a frame picture using field DCT, a DCT block contains lines from one field type only and this must have come from a screen area 16 scan lines high, whereas in progressive scan and frame DCT the area is only 8 scan lines high. A given vertical spatial frequency in the image is sampled at points twice as far apart, which is interpreted by the field DCT as a doubled spatial frequency, whereas there is no change in the horizontal spectrum.

Following the DCT calculation, the coefficient distribution will be different in field pictures and field DCT frame pictures. In these cases, the probability of coefficients is not a constant function of radius from the DC coefficient as it is in progressive scan, but is elliptical, in which the ellipse is twice as high as it is wide.

Using the standard 45° zig-zag scan with this different coefficient distribution would not have the required effect of putting all the significant coefficients at the beginning of the scan. To achieve this requires a different zig-zag scan, which is shown in Figure 6.30. This scan, sometimes known as the Yeltsin walk, attempts to match the elliptical probability of interlaced coefficients with a scan slanted at 67.5° to the vertical. This is clearly sub-optimal and is one of the reasons MPEG-2 does not work so well with interlaced video.

Motion estimation is more difficult in an interlaced system. Vertical detail can result in differences between fields and this reduces the quality of the match. Fields are vertically subsampled without filtering and so contain alias products. This aliasing will mean that the vertical waveform representing a moving object will not be the same in successive pictures and this will also reduce the quality of the match.

Even when the correct vector has been found, the match may be poor, so the estimator fails to recognize it. If it is recognized, a poor match means that the quality of the prediction in P and B pictures will be poor and so a large prediction error or residual has to be transmitted. In an attempt to reduce the residual, MPEG-2 allows field pictures to use motion-compensated prediction from either the adjacent field or the same field type in another frame. In this case the encoder will use the better match. This technique can also be used in areas of frame pictures that use field DCT.

FIGURE 6.29

(a) In frame DCT, a picture is effectively de-interlaced. (b) In field DCT, each DCT block contains lines from only one field, but over twice the screen area. (c) The same DCT content results when field pictures are assembled into blocks.

The motion compensation of MPEG-2 has half-pixel resolution and this is inherently compatible with interlace because an interpolator must be present to handle the half-pixel shifts. Figure 6.31a shows that in an interlaced system, each field contains half of the frame lines and so interpolating halfway between lines of one field type will actually create values lying on the sampling structure of the other field type. Thus it is equally possible for a predictive system to decode a given field type based on pixel data from the other field type or of the same type.

If when using predictive coding from the other field type the vertical motion vector contains a half-pixel component, then no interpolation is needed because the act of transferring pixels from one field to another results in such a shift.

Figure 6.31b shows that a macroblock in a given P field picture can be encoded using a vector that shifts data from the previous field or from the field before that, irrespective of which frames these fields occupy. As noted above, field-picture macroblocks come from an area of screen 32 lines high and this means that the vector density is halved, resulting in larger prediction errors at the boundaries of moving objects.

As an option, field pictures can restore the vector density by using 16 8 motion compensation in which separate vectors are used for the top and bottom halves of the macroblock. Frame-pictures can also use 16 8 motion compensation in conjunction with field DCT. Whilst the 2 2 DCT block luminance structure of a macroblock can easily be divided vertically in two, in 4:2:0 the same screen area is represented by only one chroma macroblock of each component type. As it cannot be divided in half, this chroma is deemed to belong to the luminance DCT blocks of the upper field. In 4:2:2 no such difficulty arises.

MPEG-2 supports interlace simply because interlaced video exists in legacy systems and there is a requirement to compress it. However, when the opportunity arises to define a new system, interlace should be avoided. Legacy interlaced source material should be handled using a motion-compensated de-interlacer prior to compression in the progressive domain.

FIGURE 6.30

The zig-zag scan for an interlaced image has to favour vertical frequencies twice as much as horizontal.

FIGURE 6.31

(a) Each field contains half of the frame lines and so interpolation is needed to create values lying on the sampling structure of the other field type. (b) Prediction can use data from the previous field or the one before that.

AN MPEG-2 CODER

Figure 6.32 shows the complete coder. The bi-directional coder outputs coefficients and vectors and the quantizing table in use. The vectors of P and B pictures and the DC coefficients of I pictures are differentially encoded in slices and the remaining coefficients are RLC/VLC coded. The multiplexer assembles all these data into a single bitstream called an elementary stream. The output of the encoder is a buffer that absorbs the variations in bit rate between different picture types. The buffer output has a constant bit rate determined by the demand clock. This comes from the transmission channel or storage device. If the bit rate is low, the buffer will tend to fill up, whereas if it is high the buffer will tend to empty. The buffer content is used to control the severity of the requantizing in the spatial coders. The more the buffer fills, the bigger the requantizing steps get.

FIGURE 6.32

An MPEG-2 coder. See text for details.

The buffer in the decoder has a finite capacity and the encoder must model the occupancy of the decoder's buffer so that it neither overflows nor underflows. An overflow might occur if an I picture is transmitted when the buffer content is already high. The buffer occupancy of the decoder depends somewhat on the memory access strategy of the decoder. Instead of defining a specific buffer size, MPEG-2 defines the size of a particular mathematical model of a hypothetical buffer. The decoder designer can use any strategy that implements the model, and the encoder can use any strategy that does not overflow or underflow the model. The elementary stream has a parameter called the video buffer verifier (VBV), which defines the minimum buffering assumptions of the encoder. Buffering is one way of ensuring constant quality when picture entropy varies. An intelligent coder may run down the buffer contents in anticipation of a difficult picture sequence so that a large amount of data can be sent.

MPEG-2 does not define what a decoder should do if a buffer underflow or overflow occurs, but because both irrecoverably lose data it is obvious that there will be more or less of an interruption in the decoding. Even a small loss of data may cause loss of synchronisation and in the case of a long GOP the lost data may make the rest of the GOP undecodable. A decoder may choose to repeat the last properly decoded picture until it can begin to operate correctly again.

Buffer problems occur if the VBV model is violated. If this happens, more than one underflow or overflow can result from a single violation. Switching an MPEG bitstream can cause a violation because the two encoders concerned may have radically different buffer occupancy at the switch.

THE ELEMENTARY STREAM

Figure 6.33 shows the structure of the elementary stream from an MPEG-2 encoder. The structure begins with a set of coefficients representing a DCT block. Six or eight DCT blocks form the luminance and chroma content of one macroblock. In P and B pictures a macroblock will be associated with a vector for motion compensation. Macroblocks are associated into slices in which DC coefficients of I pictures and vectors in P and B pictures are differentially coded. An arbitrary number of slices forms a picture and this needs I/P/B flags describing the type of picture it is. The picture may also have a global vector that efficiently deals with pans.

Several pictures form a Group of Pictures (GOP). The GOP begins with an I picture and may or may not include P and B pictures in a structure that may vary dynamically.

FIGURE 6.33

The structure of an elementary stream. MPEG defines the syntax precisely.

Several GOPs form a sequence, which begins with a sequence header containing important data to help the decoder. It is possible to repeat the header within a sequence, and this helps lock-up in random access applications. The sequence header describes the MPEG-2 profile and level, whether the video is progressive or interlaced, whether the chroma is 4:2:0 or 4:2:2, the size of the picture, and the aspect ratio of the pixels. The quantizing matrix used in the spatial coder can also be sent. The sequence begins with a standardised bit pattern, which is detected by a decoder to synchronise the de-serialization.

AN MPEG-2 DECODER

The decoder is defined only by implication from the definitions of syntax, and any decoder that can correctly interpret all combinations of syntax at a particular profile will be deemed compliant, however it works. The first problem a decoder has is that the input is an endless bitstream that contains a huge range of parameters, many of which have variable length. Unique synchronising patterns must be placed periodically throughout the bitstream so that the decoder can identify a known starting point. The pictures that can be sent under MPEG-2 are so flexible that the decoder must first find a sequence header so that it can establish the size of the picture, the frame rate, the colour coding used, etc.

The decoder must also be supplied with a 27 MHz system clock. In a DVD player, this would come from a crystal, but in a transmission system this would be provided by a numerically locked loop running from the program clock reference parameter in the bitstream (see Chapter 10). Until this loop has achieved lock the decoder cannot function properly.

Figure 6.34 shows a bi-directional decoder. The decoder can begin decoding only with an I picture and as this uses only intra-coding there will be no vectors. An I picture is transmitted as a series of slices. These slices begin with subsidiary synchronising patterns. The first macroblock in the slice contains an absolute DC coefficient, but the remaining macroblocks code the DC coefficient differentially, so the decoder must subtract the differential values from the previous one to obtain the absolute value.

The AC coefficients are sent as Huffman-coded run/size parameters followed by coefficient value codes. The variable-length Huffman codes are decoded by using a lookup table and extending the number of bits considered until a match is obtained. This allows the zero-run-length and the coefficient size to be established. The right number of bits is taken from the bitstream corresponding to the coefficient code and this is decoded to the actual coefficient using the size parameter.

FIGURE 6.34

A bi-directional MPEG-2 decoder. See text for details.

If the correct number of bits has been taken from the stream, the next bit must be the beginning of the next run/size code and so on until the EOB symbol is reached. The decoder uses the coefficient values and the zero-run-lengths to populate a DCT coefficient block following the appropriate zig-zag scanning sequence. Following EOB, the bitstream then continues with the next DCT block. Clearly this Huffman decoding will work perfectly or not at all. A single bit slippage in synchronism or a single corrupted data bit can cause a spectacular failure.

Once a complete DCT coefficient block has been received, the coefficients need to be inverse quantized and inverse weighted. Then an inverse DCT can be performed and this will result in an 8 × 8-pixel block. A series of DCT blocks will allow the luminance and colour information for an entire macroblock to be decoded and this can be placed in a frame store. Decoding continues in this way until the end of the slice, when an absolute DC coefficient will once again be sent. Once all the slices have been decoded, an entire picture will be resident in the frame store.

The amount of data needed to decode the picture is variable and the decoder just keeps going until the last macroblock is found. It will obtain data from the input buffer. In a constant bit rate transmission system, the decoder will remove more data to decode an I picture than has been received in one picture period, leaving the buffer emptier than it began. Subsequent P and B pictures need much fewer data and allow the buffer to fill again. The picture will be output when the time stamp sent with the picture matches the state of the decoder's time count.

Following the I picture may be another I picture or a P picture. Assuming a P picture, this will be predictively coded from the I picture. The P picture will be divided into slices as before. The first vector in a slice is absolute, but subsequent vectors are sent differentially. However, the DC coefficients are not differential.

Each macroblock may contain a forward vector. The decoder uses this to shift pixels from the I picture into the correct position for the predicted P picture. The vectors have half-pixel resolution and when a half-pixel shift is required, an interpolator will be used.

The DCT data are sent much as for an I picture; they will require inverse quantizing, but not inverse weighting because P and B coefficients are flat-weighted. When decoded this represents an error-cancelling picture, which is added pixel by pixel to the motion-predicted picture. This results in the output picture.

If bi-directional coding is being used, the P picture may be stored until one or more B pictures have been decoded. The B pictures are sent essentially as a P picture might be, except that the vectors can be forward, backward, or bi-directional. The decoder must take pixels from the I picture, the P picture, or both and shift them according to the vectors to make a predicted picture. The DCT data decode to produce an error-cancelling image as before.

In an interlaced system, the prediction mechanism may alternatively obtain pixel data from the previous field or the field before that. Vectors may relate to macroblocks or to 16 × 8-pixel areas. DCT blocks after decoding may represent frame lines or field lines. This adds up to a lot of different possibilities for a decoder handling an interlaced input.

MPEG-4 AND ADVANCED VIDEO CODING (AVC)

MPEG-4 advances the coding art in a number of ways. Whereas MPEG-1 and MPEG-2 were directed only to coding the video pictures that resulted after shooting natural scenes or from computer synthesis, MPEG-4 moves farther back in the process of how those scenes are created. For example, the rotation of a detailed three-dimensional object before a video camera produces huge changes in the video from picture to picture, which MPEG-2 would find difficult to code. Instead, if the three-dimensional object is re-created at the decoder, rotation can be portrayed by transmitting a trivially small amount of vector data. If the above object is synthetic, effectively the synthesis or rendering process is completed in the decoder. However, a suitable if complex image processor at the encoder could identify such objects in natural scenes. MPEG-4 objects are defined as a part of a scene that can independently be accessed or manipulated. An object is an entity that exists over a certain time span. The pictures of conventional imaging become object planes in MPEG-4. When an object intersects an object plane, it can be described by the coding system using intra-coding, forward prediction, or bi-directional prediction.

Figure 6.35 shows that MPEG-4 has four object types. A video object is an arbitrarily shaped planar pixel array describing the appearance or texture of part of a scene. A still texture object or sprite is a planar video object in which there is no change with respect to time. A mesh object describes a two- or three-dimensional shape as a set of points. The shape and its position can change with respect to time. Using computer graphics techniques, texture can be mapped onto meshes, a process known as warping, to produce rendered images. Using two-dimensional warping, a still texture object can be made to move. In three-dimensional graphic rendering, mesh coding allows an arbitrary solid shape to be created, which is then covered with texture.

Perspective computation then allows this three-dimensional object to be viewed in correct perspective from any viewpoint. MPEG-4 provides tools to allow two- or three-dimensional meshes to be created in the decoder and then oriented by vectors. Changing the vectors then allows realistic moving images to be created with an extremely low bit rate.

FIGURE 6.35

In MPEG-4 four types of objects are coded.

Face and body animation is a specialized subset of three-dimensional mesh coding in which the mesh represents a human face and/or body. As the subject moves, carefully defined vectors carry changes of expression, which allow rendering of an apparently moving face and/or body that has been almost entirely synthesized from a single still picture.

In addition to object coding, MPEG-4 refines the existing MPEG tools by increasing the efficiency of a number of processes using lossless prediction. AVC extends this concept further still. This improves the performance of both the motion compensation and the coefficient coding, allowing either a lower bit rate or improved quality. MPEG-4 also extends the idea of scalability introduced in MPEG-2. Multiple scalability is supported, in which a low-bit rate base-level picture may optionally be enhanced by adding information from one or more additional bitstreams. This approach is useful in network applications in which the content creator cannot know the bandwidth that a particular user will have available. Scalability allows the best quality in the available bandwidth.

Although most of the spatial compression of MPEG-4 is based on the DCT as in earlier MPEG standards, MPEG-4 also introduces wavelet coding of still objects. Wavelets are advantageous in scalable systems because they naturally decompose the original image into various resolutions.

In contrast to the rest of MPEG-4, AVC is intended for use with entire pictures and as such is more of an extension of MPEG-2. AVC adds refinement to the existing coding tools of MPEG and also introduces some new ones. The emphasis is on lossless coding to obtain performance similar to that of MPEG-2 at around half the bit rate.

TEXTURE CODING

In MPEG-1 and MPEG-2 the only way of representing an image is with pixels and this requires no name. In MPEG-4 there are various types of image description tools and it becomes necessary to give the pixel representation of the earlier standards a name. This is texture coding, which is that part of MPEG-4 that operates on pixel-based areas of image. Coming later than MPEG-1 and MPEG-2, the MPEG-4 and AVC texture-coding systems can afford additional complexity in the search for higher performance. Figure 6.36 contrasts MPEG-2, MPEG-4, and AVC. Figure 6.36a shows the texture decoding system of MPEG-2, whereas (b) shows MPEG-4 and (c ) shows AVC. The latter two are refinements of the earlier technique. These refinements are lossless in that the reduction in bit rate they allow does not result in a loss of quality. When inter-coding, there is always a compromise needed over the quantity of vector data. Clearly if the area steered by each vector is smaller, the motion compensation is more accurate, but the reduction in residual data is offset by the increase in vector data.

MPEG-2	MPEG-4	AVC
DCT	DCT	Small block size integer coefficient transform
DC coefficient prediction in slice	DC coefficient prediction AC coefficient prediction	Spatial prediction with edge direction adaptation
Single coefficient scan	3 × coefficient scans	Multiple coefficient scans
Fixed resolution weighting	Reduced resolution weighting	Reduced resolution weighting
VLC/RLC	VLC/RLC
Differential vectors in slice one vector/MB	Vector prediction up to 4 vectors/MB	Vector prediction up to 16 vectors/MB

(a)

(b)

(c)

FIGURE 6.36

(a) The texture coding system of MPEG-2. (b) Texture coding in MPEG-4.

In MPEG-1 and MPEG-2 only a small amount of vector compression is used. In contrast, MPEG-4 and AVC use advanced forms of lossless vector compression, which can, without any bit rate penalty, increase the vector density to one vector per DCT block in MPEG-4 and to one vector per 4 × 4-pixel block in AVC. AVC also allows quarter-pixel-accurate vectors. In inter-coded pictures the prediction of the picture is improved so that the residual to be coded is smaller. When intra-coding, MPEG-4 looks for further redundancy between coefficients using prediction. When a given DCT block is to be intra-coded, certain of its coefficients will be predicted from adjacent blocks.

The choice of the most appropriate block is made by measuring the picture gradient, defined as the rate of change of the DC coefficient. Figure 6.37a shows that the three adjacent blocks, A, B, and C, are analysed to decide whether to predict from the DCT block above (vertical prediction) or to the left (horizontal prediction). Figure 6.37b shows that in vertical prediction the top row of coefficients is predicted from the block above so that only the differences between them need to be coded.

FIGURE 6.37

(a) DC coefficients are used to measure the picture gradient. (b) In vertical prediction the top row of coefficients is predicted using those above as a basis. (c) In horizontal prediction the left column of coefficients is predicted from those to the left.

Figure 6.37c shows that in horizontal prediction the left column of coefficients is predicted from the block on the left so that again only the differences need be coded. Choosing the blocks above and to the left is important because these blocks will already be available in both the encoder and the decoder. By making the same picture gradient measurement, the decoder can establish whether vertical or horizontal prediction has been used and so no flag is needed in the bitstream.

Some extra steps are needed to handle the top row and the left column of a picture or object when true prediction is impossible. In these cases both encoder and decoder assume standardised constant values for the missing prediction coefficients. The picture gradient measurement determines the direction in which there is the least change from block to block. There will generally be fewer DCT coefficients present in this direction. There will be more coefficients in the other axis, where there is more change. Consequently it is advantageous to alter the scanning sequence so that the coefficients that are likely to exist are transmitted earlier in the sequence.

Figure 6.38 shows the two alternate scans for MPEG-4. The alternate horizontal scan concentrates on horizontal coefficients early in the scan and will be used in conjunction with vertical prediction. Conversely the alternate vertical scan concentrates on vertical coefficients early in the scan and will be used in conjunction with horizontal prediction. The decoder can establish which scan has been used in the encoder from the picture gradient.

FIGURE 6.38

The alternate zig-zag scans employed with vertical and horizontal prediction.

Coefficient prediction is not employed when inter-coding because the statistics of residual images are different. Instead of attempting to predict residual coefficients, in inter-coded texture, pixel-based prediction may be used to reduce the magnitude of texture residuals. This technique is known as overlapped block motion compensation (OBMC), which is used only in P-VOPs. With only one vector per DCT block, clearly in many cases the vector cannot apply to every pixel in the block. If the vector is considered to describe the motion of the centre of the block, the vector accuracy falls toward the edge of the block. A pixel in the corner of a block is almost equidistant from a vector in the centre of an adjacent block.

OBMC uses vectors from adjacent blocks, known as remote vectors, in addition to the vector of the current block for prediction. Figure 6.39 shows that the motion-compensation process of MPEG-1 and MPEG-2, which uses a single vector, is modified by the addition of the pixel prediction system, which considers three vectors. A given pixel in the block to be coded is predicted from the weighted sum of three motion-compensated pixels taken from the previous I- or P-VOP. One of these pixels is obtained in the normal way by accessing the previous VOP with a shift given by the vector of this block. The other two are obtained by accessing the same VOP pixels using the remote vectors of two adjacent blocks. The remote vectors that are used and the weighting factors are both a function of the pixel position in the block. Figure 6.39 shows that the block to be coded is divided into quadrants. The remote vectors are selected from the blocks closest to the quadrant in which the pixel resides.

FIGURE 6.39

(a) MPEG-4 inter-coded pixel values may be predicted from three remote vectors as well as the true vector. (b) The three remote vectors are selected according to the macroblock quadrant. Macroblocks may have one or four vectors and the prediction mechanism allows prediction between blocks of either type.

For example, a pixel in the bottom-right quadrant would be predicted using remote vectors from the DCT block immediately below and the block immediately to the right. Not all blocks can be coded in this way. In P-VOPs it is permissible to have blocks that are not coded or intrablocks that contain no vector. Remote vectors will not all be available at the boundaries of a VOP. In the normal sequence of macroblock transmission, vectors from macroblocks below the current block are not yet available. Some additional steps are needed to handle these conditions. Adjacent to boundaries where a remote vector is not available it is replaced by a copy of the actual vector.

This is also done when an adjacent block is intra-coded and for blocks at the bottom of a macroblock when the vectors for the macroblocks below will not be available yet. In the case of noncoded blocks the remote vector is set to zero.

Figure 6.40a shows that the weighting factors for pixels near the centre of a block favour that block. In the case of pixels at the corner of the block, the weighting is even between the value obtained from the true vector and the sum of the two pixel values obtained from remote vectors. The weighted sum produces a predicted pixel, which is subtracted from the actual pixel in the current VOP to be coded to produce a residual pixel. Blocks of residual pixels are DCT coded as usual. OBMC reduces the magnitude of residual pixels and gives a corresponding reduction in the number or magnitude of DCT coefficients to be coded.

FIGURE 6.40

Vector prediction.

OBMC is lossless because the decoder already has access to all the vectors and knows the weighting tables. Consequently the only overhead is the transmission of a flag that enables or disables the mechanism. MPEG-4 also has the ability to down sample prediction error or residual macroblocks that contain little detail. A 16 × 16 macroblock is down sampled to 8 by 8 and flagged. The decoder will identify the flag and interpolate back to 16 × 16.

In vector prediction, each macroblock may have only one or four vectors as the coder decides. Consequently the prediction of a current vector may have to be done from either macroblock or DCT block vectors. In the case of predicting one vector for an entire macroblock, or the top-left DCT block vector, the process shown in Figure 6.40b is used. Three earlier vectors, which may be macroblock or DCT block vectors, as available, are used as the input to the prediction process. In the diagram the large squares show the macroblock vectors to be selected and the small squares show the DCT block vectors to be selected. The three vectors are passed to a median filter, which outputs the vector in the centre of the range unchanged.

A median filter is used because the same process can be performed in the decoder with no additional data transmission. The median vector is used as a prediction, and comparison with the actual vector enables a residual to be computed and coded for transmission. At the decoder the same prediction can be made and the received residual is added to re-create the original vector.

The remaining parts of Figure 6.40b show how the remaining three DCT block vectors are predicted from adjacent DCT block vectors. If the relevant block is only macroblock coded, that vector will be substituted.

ADVANCED VIDEO CODING

AVC, or H.264, is intended to compress moving images that take the form of 8-bit 4:2:0 coded pixel arrays. As in MPEG-2 these may be pixel arrays or fields from an interlaced signal. It does not support object-based coding. Incoming pixel arrays are subdivided into 16 × 16-macroblocks as in previous MPEG standards. In those previous standards, macroblocks were transmitted only in a raster-scan fashion. Whilst this is fine when the coded data are delivered via a reliable channel, AVC is designed to operate with imperfect channels that are subject to error or packet loss. One mechanism that supports this is known as FMO (flexible macroblock ordering).

When FMO is in use, the picture can be divided into different areas along horizontal or vertical macroblock boundaries. Figure 6.41a shows an approach in which macroblocks are chequerboarded. If the shaded macroblocks are sent in a different packet to the unshaded macroblocks, the loss of a packet will result in a degraded picture rather than no picture. Figure 6.41b shows another approach in which the important elements of the picture are placed in one area and less important elements in another. The important data may be afforded higher priority in a network. Note that when interlaced input is being coded, it may be necessary to constrain the FMO such that the smallest element becomes a macroblock pair in which one macroblock is vertically above the other.

In FMO these areas are known as slice groups that contain integer numbers of slices. Within slice groups, macroblocks are always sent in raster-scan fashion with respect to that slice group. The decoder must be able to establish the position of every received macroblock in the picture. This is the function of the macroblock to slice group map, which can be deduced by the decoder from picture header and slice header data.

FIGURE 6.41

(a) Chequerboarded macroblocks that have come from two different slice groups. Loss of one slice group allows a degraded picture to be seen. (b) Important picture content is placed in one slice, whereas background is in another.

Another advantage of AVC is that the bitstream is designed to be transmitted or recorded in a greater variety of ways, having distinct advantages in certain applications. AVC may convert the direct coder output in a NAL (network application layer) that formats the data in an appropriate manner.

In previous MPEG standards, prediction was used primarily between pictures. In MPEG-2 I pictures the only prediction was in DC coefficients, whereas in MPEG-4 some low-frequency coefficients were predicted. In AVC, I pictures are subject to spatial prediction and it is the prediction residual that is transform coded, not pixel data.

In I PCM mode, the prediction and transform stages are both bypassed and actual pixel values enter the remaining stages of the coder. In nontypical images such as noise, PCM may be more efficient. In addition, if the channel bit rate is high enough, a truly lossless coder may be obtained by the use of PCM. Figure 6.42 shows that the encoder contains a spatial predictor that is switched in for I pictures, whereas for P and B pictures the temporal predictor operates. The predictions are subtracted from the input picture and the residual is coded.

FIGURE 6.42

AVC encoder has a spatial predictor as well as a temporal predictor.

Spatial prediction works in two ways. In featureless parts of the picture, the DC component, or average brightness, is highly redundant. Edges between areas of differing brightness are also redundant. Figure 6.43a shows that in a picture having a strong vertical edge, rows of pixels traversing the edge are highly redundant, whereas (b) shows that in the case of a strong horizontal edge, columns of pixels are redundant. Sloping edges will result in redundancy on diagonals.

According to picture content, spatial prediction can operate on 4 × 4-pixel blocks or 16 × 16-pixel blocks. Figure 6.44a shows eight of the nine spatial prediction modes for 4 × 4 blocks. Mode 2, not shown, is the DC prediction that is directionless. Figure 6.44b shows that in 4 × 4 prediction, up to 13 pixel values above and to the left of the block will be used. This means that these pixel values are already known by the decoder because of the order in which decoding takes place. Spatial prediction cannot take place between different slices because the error recovery capability of a slice would be compromised if it depended on an earlier one for decoding.

Figure 6.44c shows that in vertical prediction (Mode 0), four pixel values above the block are copied downward so that all four rows of the predicted block are identical. Figure 6.44d shows that in horizontal prediction (Mode 1) four pixel values to the left are copied across so that all four columns of the predicted block are identical. Figure 6.44e shows how in diagonal prediction (Mode 4) seven pixel values are copied diagonally. Figure 6.44f shows that in DC prediction (Mode 2) pixel values above and to the left are averaged and the average value is copied into all 16 predicted pixel locations.

FIGURE 6.43

(a) In the case of a strong vertical edge, pixel values in rows tend to be similar.

(b) Horizontal edges result in columns of similar values.

With 16 × 16 blocks, only four modes are available: vertical, horizontal, DC, and plane. The first three of these are identical in principle to Modes 0, 1, and 2 with 4 × 4 blocks. Plane mode is a refinement of DC mode. Instead of setting every predicted pixel in the block to the same value by averaging the reference pixels, the predictor looks for trends in changing horizontal brightness in the top reference row and similar trends in vertical brightness in the left reference column and computes a predicted block whose values lie on a plane that may be tilted in the direction of the trend.

Clearly it is necessary for the encoder to have circuitry or software that identifies edges and their direction (or the lack of them) to select the appropriate mode. The standard does not suggest how this should work, only how its outputs should be encoded. In each case the predicted pixel block is subtracted from the actual pixel block to produce a residual. Spatial prediction is also used on chroma data.

FIGURE 6.44

(a) Spatial prediction works in eight different directions. (b) Adjacent pixels from which predictions will be made. (c) Vertical prediction copies rows of pixels downward.

(d) Horizontal prediction copies columns of pixels across. (e) Diagonal prediction.

(f) DC prediction copies the average of the reference pixels into every predicted pixel location.

FIGURE 6.45

The transform matrix used in AVC. The coefficients are all integers.

When spatial prediction is used, the statistics of the residual will be different from the statistics of the original pixels. When the prediction succeeds, the lower frequencies in the image are largely taken care of and so only higher frequencies remain in the residual. This suggests the use of a smaller transform than the 8 × 8 transform of previous systems. AVC uses a 4 × 4 transform. It is not, however, a DCT, but a DCT-like transform, using coefficients that are integers. This gives the advantages that coding and decoding require only shifting and addition and that the transform is perfectly reversible even when limited word length is used. Figure 6.45 shows the transform matrix of AVC.

One of the greatest deficiencies of earlier coders was blocking artifacts at transform block boundaries. AVC incorporates a deblocking filter. In operation, the filter examines sets of pixels across a block boundary. If it finds a step in value, this may or may not indicate a blocking artifact. It could be a genuine transition in the picture. However, the size of pixel value steps can be deduced from the degree of quantizing in use. If the step is bigger than the degree of quantizing would suggest, it is left alone. If the size of the step corresponds to degree of quantizing, it is filtered or smoothed.

The adaptive deblocking algorithm is deterministic and must be the same in all decoders. This is because the encoder must also contain the same deblocking filter to prevent drift when temporal coding is used. This is known as in-loop deblocking. In other words, when, for example, a P picture is being predicted from an I picture, the I picture in both encoder and decoder will have been identically deblocked. Thus any errors due to imperfect deblocking are cancelled out by the P picture residual data.

Deblocking filters are modified when interlace is used because the vertical separation of pixels in a field is twice as great as in a frame. Figure 6.46 shows an in-loop deblocking system. The I picture is encoded and transmitted and is decoded and deblocked identically at both encoder and decoder. At the decoder the deblocked I picture forms the output as well as the reference with which a future P or I picture can be decoded. Thus when the encoder sends a residual, it will send the difference between the actual input picture and the deblocked I picture. The decoder adds this residual to its own deblocked I picture and recovers the actual picture.

FIGURE 6.46

The in-loop deblocking filter exists at both encoder and decoder to prevent drift.

MOTION COMPENSATION (MC) IN AVC

AVC has a more complex motion-compensation system than previous standards. Smaller picture areas are coded using vectors that may have quarter-pixel accuracy. The interpolation filter for subpixel motion compensation is specified so that the same filter is present in all encoders and decoders. The interpolator is then effectively in the loop like the deblocking filter. Figure 6.47 shows that more than one previous reference picture may be used to decode a motion-compensated picture. A larger number of future pictures are not used as this would increase latency. The ability to select a number of previous pictures is advantageous when a single nontypical picture is found inserted in normal material. An example is the white frame that results from a flashgun firing. MPEG-2 deals with this poorly, whereas AVC could deal with it well, simply by decoding the picture after the flash from the picture before the flash. Bi-directional coding is enhanced because the weighting of the contribution from earlier and later pictures can now be coded. Thus a dissolve between two pictures could be coded efficiently by changing the weighting. In previous standards a B picture could not be used as a basis for any further decoding, but in AVC this is allowed.

AVC macroblocks may be coded with between one and 16 vectors. Prediction using 16 × 16 macroblocks fails when the edge of a moving object intersects the macroblock. In such cases it may be better to divide the macroblock up according to the angle and position of the edge. Figure 6.48a shows the number of ways a 16 × 16 macroblock may be partitioned in AVC for motion-compensation purposes. There are four high-level partition schemes, one of which is to use four 8 × 8 blocks. When this mode is selected, these 8 × 8 blocks may be further partitioned as shown. This finer subdivision requires additional syntactical data to specify to the decoder what has been done. It will be self-evident that if more vectors have to be transmitted, there must be a greater reduction in the amount of residual data to be transmitted to make it worthwhile. Thus the encoder needs to intelligently decide the partitioning to be used. Figure 6.48b shows an example. There also needs to be some efficient vector coding scheme.

FIGURE 6.47

The motion compensation of AVC allows a picture to be built up from more than one previous picture.

FIGURE 6.48

(a) Macroblock partitioning for motion compensation in AVC. (b) How an encoder might partition macroblocks at the boundary of a moving object.

FIGURE 6.49

Vector inference. By assuming the optic flow axis to be straight, the vector for a B block can be computed from the vector for a P block.

In P coding, vectors are predicted from those in macroblocks already sent, provided that slice independence is not compromised. The predicted vector is the median of those on the left of, above, and above right of the macroblock to be coded. A different prediction is used if 16 × 8 or 8 × 16 partitions are used. Only the prediction error needs to be sent. In fact if nothing is sent, as in the case of a skipped block, the decoder can predict the vector for itself.

In B coding, the vectors are predicted by inference from the previous P vectors. Figure 6.49 shows the principle. To create the P picture from the I picture, a vector must be sent for each moving area. As the B picture is at a known temporal location with respect to these anchor pictures, the vector for the corresponding area of the B picture can be predicted by assuming the optic flow axis is straight and performing a simple linear interpolation.

AN AVC CODEC

Figure 6.50 shows an AVC coder–decoder pair. There is a good deal of general similarity with the previous standards. In the case of an I picture, there is no motion compensation and no previous picture is relevant. However, spatial prediction will be used. The prediction error will be transformed and quantized for transmission, but is also locally inverse quantized and inverse transformed prior to being added to the prediction to produce an unfiltered reconstructed macroblock. Thus the encoder has available exactly what the decoder will have and both use the same data to make predictions to avoid drift. The type of intra prediction used is determined from the characteristics of the input picture.

FIGURE 6.50

AVC coder and decoder. See text for details.

The locally reconstructed macroblocks are also input to the deblocking filter. This is identical to the decoder's deblocking filter and so the output will be identical to the output of the decoder. The deblocked, decoded I picture can then be used as a basis for encoding a P picture. Using this architecture the deblocking is in-loop for inter-coding purposes, but does not interfere with the intra prediction. Operation of the decoder should be obvious from what has gone before as the encoder effectively contains a decoder.

Like earlier formats, AVC uses lossless arithmetic coding, or entropy coding, to pack the data more efficiently. However, AVC takes the principle further. Arithmetic coding is used to compress syntax data as well as coefficients. Syntax data take a variety of forms: vectors, slice headers, etc. A common exp-Golomb variable-length arithmetic code is used for all syntax data. The different types of data are mapped appropriately for their statistics before that code. Coefficients are coded using a system called CAVLC (context adaptive variable-length coding).

Optionally, a further technique known as CABAC (context adaptive binary arithmetic coding) may be used in some profiles. This is a system that adjusts the coding dynamically according to the local statistics of the data instead of relying on statistics assumed at the design stage. It is more efficient and allows a coding gain of about 15% with more complexity. CAVLC performs the same function as RLC/VLC in MPEG-2 but it is more efficient. As in MPEG-2 it relies on the probability that coefficient values fall with increasing spatial frequency and that at the higher frequencies coefficients will be spaced apart by zero values.

FIGURE 6.51

CAVLC parameters used in AVC. See text for details.

The efficient prediction of AVC means that coefficients will typically be smaller than in earlier standards. It becomes useful to have specific means to code coefficients of value ±1 as well as zero. These are known as trailing ones (T1s). Figure 6.51 shows the parameters used in CAVLC.

The coefficients are encoded in the reverse order of the zig-zag scan. The number of nonzero coefficients N and the number of trailing 1’s is encoded into a single VLC symbol. The TotalZeros parameter defines the number of zero coefficients between the last non-zero coefficient and its start. The difference between N and TotalZeros must be the number of zeros within the transmitted coefficient sequence but does not reveal where they are. This is the function of the RunBefore parameter, which is sent prior to any coefficient that is preceded by zeros in the transmission sequence. If N is 16, the TotalZeros must be zero and will not be sent. RunBefore parameters will not occur.

Coefficient values for trailing ones need only a single bit to denote the sign. Values above one embed the polarity into the value input to the VLC.

CAVLC obtains extra coding efficiency because it can select different codes according to circumstances. For example if in a 16-coefficient block N is 7, then TotalZeros must have a value between zero and nine. The encoder selects a VLC table optimized for nine values. The decoder can establish what table has been used by subtracting N from 16 so no extra data need be sent to switch tables. The N and T1s parameter can be coded using one of four tables selected using the values of N and T1 in nearby blocks. Six code tables are available for adaptive coefficient encoding.

CODING ARTIFACTS

This section describes the visible results of imperfect coding. Imperfect coding may be where the coding algorithm is suboptimal, where the coder latency is too short, or where the compression factor in use is simply too great for the material.

In motion-compensated systems such as MPEG, the use of periodic intra fields means that the coding noise varies from picture to picture and this may be visible as noise pumping. Noise pumping may also be visible when the amount of motion changes. If a pan is observed, as the pan speed increases the motion vectors may become less accurate and reduce the quality of the prediction processes. The prediction errors will get larger and will have to be more coarsely quantized. Thus the picture gets noisier as the pan accelerates and the noise reduces as the pan slows down. The same result may be apparent at the edges of a picture during zooming. The problem is worse if the picture contains fine detail. Panning on grass, or trees waving in the wind, taxes most coders severely. Camera shake from a handheld camera also increases the motion vector data and results in more noise, as does film weave.

Input video noise or film grain degrades inter-coding as there is less redundancy between pictures and the difference data become larger, requiring coarse quantizing and adding to the existing noise.

When a codec is really fighting, the quantizing may become very coarse and as a result the video level at the edge of one DCT block may not match that of its neighbour. Therefore the DCT block structure becomes visible as a mosaicing or tiling effect. Coarse quantizing also causes some coefficients to be rounded up and appear larger than they should be. High-frequency coefficients may be eliminated by heavy quantizing and this forces the DCT to act as a steep-cut low-pass filter. This causes fringing or ringing around sharp edges and extra shadowy edges that were not in the original. This is most noticeable on text.

Excess compression may also result in colour bleed where fringing has taken place in the chroma or where high-frequency chroma coefficients have been discarded. Graduated colour areas may reveal banding or posterizing as the colour range is restricted by requantizing. These artifacts are almost impossible to measure with conventional test gear.

Neither noise pumping nor blocking is suffered by analog video recorders, and so it is nonsense to liken the performance of a codec to the quality of a VCR. In fact noise pumping is extremely objectionable because, unlike steady noise, it attracts attention in peripheral vision and may result in viewing fatigue.

In addition to highly detailed pictures with complex motion, certain types of video signal are difficult for MPEG-2 to handle and will usually result in a higher level of artifacts than usual. Noise has already been mentioned as a source of problems. Time base error from, for example, VCRs is undesirable because this puts successive lines in different horizontal positions. A straight vertical line becomes jagged and this results in high spatial frequencies in the DCT process. Spurious coefficients that need to be coded are created.

Much archive video is in composite form and MPEG-2 can handle this only after it has been decoded to components. Unfortunately many general-purpose composite decoders have a high level of residual subcarrier in the outputs. This is normally not a problem because the subcarrier is designed to be invisible to the naked eye. Figure 6.52 shows that in PAL and NTSC the subcarrier frequency is selected so that a phase reversal is achieved between successive lines and frames.

Whilst this makes the subcarrier invisible to the eye, it is not invisible to an MPEG decoder. The subcarrier waveform is interpreted as a horizontal frequency, the vertical phase reversals are interpreted as a vertical spatial frequency, and the picture-to-picture reversals increase the magnitude of the prediction errors. The subcarrier level may be low but it can be present over the whole screen and may require an excess of coefficients to describe it.

Composite video should not in general be used as a source for MPEG-2 encoding, but where this is inevitable the standard of the decoder must be much higher than average, especially in the residual subcarrier specification. Some MPEG pre-processors support high-grade composite decoding options.

Judder from conventional linear standards convertors degrades the performance of MPEG-2. The optic flow axis is corrupted and linear filtering causes multiple images, which confuse motion estimators and result in larger prediction errors. If standards conversion is necessary, the MPEG-2 system must be used to encode the signal in its original format and the standards convertor should be installed after the decoder. If a standards convertor has to be used before the encoder, then it must be a type that has effective motion compensation.

Film weave causes movement of one picture with respect to the next and this results in more vector activity and larger prediction errors. Movement of the centre of the film frame along the optical axis causes magnification changes that also result in excess prediction error data. Film grain has the same effect as noise: it is random and so cannot be compressed.

FIGURE 6.52

In composite video the subcarrier frequency is arranged so that inversions occur between adjacent lines and pictures to help reduce the visibility of the chroma.

Perhaps because it is relatively uncommon, MPEG-2 cannot handle image rotation well because the motion-compensation system is designed only for translational motion. When a rotating object is highly detailed, such as in certain fairground rides, the motion-compensation failure requires a significant amount of prediction error data and if a suitable bit rate is not available the level of artifacts will rise.

Flashguns used by still photographers are a serious hazard to MPEG-2, especially when long GOPs are used. At a press conference where a series of flashes may occur, the resultant video contains intermittent white frames, which defeat prediction. A huge prediction error is required to turn the previous picture into a white picture, followed by another huge prediction error to return the white frame to the next picture. The output buffer fills and heavy requantizing is employed. After a few flashes the picture has generally gone to tiles.

PROCESSING MPEG-2 AND CONCATENATION

Concatenation loss occurs when the losses introduced by one codec are compounded by a second codec. All practical compressors, MPEG-2 included, are lossy because what comes out of the decoder is not bit-identical to what went into the encoder. The bit differences are controlled so that they have minimum visibility to a human viewer.

MPEG-2 is a toolbox that allows a variety of manipulations to be performed in both the spatial and the temporal domains. There is a limit to the compression that can be used on a single frame, and if higher compression factors are needed, temporal coding will have to be used. The longer the run of pictures considered, the lower the bit rate needed, but the harder it becomes to edit.

The most editable form of MPEG-2 is to use I pictures only. As there is no temporal coding, pure-cut edits can be made between pictures. The next best thing is to use a repeating IB structure that is locked to the odd/even field structure. Cut edits cannot be made as the B pictures are bi-directionally coded and need data from both adjacent I pictures for decoding. The B picture has to be decoded prior to the edit and re-encoded after the edit. This will cause a small concatenation loss.

Beyond the IB structure processing gets harder. If a long GOP is used for the best compression factor, an IBBPBBP… structure results. Editing this is very difficult because the pictures are sent out of order so that bi-directional decoding can be used. MPEG allows closed GOPs in which the last B picture is coded wholly from the previous pictures and does not need the I picture in the next GOP. The bitstream can be switched at this point but only if the GOP structures in the two source video signals are synchronised (makes colour framing seem easy). Consequently in practice a long GOP bitstream will need to be decoded prior to any production step. Afterward it will need to be re-encoded.

This is known as naive concatenation and an enormous pitfall awaits. Unless the GOP structure of the output is identical to and synchronised with the input the results will be disappointing. The worst case is that in which an I picture is encoded from a picture that was formerly a B picture. It is easy enough to lock the GOP structure of a coder to a single input, but if an edit is made between two inputs, the GOP timings could well be different.

As there are so many structures allowed in MPEG, there will be a need to convert between them. If this has to be done, it should be only in the direction that increases the GOP length and reduces the bit rate. Going the other way is inadvisable. The ideal way of converting from, say, the IB structure of a news system to the IBBP structure of an emission system is to use a re-compressor. This is a kind of standards convertor that will give better results than a decode followed by an encode.

The DCT part of MPEG-2 itself is lossless. If all the coefficients are preserved intact an inverse transform yields the same pixel data. Unfortunately this does not give enough compression for many applications. In practice the coefficients are made less accurate by removing bits, starting at the least significant end and working upward. This process is weighted, or made progressively more aggressive as spatial frequency increases.

Small-value coefficients may be truncated to zero and large-value coefficients are most coarsely truncated at high spatial frequencies, where the effect is least visible.

Figure 6.53 shows what happens in the ideal case in which two identical coders are put in tandem and synchronised. The first coder quantizes the coefficients to finite accuracy and causes a loss on decoding. However, when the second coder performs the DCT calculation, the coefficients obtained will be identical to the quantized coefficients in the first coder and so if the second weighting and requantizing step is identical the same truncated coefficient data will result and there will be no further loss of quality.⁷

In practice this ideal situation is elusive. If the two DCTs become nonidentical for any reason, the second requantizing step will introduce further error in the coefficients and the artifact level goes up. Figure 6.53b shows that non-identical concatenation can result from a large number of real-world effects.

An intermediate processing step such as a fade will change the pixel values and thereby the coefficients. A DVE (digital video effects generator) resize or shift will move pixels from one DCT block to another. Even if there is no processing step, this effect will also occur if the two codecs disagree on where the MPEG picture boundaries are within the picture. If the boundaries are correct there will still be concatenation loss if the two codecs use different weighting.

One problem with MPEG is that the compressor design is unspecified. Whilst this has advantages, it does mean that the chances of finding identical coders is minute because each manufacturer will have his or her own views on the best compression algorithm. In a large system it may be worth obtaining the coders from a single supplier.

It is now increasingly accepted that concatenation of compression techniques is potentially damaging, and results are worse if the codecs are different. Clearly, feeding a digital coder such as MPEG-2 with a signal that has been subject to analog compression comes into the category of worse. Using interlaced video as a source for MPEG coding is suboptimal and using decoded composite video is even worse.

FIGURE 6.53

(a) Two identical coders in tandem that are synchronised make similar coding decisions and cause little loss. (b) There are various ways in which concatenated coders can produce nonideal performance.

One way of avoiding concatenation is to stay in the compressed data domain. If the goal is just to move pictures from one place to another, decoding to traditional video so an existing router can be used is not ideal, although substantially better than going through the analog domain.

Figure 6.54 shows some possibilities for picture transport. Clearly, if the pictures exist as a compressed file on a server, a file transfer is the right way to do it as there is no possibility of loss because there has been no concatenation. File transfer is also quite indifferent to the picture format. It doesn't care whether the pictures are interlaced or not or whether the colour is 4:2:0 or 4:2:2.

Decoding to SDI (serial digital interface) standard is sometimes done so that existing serial digital routing can be used. This is concatenation and has to be done carefully. The compressed video can use interlace only with nonsquare pixels and the colour coding has to be 4:2:2 because SDI allows only that. If a compressed file has 4:2:0 the chroma has to be interpolated up to 4:2:2 for SDI transfer and then subsampled back to 4:2:0 at the second coder, and this will cause generation loss. An SDI transfer also can be performed only in real time, thus negating one of the advantages of compression. In short, traditional SDI is not really at home with compression.

FIGURE 6.54

Compressed picture transport mechanisms contrasted.

As 4:2:0 progressive scan gains popularity and video production moves steadily toward non-format-specific hardware using computers and data networks, use of the serial digital interface will eventually decline. In the short term, if an existing SDI router has to be used, one solution is to produce a bitstream that is sufficiently similar to SDI that a router will pass it. In other words, the signal level, frequency, and impedance are pure SDI, but the data protocol is different so that a bit-accurate file transfer can be performed. This has two advantages over SDI. First, the compressed data format can be anything appropriate and noninterlaced and/or 4:2:0 can be handled in any picture size, aspect ratio, or frame rate. Second, a faster than real-time transfer can be used depending on the compression factor of the file. Equipment that allows this is becoming available and its use can mean that the full economic life of a SDI-routing installation can be obtained.

An improved way of reducing concatenation loss has emerged from the ATLANTIC research project.⁸ Figure 6.55 shows that the second encoder in a concatenated scheme does not make its own decisions from the incoming video, but is instead steered by information from the first bitstream. As the second encoder has less intelligence, it is known as a dim encoder.

The information bus carries all the structure of the original MPEG-2 bitstream, which would be lost in a conventional decoder. The ATLANTIC decoder does more than decode the pictures. It also places on the information bus all parameters needed to make the dim encoder reenact what the initial MPEG-2 encoder did as closely as possible.

The GOP structure is passed on so that pictures are reencoded as the same type. Positions of macroblock boundaries become identical so that DCT blocks contain the same pixels and motion vectors relate to the same screen data. The weighting and quantizing tables are passed so that coefficient truncation is identical. Motion vectors from the original bitstream are passed on so that the dim encoder does not need to perform motion estimation. In this way predicted pictures will be identical to the original prediction and the prediction error data will be the same.

FIGURE 6.55

In an ATLANTIC system, the second encoder is steered by information from the decoder.

One application of this approach is in re-compression, in which an MPEG-2 bit-stream has to have its bit rate reduced. This has to be done by heavier requantizing of coefficients, but if as many other parameters as possible can be kept the same, such as motion vectors, the degradation will be minimized. In a simple recompressor just requantizing the coefficients means that the predictive coding will be impaired. In a proper encode, the quantizing error due to coding, say, an I picture is removed from the P picture by the prediction process. The prediction error of P is obtained by subtracting the decoded I picture rather than the original I picture.

In simple re-compression this does not happen and there may be a tolerance buildup known as drift.⁹ A more sophisticated re-compressor will need to repeat the prediction process using the decoded output pictures as the prediction reference.

MPEG-2 bitstreams will often be decoded for the purpose of switching. Local insertion of commercial breaks into a centrally originated bitstream is one obvious requirement. If the decoded video signal is switched, the information bus must also be switched. At the switch point identical re-encoding becomes impossible because prior pictures required for predictive coding will have disappeared. At this point the dim encoder has to become bright again because it has to create an MPEG-2 bitstream without assistance.

It is possible to encode the information bus into a form that allows it to be invisibly carried in the serial digital interface. When a production process such as a vision mixer or DVE performs no manipulation, i.e., becomes bit transparent, the subsequent encoder can extract the bus information and operate in “dim” mode. When a manipulation is performed, the information bus signal will be corrupted and the encoder has to work in “bright” mode. The encoded information signal is known as a “mole”¹⁰ because it burrows through the processing equipment!

There will be a generation loss at the switch point because the re-encode will be making different decisions in bright mode. This may be difficult to detect because the human visual system is slow to react to a vision cut, and defects in the first few pictures after a cut are masked.

In addition to the video computation required to perform a cut, the process has to consider the buffer occupancy of the decoder. A downstream decoder has finite buffer memory, and individual encoders model the decoder buffer occupancy to ensure that it neither overflows nor underflows. At any instant the decoder buffer can be nearly full or nearly empty without a problem, provided there is a subsequent correction. An encoder that is approaching a complex I picture may run down the buffer so it can send a lot of data to describe that picture. Figure 6.56a shows that if a decoder with a nearly full buffer is suddenly switched to an encoder that has been running down its buffer occupancy, the decoder buffer will overflow when the second encoder sends a lot of data.

An MPEG-2 switcher will need to monitor the buffer occupancy of its own output to avoid overflow of downstream decoders. When this is a possibility the second encoder will have to recompress to reduce the output bit rate temporarily. In practice there will be a recovery period in which the buffer occupancy of the newly selected signal is matched to that of the previous signal. This is shown in Figure 6.56b.

FIGURE 6.56

(a) A bitstream switch at a different level of buffer occupancy can cause a decoder overflow.

(b) Recompression after a switch to return to correct buffer occupancy.

References

1. MPEG Video Standard ISO/IEC 138182: Information technology generic coding of moving pictures and associated audio information. (video) (aka ITU-T Rec. H-262) (1996).

2. Huffman, D.A. A method for the construction of minimum redundancy codes. Proc. IRE, 40, 1098–1101 (1952).

3. LeGall, D. MPEG: a video compression standard for multimedia applications. Commun. ACM, 34, No. 4, 46–58 (1991).

4. ISO/IEC JTC1/SC29/WG11 MPEG. International Standard ISO 11172: Coding of moving pictures and associated audio for digital storage media up to 1.5 Mbits/s. (1992).

5. ISO Joint Photographic Experts Group Standard JPEG-8-R8.

6. Wallace, G.K. Overview of the JPEG (ISO/CCITT) still image compression standard. ISO/JTC1/SC2/WG8 N932 (1989).

7. Stone, J., and Wilkinson, J. Concatenation of video compression systems. Presented at the 137th SMPTE Tech. Conf. (New Orleans) (1995).

8. Wells, N.D. The ATLANTIC project: models for program production and distribution. Proc. Eur. Conf. Multimedia Applications Services and Techniques (ECMAST), 243–253 (1996).

9. Werner, O. Drift analysis and drift reduction for multiresolution hybrid video coding. Image Commun., 8, 387–409 (1996).

10. Knee, M.J., and Wells, N.D. Seamless concatenation: a 21st century dream. Presented at the Int. Television Symp. (Montreux) (1997).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 6 Video Compression and MPEG

Create new playlist

Sign In

Sign Up

INTRODUCTION TO COMPRESSION

WHAT IS MPEG?

MPEG-1, -2, AND -4 AND H.264 CONTRASTED

SPATIAL AND TEMPORAL REDUNDANCY IN MPEG

I AND P CODING

BI-DIRECTIONAL CODING

CODING APPLICATIONS

SPATIAL COMPRESSION

SCANNING AND RUN-LENGTH/VARIABLE-LENGTH CODING

A BI-DIRECTIONAL CODER

SLICES

HANDLING INTERLACED PICTURES

AN MPEG-2 CODER

THE ELEMENTARY STREAM

AN MPEG-2 DECODER

MPEG-4 AND ADVANCED VIDEO CODING (AVC)

TEXTURE CODING

ADVANCED VIDEO CODING

MOTION COMPENSATION (MC) IN AVC

AN AVC CODEC

CODING ARTIFACTS

PROCESSING MPEG-2 AND CONCATENATION

References

Table of Contents for
CHAPTER 6 Video Compression and MPEG