14.5. Accelerating an MPEG-4 Decoder

One of the most difficult parts of encoding MPEG-4 video data is motion estimation, which searches adjacent video frames for similar pixel blocks to detect inter-frame movement in the picture. The motion-estimation search algorithm’s inner loop contains an SAD (sum of absolute differences) operation consisting of a subtraction, an absolute value, and the addition of the resulting value with the previously computed value.

For a QCIF (quarter common image format, 176 × 144 pixels) image frame, a 15-Hz frame rate, and an exhaustive-search motion-estimation scheme, SAD operations require about 641 million operations/sec. As shown in Figure 14.3, it’s possible to add SIMD SAD hardware capable of executing 16 pixel-wide SAD instructions per cycle using TIE. (Note: Configuring the Xtensa processor’s memory bus to be 128 bits wide makes it possible to load 16 pixels worth of data using one load instruction.)

Figure 14.3. MPEG-4 SIMD SAD instruction execution hardware.


Executing all three SAD component operations (subtraction, absolute value, addition) at once for 16 pixel values simultaneously reduces the 641 million operations/sec requirement to 14 million instructions/sec, a substantial reduction in cycle count, which should result in a reduced clock rate. This MPEG-4 motion-estimation accelerator is part of a MPEG-4 decoder reference design developed by Tensilica. The MPEG-4 decoder adds approximately 100,000 gates to the base Xtensa processor and implements a 2-way QCIF video codec operating at 15 frames/sec or a QCIF MPEG-4 decoder that operates at 30 frames/sec using approximately 30 MIPS for either operational mode.

Other MPEG-4 algorithms also can be accelerated including variable-length decoding, iDCT, bitstream processing, dequantization, AC/DC prediction, color conversion, and post filtering. When instructions are added to accelerate all of these MPEG-4 decoding tasks, creating an MPEG-4 SIMD engine within the tailored processor, the results can be quite surprising.

As Table 14.2 shows, the resulting SIMD engine acceleration drops the number of cycles required to decode the MPEG-4 video clips from billions to millions and the required processor operating frequency by roughly 30x to around 10 MHz. Without the additional, application-tailored instructions, the processor would need to run at roughly 300 MHz to perform the MPEG-4 decoding. Clearly, there is a substantial difference in power dissipation and process-technology cost between a 10 MHz and a 300 MHz processor. It’s unlikely that any amount of assembly language coding could produce similarly large drops in the clock rate.

Table 14.2. MPEG-4 decoder acceleration results from processor augmentation with FFT instructions
Video clipOriginal MPEG-4 decoder performance (# of execution cycles)Optimized MPEG-4 decoder performance (# of execution cycles)Clock frequency (15 frames/sec)TIE speedup
Miss America3.126G cycles76.81M cycles7.7MHz40.1×
Suzie3.389G cycles102.19M cycles10.3MHz33.2×
Foreman10.045G cycles359.5M cycles13.5MHz27.9×
Car phone9.222G cycles308.7M cycles12.2MHz29.9×
Monsters Inc.29.327G cycles822.8M cycles8.6MHz35.6×

As shown in the examples above, it’s possible to accelerate the performance of embedded algorithms using configurable and extensible microprocessor cores. Designers can add precisely the resources (special-purpose registers, execution units, and wide data buses) required to achieve the desired algorithmic performance instead of attempting to shoehorn algorithms into the computational assets of a fixed-ISA processor.

This design approach only requires that the design team be able to profile existing algorithm code and to find the critical inner loops in that profiled code (two tasks they already perform). From these profiles, the design team can then define new processor instructions and registers that accelerate these critical loops. The result of this new approach is to greatly accelerate algorithm performance. In most cases, designers can replace entire RTL blocks with configurable processors tuned for the exact application, saving valuable design and verification time and adding an extra level of flexibility because of the inherent programmability of this approach.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset