Accelerating an MPEG-4 Decoder

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

14.5. Accelerating an MPEG-4 Decoder

One of the most difficult parts of encoding MPEG-4 video data is motion estimation, which searches adjacent video frames for similar pixel blocks to detect inter-frame movement in the picture. The motion-estimation search algorithm’s inner loop contains an SAD (sum of absolute differences) operation consisting of a subtraction, an absolute value, and the addition of the resulting value with the previously computed value.

For a QCIF (quarter common image format, 176 × 144 pixels) image frame, a 15-Hz frame rate, and an exhaustive-search motion-estimation scheme, SAD operations require about 641 million operations/sec. As shown in Figure 14.3, it’s possible to add SIMD SAD hardware capable of executing 16 pixel-wide SAD instructions per cycle using TIE. (Note: Configuring the Xtensa processor’s memory bus to be 128 bits wide makes it possible to load 16 pixels worth of data using one load instruction.)

Figure 14.3. MPEG-4 SIMD SAD instruction execution hardware.

[View full size image]

Executing all three SAD component operations (subtraction, absolute value, addition) at once for 16 pixel values simultaneously reduces the 641 million operations/sec requirement to 14 million instructions/sec, a substantial reduction in cycle count, which should result in a reduced clock rate. This MPEG-4 motion-estimation accelerator is part of a MPEG-4 decoder reference design developed by Tensilica. The MPEG-4 decoder adds approximately 100,000 gates to the base Xtensa processor and implements a 2-way QCIF video codec operating at 15 frames/sec or a QCIF MPEG-4 decoder that operates at 30 frames/sec using approximately 30 MIPS for either operational mode.

Other MPEG-4 algorithms also can be accelerated including variable-length decoding, iDCT, bitstream processing, dequantization, AC/DC prediction, color conversion, and post filtering. When instructions are added to accelerate all of these MPEG-4 decoding tasks, creating an MPEG-4 SIMD engine within the tailored processor, the results can be quite surprising.

As Table 14.2 shows, the resulting SIMD engine acceleration drops the number of cycles required to decode the MPEG-4 video clips from billions to millions and the required processor operating frequency by roughly 30x to around 10 MHz. Without the additional, application-tailored instructions, the processor would need to run at roughly 300 MHz to perform the MPEG-4 decoding. Clearly, there is a substantial difference in power dissipation and process-technology cost between a 10 MHz and a 300 MHz processor. It’s unlikely that any amount of assembly language coding could produce similarly large drops in the clock rate.

Table 14.2. MPEG-4 decoder acceleration results from processor augmentation with FFT instructions
Video clip	Original MPEG-4 decoder performance (# of execution cycles)	Optimized MPEG-4 decoder performance (# of execution cycles)	Clock frequency (15 frames/sec)	TIE speedup
Miss America	3.126G cycles	76.81M cycles	7.7MHz	40.1×
Suzie	3.389G cycles	102.19M cycles	10.3MHz	33.2×
Foreman	10.045G cycles	359.5M cycles	13.5MHz	27.9×
Car phone	9.222G cycles	308.7M cycles	12.2MHz	29.9×
Monsters Inc.	29.327G cycles	822.8M cycles	8.6MHz	35.6×

As shown in the examples above, it’s possible to accelerate the performance of embedded algorithms using configurable and extensible microprocessor cores. Designers can add precisely the resources (special-purpose registers, execution units, and wide data buses) required to achieve the desired algorithmic performance instead of attempting to shoehorn algorithms into the computational assets of a fixed-ISA processor.

This design approach only requires that the design team be able to profile existing algorithm code and to find the critical inner loops in that profiled code (two tasks they already perform). From these profiles, the design team can then define new processor instructions and registers that accelerate these critical loops. The result of this new approach is to greatly accelerate algorithm performance. In most cases, designers can replace entire RTL blocks with configurable processors tuned for the exact application, saving valuable design and verification time and adding an extra level of flexibility because of the inherent programmability of this approach.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Accelerating an MPEG-4 Decoder

Create new playlist

Sign In

Sign Up

14.5. Accelerating an MPEG-4 Decoder

Figure 14.3. MPEG-4 SIMD SAD instruction execution hardware.

Table of Contents for
Accelerating an MPEG-4 Decoder