Chapter 1.8. Techniques for Effective Vertex and Fragment Shading on the SPUs

Steven Tovey, Bizarre Creations Ltd.

When the Cell Broadband Engine was designed, Sony and the other corporations in the STI coalition always had one eye on the Cell’s ability to support a GPU in its processing activities [Shippy09]. The Cell has been with us for three years now, and like any new piece of hardware, it has taken time for developers to understand the best ways of pushing the hardware to its limits. The likes of Mike Acton and the Insomniac Games Technology Team have been instrumental in pushing general development and coding strategies for the Cell forward, but there has been little discussion about ways that the SPUs can support a GPU in its processing activities specifically. This chapter aims to introduce fundamental techniques that can be employed when developing code for the CBE that will allow it to aid the GPU in performing rendering tasks.

The CBE as Part of a Real-World System

Understanding Cell’s place in a real-world system is useful to our discussion, and, as such, we will use Sony’s PlayStation 3 as our case study. PlayStation 3 contains the Cell Broadband Engine, which was developed jointly by Sony Computer Entertainment, Toshiba Inc., and IBM Corp. [Shippy09, Möller08, IBM08]. The Cell forms part of the overall architecture of the console along with the Reality Synthesizer, RSX, and two types of memory. Figure 1.8.1 shows a high-level view of the architecture.

The PlayStation 3 architecture (illustration modeled after [Möller08, Perthuis06]).

Figure 1.8.1. The PlayStation 3 architecture (illustration modeled after [Möller08, Perthuis06]).

The Cell contains two distinctly different types of processor: the PowerPC Processing Element (PPE) and the Synergistic Processing Element (SPE). The PPE is essentially the brains of the chip [Shippy09] and is capable of running an operating system in addition to coordinating the processing activities of its counterpart processing elements, the SPEs. Inside PlayStation 3, there are eight SPEs. However, to increase chip yield, one is locked out, and Sony reserves another for their operating system, leaving a total of six SPEs available for application programmers. All processing elements in the Cell are connected by a token-ring bus, as shown in Figure 1.8.2.

The Cell Broadband Engine (modeled after [IBM08]).

Figure 1.8.2. The Cell Broadband Engine (modeled after [IBM08]).

Because the SPEs are the main focus of this chapter, they are discussed in much greater detail in the forthcoming sections.

The SPEs

Each Synergistic Processing Element is composed of two major components: the Synergistic Processing Unit (SPU) and the Memory Flow Controller (MFC).

The SPU

Detailed knowledge of the SPU instruction set and internal execution model are critical to achieving peak performance on the PlayStation 3. In the following sections, we will highlight some important facets of this unique processor.

The Synergistic Execution Unit and SPU ISA

The Synergistic Execution Unit (SXU), part of the SPU, is responsible for the execution of instructions. Inside the SXU are two pipelines: the odd pipeline and the even pipeline. Instructions are issued to exactly one of these pipelines, depending on the group the issued instruction falls into (see Table 1.8.1). The SXU supports the dual issue of instructions (one from each pipeline) if and only if a very strict set of requirements is met. We will discuss these requirements in detail later.

Table 1.8.1. A List of Instruction Groups Together with Their Associated Execution Pipes and Latencies

Instruction Group

Pipeline

Latency (Cycles)

Issue (Cycles)

Single precision floating-point operations

EVEN

6

1

Double precision floating-point operations

EVEN

7

6

Integer multiplies, integer/float conversions, and interpolation

EVEN

7

1

Immediate loads, logical operations, integer addition/subtraction, carry/borrow generate

EVEN

2

1

Element-wise rotates and shifts, special byte operations

EVEN

4

1

Loads and stores, branch hints, channel operations

ODD

6

1

Shuffle bytes, qword rotates and shifts, estimates, gather, selection mask formation and branches

ODD

4

1

The SPU has a particularly large register file to facilitate the execution of pipelined, unrolled code without the need for excessive register spilling. Unlike its counterpart, the PPE, the register file of the SPU is unified. That is, floating-point, integer, and vector operations act on the same registers without having to move through memory. As the SPU is a vector processing unit at its heart, its Instruction Set Architecture (ISA) is designed specifically for vector processing [IBM08a]. All 128 of the SPU’s registers are 16 bytes in size, allowing for up to four 32-bit floating-point values or eight 16-bit integers to be processed with each instruction.

While a full analysis of the SPU’s ISA is beyond the scope of this gem, there are a number of instructions worth discussing in greater detail that are particularly important for efficient programming of the SPU. The first of these instructions is selb, or “select bits.” The selb instruction performs branchless selection on a bitwise basis and takes the form selb rt, ra, rb, rm. For each bit of a quadword, this instruction uses the mask register (rm) to determine which bits of the source registers (ra and rb) should be placed in the corresponding bits of the target register (rt). Comparison instructions all return a quadword selection mask that can be used with selb[1].

The shuffle bytes instruction, shufb, is the key instruction in data manipulation on the SPU. The shufb instruction takes four operands, all of which are registers. The first operand, rt, is the target register. The next two operands, ra and rb, are the two quadwords that will be manipulated by the quadword pattern from the fourth operand, rp. The manipulations controlled by this fourth operand, known as the shuffle pattern, are particularly interesting.

A shuffle pattern is a quadword value that works on a byte level. Each of the 16 bytes in the quadword controls the contents of the corresponding byte in the target register. For example, the 0th byte of the pattern quadword controls the value that will ultimately be placed into the 0th byte of the target register, the 1th byte controls the value of the 1th byte placed into the target register, and so on, for all 16 bytes of the quad word. Listing 1.8.1 provides an example shuffle pattern.

Example 1.8.1. An example shuffle pattern

const vector unsigned char _example1 =
{ 0x00, 0x11, 0x02, 0x13,
  0x04, 0x15, 0x06, 0x17,
  0x08, 0x19, 0x0a, 0x1b,
  0x0c, 0x1d, 0x0e, 0x1f };

The above pattern performs a perfect shuffle, but on a byte level. (The term “perfect shuffle” typically refers to the interleaving of bits from two words.) The lower 4 bits of each byte can essentially be thought of as an index into the bytes of the first or second operand quadword. Similarly, the upper 4 bits can be thought of as an index into the registers referred to in the instruction’s operands. Since there are only two, we need only concern ourselves with the LSB of this 4-bit group—in other words, 0x0x (where x denotes some other value of the lower 4 bits of the byte) would index into the contents of the ra register, and 0x1x would access the second. It is worth noting that there are special case values that can be used to load constants with shufb; an interested reader can refer to [IBM08a] for details. A further example in Listing 1.8.2 will aid us in our understanding.

Example 1.8.2. An example of using shufb

const vector unsigned char _example2 =
{ 0x00, 0x01, 0x02, 0x03,
  0x14, 0x15, 0x16, 0x17,
  0x08, 0x09, 0x0a, 0x0b,
  0x1c, 0x1d, 0x1e, 0x1f };

qword pattern = (const qword)_example2;
qword ra = si_ilhu(0x3f80); // ra contains: 1.0f, 1.0f, 1.0f, 1.0f
qword rb = si_ilhu(0x4000); // rb contains: 2.0f, 2.0f, 2.0f, 2.0f

// result contains: 1.0f, 2.0f, 1.0f, 2.0f
qword result = si_shufb(ra, rb, pattern);

In many programs, simply inlining of shuffle patterns for data manipulation requirements will suffice, but since the terminal operand to shufb is simply a register, there is nothing to stop you from computing the patterns dynamically in your program or from forming them with the constant formation instructions (as should be preferred when lower latency can be achieved than the 6-cycle load from the local store). As it turns out, dynamic shuffle pattern computation is actually critical to performing unaligned loads from the local store in a vaguely efficient manner, as we shall see later. In-depth details of the SPU ISA can be found in [IBM08a].

Local Store and Memory Flow Controller

As previously mentioned, each of the SPUs in the Cell is individually endowed with its own memory, known as its local store. The local store is (at least on current implementations of the CBE) 256 KB in size and can essentially be thought of as an L1 cache for the Synergistic Execution Unit. Data can be copied into and out of the local store by way of the DMA engine in the MFC, which resides on each SPE and acts asynchronously of the SXU. Loads and stores to and from the local store are always 16-byte aligned and sized. Hence, processing data smaller than 16 bytes requires use of a less-than-efficient load-modify-store pattern. Accesses to the local store are arbitrated by the SPU Store and Load unit (SLS) based on a priority; the DMA engine always has priority over the SXU for local store accesses.

Each DMA is part of a programmer-specified tag group. This provides a mechanism for a programmer to poll the state of the MFC to find out if a specific DMA has completed. A tag group is able to contain multiple DMAs. The tag group is denoted by a 5-bit value internally, and, as such, the MFC supports 32 distinct tag groups [Bader07]. The DMA queue (DMAQ) is 16 entries deep in current implementations of the CBE.

Data Management

In many ways, the choice of data structure is more important than the efficiency of the operations that must be performed on it. In the following sections, we will describe a variety of data management strategies and their tradeoffs in the context of the SPU.

Multi-Buffering

All graphics programmers will be familiar with the concept of a double buffer. The multi-buffer is simply a term that generalizes the concept to an arbitrary number of buffers. In many cases two buffers will be sufficient, but sometimes a third buffer will be required to effectively hide the latency of transfers to and from the effective address space. Figure 1.8.3 shows the concept of multi-buffering.

Multi-buffering data to hide latency (modeled after [Bader07]).

Figure 1.8.3. Multi-buffering data to hide latency (modeled after [Bader07]).

Bader suggests that each buffer should use a separate tag group in order to prevent unnecessary stalling of the SPU waiting for data that will be processed sometime in the future. Barriers and fences should be used to order DMAs within a tag group and the DMA queue, respectively [Bader07]. Multi-buffering can yield significant performance increases, but it does have a downside. Because the buffers are resident in the local store, it does mean that SPE programs must be careful not to exceed the 256-KB limit.

Using a reasonable size for each of the buffers in your multi-buffer (about 16 KB) in order to allow the SPU to process several vertices or pixels before requiring more data from the main address space is a fine strategy. However, the pointer wrangling can become a little complicated if one’s goal is to support a list of arbitrarily sized (and hence aligned) vertex formats. Conversely, alignments do tend to be a little more favorable and can be easily controlled by carefully selecting a reasonably sized unit of work when processing pixels.

Structure-of-Arrays versus Array-of-Structures

The design of data is paramount when hoping to write performant software for the SPU. Since the SPU is a SIMD vector processor, concepts familiar to those who have programmed with other vector ISAs, such as SSE on Intel chips, Altivec on PowerPC chips, or even the VU on the PlayStation 2, are immediately transferable to the SPU. One such concept is parallel array data layout, better known as Structure-of-Arrays (SOA). By laying data out in a format that is the transpose of its natural layout (Array-of-Structures), as can be seen in Figure 1.8.4, a programmer is often able to produce much more efficient code (most notably in those cases where vectorized data is interacting with scalar data).

An Array-of-Structures layout on the left is transposed into a Structure-of-Arrays layout (illustration modeled after [Tovey10]).

Figure 1.8.4. An Array-of-Structures layout on the left is transposed into a Structure-of-Arrays layout (illustration modeled after [Tovey10]).

The benefits of using an SOA layout are substantial in a lot of common cases. Listing 1.8.3 illustrates this by way of computing the squared length of a vector.

Example 1.8.3. Two versions of a function to calculate the squared length of a vector. The first assumes Array-of-Structures data layout, and the second Structure-of-Arrays layout.

// Version 1: AOS mode - 1 vector, ~18 cycles.
      qword dot_xx                  = si_fm(v, v);
      qword dot_xx_r4               = si_rotqbyi(dot_xx, 4);
            dot_xx                  = si_fa(dot_xx, dot_xx_r4);
      qword dot_xx_r8               = si_rotqbyi(dot_xx, 8);
            dot_xx                  = si_fa(dot_xx, dot_xx_r8);
      return si_to_float(dot_xx);

      // Version 2: SOA mode - 4 vectors, ~8 cycles.
      qword dot_x                   = si_fm(x, x);
      qword dot_y                   = si_fma(y, y, dot_x);
      qword dot_z                   = si_fma(z, z, dot_y);
      return dot_z;

Branch-Free DMAs

The cost of badly predicted branches on the SPU is quite significant. Given that the SPU does not contain any dedicated branch prediction hardware[2], the burden of responsibility falls squarely on the shoulders of the programmer (or in the majority of cases, the compiler). There are built-in language extensions available in most SPU compilers that allow the programmer to supply branch hints, but such things assume that you have sufficient time in order to make the prediction (that is, more than 11 cycles) and that the branch is intrinsically predictable, which may not be the case. It is therefore recommended that programmers avoid branches entirely [Acton08]. Others have discussed this topic at length [Acton08, Kapoulkine09], so I will refrain from doing so here; however, I do wish to touch upon one common case where branch avoidance is not entirely obvious but is entirely trivial.

IBM’s SDK provides several MFC functions to initiate DMA without resorting to the manual writing of registers[3]. An unfortunate side effect of such functions is that they seem to actively encourage code such as that presented in Listing 1.8.4.

Example 1.8.4. All-too-often encountered code to avoid issuing unwanted DMAs

if(si_to_uint(counter)> 0) )
       mfc_put(si_to_uint(lsa),
                     si_to_uint(ea),
                     si_to_uint(size),
                     si_to_uint(tag));

However, a little knowledge of the MFC can help avoid the branch in this case. The MFC contains the DMA queue (DMAQ). This queue contains SPU-initiated commands to the MFC’s DMA engine. Similar to a CPU or GPU, the MFC supports the concept of a NOP. A NOP is an operation that can be inserted into the DMAQ but doesn’t result in any data being transferred. A NOP for the MFC is denoted by any DMA command being written that has zero size. The resulting code looks something like Listing 1.8.5.

Example 1.8.5. Branch-free issue of DMA

qword cmp_mask    = si_cgti(counter, 0x0);
qword cmp         = si_andi(cmp_mask, 0x1); // bottom bit only.
qword dma_size    = si_mpy(size, cmp);    // size < 2^16
                    mfc_put(si_to_uint(lsa),
                                  si_to_uint(ea),
                                  si_to_uint(dma_size),
                                  si_to_uint(tag));

Unfortunately, the hardware is not smart enough to discard zero-sized DMA commands immediately upon the command register being written, and these commands are inserted into the 16-entry DMAQ for processing. The entry into the queue is immediately discarded when the DMA engine attempts to process this element of the queue. However, this causes a subtle downside to the employment of this technique for branch avoidance. SPE programs that issue a lot of DMAs can quickly back up the DMAQ, and issuing a zero-sized DMA can stall the SPU while it flushes the entire DMAQ. Luckily, this state of affairs can be almost entirely mitigated by a well-designed SPE program, which issues fewer, but larger DMAs.

Vertex/Geometry Shading

The SPUs can also lend a hand in various vertex processing tasks and, because of their general nature, can help overcome some of the shortcomings of the GPU programming model. In Blur, we were able to use the SPU to deal with awkward vertex sizes and to optimize the vehicle damage system.

Handling Strange Alignments When Multi-Buffering

Vertex data comes in all shapes and sizes, and, as a result, multi-buffering this type of data presents some challenges. When vertex buffers are created, contiguous vertices are packed tightly together in the buffer to both save memory and improve the performance of the pre-transform cache on the GPU. This presents an SPU programmer with a challenge when attempting to process buffers whose per-vertex alignment may not be a multiple of 16 bytes. This is a problem for two reasons. First, the DMA engine in the MFC transfers 1, 2, 4, 8, or multiples of 16 bytes, meaning that we must be careful not to overwrite parts of the buffer that we do not mean to modify. Second, loads and stores performed by the SXU itself are always 16-byte aligned [IBM08].

There are a lot of cases where a single vertex will straddle the boundary of two multi-buffers, due to vertex structures that have alignments that are sub-optimal from an SPU processing point of view. The best way of coding around this problem is to simply copy the end of a multi-buffer to its nearest 16-byte boundary into the start of the second multi-buffer and offset the pointer to the element you are currently processing. This means that when the second multi-buffer is transferred back to the main address space, it will not corrupt the vertices you had previously processed and transferred out of the first multi-buffer, as shown in Figure 1.8.5. Listing 1.8.6 contains code demonstrating how to handle unaligned loads from the local store.

Avoid buffer corruption by copying a small chunk from the end of one multi-buffer into the start of another.

Figure 1.8.5. Avoid buffer corruption by copying a small chunk from the end of one multi-buffer into the start of another.

Case Study: Car Damage in Blur

The car damage system in Blur works by manipulating a lattice of nodes that roughly represent the volume of the car. The GPU implementation makes use of a volume texture containing vectors representing the offset of these nodes’ positions from their original positions. This is then sampled based on the position of a vertex being processed relative to a volume that loosely represents the car in order to calculate position and normal offsets (see Figure 1.8.6). The texture is updated each time impacts are applied to the lattice, or when the car is repaired.

Position and normal offsets are applied to each vertex based on deltas stored in a volume texture.

Figure 1.8.6. Position and normal offsets are applied to each vertex based on deltas stored in a volume texture.

The GPU performs the deformation every frame because the damage is stateless and a function of the volume texture and the undamaged vertex data. Given the amount of work involved and the additional performance hit from sampling textures in the vertex unit, the performance of rendering cars in Blur was heavily vertex limited. This was something we wanted to tackle, and the SPUs were useful in doing so. Porting the entire vertex shader to the SPU was not practical given the timeframe and memory budgets, so instead we focused on moving just the damage calculations to the SPUs. This meant that the car damage vertex processing would only occur when damage needed to be inflicted on the car (instead of every frame with the equivalent GPU implementation), and it would greatly reduce the complexity of the vertex shader running on the GPU.

The damage offsets are a function of the vertex’s position and the state of the node lattice. Given the need for original position, we must transfer the vertex data for the cars to the local store via DMA and read the position data corresponding to each vertex. This is done using a multi-buffering strategy. Because different components of the car utilize different materials (and hence have different vertex formats), we were also forced to a variety of vertex alignments as described earlier. With the vertex data of the car in the SPU local store, we are able to calculate a position and normal offset for each vertex and write these out to a separate vertex buffer. Each of these values is stored as a float4, which means the additional vertex stream has a stride of 32 bytes per vertex. An astute GPU programmer will notice the potential to pack this data into fewer bits to improve cache utilization. This is undesirable, however. The data in its 32-bytes-per-vertex form is ideal for the DMA engine because the MFC natively works in 16-byte chunks, meaning from the point of view of other processing elements (in our case, the GPU), a given vertex is either deformed or it is not. This is one of the tradeoffs made to mitigate the use of a double buffer. Color Plate 5 has a screen-shot of this technique.

To GPU Types and Back Again

For the most part, GPUs do their best to support common type formats found in CPUs. The IEEE754 floating-point format is (for better or worse) the de facto floating-point standard on pretty much all modern hardware that supports floating point[4].

However, in addition to the IEEE754 standard 32-bit floats and 64-bit doubles, most shading languages offer a 16-bit counterpart known as half. The format of the half is not defined by any standard, and, as such, chip designers are free to implement their own floating-point formats for this data type on their GPUs. Fortunately, almost all GPU vendors have adopted the half format formalized by Industrial Light & Magic for their OpenEXR HDR file format [ILM09]. This format uses a single bit to denote the sign of the number, 5 bits for the exponent, and the remaining 10 bits for its mantissa or significand.

Since the half type is regrettably absent from the C98 and C++ standards, it falls to the programmer to write routines to convert to other data types. Acton has made available an entirely branch-free version of these conversion functions at [Acton06]. For the general case, you would be hard-pressed to better Acton’s code (assuming you don’t have the memory for a lookup table as in [ILM09]). However, in many constrained cases, we have knowledge about our data that allows us to omit support for floating-point special cases that require heavyweight conversion logic (NaNs and de-normalized numbers). Listing 1.8.6 contains code to convert between an unaligned half4 and float4 but omits support for NaNs. This is an optimization that was employed in Blur’s damage system. The inverse of this function is left as an exercise for the reader.

Example 1.8.6. Code to convert an unaligned half4 to a qword

static inline const qword ld_float16_4(void * __restrict__ addr)
{
     const vector unsigned char _loader =
     { 0x80, 0x80, 0x00, 0x01,
       0x80, 0x80, 0x02, 0x03,
       0x80, 0x80, 0x04, 0x05,
       0x80, 0x80, 0x06, 0x07 };
     const vector unsigned char _shft =
     { 0x00, 0x01, 0x02, 0x03,
       0x04, 0x05, 0x06, 0x07,
       0x08, 0x09, 0x0a, 0x0b,
       0x0c, 0x0d, 0x0e, 0x0f };

     qword target           = si_from_ptr(addr);
     qword val_lo           = si_lqd(target, 0x00);
     qword val_hi           = si_lqd(target, 0x10);
     qword sign_bit_mask    = si_ilhu(0x0);
           sign_bit_mask    = si_iohl(sign_bit_mask, 0x8000);
     qword mant_bit_mask    = si_ilhu(0x0);
           mant_bit_mask    = si_iohl(mant_bit_mask, 0x7fff);
     qword expo_bias        = si_ilhu(0x3800);
     qword loader           = (const qword)_loader;
     qword shft             = (const qword)_shft;
     qword offset           = si_andi(target, 0x0f);
     qword lo_byte_pat      = si_ilh(0x0303);
     qword offset_pat       = si_shufb(offset, offset, lo_byte_pat);
     qword mod_shuf         = si_a(shft, offset_pat);
     qword val              = si_shufb(val_lo, val_hi, mod_shuf);
     qword result           = si_shufb(val, val, loader); // aligned
     qword sign_bit         = si_and(result, sign_bit_mask);
           sign_bit         = si_shli(sign_bit, 0x10);
     qword significand      = si_and(result, mant_bit_mask);
           significand      = si_shli(significand, 0xd);
     qword is_zero_mask     = si_cgti(significand, 0x0);
           expo_bias        = si_and(is_zero_mask, expo_bias);
     qword exponent_bias    = si_a(significand, expo_bias);
     qword final_result     = si_or(exponent_bias, sign_bit);
     return final_result;
}

Benefits versus Drawbacks

Processing vertex data on the SPUs has a number of advantages; one of the most significant is that the rigidity of the GPU’s processing model is largely circumvented as you are performing processing on a general-purpose CPU. Access to mesh topology is supported, but one must be careful that these accesses do not introduce unwanted stalls as the data is fetched from the main address space. In addition, since we are using a CPU capable of general-purpose program execution, we are able to employ higher-level optimization tactics, such as early outs or faster code paths, which would be tricky or impossible under the rigid processing model adopted by GPUs. The ability to split workloads between the SPUs and the GPU is also useful in striking the ideal balance for a given application.

As with most things in graphics programming, there are some tradeoffs to be made. Vertex processing on the SPU can in many cases require that vertex buffers are double buffered, meaning a significantly increased memory footprint. The situation is only aggravated if there is a requirement to support multiple instances of the same model. In this case, each instance of the base model may also require a double buffer. This can be mitigated to some extent by carefully designing the vertex format to support atomic writes of individual elements by the DMA engine, but the practicality of this is highly application-specific and certainly doesn’t work in the case of instances. Clever use of a ring buffer can also solve this problem to some extent, but it introduces additional problems with SPU/GPU inter-processor communication.

Fragment Shading

Fragment shading in the traditional sense is heavily tied to the output of the GPU’s rasterizer. Arbitrarily “hooking into” the graphics pipeline to have the SPUs perform general-purpose fragment shading with current generations of graphics hardware is effectively impossible. However, performing the heavy lifting for certain types of fragment shading that do not necessarily require the use of the rasterizer, or even helping out the GPU with some pre-processing as in [Swoboda09], is certainly feasible and in our experience has yielded significant performance benefits in real-world applications [Tovey10]. This section discusses some of the techniques that will help you get the most out of the SPUs when shading fragments.

Batch! Batch! Batch!

It might be tempting with initial implementations of pixel processing code on the SPU to adopt the approach of video hardware, such as the RSX. RSX processes pixels in groups of four, known as quads [Möller08]. For sufficiently interleavable program code—in other words, program code that contains little dependency between operations that follow one another—this may be a good approach. However, in our experience, larger batches can produce better results with respect to pixel throughput because there is a greater volume of interleavable operations. Too few pixels result in large stalls between dependant operations, time that could be better spent performing pixel shading, while larger batches cause high register pressure and ultimately spilling. Moreover, in many applications that have a fixed setup cost for each processing batch, you are doing more work for little to no extra setup overhead.

So, what is the upper bound on the number of pixels to process in a single batch of work? Can we simply process the entire buffer at once? The answer to this is not obvious and depends on a number of factors, including the complexity of your fragment program and the number of intermediate values that you have occupying registers at any one time. Typically, the two are inextricably linked.

As mentioned earlier, the SXU contains 128 registers, each 16 bytes in size. It is the task of the compiler to multiplex all live variables in your program onto a limited register file[5]. When there are more live variables than there are registers—in other words, when register pressure is high—the contents of some or all of the registers (depending on the size of the register file) have to be written back to main memory and restored later. This is known as spilling registers. The more pixels one attempts to process in a batch, the higher the register pressure for that function will be, and the likelihood that the compiler will have to spill registers back to the stack becomes greater. Spilling registers can become very expensive if done to excess. The optimum batch size is hence the largest number of pixels that one can reasonably process without spilling any registers back to the local store and without adding expense to the setup code for the batch of pixels.

Pipeline Balance Is Key!

An efficient, well-written program will be limited by the number of instructions issued to the processor. Those processors with dual-issue capabilities, such as the SPU, have the potential to dramatically decrease the number of cycles that a program consumes. Pipeline balance between the odd and even execution pipelines is critical to achieving good performance with SPU programs. We will now discuss the requirements for instruction dual-issue and touch briefly on techniques to maximize instruction issue (through dual-issue) for those programmers writing in assembly.

The SPU can dual-issue instructions under a very specific set of circumstances. Instructions are fetched in pairs from two very small instruction buffers [Bader07], and the following must all be true if dual-issue is to occur:

  • The instructions in the fetch group must be capable of dispatch to separate execution pipelines.

  • The alignment of the instructions must be such that the even pipeline instruction occupies an even-aligned address in the fetch group, and the odd pipeline in the odd-aligned address.

  • Finally, there must be no dependencies either between the two instructions in the fetch group or between any one of the instructions in the fetch group and another instruction currently being executed in either of the pipelines.

Programmers writing code with intrinsics rarely need to worry about instruction alignment. The addition of nops and lnops in intrinsic form does not typically help the compiler to better align your code for dual-issue, and, in many cases, the compiler will do a reasonable job of instruction balancing. However, if you’re programming in assembly language, the use of nop (and its odd-pipeline equivalent, lnop) will be useful in ensuring that code is correctly aligned for dual-issue. Of course, care must be taken not to overdo it and actually make the resulting code slower. A good rule of thumb is never to insert more than two nops/lnops.

Case Study: Light Pre-Pass Rendering in Blur

Light pre-pass rendering is a variant of deferred shading first introduced by Wolfgang Engel on his blog [Engel08] and later in [Engel09, Engel09a] at the same time it was derived independently by Balestra et al. for use in Uncharted: Drake’s Fortune [Balestra08]. The techniques behind light pre-pass rendering are well understood and are discussed elsewhere [Engel08, Balestra08, Engel09, Engel09a, Tovey10], so a brief summary will suffice here.

As with all deferred rendering, the shading of pixels is decoupled from scene complexity by rendering out “fat” frame buffers for use in an image space pass [Deering88, Saito90]. Light pre-pass rendering differs slightly from traditional deferred shading in that only the data required for lighting calculations is written to the frame buffer during an initial rendering pass of the scene. This has several advantages, including a warm Z-buffer and a reduced impact on bandwidth requirements, at the expense of rendering the scene geometry twice.

Because one of the main requirements for the new engine written for Blur was that it should be equipped to handle a large number of dynamic lights, the light pre-pass renderer was a very attractive option. After implementing a light pre-pass renderer for Blur (which ran on the RSX), it became apparent that we could get significant performance gains from offloading the screen-space lighting pass to the SPUs[6].

The lighting calculations in Blur are performed on the SPU in parallel with other non-dependent parts of the frame. This means that as long as we have enough rendering work for the RSX, the lighting has no impact on the latency of a frame. Processing of the lighting buffer is done in tiles, the selection of which is managed through the use of the SPE’s atomic unit. When the tiles are processed, the RSX is free to access the lighting buffer during the rendering of the main pass. The results of our technique are shown in Color Plate 6 and discussed in greater detail in [Swoboda09, Tovey10].

Benefits versus Drawbacks

The SPUs are powerful enough to perform fragment processing. This has been demonstrated by developers with deferred shading, post-processing, and so on [Swoboda09, van der Leeuw09, Tovey10]. While general-purpose fragment shading is not possible, it is possible to perform a plethora of image-space techniques on the SPUs, including motion blur, depth of field, shadowing, and lighting. Parallelization with other non-related rendering work on the GPU can provide an extra gain if one’s goal is to minimize frame latency. Such gains can even be made without the expense of an increased memory footprint.

Rasterization on the SPUs has been achieved by a number of studios with good results, but the use cases for this technique are somewhat restricted, usually being reserved for occlusion culling and the like rather than general-purpose rendering. Rasterization aside, the most serious drawback to performing fragment shading on the SPUs is the lack of dedicated texture-mapping hardware. Small textures may be feasible, as they will fit in the limited local store, but for larger textures or multiple textures, software caching is currently considered to be the best approach [Swoboda09].

Further Work

Due to the highly flexible nature of the SPUs in augmenting the processing power of the GPU, it is hard to suggest avenues of further work with any certainty. However, there are a few significant challenges that warrant additional research efforts in order to further improve the feasibility of some graphics techniques on the SPUs.

Texture mapping is one such avenue of research. Currently, the best that has been done is the use of a good software cache [Swoboda09] to try and minimize the latency of texture accesses from the SPUs. Taking inspiration from other convergent architectures, namely Intel Larrabee [Seiler08], we believe that the employment of user-level threads on the SPUs as a mechanism for hiding latency could certainly go some way to helping the prohibitively slow texture access speeds currently endured by graphics programmers seeking to help the GPU along with the SPUs. Running two to four copies of the same SPU program (albeit with offline modifications to the program’s byte code) could allow a programmer to trade space in the local store for processing speed. The idea is simple: Each time a DMA is initiated, the programmer performs a lightweight context-switch to another version of the program residing in the local store, which can be done cheaply if the second copy does not make use of the same registers. The hope is that by the time we return the original copy, the data we requested has arrived in the local store, allowing us to process it without delay. Such a scheme would impose some limitations but could be feasible for small-stream kernels, such as shaders.

Conclusion

The SPUs are fast enough to perform high-end vertex and fragment processing. While they are almost certainly not going to beat the GPU in a like-for-like race (in other words, the implementation of a full graphics pipeline), they can be used in synergy with the GPU to supplement processing activities traditionally associated with rendering. The option to split work between the two processing elements makes them great tools for optimizing the rendering of specific objects in a scene. The deferred lighting and car damage systems in Blur demonstrate the potential of the SPUs to work harmoniously with the GPU to produce impressive results.

Looking to the future, the ever-growing popularity and prevalence of deferred rendering techniques in current generations of hardware further empower the SPUs to deliver impressive improvements to the latency of a frame and allow game developers to get closer to synthesizing reality than ever before.

Acknowledgements

I would like to thank the supremely talented individuals of the Bizarre Creations Core Technologies Team for being such a great bunch to work with, with special thanks reserved for Steve McAuley for being my partner in crime with our SPU lighting implementation. Thanks also go to Andrew Newton and Neil Purvey at Juice Games for our numerous discussions about SPU coding, to Matt Swoboda of SCEE R&D for our useful discussions about SPU-based image processing, and to Wade Brainerd of Activision Central Technology for his helpful comments, corrections, and suggestions. Last but not least, thanks also to Jason Mitchell of Valve for being an understanding and knowledgeable section editor!

References



[1] The fsmbi instruction is also very useful for efficiently constructing a selection mask for use with selb.

[2] The SXU adopts the default prediction strategy that all branches are not taken.

[3] SPU-initiated DMAs are performed by the writing of special-purpose registers in the MFC using the wrch instruction. There are six such registers that must be written in order to initiate a DMA. These may be written arbitrarily as long as the command register is written terminally [IBM08].

[4] Ironically, the SPUs do not offer full IEEE754 support, but it’s very close.

[5] The process of mapping multiple live variables onto a limited register file is known as register coloring. Register coloring is a topic in its own right, and we will not cover it in detail here.

[6] Coincidentally, it was around this time that Matt Swoboda presented his work in a similar area, in which he moved a fully deferred renderer to the SPUs in [Swoboda09]; Matt’s work and willingness to communicate with us was useful in laying the ground work for our implementation in Blur.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset