440 28.ACache‐AwareHybridSorter
are targeting the L2 cache, then we might have to reserve a little more space in
L2 since the set-associative mechanism is not perfect.
For the counting phase, the only critical data that should be in the cache is
the counts table. Our use of radix-256 implies a table size of
256 *
sizeof(int)
, which is 1 kB or 2 kB. For a GPU-based sort that has limited
shared memory, one might consider a lower radix and a few extra passes.
28.2SubstreamSorting
At first glance, once we have partitioned the input, we can just use the “best” se-
rial sort within each partition. Our interest is in first developing a good usable
serial algorithm that is well suited for parallel usage. In a parallel context, we
would like each thread to finish in roughly the same time given a substream of
identical length. Keep in mind that one thread can stall the entire process since
other threads must wait for it to finish before the merge phase can commence.
Quicksort’s worst-case behavior can be
2
On
and is data dependent. Radix sort
is
On, and its run time is not dependent on the data.
Our radix sorter is a highly optimized descendent of Herf’s code, which is
built on top of Terdiman’s. It is a radix-256 least-significant-bit sort. The main
difficulty with using radix sort on game data is handling floating-point values.
An IEEE floating-point value uses a sign-exponent-mantissa format. The sign bit
makes all negative numbers appear to be larger than positive ones. For example,
2.0 is
0x40000000, and 2.0
is 0xC0000000 in hexadecimal, while 4.0
is
0xC0800000, which implies 4.0 2.0 2.0
, just the opposite of what we want.
The key idea is Herf’s rule for massaging floats:
1. Always invert the sign bit.
2. If the sign bit was set, then invert the exponent and mantissa too.
Rule 1 turns 2.0 into
0xC0000000, 2.0
into 0x40000000, and 4.0
into
0x40800000, which implies 2.0 4.0 2.0 . This is better, but still wrong. Ap-
plying rule 2 turns 2.0
into 0x3FFFFFFF and 4.0
into 0x3F7FFFFF, giving
2.0 2.0 4.0 , which is the result we want. This bit hack has great implica-
tions for radix sort since we can now treat all the digits the same. Interested read-
ers can compare Terdiman’s and Herf’s code to see how much this simple hack
helps to simplify the code.
The code in Listing 28.2 is taken from Herf’s webpage. Like Herf’s code, we
build all four histograms in one pass. If you plan on sorting a large number of
items and you are running on the CPU, you can consider Herf’s radix-2048 opti-