28.4MulticoreImplementation 445
sure the input buffer always has extra space at the end, then we can avoid the
extra buffer and copying. For example, we can define the interface to the sorter
as such or take an extra input argument that defines the actual size of the input
buffer and copy only if no room is reserved.
Our merge phase is very cache friendly since all the memory operations are
sequential. The sorted streams are accessed sequentially, and the output is
streamed out from the merger. The merger is small since it only needs space to
store one item from each stream. The size of the tree is
2 * kNumStreams and,
therefore, fits into the L1 cache easily. One can even consider keeping the merger
entirely within registers for maximum speed. For small datasets, our sort is 2.1 to
3.5 times faster than STL sort. The relative disadvantage of quicksort is smaller
when the dataset is smaller and fits nicely into the caches.
The loser tree might not be the best approach if you are writing GPU-based
applications. An even–odd or bitonic merge network is probably a better way to
exploit the wide SIMD parallelism. That being said, the merge phase is only a
small fraction of the total processing time (~25%). The sorter is encapsulated in a
class to hide details, like the stream markers and substream sizes, and to reuse
temporary buffers for multiple sorts.
28.4MulticoreImplementation
The sorter is refactored to expose the initialization, substream sort, and merge as
separate methods. In the sample code on the website, the initialization function is
called from the main thread before firing off the worker threads, each of which
sorts one of the substreams. Some care is taken to create the threads up front and
to use auto-reset events to reduce the overhead with thread synchronization. The
class is carefully designed to leave the threading logic outside of the sorter, with-
out impacting performance. The merging is performed on the main thread.
The data set used for our performance tests consists of 0.89 million floating-
point values that happen to be the x coordinates for the Stanford Dragon. The
one-stream times in Table 28.1 show that our threading code is efficient, and on-
ly about 0.12 to 0.17 ms of overhead is introduced. One can see the per-core
scalability is not great on the Q6600—using four threads only doubles the speed
of the sort phase. This is somewhat understandable since radix sort is very band-
width hungry. Keep in mind we are sorting substreams that have been spliced. If
we directly sort the whole input array using a single radix-256 sort, then the
runtime is 28 ms. The serial merge costs 5 to 6 ms, which gives us a 65 percent
improvement. Most of that improvement can be had with as little as two cores.
For a point of reference, STL sort runs in 76 ms, and while using four threads,