442 28.ACacheAwareHybridSorter
28.3StreamMergingandLoserTree
Our substreams are now sorted, and all we need is to merge them into one output
stream. The most common approach is to use a priority queue (PQ). We insert the
head of each substream into the PQ together with the
stream_id. Then we re-
move the “min-member” and output it. We then take the next item from the sub-
stream in which the min-member was located, and insert it into the PQ. Finally,
we re-establish the heap property and repeat. This approach was tried using one
of the best available heap-based PQs, and the result was poor.
The best approach we found is to utilize an old idea reported by Knuth
[1973] called the loser tree.
1
A loser tree is a kind of tournament tree where pairs
of players engage in rounds of a single-elimination tournament. The loser is kept
in the tree while the winner moves on, in a register. Our tree node consists of a
floating-point key and a payload of
stream_id, as shown in Listing 28.3. The
loser tree is stored as a linearized complete binary tree, which enables us to avoid
storing pointers. Navigating up and down the tree is done by shifts and adds.
The loser tree is initialized by inserting the head of each substream, as shown
in Listing 28.4. At the end of this, the winner literally rises to the top. We remove
the winner and output the key to the output buffer since this is the smallest
among the heads of the sorted substreams. The winner node also carries the
stream_id from which it came. We use that to pull the next item from the stream
to take the place of the last winner. A key idea of a loser tree is that the new win-
ner can only come from matches that the last winner had won—i.e., the path
struct Node
{
float key; // our key
int value; // always the substream index 'key' is from
};
// Returns our parent's index:
int Parent(int index) { return (index >> 1); }
int Left(int index) { return (index << 1); }
int Right(int index) { return ((index << 1) + 1); }
Listing 28.3. Data structures and abstractions for a loser tree.
1
For an excellent graphical illustration of a loser tree in action, please refer to the course
notes by Dr. Thomas W. Bennet, available at http://sandbox.mc.edu/~bennet/cs402/lec/
losedex.html.
28.3StreamMergingandLoserTree 443
// Build the loser tree after populating all the leaf nodes:
int InitWinner(int root)
{
if (root >= kStreams)
{
// leaf reached
return (root); // leaf index
}
else
{
int left = InitWinner(root * 2);
int right = InitWinner(root * 2 + 1);
Key lk = m_nodes[left].m_key;
Key rk = m_nodes[right].m_key;
if (lk <= rk)
{
// right subtree loses
m_nodes[root] = m_nodes[right]; // store loser
return (left); // return winner
}
else
{
m_nodes[root] = m_nodes[left];
return (right);
}
}
}
Listing 28.4. Initialization for a loser tree.
from the root of the tree to the original position of the winner at the bottom. We
repeat those matches with the new item, and a new winner emerges, as demon-
strated in Listing 28.5.
The next little trick to speed up the merge is to mark each substream with an
end-of-stream marker that is guaranteed to be larger than all keys. The marker
stops the merger from pulling from that substream when it reaches the merger.
We chose
infinity() as our marker. This also allows us to handle the uneven
444 28.ACacheAwareHybridSorter
// Update loser tree after storing 'nextval' at 'slot':
int Update(int slot, Key newKey)
{
m_nodes[slot].m_key = newKey;
// should always be the same stream
assert(m_nodes[slot].m_value == slot);
int loserslot = Parent(slot);
int winner = slot;
while (loserslot != 0)
{
float loser = m_nodes[loserslot].m_key; // previous loser
if (newKey > loser)
{
// newKey is losing to old loser
// new winner's key
newKey = loser;
// new winner's slot
int newwinner = m_nodes[loserslot].m_value;
// newKey is the new loser
m_nodes[loserslot] = m_nodes[winner];
winner = newwinner;
}
loserslot = Parent(loserslot);
}
return (winner);
}
Listing 28.5. Update method for a loser tree.
substream lengths when the input is not exactly divisible by the number of
substreams. In the sample code on the website, we copy the input and allocate the
extra space needed by the marker. With some careful coding, we only need the
extra space for the last substream. In a production environment, if one can be
28.4MulticoreImplementation 445
sure the input buffer always has extra space at the end, then we can avoid the
extra buffer and copying. For example, we can define the interface to the sorter
as such or take an extra input argument that defines the actual size of the input
buffer and copy only if no room is reserved.
Our merge phase is very cache friendly since all the memory operations are
sequential. The sorted streams are accessed sequentially, and the output is
streamed out from the merger. The merger is small since it only needs space to
store one item from each stream. The size of the tree is
2 * kNumStreams and,
therefore, fits into the L1 cache easily. One can even consider keeping the merger
entirely within registers for maximum speed. For small datasets, our sort is 2.1 to
3.5 times faster than STL sort. The relative disadvantage of quicksort is smaller
when the dataset is smaller and fits nicely into the caches.
The loser tree might not be the best approach if you are writing GPU-based
applications. An even–odd or bitonic merge network is probably a better way to
exploit the wide SIMD parallelism. That being said, the merge phase is only a
small fraction of the total processing time (~25%). The sorter is encapsulated in a
class to hide details, like the stream markers and substream sizes, and to reuse
temporary buffers for multiple sorts.
28.4MulticoreImplementation
The sorter is refactored to expose the initialization, substream sort, and merge as
separate methods. In the sample code on the website, the initialization function is
called from the main thread before firing off the worker threads, each of which
sorts one of the substreams. Some care is taken to create the threads up front and
to use auto-reset events to reduce the overhead with thread synchronization. The
class is carefully designed to leave the threading logic outside of the sorter, with-
out impacting performance. The merging is performed on the main thread.
The data set used for our performance tests consists of 0.89 million floating-
point values that happen to be the x coordinates for the Stanford Dragon. The
one-stream times in Table 28.1 show that our threading code is efficient, and on-
ly about 0.12 to 0.17 ms of overhead is introduced. One can see the per-core
scalability is not great on the Q6600—using four threads only doubles the speed
of the sort phase. This is somewhat understandable since radix sort is very band-
width hungry. Keep in mind we are sorting substreams that have been spliced. If
we directly sort the whole input array using a single radix-256 sort, then the
runtime is 28 ms. The serial merge costs 5 to 6 ms, which gives us a 65 percent
improvement. Most of that improvement can be had with as little as two cores.
For a point of reference, STL sort runs in 76 ms, and while using four threads,
446 28.ACacheAwareHybridSorter
Threaded (Q6600) Serial (Q6600) Threaded (I7) Serial (I7)
1 stream 5.12 5 3.89 3.62
2 streams 6.90 10.04 4.20 7.1
3 streams 8.08 15.07 4.56 10.69
4 streams 10.97 20.55 4.86 14.2
4 + merge 16.4 26.01 9.61 19.0
Table 28.1. Threading performance, in milliseconds.
our sort runs in 16 to 17 ms. The picture is very different on an I7, which shows
better scalability, where four threads is about three times faster than serially sort-
ing the four substreams. The runtime for STL sort is 58 ms on an I7, which is a
six-times speed-up for the hybrid sorter.
ParallelLoserTree
The Update() operation of the loser tree is the key step during merging. Can we
find a way to parallelize the
Update() step? Superficially, in Listing 28.5, the
loserslot appears to be dependent on the previous loop’s result. However, if we
remember our invariants for a loser tree, the matches between entries in the tree
have already been played earlier. Hence, any new loser must be a new player—
i.e., the one we inserted, triggering the call to
Update(). We arrive at the critical
insight that all tournaments along the path from a new leaf to the root are inde-
pendent of each other. In addition, the path to the root for a given leaf is always
the same. This means that all the addressing needed is known at compile time.
This kind of parallelization can never be derived automatically by a compiler.
Marin [1997] presented the following parallel
Update() step for a binary
tournament tree:
for n in [0..height-1] do in parallel
parent = Parent(n)
if (m_nodes[parent].m_key > loser)
m_nodes[parent] = m_nodes[winner]
28.5Conclusion
We now have a sorter that is fast in a single-threaded application and also func-
tions well in a multicore setting. The sort is a stable sort and has a predictable run
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset