28. A Cache‐Aware Hybrid Sorter (2/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

442 28.ACache‐AwareHybridSorter

28.3StreamMergingandLoserTree

Our substreams are now sorted, and all we need is to merge them into one output

stream. The most common approach is to use a priority queue (PQ). We insert the

head of each substream into the PQ together with the

stream_id. Then we re-

move the “min-member” and output it. We then take the next item from the sub-

stream in which the min-member was located, and insert it into the PQ. Finally,

we re-establish the heap property and repeat. This approach was tried using one

of the best available heap-based PQs, and the result was poor.

The best approach we found is to utilize an old idea reported by Knuth

[1973] called the loser tree.

A loser tree is a kind of tournament tree where pairs

of players engage in rounds of a single-elimination tournament. The loser is kept

in the tree while the winner moves on, in a register. Our tree node consists of a

floating-point key and a payload of

stream_id, as shown in Listing 28.3. The

loser tree is stored as a linearized complete binary tree, which enables us to avoid

storing pointers. Navigating up and down the tree is done by shifts and adds.

The loser tree is initialized by inserting the head of each substream, as shown

in Listing 28.4. At the end of this, the winner literally rises to the top. We remove

the winner and output the key to the output buffer since this is the smallest

among the heads of the sorted substreams. The winner node also carries the

stream_id from which it came. We use that to pull the next item from the stream

to take the place of the last winner. A key idea of a loser tree is that the new win-

ner can only come from matches that the last winner had won—i.e., the path

struct Node

{

float key; // our key

int value; // always the substream index 'key' is from

};

// Returns our parent's index:

int Parent(int index) { return (index >> 1); }

int Left(int index) { return (index << 1); }

int Right(int index) { return ((index << 1) + 1); }

Listing 28.3. Data structures and abstractions for a loser tree.

For an excellent graphical illustration of a loser tree in action, please refer to the course

notes by Dr. Thomas W. Bennet, available at http://sandbox.mc.edu/~bennet/cs402/lec/

losedex.html.

28.3StreamMergingandLoserTree 443

// Build the loser tree after populating all the leaf nodes:

int InitWinner(int root)

{

if (root >= kStreams)

{

// leaf reached

return (root); // leaf index

}

else

{

int left = InitWinner(root * 2);

int right = InitWinner(root * 2 + 1);

Key lk = m_nodes[left].m_key;

Key rk = m_nodes[right].m_key;

if (lk <= rk)

{

// right subtree loses

m_nodes[root] = m_nodes[right]; // store loser

return (left); // return winner

}

else

{

m_nodes[root] = m_nodes[left];

return (right);

}

Listing 28.4. Initialization for a loser tree.

from the root of the tree to the original position of the winner at the bottom. We

repeat those matches with the new item, and a new winner emerges, as demon-

strated in Listing 28.5.

The next little trick to speed up the merge is to mark each substream with an

end-of-stream marker that is guaranteed to be larger than all keys. The marker

stops the merger from pulling from that substream when it reaches the merger.

We chose

infinity() as our marker. This also allows us to handle the uneven

444 28.ACache‐AwareHybridSorter

// Update loser tree after storing 'nextval' at 'slot':

int Update(int slot, Key newKey)

{

m_nodes[slot].m_key = newKey;

// should always be the same stream

assert(m_nodes[slot].m_value == slot);

int loserslot = Parent(slot);

int winner = slot;

while (loserslot != 0)

{

float loser = m_nodes[loserslot].m_key; // previous loser

if (newKey > loser)

{

// newKey is losing to old loser

// new winner's key

newKey = loser;

// new winner's slot

int newwinner = m_nodes[loserslot].m_value;

// newKey is the new loser

m_nodes[loserslot] = m_nodes[winner];

winner = newwinner;

}

loserslot = Parent(loserslot);

}

return (winner);

}

Listing 28.5. Update method for a loser tree.

substream lengths when the input is not exactly divisible by the number of

substreams. In the sample code on the website, we copy the input and allocate the

extra space needed by the marker. With some careful coding, we only need the

extra space for the last substream. In a production environment, if one can be

28.4MulticoreImplementation 445

sure the input buffer always has extra space at the end, then we can avoid the

extra buffer and copying. For example, we can define the interface to the sorter

as such or take an extra input argument that defines the actual size of the input

buffer and copy only if no room is reserved.

Our merge phase is very cache friendly since all the memory operations are

sequential. The sorted streams are accessed sequentially, and the output is

streamed out from the merger. The merger is small since it only needs space to

store one item from each stream. The size of the tree is

2 * kNumStreams and,

therefore, fits into the L1 cache easily. One can even consider keeping the merger

entirely within registers for maximum speed. For small datasets, our sort is 2.1 to

3.5 times faster than STL sort. The relative disadvantage of quicksort is smaller

when the dataset is smaller and fits nicely into the caches.

The loser tree might not be the best approach if you are writing GPU-based

applications. An even–odd or bitonic merge network is probably a better way to

exploit the wide SIMD parallelism. That being said, the merge phase is only a

small fraction of the total processing time (~25%). The sorter is encapsulated in a

class to hide details, like the stream markers and substream sizes, and to reuse

temporary buffers for multiple sorts.

28.4MulticoreImplementation

The sorter is refactored to expose the initialization, substream sort, and merge as

separate methods. In the sample code on the website, the initialization function is

called from the main thread before firing off the worker threads, each of which

sorts one of the substreams. Some care is taken to create the threads up front and

to use auto-reset events to reduce the overhead with thread synchronization. The

class is carefully designed to leave the threading logic outside of the sorter, with-

out impacting performance. The merging is performed on the main thread.

The data set used for our performance tests consists of 0.89 million floating-

point values that happen to be the x coordinates for the Stanford Dragon. The

one-stream times in Table 28.1 show that our threading code is efficient, and on-

ly about 0.12 to 0.17 ms of overhead is introduced. One can see the per-core

scalability is not great on the Q6600—using four threads only doubles the speed

of the sort phase. This is somewhat understandable since radix sort is very band-

width hungry. Keep in mind we are sorting substreams that have been spliced. If

we directly sort the whole input array using a single radix-256 sort, then the

runtime is 28 ms. The serial merge costs 5 to 6 ms, which gives us a 65 percent

improvement. Most of that improvement can be had with as little as two cores.

For a point of reference, STL sort runs in 76 ms, and while using four threads,

446 28.ACache‐AwareHybridSorter

Threaded (Q6600) Serial (Q6600) Threaded (I7) Serial (I7)

1 stream 5.12 5 3.89 3.62

2 streams 6.90 10.04 4.20 7.1

3 streams 8.08 15.07 4.56 10.69

4 streams 10.97 20.55 4.86 14.2

4 + merge 16.4 26.01 9.61 19.0

Table 28.1. Threading performance, in milliseconds.

our sort runs in 16 to 17 ms. The picture is very different on an I7, which shows

better scalability, where four threads is about three times faster than serially sort-

ing the four substreams. The runtime for STL sort is 58 ms on an I7, which is a

six-times speed-up for the hybrid sorter.

ParallelLoserTree

The Update() operation of the loser tree is the key step during merging. Can we

find a way to parallelize the

Update() step? Superficially, in Listing 28.5, the

loserslot appears to be dependent on the previous loop’s result. However, if we

remember our invariants for a loser tree, the matches between entries in the tree

have already been played earlier. Hence, any new loser must be a new player—

i.e., the one we inserted, triggering the call to

Update(). We arrive at the critical

insight that all tournaments along the path from a new leaf to the root are inde-

pendent of each other. In addition, the path to the root for a given leaf is always

the same. This means that all the addressing needed is known at compile time.

This kind of parallelization can never be derived automatically by a compiler.

Marin [1997] presented the following parallel

Update() step for a binary

tournament tree:

for n in [0..height-1] do in parallel

parent = Parent(n)

if (m_nodes[parent].m_key > loser)

m_nodes[parent] = m_nodes[winner]

28.5Conclusion

We now have a sorter that is fast in a single-threaded application and also func-

tions well in a multicore setting. The sort is a stable sort and has a predictable run

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 28. A Cache‐Aware Hybrid Sorter (2/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
28. A Cache‐Aware Hybrid Sorter (2/3)