Chapter 5. Counting Bits

5–1 Counting 1-Bits

The IBM Stretch computer (ca. 1960) had a means of counting the number of 1-bits in a word, as well as the number of leading 0’s. It produced these two quantities as a by-product of all logical operations! The former function is sometimes called population count (e.g., on Stretch and the SPARCv9).

For machines that don’t have this instruction, a good way to count the number of 1-bits is to first set each 2-bit field equal to the sum of the two single bits that were originally in the field, and then sum adjacent 2-bit fields, putting the results in each 4-bit field, and so on. A more complete discussion of this trick is in [RND]. The method is illustrated in Figure 5–1, in which the first row shows a computer word whose 1-bits are to be summed, and the last row shows the result (23 decimal).

Image

FIGURE 5–1. Counting 1-bits, “divide and conquer” strategy.

This is an example of the “divide and conquer” strategy, in which the original problem (summing 32 bits) is divided into two problems (summing 16 bits), which are solved separately, and the results are combined (added, in this case). The strategy is applied recursively, breaking the 16-bit fields into 8-bit fields, and so on.

In the case at hand, the ultimate small problems (summing adjacent bits) can all be done in parallel, and combining adjacent sums can also be done in parallel in a fixed number of steps at each stage. The result is an algorithm that can be executed in log2(32) = 5 steps.

Other examples of divide and conquer are the well-known techniques of binary search, a sorting method known as quicksort, and a method for reversing the bits of a word, discussed on page 129.

The method illustrated in Figure 5–1 can be committed to C code as

x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF);

The first line uses (x >> 1) & 0x55555555 rather than the perhaps more natural (x & 0xAAAAAAAA) >> 1, because the code shown avoids generating two large constants in a register. This would cost an instruction if the machine lacks the and not instruction. A similar remark applies to the other lines.

Clearly, the last and is unnecessary, and other and’s can be omitted when there is no danger that a field’s sum will carry over into the adjacent field. Furthermore, there is a way to code the first line that uses one fewer instruction. This leads to the simplification shown in Figure 5–2, which executes in 21 instructions and is branch-free.


int pop(unsigned x) {
   x = x - ((x >> 1) & 0x55555555);
   x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
   x = (x + (x >> 4)) & 0x0F0F0F0F;
   x = x + (x >> 8);
   x = x + (x >> 16);
   return x & 0x0000003F;
}


FIGURE 5–2. Counting 1-bits in a word.

The first assignment to x is based on the first two terms of the rather surprising formula

Image

In Equation (1), we must have x ≥ 0. By treating x as an unsigned integer, Equation (1) can be implemented with a sequence of 31 shift right immediate’s of 1, and 31 subtract’s. The procedure of Figure 5–2 uses the first two terms of this on each 2-bit field, in parallel.

There is a simple proof of Equation (1), which is shown below for the case of a four-bit word. Let the word be b3b2b1b0, where each bi = 0 or 1. Then,

Image

Alternatively, Equation (1) can be derived by noting that bit i of the binary representation of a nonnegative integer x is given by

Image

and summing this for i = 0 to 31. Work it out—the last term is 0 because x < 232. Equation (1) generalizes to other bases. For base ten it is

Image

where the terms are carried out until they are 0. This can be proved by essentially the same technique used above.

A variation of the above algorithm is to use a base 4 analogue of Equation (1) as a substitute for the second executable line of Figure 5–2:

x = x - 3*((x >> 2) & 0x33333333)

This code, however, uses the same number of instructions as the line it replaces (six), and requires a fast multiply-by-3 instruction.

An algorithm in HAKMEM memo [HAK, item 169] counts the number of 1-bits in a word by using the first three terms of (1) to produce a word of 3-bit fields, each of which contains the number of 1-bits that were in it. It then adds adjacent 3-bit fields to form 6-bit field sums, and then adds the 6-bit fields by computing the value of the word modulo 63. Expressed in C, the algorithm is (the long constants are in octal)

int pop(unsigned x) {
   unsigned n;

   n = (x >> 1) & 033333333333;       // Count bits in
   x = x - n;                         // each 3-bit
   n = (n >> 1) & 033333333333;       // field.
   x = x - n;
   x = (x + (x >> 3)) & 030707070707; // 6-bit sums.
   return x%63;                       // Add 6-bit sums.
}

The last line uses the unsigned modulus function. (It could be either signed or unsigned if the word length were a multiple of 3.) That the modulus function sums the 6-bit fields becomes clear by regarding the word x as an integer written in base 64. The remainder upon dividing a base b integer by b – 1 is, for b ≥ 3, congruent mod b – 1 to the sum of the digits and, of course, is less than b – 1. Because the sum of the digits in this case must be less than or equal to 32, mod(x, 63) must be equal to the sum of the digits of x, which is to say equal to the number of 1-bits in the original x.

This algorithm requires only ten instructions on the DEC PDP-10, because that machine has an instruction for computing the remainder with its second operand directly referencing a fullword in memory. On a basic RISC, it requires about 13 instructions, assuming the machine has unsigned modulus as one instruction (but not directly referencing a fullword immediate or memory operand). It is probably not very fast, because division is almost always a slow operation. Also, it doesn’t apply to 64-bit word lengths by simply extending the constants, although it does work for word lengths up to 62.

The return statement in the code above can be replaced with the following, which runs faster on most machines, but is perhaps less elegant (octal notation again).

return ((x * 0404040404) >> 26) +   // Add 6-bit sums.
        (x >> 30);

A variation on the HAKMEM algorithm is to use Equation (1) to count the number of 1’s in each 4-bit field, working on all eight 4-bit fields in parallel [Hay1]. Then, the 4-bit sums can be converted to 8-bit sums in a straightforward way, and the four bytes can be added with a multiplication by 0x01010101. This gives

int pop(unsigned x) {
   unsigned n;

   n = (x >> 1) & 0x77777777;          // Count bits in
   x = x - n;                          // each 4-bit
   n = (n >> 1) & 0x77777777;          // field.
   x = x - n;
   n = (n >> 1) & 0x77777777;
   x = x - n;
   x = (x + (x >> 4)) & 0x0F0F0F0F;    // Get byte sums.
   x = x*0x01010101;                   // Add the bytes.
   return x >> 24;
}

This is 19 instructions on the basic RISC. It works well if the machine is two-address, because the first six lines can be done with only one move register instruction. Also, the repeated use of the mask 0x77777777 permits loading it into a register and referencing it with register-to-register instructions. Furthermore, most of the shifts are of only one position.

A quite different bit-counting method, illustrated in Figure 5–3, is to turn off the rightmost 1-bit repeatedly [Weg, RND], until the result is 0. It is very fast if the number of 1-bits is small, taking 2 + 5pop(x) instructions.


int pop(unsigned x) {
   int n;

   n = 0;
   while (x ! = 0) {
      n = n+ 1;
      x = x & (x - 1);
   }
   returnn;
}


FIGURE 5–3. Counting 1-bits in a sparsely populated word.

This has a dual algorithm that is applicable if the number of 1-bits is expected to be large. The dual algorithm keeps turning on the rightmost 0-bit with x = x | (x + 1), until the result is all 1’s (–1). Then, it returns 32 – n. (Alternatively, the original number x can be complemented, or n can be initialized to 32 and counted down.)

A rather amazing algorithm is to rotate x left one position, 31 times, adding the 32 terms [MM]. The sum is the negative of pop(x)! That is,

Image

where the additions are done modulo the word size, and the final sum is interpreted as a two’s-complement integer. This is just a novelty; it would not be useful on most machines, because the loop is executed 31 times and thus it requires 63 instructions, plus the loop-control overhead.

To see why Equation (2) works, consider what happens to a single 1-bit of x. It gets rotated to all positions, and when these 32 numbers are added, a word of all 1-bits results. This is –1. To illustrate, consider a 6-bit word size and x = 001001 (binary):

Image

Of course, rotate-right would work just as well.

The method of Equation (1) is very similar to this “rotate and sum” method, which becomes clear by rewriting (1) as

Image

This gives a slightly better algorithm than Equation (2) provides. It is better because it uses shift right, which is more commonly available than rotate, and because the loop can be terminated when the shifted quantity becomes 0. This reduces the loop-control code and may save a few iterations. The two algorithms are contrasted in Figure 5–4.


int pop(unsigned x) {
   int i, sum;

// Rotate and sum method            // Shift right & subtract

   sum = x; // sum = x;
   for (i = 1; i <= 31; i++) {      // while (x != 0) {
       x = rotatel (x, 1);          //    x = x >> 1;
       sum = sum + x;               //    sum = sum - x;
   }                                // }
   return -sum;                     // return sum;
}


FIGURE 5–4. Two similar bit-counting algorithms.

A less interesting algorithm that may be competitive with all the algorithms for pop(x) in this section is to have a table that contains pop(x) for, say, x in the range 0 to 255. The table can be accessed four times, adding the four numbers obtained. A branch-free version of the algorithm looks like this:

int pop(unsigned x) {               // Table lookup.
   static char table[256] = {
      0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
      ...
      4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};
   return table[x         & 0xFF] +
          table[(x >>  8) & 0xFF] +
          table[(x >> 16) & 0xFF] +
          table[(x >> 24)];
}

Item 167 in [HAK] contains a short algorithm for counting the number of 1-bits in a 9-bit quantity that is right-adjusted and isolated in a register. It works only on machines with registers of 36 or more bits. Below is a version of that algorithm that works on 32-bit machines, but only for 8-bit quantities.

x = x * 0x08040201;  // Make 4 copies.
x = x >> 3;          // So next step hits proper bits.
x = x & 0x11111111;  // Every 4th bit.
x = x * 0x11111111;  // Sum the digits (each 0 or 1).
x = x >> 28;         // Position the result.

A version for 7-bit quantities is

x = x * 0x02040810;  // Make 4 copies, left-adjusted.
x = x & 0x11111111;  // Every 4th bit.
x = x * 0x11111111;  // Sum the digits (each 0 or 1).
x = x >> 28;         // Position the result.

In these, the last two steps can be replaced with steps to compute the remainder of x modulo 15.

These are not particularly good; most programmers would probably prefer to use table lookup. The latter algorithm above, however, has a version that uses 64-bit arithmetic, which might be useful for a 64-bit machine that has fast multiplication. Its argument is a 15-bit quantity. (I don’t believe there is a similar algorithm that deals with 16-bit quantities, unless it is known that not all 16 bits are 1.) The data type long long is a C extension found in many C compilers, old and new, for 64-bit integers. It is made official in the C99 standard. The suffix ULL makes unsigned long long constants.

int pop(unsigned x) {
   unsigned long long y;
   y = x * 0x0002000400080010ULL;
   y = y & 0x1111111111111111ULL;
   y = y * 0x1111111111111111ULL;
   y = y >> 60;
   return y;
}

Sum and Difference of Population Counts of Two Words

To compute pop(x) + pop(y) (if your computer does not have the population count instruction), some time can be saved by using the first two lines of Figure 5–2 on x and y separately, adding x and y, and then executing the last three stages of the algorithm on the sum. After the first two lines of Figure 5–2 are executed, x and y consist of eight 4-bit fields, each containing a maximum value of 4. Thus, x and y can safely be added, because the maximum value in any 4-bit field of the sum would be 8, so no overflow occurs. The third executable line must be changed to x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F); and the 3F in the last line must be changed to 7F.

This idea also applies to subtraction. To compute pop(x) – pop(y), use

Image

Then, use the technique just described to compute pop(x) + pop(y). The code is shown in Figure 5–5. It uses 32 instructions, versus 43 for two applications of the code in Figure 5–2 followed by a subtraction.


int popDiff(unsigned x, unsigned y) {
   x = x - ((x >> 1) & 0x55555555);
   x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
   y = ~y;
   y = y - ((y >> 1) & 0x55555555);
   y = (y & 0x33333333) + ((y >> 2) & 0x33333333);
   x = x + y;
   x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
   x = x + (x >> 8);
   x = x + (x >> 16);
   return (x & 0x0000007F) - 32;
}


FIGURE 5–5. Computing pop(x) – pop(y).

Comparing the Population Counts of Two Words

Sometimes one wants to know which of two words has the larger population count without regard to the actual counts. Can this be determined without doing a population count of the two words? Computing the difference of two population counts as in Figure 5–5, and comparing the result to 0 is one way, but there is another way that is preferable if either the population counts are expected to be low or if there is a strong correlation between the particular bits that are set in the two words.

The idea is to clear a single bit in each word until one of the words is all zero; the other word then has the larger (or same) population count. The process runs faster in its worst and average cases if the bits that are 1 at the same positions in each word are first cleared. The code is shown in Figure 5–6. The procedure returns a negative integer if pop(x) < pop(y), 0 if pop(x) = pop(y), and a positive integer (1) if pop(x) > pop(y).


int popCmpr(unsigned xp, unsigned yp) {
   unsigned x, y;
   x = xp & ~yp;                // Clear bits where
   y = yp & ~xp;                // both are 1.
   while (1) {
      if (x == 0) return y | -y;
      if (y == 0) return 1;
      x = x & (x - 1);          // Clear one bit
      y = y & (y - 1);          // from each.
   }
}


FIGURE 5–6. Comparing pop(x) with pop(y).

After clearing the common 1-bits in each 32-bit word, the maximum possible number of 1-bits in both words together is 32. Therefore, the word with the smaller number of 1-bits can have at most 16. Thus, the loop in Figure 5–6 is executed a maximum of 16 times, which gives a worst case of 119 instructions executed on the basic RISC (16 · 7 + 7). A simulation using uniformly distributed random 32-bit integers showed that the average population count of the word with the smaller population count is approximately 6.186, after clearing the common 1-bits. This gives an average execution time of about 50 instructions executed for random 32-bit inputs, not as good as using Figure 5–5. For this procedure to beat that of Figure 5–5, the number of 1-bits in either x or y, after clearing the common 1-bits, would have to be three or less.

Counting the 1-bits in an Array

The simplest way to count the number of 1-bits in an array (vector) of fullwords, in the absence of the population count instruction, is to use a procedure such as that of Figure 5–2 on page 82 on each word of the array and simply add the results. We call this the “naive” method. Ignoring loop control, the generation of constants, and loads from the array, it takes 16 instructions per word: 15 for the code of Figure 5–2, plus one for the addition. We assume the procedure is expanded in line, the masks are loaded outside the loop, and the machine has a sufficient number of registers to hold all the quantities used in the calculation.

Another way is to use the first two executable lines of Figure 5–2 on groups of three words in the array, adding the three partial results. Because each partial result has a maximum value of 4 in each four-bit field, the sum of the three has a maximum value of 12 in each four-bit field, so no overflow occurs. This idea can be applied to the 8- and 16-bit fields. Coding and compiling this method indicates that it gives about a 20% reduction over the naive method in total number of instructions executed on the basic RISC. Much of the savings are cancelled by the additional housekeeping instructions required. We will not dwell on this method because there is a much better way to do it.

The better way seems to have been invented by Robert Harley and David Seal in about 1996 [Seal1]. It is based on a circuit called a carry-save adder (CSA), or 3:2 compressor. A CSA is simply a sequence of independent full adders1 [H&P], and it is often used in binary multiplier circuits.

In Boolean algebra notation, the logic for each full adder is

hab + ac + bc = ab + (a + b)c = ab + (ab)c,
l ← (ab) ⊕ c.

where a, b, and c are the 1-bit inputs, l is the low-bit output (sum) and h is the high-bit output (carry). Changing a + b on the first line to ab is justified because when a and b are both 1, the term ab makes the value of the whole expression 1. By first assigning ab to a temporary, the full adder logic can be evaluated in five logical instructions, each operating on 32 bits in parallel (on a 32-bit machine). We will refer to these five instructions as CSA(h, l, a, b, c). This is a “macro,” with h and l being outputs.

One way to use the CSA operation is to process elements of the array A in groups of three, reducing each group of three words to two, and applying the population count operation to these two words. In the loop, these two population counts are summed. After executing the loop, the total population count of the array is twice the accumulated population count of the CSA’s high-bit outputs, plus the accumulated population count of the low-bit outputs.

Let nc be the number of instructions required for the CSA steps and np be the number of instructions required to do the population count of one word. On a typical RISC machine nc = 5 and np = 15. Ignoring loads from the array and loop control (the code for which may vary quite a bit from one machine to another), the loop discussed above takes (nc + 2np + 2)/3 ≈ 12.33 instructions per word of the array (the “+2” is for the two additions in the loop). This is in contrast to the 16 instructions per word required by the naive method.

There is another way to use the CSA operation that results in a program that’s more efficient and slightly more compact. This is shown in Figure 5–7. It takes (nc + np + 1)/2 =10.5 instructions per word (ignoring loop control and loads). In this code, the CSA operation expands into


#define CSA(h,l, a,b,c)
   {unsigned u = a ^ b; unsigned v = c;
      h = (a & b) | (u & v); l = u ^ v;}

int popArray(unsigned A[], int n) {

   int tot, i;
   unsigned ones, twos;

   tot = 0;                      // Initialize.
   ones = 0;
   for (i = 0; i <= n - 2; i = i + 2) {
      CSA(twos, ones, ones, A[i], A[i+1])
      tot = tot + pop(twos);
   }
   tot = 2*tot + pop(ones);

   if (n & 1)                    // If there's a last one,
      tot = tot + pop(A[i]);      // add it in.

   return tot;
}


FIGURE 5–7. Array population count, processing elements in groups of two.

u = ones ^ A[i];
v = A[i+1];
twos = (ones & A[i]) | (u & v);
ones = u ^ v;

The code relies on the compiler to common the loads.

There are ways to use the CSA operation to further reduce the number of instructions required to compute the population count of an array. They are most easily understood by means of a circuit diagram. For example, Figure 5–8 illustrates a way to code a loop that takes array elements eight at a time and compresses them into four quantities, labeled eights, fours, twos, and ones. The fours, twos, and ones are fed back into the CSAs on the next loop iteration, and the 1-bits in eights are counted by an execution of the word-level population count function, and this count is accumulated. When all of the array has been processed, the total population count is

8pop(eights) + 4pop(fours) + 2pop(twos) + pop(ones).

Image

FIGURE 5–8. A circuit for the array population count.

The code is shown in Figure 5–9, which uses the CSA macro defined in Figure 5–7. The numbering of the CSA blocks in Figure 5–8 corresponds to the order of the CSA macro calls in Figure 5–9. The execution time of the loop, exclusive of array loads and loop control, is (7nc + np + 1)/8 = 6.375 instructions per word of the array.


int popArray(unsigned A[], int n) {

   int tot, i;
   unsigned ones, twos, twosA, twosB,
      fours, foursA, foursB, eights;

   tot = 0;                     // Initialize.
   fours = twos = ones = 0;

   for (i = 0; i <= n - 8; i = i + 8) {
      CSA(twosA, ones, ones, A[i], A[i+1])
      CSA(twosB, ones, ones, A[i+2], A[i+3])
      CSA(foursA, twos, twos, twosA, twosB)
      CSA(twosA, ones, ones, A[i+4], A[i+5])
      CSA(twosB, ones, ones, A[i+6], A[i+7])
      CSA(foursB, twos, twos, twosA, twosB)
      CSA(eights, fours, fours, foursA, foursB)
      tot = tot + pop(eights);
   }
   tot = 8*tot + 4*pop(fours + 2*pop(twos) + pop(ones);

   for (i = i; i < n; i++)       // Simply add in the last
      tot = tot + pop(A[i]);    // 0 to 7 elements.
   return tot;
}


FIGURE 5–9. Array population count, processing elements in groups of eight.

The CSAs can be connected in many arrangements other than that shown in Figure 5–8. For example, increased parallelism might result from feeding the first three array elements into one CSA, and the next three into a second CSA, which allows the instructions of these two CSAs to execute in parallel. One might also be able to permute the three input operands of the CSA macros for increased parallelism. With the plan shown in Figure 5–8, one can easily see how to use only the first three CSAs to construct a program that processes array elements in groups of four, and also how to expand it to construct programs that process array elements in groups of 16 or more. The plan shown also spreads out the loads somewhat, which would be advantageous for a machine that has a relatively low limit on the number of loads that can be outstanding at any one time.

The plan of Figure 5–8 can be generalized so that very few word population counts are done. To sketch how this program might be constructed, it needs an array of m×2 words to hold two of each of the variables we have called ones, twos, fours, and so forth. For an array of size n, choosing mlog2(n + 1) + 1 is sufficient (m = 31 is sufficient for any size array that can be held in a machine with a 32-bit byte-addressed space). A byte array of size m is also needed to keep track of how many (0, 1, or 2) values are currently in each row of the m×2 array. The program processes array elements in groups of two. For each group, the CSA is invoked to compress those two array elements with a saved value of ones, which is most conveniently kept in the [0,0] position of the m×2 array. In an inner loop, the resulting twos is saved in the array, by scanning down (usually not far at all) to find a row with fewer than two items. If the twos row is full, its two values are combined with twos (using the CSA). The twos output is put in the array, resetting its row count to 1. The scan continues with the fours output to find a place to put it, and so forth.

After completing the pass over the input array, the program next makes a pass over the (much shorter) m×2 array, compressing all full rows, so that all rows contain only one significant value. Lastly, the program invokes the word-level population count operation on the first element of each row until a row with a zero count is encountered, computing the total array population count as

pop(row 0) + 2pop(row 1) + 4pop(row 2) + ....

The value suggested above for m ensures that the last row will have a zero count, which can be used to terminate the scans.

The resulting program executes exactly log2(n + 3) word population counts. Unfortunately it is not practical, because the housekeeping steps for loading from and storing into the intermediate result arrays outweigh the computational instructions that are saved. An experimental program (without trying too hard to optimize it) ran in about 29 instructions per array word (counting all instructions in the loop). This is significantly worse than the naive method.

Table 5–1 summarizes the number of instructions executed by this plan for various group sizes. The values in the middle two columns ignore loads and loop control. The fourth column gives the total loop instruction execution count, per word of the input array, produced by a compiler for the basic RISC machine (which does not have indexed loads).

TABLE 5–1. INSTRUCTIONS PER WORD FOR THE ARRAY POPULATION COUNT

Image

For small arrays, there are better plans than that of Figure 5–8. For example, for an array of seven words, the plan of Figure 5–10 is quite efficient [Seal1]. It executes in 4nc + 3np + 4 = 69 instructions, or 9.86 instructions per word. Similar plans exist that apply to arrays of size 2k − 1 words for any positive integer k. The plan for 15 words executes in 11nc + 4np + 6 = 121 instructions, or 8.07 instructions per word.

Image

FIGURE 5–10. A circuit for the total population count of seven words.

Applications

An application of the population count function is in computing the “Hamming distance” between two bit vectors, a concept from the theory of error-correcting codes. The Hamming distance is simply the number of places where the vectors differ; that is,

dist(x, y) = pop(xy).

See, for example, the chapter on error-correcting codes in [Dewd].

Another application is to allow reasonably fast direct-indexed access to a moderately sparse array A that is represented in a certain compact way. In the compact representation, only the defined, or nonzero, elements of the array are stored. There is an auxiliary bit string array bits of 32-bit words, which has a 1-bit for each index i for which A[i] is defined. As a speedup device, there is also an array of words bitsum such that bitsum[j] is the total number of 1-bits in all the words of bits that precede entry j. This is illustrated below for an array in which elements 0, 2, 32, 47, 48, and 95 are defined.

Image

Given an index i, 0 ≤ i ≤ 95, the corresponding index sparse_i into the data array is given by the number of 1-bits in array bits that precede the bit corresponding to i. This can be calculated as follows:

j = i >> 5;                  // j = i/32.
k = i & 31;                  // k = rem(i, 32);
mask = 1 << k;               // A "1" at position k.
if ((bits[j] & mask) == 0) goto no_such_element;
mask = mask - 1;             // l's to right of k.
sparse_i = bitsum[j] + pop(bits[j] & mask);

The cost of this representation is two bits per element of the full array.

The population function can be used to generate binomially distributed random integers. To generate an integer drawn from a population given by BINOMIAL(t, p) where t is the number of trials and p = 1/2, generate t random bits and count the number of 1’s in the t bits. This can be generalized to probabilities p other than 1/2; see for example [Knu2, sec. 3.4.1, prob. 27].

Still another application of the population function is in computing the number of trailing 0’s in a word (see “Counting Trailing 0’s” on page 107).

According to computer folklore, the population count function is important to the National Security Agency. No one (outside of NSA) seems to know just what they use it for, but it may be in cryptography work or in searching huge amounts of material.

5–2 Parity

The “parity” of a string refers to whether it contains an odd or an even number of 1-bits. The string has “odd parity” if it contains an odd number of 1-bits; otherwise, it has “even parity.”

Computing the Parity of a Word

Here we mean to produce a 1 if a word x has odd parity, and a 0 if it has even parity. This is the sum, modulo 2, of the bits of x—that is, the exclusive or of all the bits of x.

One way to compute this is to compute pop(x); the parity is the rightmost bit of the result. This is fine if you have the population count instruction, but if not, there are better ways than using the code for pop(x).

A rather direct method is to compute

Image

where n is the word size, and then the parity of x is given by the rightmost bit of y. (Here ⊕ denotes exclusive or, but for this formula ordinary addition could be used.)

The parity can be computed much more quickly, for moderately large n, as follows (illustrated for n = 32; the shifts can be signed or unsigned):

Image

This executes in ten instructions, as compared to 62 for the first method, even if the implied loop is completely unrolled. Again, the parity bit is the rightmost bit of y. In fact, with either of these, if the shifts are unsigned, then bit i of y gives the parity of the bits of x at and to the left of i. Furthermore, because exclusive or is its own inverse, yiyj is the parity of bits i − 1 through j of x, for ij.

This is an example of the “parallel prefix,” or “scan” operation, which has applications in parallel computing [KRS; HS]. Given a sufficient number of processors, it can convert certain seemingly serial processes from O(n) to O(log2n) time. For example, if you have an array of words and you wish to compute the exclusive or scan operation on the entire array of bits, you can first use (3) on the entire array, and then continue with shifts of 32 bits, 64 bits, and so on, doing exclusive or’s on the words of the array. This takes more elementary (word length) exclusive or operations than a simple left-to-right process, and hence it is not a good idea for a uniprocessor. But on a parallel computer with a sufficient number of processors, it can do the job in O(log2n) rather than O(n) time (where n is the number of words in the array).

A direct application of (3) is the conversion of a Gray coded integer to binary (see page 312).

If the code (3) is changed to use left shifts, the parity of the whole word x winds up in the leftmost bit position, and bit i of y gives the parity of the bits of x at and to the right of position i. This is called the “parallel suffix” operation, because each bit is a function of itself and the bits that follow it.

If rotate shift’s are used, the result is a word of all 1’s if the parity of x is odd, and of all 0’s if even.

The five assignments in (3) can be done in any order (provided variable x is used in the first one). If they are done in reverse order, and if you are interested only in getting the parity in the low-order bit of y, then the last two lines:

y = y ^(y >> 2);
y = y ^(y >> 1);

can be replaced with [Huef]

y = 0x6996 >> (y & 0xF);

This is an “in-register table lookup” operation. On the basic RISC it saves one instruction, or two if the load of the constant is not counted. The low-order bit of y has the original word’s parity, but the other bits of y do not contain anything useful.

The following method executes in nine instructions and computes the parity of x as the integer 0 or 1 (the shifts are unsigned).

x = x ^ (x >> 1);
x = (x ^ (x >> 2)) & 0x11111111;
x = x*0x11111111;
p = (x >> 28) & 1;

After the second statement above, each hex digit of x is 0 or 1, according to the parity of the bits in that hex digit. The multiply adds these digits, putting the sum in the high-order hex digit. There can be no carry out of any hex column during the add part of the multiply, because the maximum sum of a column is 8.

The multiply and shift could be replaced by an instruction to compute the remainder after dividing x by 15, giving a (slow) solution in eight instructions, if the machine has remainder immediate.

On a 64-bit machine, the above code employing multiplication gives the correct result after making the obvious changes (expand the hex constants to 16 nibbles, each with value 1, and change the final shift amount from 28 to 60). In this case, the maximum sum in any 4-bit column of the partial products, other than the most significant column, is 15, so again no overflow occurs that affects the result in the most significant column. On the other hand, the variation that computes the remainder upon division by 15 does not work on a 64-bit machine, because the remainder is the sum of the nibbles modulo 15, and the sum may be as high as 16.

Adding a Parity Bit to a 7-Bit Quantity

Item 167 in [HAK] contains a novel expression for putting even parity on a 7-bit quantity that is right-adjusted and isolated in a register. By this we mean to set the bit to the left of the seven bits, to make an 8-bit quantity with even parity. Their code is for a 36-bit machine, but it works on a 32-bit machine as well.

modu((x * 0x10204081) & 0x888888FF, 1920)

Here, modu(a, b) denotes the remainder of a upon division by b, with the arguments and result interpreted as unsigned integers, “*” denotes multiplication modulo 232, and the constant 1920 is 15 · 27. Actually, this computes the sum of the bits of x, and places the sum just to the left of the seven bits comprising x. For example, the expression maps 0x0000007F to 0x000003FF, and 0x00000055 to 0x00000255.

Another ingenious formula from [HAK] is the following, which puts odd parity on a 7-bit integer:

modu((x * 0x00204081) | 0x3DB6DB00, 1152),

where 1152 = 9 · 27. To understand this, it helps to know that the powers of 8 are ±1 modulo 9. If the 0x3DB6DB00 is changed to 0xBDB6DB00, this formula applies even parity.

These methods are not practical on today’s machines, because memory is cheap but division is still slow. Most programmers would compute these functions with a simple table lookup.

Applications

The parity operation is widely used to calculate a check bit to append to data. It is also useful in multiplying bit matrices in GF(2) (in which the add operation is exclusive or).

5–3 Counting Leading 0’s

There are several simple ways to count leading 0’s with a binary search technique. Below is a model that has several variations. It executes in 20 to 29 instructions on the basic RISC. The comparisons are “logical” (unsigned integers).

if (x == 0) return(32);
n = 0;
if (x <= 0x0000FFFF) {n = n +16; x = x <<16;}
if (x <= 0x00FFFFFF) {n = n + 8; x = x << 8;}
if (x <= 0x0FFFFFFF) {n = n + 4; x = x << 4;}
if (x <= 0x3FFFFFFF) {n = n + 2; x = x << 2;}
if (x <= 0x7FFFFFFF) {n = n + 1;}
return n;

One variation is to replace the comparisons with and’s:

if ((x & 0xFFFF0000) == 0) {n = n +16; x = x <<16;}
if ((x & 0xFF000000) == 0) {n = n + 8; x = x << 8}
...

Another variation, which avoids large immediate values, is to use shift right instructions.

The last if statement is simply adding 1 to n if the high-order bit of x is 0, so an alternative, which saves a branch instruction, is:

n = n + 1 - (x >> 31);

The “+ 1” in this assignment can be omitted if n is initialized to 1 rather than to 0. These observations lead to the algorithm (12 to 20 instructions on the basic RISC) shown in Figure 5–11. A further improvement is possible for the case in which x begins with a 1-bit: change the first line to

if ((int)x <= 0) return (~x >> 26) & 32;


int nlz(unsigned x) {
   int n;

   if (x == 0) return(32);
   n = 1;
   if ((x >> 16) == 0) {n = n +16; x = x <<16;}
   if ((x >> 24) == 0) {n = n + 8; x = x << 8;}
   if ((x >> 28) == 0) {n = n + 4; x = x << 4;}
   if ((x >> 30) == 0) {n = n + 2; x = x << 2;}
   n = n - (x >> 31);
   return n;
}


FIGURE 5–11. Number of leading zeros, binary search.

Figure 5–12 illustrates a sort of reversal of the above. It requires fewer operations the more leading 0’s there are, and avoids large immediate values and large shift amounts. It executes in 12 to 20 instructions on the basic RISC.


int nlz(unsigned x)  {
   unsigned y;
   int n;

   n = 32;
   y = x >>16; if (y != 0) {n = n -16; x = y;}
   y = x >> 8; if (y != 0) {n = n - 8; x = y;}
   y = x >> 4; if (y != 0) {n = n - 4; x = y;}
   y = x >> 2; if (y != 0) {n = n - 2; x = y;}
   y = x >> 1; if (y != 0) return n - 2;
   return n - x;
}


FIGURE 5–12. Number of leading zeros, binary search, counting down.

This algorithm is amenable to a “table assist”: the last four executable lines can be replaced by

static char table[256] = {0,1,2,2,3,3,3,3,4,4,...,8};
return n - table[x];

Many algorithms can be aided by table lookup, but this will not often be mentioned here.

For compactness, this and the preceding algorithms in this section can be coded as loops. For example, the algorithm of Figure 5–12 becomes the algorithm shown in Figure 5–13. This executes in 23 to 33 basic RISC instructions, ten of which are conditional branches.


int nlz(unsigned x) {
   unsigned y;
   int n, c;

   n = 32;
   c = 16;
   do {
      y = x >> c; if (y != 0) {n = n - c; x = y;}
      c = c >> 1;
   } while (c != 0);
   return n - x;
}


FIGURE 5–13. Number of leading zeros, binary search, coded as a loop.

One can, of course, simply shift left one place at a time, counting, until the sign bit is on; or shift right one place at a time until the word is all 0. These algorithms are compact and work well if the number of leading 0’s is expected to be small or large, respectively. One can combine the methods, as shown in Figure 5–14. We mention this because the technique of merging two algorithms and choosing the result of whichever one stops first is more generally applicable. It leads to code that runs fast on superscalar machines, because of the proximity of independent instructions. (These machines can execute two or more instructions simultaneously, provided they are independent.)


int nlz(int x) {
   int y, n;

   n = 0;
   y = x;
L: if (x < 0) return n;
   if (y == 0) return 32 - n;
   n = n + 1;
   x = x << 1;
   y = y >> 1;
   goto L;
}


FIGURE 5–14. Number of leading zeros, working both ends at the same time.

On the basic RISC, this executes in min(3 + 6nlz(x), 6 + 6(32 – nlz(x))) instructions, or 99 worst case. One can imagine a superscalar machine executing the entire loop body in one cycle if the comparison results are obtained as a by-product of the shifts, or in two cycles otherwise, plus the branch overhead.

It is straightforward to convert either of the algorithms of Figure 5–11 or Figure 5–12 to a branch-free counterpart. Figure 5–15 shows a version that does the job in 28 basic RISC instructions.


int nlz(unsigned x) {
   int y, m, n;

   y = -(x >> 16);      // If left half of x is 0,
   m = (y >> 16) & 16;  // set n = 16. If left half
   n = 16 - m;          // is nonzero, set n = 0 and
   x = x >> m;          // shift x right 16.
                        // Now x is of the form 0000xxxx.
   y = x - 0x100;       // If positions 8–15 are 0,
   m = (y >> 16) & 8;   // add 8 to n and shift x left 8.
   n = n + m;
   x = x << m;

   y = x - 0x1000;      // If positions 12–15 are 0,
   m = (y >> 16) & 4;   // add 4 to n and shift x left 4.
   n = n + m;
   x = x << m;

   y = x - 0x4000;      // If positions 14–15 are 0,
   m = (y >> 16) & 2;   // add 2 to n and shift x left 2.
   n = n + m;
   x = x << m;

   y = x >> 14;         // Set y = 0, 1, 2, or 3.
   m = y & ~(y >> 1);   // Set m = 0, 1, 2, or 2 resp.
   return n + 2 - m;
}


FIGURE 5–15. Number of leading zeros, branch-free binary search.

If your machine has the population count instruction, a good way to compute the number of leading zeros function is given in Figure 5–16. The five assignments to x can be reversed, or, in fact, done in any order. This is branch-free and takes 11 instructions. Even if population count is not available, this algorithm may be useful. Using the 21-instruction code for counting 1-bits given in Figure 5–2 on page 82, it executes in 32 branch-free basic RISC instructions.


int nlz(unsigned x) {
   int pop(unsigned x);

   x = x | (x >> 1);
   x = x | (x >> 2);
   x = x | (x >> 4);
   x = x | (x >> 8);
   x = x | (x >>16);
   return pop(~x);
}


FIGURE 5–16. Number of leading zeros, right-propagate and count 1-bits.

Robert Harley [Harley] devised an algorithm for nlz(x) that is very similar to Seal’s algorithm for ntz(x) (see Figure 5–25 on page 111). Harley’s method propagates the most significant 1-bit to the right using shift’s and or’s, and multiplies modulo 232 by a special constant, producing a product whose high-order six bits uniquely identify the number of leading 0’s in x. It then does a shift right and a table lookup (indexed load) to translate the six-bit identifier to the actual number of leading 0’s. As shown in Figure 5–17, it consists of 14 instructions, including a multiply, plus an indexed load. Table entries shown as u are unused.


int nlz(unsigned x) {

   static char table[64] =
     {32,31, u,16, u,30, 3, u,  15, u, u, u,29,10, 2, u,
       u, u,12,14,21, u,19, u,   u,28, u,25, u, 9, 1, u,
      17, u, 4, u, u, u,11, u,  13,22,20, u,26, u, u,18,
       5, u, u,23, u,27, u, 6,   u,24, 7, u, 8, u, 0, u};

   x = x | (x >> 1);     // Propagate leftmost
   x = x | (x >> 2);     // 1-bit to the right.
   x = x | (x >> 4);
   x = x | (x >> 8);
   x = x | (x >>16);
   x = x*0x06EB14F9;     // Multiplier is 7*255**3.
   return table[x >> 26];
}


FIGURE 5–17. Number of leading zeros, Harley’s algorithm.

The multiplier is 7·2553, so the multiplication can be done as shown below. In this form, the function consists of 19 elementary instructions, plus an indexed load.

x = (x << 3) - x;     // Multiply by 7.
x = (x << 8) - x;     // Multiply by 255.
x = (x << 8) - x;     // Again.
x = (x << 8) - x;     // Again.

There are many multipliers that have the desired uniqueness property and whose factors are all of the form 2k ± 1. The smallest is 0x045BCED1 = 17 · 65· 129 ·513. There are no such multipliers consisting of three factors if the table size is 64 or 128 entries. If the table size is 256 entries, however, there are a number of such multipliers. The smallest is 0x01033CBF = 65·255·1025 (using this would save two instructions at the expense of a larger table).

Julius Goryavsky [Gor] has found several variations of Harley’s algorithm that reduce the table size at the expense of a few instructions, or have improved parallelism, or have other desirable properties. One, shown in Figure 5–18, is a clear winner if the multiplication is done with shifts and adds. The code changes only the table and the lines that contain the shift right of 16 and the following multiply in Figure 5–17. If the machine has and not, this saves two instructions because the multiplier can be factored as 511·2047 · 16383 (mod 232), which can be done in six elementary instructions rather than eight. If the machine does not have and not, it saves one instruction.


...
static char table[64] =
  {32,20,19, u, u,18, u, 7,  10,17, u, u,14, u, 6, u,
    u, 9, u,16, u, u, 1,26,   u,13, u, u,24, 5, u, u,
    u,21, u, 8,11, u,15, u,   u, u, u, 2,27, 0,25, u,
   22, u,12, u, u, 3,28, u,  23, u, 4,29, u, u,30,31};
...
x = x & ~(x >> 16);
x = x*0xFD7049FF;
...


FIGURE 5–18. Number of leading zeros, Goryavsky’s variation of Harley’s algorithm.

Floating-Point Methods

The floating-point post-normalization facilities can be used to count leading zeros. It works out quite well with IEEE-format floating-point numbers. The idea is to convert the given unsigned integer to double-precision floating-point, extract the exponent, and subtract it from a constant. Figure 5–19 illustrates a complete procedure for this.


int nlz(unsigned k) {
   union {
      unsigned asInt[2];
      doubleasDouble;
   };
   int n;

   asDouble = (double)k + 0.5;
   n = 1054 - (as!nt[LE] >> 20);
   return n;
}


FIGURE 5–19. Number of leading zeros, using IEEE floating-point.

The code uses the C++ “anonymous union” to overlay an integer with a double-precision floating-point quantity. Variable LE must be 1 for execution on a little-endian machine, and 0 for big-endian. The addition of 0.5, or some other small number, is necessary for the method to work when k = 0.

We will not attempt to assess the execution time of this code, because machines differ so much in their floating-point capabilities. For example, many machines have their floating-point registers separate from the integer registers, and on such machines data transfers through memory may be required to convert an integer to floating-point and then move the result to an integer register.

The code of Figure 5–19 is not valid C or C++ according to the ANSI standard, because it refers to the same memory locations as two different types. Thus, one cannot be sure it will work on a particular machine and compiler. It does work with IBM’s XLC compiler on AIX, and with the GCC compiler on AIX and on Windows 2000 and XP, at all optimization levels (as of this writing, anyway). If the code is altered to do the overlay defining with something like

xx = (double)k + 0.5;
n = 1054 - (*((unsigned *)&xx + LE) >> 20);

it does not work on these systems with optimization turned on. This code, incidentally, violates a second ANSI standard, namely, that pointer arithmetic can be performed only on pointers to array elements [Cohen]. The failure, however, is due to the first violation, involving overlay defining.

In spite of the flakiness of this code,2 three variations are given below.

asDouble = (double)k;
n = 1054 - (asInt[LE] >> 20);
n = (n & 31) + (n >> 9);

k = k & ~(k >> 1);
asFloat = (float)k + 0.5f;
n = 158 - (asInt >> 23);

k = k & ~(k >> 1);
asFloat = (float)k;
n = 158 - (asInt >> 23);
n = (n & 31) + (n >> 6);

In the first variation, the problem with k = 0 is fixed not by a floating-point addition of 0.5, but by integer arithmetic on the result n (which would be 1054, or 0x41E, if the correction were not done).

The next two variations use single-precision floating-point, with the “anonymous union” changed in an obvious way. Here there is a new problem: Rounding can throw off the result when the rounding mode is either round to nearest (almost universally used) or round toward +∞. For round to nearest mode, the rounding problem occurs for k in the ranges hexadecimal FFFFFF80 to FFFFFFFF, 7FFFFFC0 to 7FFFFFFF, 3FFFFFE0 to 3FFFFFFF, and so on. In rounding, an add of 1 carries all the way to the left, changing the position of the most significant 1-bit. The correction steps used above clear the bit to the right of the most significant 1-bit, blocking the carry. If k is a 64-bit quantity, this correction is also needed for the code of Figure 5–19 and for the first of the three variations given above.

The GNU C/C++ compiler has a unique feature that allows coding any of these schemes as a macro, giving in-line code for the function references [Stall]. This feature allows statements, including declarations, to be inserted in code where an expression is called for. The sequence of statements would usually end with an expression, which is taken to be the value of the construction. Such a macro definition is shown below, for the first single-precision variation. (In C, it is customary to use uppercase for macro names.)

#define NLZ(kp)
   ({union {unsigned _asInt; float _asFloat;};
     unsigned _k = (kp), _kk = _k & ~(_k >> 1);
     _asFloat = (float)_kk + 0.5f;
     158 - (_asInt >> 23);})

The underscores are used to avoid name conflicts with parameter kp; presumably, user-defined names do not begin with underscores.

Comparing the Number of Leading Zeros of Two Words

There is a simple way to determine which of two words x and y has the larger number of leading zeros [Knu5] without actually computing nlz(x) or nlz(y). The methods are shown in the equivalences below. The three relations not shown are, of course, obtained by complementing the sense of the comparison on the right.

Image

Relation to the Log Function

The “nlz” function is, essentially, the “integer log base 2” function. For unsigned x0,

Image

See also Section 11–4, “Integer Logarithm,” on page 291.

Another closely related function is bitsize, the number of bits required to represent its argument as a signed quantity in two’s-complement form. We take its definition to be

Image

From this definition, bitsize(x) = bitsize(−x−1). But − x1 = ¬x, so an algorithm for bitsize is (where the shift is signed)

x = x ^ (x >> 31);        // If (x < 0) x = -x - 1;
return 33 - nlz(x);

An alternative, which is the same function as bitsize(x) except it gives the result 0 for x = 0, is

32 - nlz(x ^ (x << 1))

Applications

Two important applications of the number of leading zeros function are in simulating floating-point arithmetic operations and in various division algorithms (see Figure 9–1 on page 185 and Figure 9–3 on page 196). The instruction seems to have a miscellany of other uses.

It can be used to get the “x = y” predicate in only three instructions (see “Comparison Predicates” on page 23), and as an aid in computing certain elementary functions (see pages 281, 284, 290, and 294).

A novel application is to generate exponentially distributed random integers by generating uniformly distributed random integers and taking nlz of the result [GLS1]. The result is 0 with probability 1/2, 1 with probability 1/4, 2 with probability 1/8, and so on. Another application is as an aid in searching a word for a consecutive string of 1-bits (or 0-bits) of a certain length, a process that is used in some disk block allocation algorithms. For these last two applications, the number of trailing zeros function could also be used.

5–4 Counting Trailing 0’s

If the number of leading zeros instruction is available, then the best way to count trailing 0’s is, most likely, to convert it to a count leading 0’s problem:

32 − nlz(¬x&(x1)).

If population count is available, a slightly better method is to form a mask that identifies the trailing 0’s, and count the 1-bits in it [Hay2], such as

Image

Variations exist using other expressions for forming a mask that identifies the trailing zeros of x, such as those given in Section 2–1, “Manipulating Rightmost Bits,” on page 11. These methods are also reasonable even if the machine has none of the bit-counting instructions. Using the algorithm for pop(x) given in Figure 5–2 on page 82, the first expression above executes in about 3 + 21 = 24 instructions (branch-free).

Figure 5–20 shows an algorithm that does it directly, in 12 to 20 basic RISC instructions (for x ≠ 0).


int ntz(unsigned x) {
   int n;

   if (x == 0) return(32);
   n = 1;
   if ((x & 0x0000FFFF) == 0) {n = n + 16; x = x >>16;}
   if ((x & 0x000000FF) == 0) {n = n +  8; x = x >> 8;}
   if ((x & 0x0000000F) == 0) {n = n +  4; x = x >> 4;}
   if ((x & 0x00000003) == 0) {n = n +  2; x = x >> 2;}
   return n - (x & 1);
}


FIGURE 5–20. Number of trailing zeros, binary search.

The n + 16 can be simplified to 17 if that helps, and if the compiler is not smart enough to do that for you (this does not affect the number of instructions as we are counting them).

Figure 5–21 shows a variation that uses smaller immediate values and simpler operations. It executes in 12 to 21 basic RISC instructions. Unlike the above procedure, when the number of trailing 0’s is small, the procedure of Figure 5–21 executes a larger number of instructions, but also a larger number of “fall-through” branches.


int ntz(unsigned x) {
   unsigned y;
   int n;

   if (x == 0) return 32;
   n = 31;
   y = x <<16;  if (y != 0) {n = n -16; x = y;}
   y = x << 8;  if (y != 0) {n = n - 8; x = y;}
   y = x << 4;  if (y != 0) {n = n - 4; x = y;}
   y = x << 2;  if (y != 0) {n = n - 2; x = y;}
   y = x << 1;  if (y != 0) {n = n - 1;}
   return n;
}


FIGURE 5–21. Number of trailing zeros, smaller immediate values.

The line just above the return statement can alternatively be coded

n = n - ((x << 1) >> 31);

which saves a branch, but not an instruction.

In terms of number of instructions executed, it is hard to beat the “search tree” [Aus2]. Figure 5–22 illustrates this procedure for an 8-bit argument. This procedure executes in seven instructions for all paths except the last two (return 7 or 8), which require nine. A 32-bit version would execute in 11 to 13 instructions. Unfortunately, for large word sizes, the program is quite large. The 8-bit version above is 12 lines of executable source code and would compile into about 41 instructions. A 32-bit version would be 48 lines and about 164 instructions, and a 64-bit version would be twice that.


int ntz(char x) {
   if (x & 15) {
      if (x & 3) {
         if (x & 1) return 0;
         else return 1;
      }
      else if (x & 4) return 2;
      else return 3;
   }
   else if (x & 0x30) {
      if (x & 0x10) return 4;
      else return 5;
   }
   else if (x & 0x40) return 6;
   else if (x) return 7;
   else return 8;
}


FIGURE 5–22. Number of trailing zeros, binary search tree.

If the number of trailing 0’s is expected to be small or large, then the simple loops shown in Figure 5–23 are quite fast. The algorithm on the left executes in 5 + 3ntz(x), and that on the right in 3 + 3(32 – ntz(x)) basic RISC instructions.


int ntz(unsigned x) {
   int n;

   x = ~x & (x - 1);
   n = 0;                        // n = 32;
   while (x != 0) {              // while (x != 0) {
      n = n + 1;                 //    n = n - 1;
      x = x >> 1;                //    x = x + x;
   }                             // }
   return n;                     // return n;
}


FIGURE 5–23. Number of trailing zeros, simple counting loops.

Dean Gaudet [Gaud] devised an algorithm that is interesting because with the right instructions it is branch-free, load-free (does not use table lookup), and has lots of parallelism. It is shown in Figure 5–24.


int ntz(unsigned x) {
   unsigned y, bz, b4, b3, b2, b1, b0;

   y = x & -x;               // Isolate rightmost 1-bit.
   bz = y ? 0 : 1;           // 1 if y = 0.
   b4 = (y & 0x0000FFFF) ? 0 : 16;
   b3 = (y & 0x00FF00FF) ? 0 :  8;
   b2 = (y & 0x0F0F0F0F) ? 0 :  4;
   b1 = (y & 0x33333333) ? 0 :  2;
   b0 = (y & 0x55555555) ? 0 :  1;
   return bz + b4 + b3 + b2 + bl +b0;
}


FIGURE 5–24. Number of trailing zeros, Gaudet’s algorithm.

As shown, the code uses the C “conditional expression” in six places. This construct has the form a?b:c. Its value is b if a is true (nonzero), and c if a is false (zero). Although a conditional expression must, in general, be compiled into compares and branches, for the simple cases in Figure 5–24 branching can be avoided if the machine has a compare for equality to zero instruction that sets a target register to 1 if the operand is 0, and to 0 if the operand is nonzero. Branching can also be avoided by using conditional move instructions. Using compare, the assignment to b3 can be compiled into five instructions on the basic RISC: two to generate the hex constant, an and, the compare, and a shift left of 3. (The first, second, and last conditional expressions require one, three, and four instructions, respectively.)

The code can be compiled into a total of 30 instructions. All six lines with the conditional expressions can run in parallel. On a machine with a sufficient degree of parallelism, it executes in ten cycles. Present machines don’t have that much parallelism, so as a practical matter it might help to change the first two uses of y in the program to x. This permits the first three executable statements to run in parallel.

David Seal [Seal2] devised an algorithm for computing ntz(x) that is based on the idea of compressing the 232 possible values of x to a small dense set of integers and doing a table lookup. He uses the expression x & – x to reduce the number of possible values to a small number. The value of this expression is a word that contains a single 1-bit at the position of the least significant 1-bit in x, or is 0 if x = 0. Thus, x & – x has only 33 possible values. But they are not dense; they range from 0 to 231.

To produce a dense set of 33 integers that uniquely identify the 33 values of x & –x, Seal found a certain constant which, when multiplied by x & –x, produces the identifying value in the high-order six bits of the low-order half of the product of the constant and x & –x. Since x & – x is an integral power of 2 or is 0, the multiplication amounts to a left shift of the constant, or it is a multiplication by 0. Using only the high-order five bits is not sufficient, because 33 distinct values are needed.

The code is shown in Figure 5–25, where table entries shown as u are unused.


int ntz(unsigned x) {

   static char table[64] =
     {32, 0, 1,12, 2, 6, u,13,   3, u, 7, u, u, u, u,14,
      10, 4, u, u, 8, u, u,25,   u, u, u, u, u,21,27,15,
      31,11, 5, u, u, u, u, u,   9, u, u,24, u, u,20,26,
      30, u, u, u, u,23, u,19,  29, u,22,18,28,17,16, u};

   x = (x & -x)*0x0450FBAF;
   return table[x >> 26];
}


FIGURE 5–25. Number of trailing zeros, Seal’s algorithm.

As an example, if x is an odd multiple of 16, then x & -x = 16, so the multiplication is simply a left shift of four positions. The high-order six bits of the low-order half of the product are then binary 010001, or 17 decimal. The table translates 17 to 4, which is the correct number of trailing 0’s for an odd multiple of 16.

There are thousands of constants that have the necessary uniqueness property. The smallest is 0x0431472F, and the largest is 0xFDE75C6D. Seal chose a constant for which the multiplication can be done with a small number of shifts and adds. Since 0x0450FBAF = 17–65-65535, the multiplication can be done as follows:

x = (x << 4) + x;        // x = x*17.
x = (x << 6) + x;        // x = x*65.
x = (x << 16) - x;       // x = x*65535.

With this substitution, the code of Figure 5–25 consists of nine elementary instructions, plus an indexed load. Seal was interested in the ARM instruction set, which can do a shift and add in one instruction. On that architecture, the code is six instructions, including the indexed load.

To make the multiplication even easier to do with shifts and adds, one might hope to find a constant of the form (2k1 ± 1)(2k2 ± 1) that has the necessary uniqueness property. For a table size of 64, there are no such integers, and there is only one other suitable integer that is a product of three such factors: 0x08A1FBAF = 17 · 65 · 131071. Using a table size of 128 or 256 does not help. However, for a table size of 512 there are four suitable integers of the form (2k1 ± 1)(2k2 ± 1); the smallest is 0x0080FF7F = 129 · 65535. We leave it to the reader to determine the table associated with this constant.

There is a variation of Seal’s method that is based on de Bruijn cycles [LPR]. These are cyclic sequences over a given alphabet that contain as a subsequence every sequence of the letters of the alphabet of a given length exactly once. For example, a cycle that contains as a subsequence every sequence of {a, b, c} of length 2 is aabacbbcc. Notice that the sequence ca wraps around from the end to the beginning. If the alphabet size is k and the length is n, there are kn sequences. For a cycle to contain all of these, it must be of length at least kn, which would be its length if a different sequence started at each position. It can be shown that there is always a cycle of this minimum possible length that contains all kn sequences.

For our purposes, the alphabet is {0, 1}, and for dealing with 32-bit words, we are interested in a cycle that contains all 32 sequences 00000, 00001, 00010, ..., 11111. Given such a cycle that begins with at least four 0’s, we can compute ntz(x) by first reducing x to a word that contains a single bit at the position of the least significant bit of x, as in Seal’s algorithm. Then, by multiplication, we can select a 5-bit field of the de Bruijn cycle, which will be a unique value for each multiplier. This can be mapped to give the number of trailing 0’s by a table lookup. The algorithm follows. The de Bruijn cycle used is

0000 0100 1101 0111 0110 0101 0001 1111.

It is in effect a cycle, because in use it has trailing 0’s beyond the 32 bits shown above, which is effectively the same as wrapping to the beginning.

There are 33 possible values of ntz(x) and only 32 five-bit subsequences in the de Bruijn cycle. Therefore, two words with different values of ntz(x) must map to the same number by the table lookup. The words that conflict are zero and words that end with a 1-bit. To resolve this, the code has a test for 0 and returns 32 in that case. A branch-free way to resolve it, useful if your computer has predicate comparison instructions, is to change the last statement to

return table[x >> 27] + 32*(x == 0);

To compare the two algorithms, Seal’s does not require the test for zero and it allows the alternative of doing the multiplication with six elementary instructions. The de Bruijn algorithm uses a smaller table. The de Bruijn cycle used in Figure 5–26, discovered by Danny Dubé [Dubé], is a good one because multiplication by it can be done with eight elementary instructions. The constant is 0x04D7651F = (2047 · 5 · 256 + 1) · 31, from which one can see the shifts, adds, and subtracts that do the job.


int ntz(unsigned x) {

   static char table[32] =
     { 0, 1, 2,24, 3,19, 6,25,   22, 4,20,10,16, 7,12,26,
      31,23,18, 5,21, 9,15,11,   30,17, 8,14,29,13,28,27};

   if (x == 0) return 32;
   x = (x & -x)*0x04D7651F;
   return table[x >> 27];
}


FIGURE 5–26. Number of trailing zeros using a de Bruijn cycle.

John Reiser [Reiser] observed that there is another way to map the 33 values of the factor x & -x in Seal’s algorithm to a dense set of unique integers: divide and use the remainder. The smallest divisor that has the necessary uniqueness property is 37. The resulting code is shown in Figure 5–27, where table entries shown as u are unused.


int ntz(unsigned x) {

   static char table[37] = {32, 0, 1, 26, 2, 23, 27,
                 u, 3, 16, 24, 30, 28, 11, u, 13, 4,
                 7, 17, u, 25, 22, 31, 15, 29, 10, 12,
                 6, u, 21, 14, 9, 5, 20, 8, 19, 18};

   x = (x & -x)%37;
   return table[x];
}


FIGURE 5–27. Number of trailing zeros, Reiser’s algorithm.

It is interesting to note that if the numbers x are uniformly distributed, then the average number of trailing 0’s is, very nearly, 1.0. To see this, sum the products pini, where pi is the probability that there are exactly ni trailing 0’s. That is,

Image

To evaluate this sum, consider the following array:

Image

The sum of each column is a term of the series for S. Hence S is the sum of all the numbers in the array. The sums of the rows are

1/4 + 1/8 + 1/16 + 1/32+ ... = 1/2
1/8 + 1/16 + 1/32 + 1/64+ ... = 1/4
1/16 + 1/32 + 1/64 + 1/128 + ... = 1/8
...

and the sum of these is 1/2 + 1/4 + 1/8 + ... = 1. The absolute convergence of the original series justifies the rearrangement.

Sometimes, a function similar to ntz(x) is wanted, but a 0 argument is a special case, perhaps an error, that should be identified with a value of the function that’s easily distinguished from the “normal” values of the function. For example, let us define “the number of factors of 2 in x” to be

Image

This can be calculated from

31 − nlz(x & − x).

Applications

[GLS1] points out some interesting applications of the number of trailing zeros function. It has been called the “ruler function” because it gives the height of a tick mark on a ruler that’s divided into halves, quarters, eighths, and so on.

It has an application in R. W. Gosper’s loop-detection algorithm, which will now be described in some detail, because it is quite elegant and it does more than might at first seem possible.

Suppose a sequence X0,X1,X2, ... is defined by Xn + 1 = f(Xn). If the range of f is finite, the sequence is necessarily periodic. That is, it consists of a leader X0, X1,..., Xμ–1 followed by a cycle Xμ, Xu+1,..., Xμ+λ−1 that repeats without limit (Xμ = Xμ+λ, Xμ+ 1 = Xμ + λ + 1, and so on, where λ is the period of the cycle). Given the function f, the loop-detection problem is to find the index μ of the first element that repeats, and the period λ. Loop detection has applications in testing random number generators and detecting a cycle in a linked list.

One could save all the values of the sequence as they are produced and compare each new element with all the preceding ones. This would immediately show where the second cycle starts. But algorithms exist that are much more efficient in space and time.

Perhaps the simplest is due to R. W. Floyd [Knu2, sec. 3.1, prob. 6]. This algorithm iterates the process

Image

with x and y initialized to X0. After the nth step, x = Xn and y = X2n. These are compared, and if equal, it is known that Xn and X2n are separated by an integral multiple of the period λ—that is, 2nn = n is a multiple of λ. Then μ can be determined by regenerating the sequence from the beginning, comparing X0 to Xn, then X1 to Xn + 1, and so on. Equality occurs when Xμ is compared to Xn+μ. Finally, λ can be determined by regenerating more elements, comparing Xμ to Xμ + 1, Xμ+ 2, .... This algorithm requires only a small and bounded amount of space, but it evaluates f many times.

Gosper’s algorithm [HAK, item 132; Knu2, Answers to Exercises for Section 3.1, exercise 7] finds the period λ, but not the starting point μ of the first cycle. Its main feature is that it never backs up to reevaluate f, and it is quite economical in space and time. It is not bounded in space; it requires a table of size Image where Λ is the largest possible period. This is not a lot of space; for example, if it is known a priori that Λ ≤ 232, then 33 words suffice.

Gosper’s algorithm, coded in C, is shown in Figure 5–28. This C function is given the function f being analyzed and a starting value X0. It returns lower and upper bounds on μ, and the period λ. (Although Gosper’s algorithm cannot compute μ, it can compute lower and upper bounds μl and μu such that μuμl + 1 ≤ max(λ − 1, 1).) The algorithm works by comparing Xn, for n = 1, 2, ..., to a subset of size log2n + 1 of the elements of the sequence that precede Xn. The elements of the subset are the closest preceding Xi such that i + 1 ends in a 1-bit (that is, i is the even number preceding n), the closest preceding Xt such that i + 1 ends in exactly one 0-bit, the closest preceding Xt such that i + 1 ends in exactly two 0-bits, and so on.


void ld_Gosper(int (*f)(int), int X0, int *mu_l,
                              int*mu_u, int *lambda){
   int Xn, k, m, kmax, n, lgl;
   int T[33];

   T[0] = X0;
   Xn = X0;
   for (n = 1; ; n++) {
      Xn = f(Xn);
      kmax = 31 - nlz(n);           // Floor(log2 n).
      for (k = 0; k <= kmax; k++) {
         if (Xn == T[k]) goto L;
      }
      T[ntz(n+1)] = Xn;            // No match.
   }
L:
   // Compute m = max{i | i < n and ntz(i+1) = k}.

   m = ((((n >> k) - 1) | 1) << k) - 1;
   *lambda = n - m;
   lgl = 31 - nlz(*lambda - 1); // Ceil(log2 lambda) - 1.
   *mu_u = m;                       // Upper bound on mu.
   *mu_l = m - max(1, 1 << lgl) + 1;// Lower bound on mu.
}


FIGURE 5–28. Gosper’s loop-detection algorithm.

Thus, the comparisons proceed as follows:

Image

It can be shown that the algorithm always terminates with n somewhere in the second cycle—that is, with n < μ + 2λ. See [Knu2] for further details.

The ruler function reveals how to solve the Tower of Hanoi puzzle. Number the n disks from 0 to n − 1. At each move k, as k goes from 1 to 2n − 1, move disk ntz(k) the minimum permitted distance to the right, in a circular manner.

The ruler function can be used to generate a reflected binary Gray code (see Section 13–1 on page 311). Start with an arbitrary n-bit word, and at each step k, as k goes from 1 to 2n − 1, flip bit ntz(k).

Exercises

1. Code Dubé’s algorithm for the ntz function, expanding the multiplication.

2. Code the “right justify” function, Image, x0, in three basic RISC instructions.

3. Are the parallel prefix and suffix (with XOR) operations invertible? If so, how would you compute the inverse functions?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset