Chapter 6. Searching Words

6–1 Find First 0-Byte

The need for this function stems mainly from the way character strings are represented in the C language. They have no explicit length stored with them; instead, the end of the string is denoted by an all-0 byte. To find the length of a string, a C program uses the “strlen” (string length) function. This function searches the string, from left to right, for the 0-byte, and returns the number of bytes scanned, not counting the 0-byte.

A fast implementation of “strlen” might load and test single bytes until a word boundary is reached, and then load a word at a time into a register, and test the register for the presence of a 0-byte. On big-endian machines, we want a function that returns the index of the first 0-byte from the left. A convenient encoding is values from 0 to 3 denoting bytes 0 to 3, and a value of 4 denoting that there is no 0-byte in the word. This is the value to add to the string length, as successive words are searched, if the string length is initialized to 0. On little-endian machines, one wants the index of the first 0-byte from the right end of the register, because little-endian machines reverse the four bytes when a word is loaded into a register. Specifically, we are interested in the following functions, where “00” denotes a 0-byte, “nn” denotes a nonzero byte, and “xx” denotes a byte that may be 0 or nonzero.

Image

Our first procedure for the find leftmost 0-byte function, shown in Figure 6–1, simply tests each byte, in left-to-right order, and returns the result when the first 0-byte is found.


int zbytel(unsigned x) {
   if             ((x >> 24) == 0) return 0;
   else if ((x & 0x00FF0000) == 0) return 1;
   else if ((x & 0x0000FF00) == 0) return 2;
   else if ((x & 0x000000FF) == 0) return 3;
   else return 4;
}


FIGURE 6–1. Find leftmost 0-byte, simple sequence of tests.

This executes in two to 11 basic RISC instructions, 11 in the case that the word has no 0-bytes (which is the important case for the “strlen” function). A very similar program will handle the problem of finding the rightmost 0-byte.

Figure 6–2 shows a branch-free procedure for this function. The idea is to convert each 0-byte to 0x80, and each nonzero byte to 0x00, and then use number of leading zeros. This procedure executes in eight instructions, if the machine has the number of leading zeros and nor instructions. Some similar tricks are described in [Lamp].


int zbytel(unsigned x) {
   unsigned y;
   int n;
                          // Original byte: 00 80 other
   y = (x & 0x7F7F7F7F)+ 0x7F7F7F7F;     // 7F 7F 1xxxxxxx
   y = ~(y 1 x 1 0x7F7F7F7F);            // 80 00 00000000
   n = nlz(y) >> 3;              // n = 0 ... 4, 4 if x
   return n;                     // has no 0-byte.
}


FIGURE 6–2. Find leftmost 0-byte, branch-free code.

The position of the rightmost 0-byte is given by the number of trailing 0’s in the final value of y computed above, divided by 8 (with fraction discarded). Using the expression for computing the number of trailing 0’s by means of the number of leading zeros instruction (see Section 5–4, “Counting Trailing 0’s,” on page 107), this can be computed by replacing the assignment to n in the procedure above with:

n = (32 - nlz(~y & (y - 1))) >> 3;

This is a 12-instruction solution, if the machine has nor and and not.

In most situations on PowerPC, incidentally, a procedure to find the rightmost 0-byte would not be needed. Instead, the words can be loaded with the load word byte-reverse instruction (lwbrx).

The procedure of Figure 6–2 is more valuable on a 64-bit machine than on a 32-bit one, because on a 64-bit machine the procedure (with obvious modifications) requires about the same number of instructions (seven or ten, depending upon how the constant is generated), whereas the technique of Figure 6–1 requires 23 instructions worst case.

If only a test for the presence of a 0-byte is wanted, then a branch on zero (or nonzero) can be inserted just after the second assignment to y.

A method similar to that of Figure 6–2, but for finding the rightmost 0-byte in a word x (zbyter(x)), is [Mycro]:

y = (x - 0x01010101) & ~x & 0x80808080;
n = ntz(y) >> 3;

This executes in only five instructions exclusive of loading the constants if the machine has the and not and number of trailing zeros instructions. It cannot be used to compute zbytel(x), because of a problem with borrows. It would be most useful for finding the first 0-byte in a character string on a little-endian machine, or to simply test for a 0-byte (using only the assignment to y) on a machine of either endianness.

If the nlz instruction is not available, there does not seem to be any really good way to compute the find first 0-byte function. Figure 6–3 shows a possibility (only the executable part of the code is shown).

This executes in ten to 13 basic RISC instructions, ten in the all-nonzero case. Thus, it is probably not as good as the code of Figure 6–1, although it does have fewer branch instructions. It does not scale very well to 64-bit machines, unfortunately.

There are other possibilities for avoiding the nlz function. The value of y computed by the code of Figure 6–3 consists of four bytes, each of which is either 0x00 or 0x80. The remainder after dividing such a number by 0x7F is the original value with the up-to-four 1-bits moved and compressed to the four rightmost positions. Thus, the remainder ranges from 0 to 15 and uniquely identifies the original number. For example,

Image

This value can be used to index a table, 16 bytes in size, to get the desired result. Thus, the code beginning if (y == 0) can be replaced with

static char table[16] = {4, 3, 2, 2, 1, 1, 1, 1,
                         0, 0, 0, 0, 0, 0, 0, 0};
return table[y%127];

where y is unsigned. The number 31 can be used in place of 127, but with a different table.


                    // Original byte: 00 80 other
y = (x & 0x7F7F7F7F) + 0x7F7F7F7F; // 7F 7F 1xxxxxxx
y = ~(y | x | 0x7F7F7F7F);         // 80 00 00000000
                                   // These steps map:
if (y == 0) return 4;              // 00000000 ==> 4,
else if (y > 0x0000FFFF)           // 80xxxxxx ==> 0,
   return (y >> 31) ^ 1;           // 0080xxxx ==> 1,
else                               // 000080xx ==> 2,
   return (y >> 15) ^ 3;           // 00000080 ==> 3.


FIGURE 6–3. Find leftmost 0-byte, not using nlz.

These methods involving dividing by 127 or 31 are really just curiosities, because the remainder function is apt to require 20 cycles or more, even if directly implemented in hardware. However, below are two more efficient replacements for the code in Figure 6–3 beginning with if (y == 0):


return table[hopu(y, 0x02040810) & 15];
return table[y*0x00204081 >> 28];

Here, hopu(a, b) denotes the high-order 32 bits of the unsigned product of a and b. In the second line, we assume the usual HLL convention that the value of the multiplication is the low-order 32 bits of the complete product. This might be a practical method, if either the machine has a fast multiply or the multiplication by 0x204081 is done by shift-and-add’s. It can be done in four such instructions, as suggested by

y (1 + 27 + 214 + 221) = y (1 + 27)(1 + 214).

Using this 4-cycle way to do the multiplication, the total time for the procedure comes to 13 cycles (7 to compute y, plus 4 for the shift-and-add’s, plus 2 for the shift right of 28 and the table index), and of course it is branch-free.

These scale reasonably well to a 64-bit machine. For the “modulus” method, use

return table[y%511];

where table is of size 256, with values 8, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, ... (i.e., table[i] = number of trailing 0’s in i).

For the multiplicative methods, use either

return table[hopu(y, 0x02040810 20408100) & 255]; or
return table[(y*0x0002 0408 1020 4081) >> 56];

where table is of size 256, with values 8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 3, ....

The multiplication by 0x20408 10204081 can be done with

Image

which gives a 13-cycle solution.

All these variations using the table can, of course, implement the find rightmost 0-byte function by simply changing the data in the table.

If the machine does not have the nor instruction, the not in the second assignment to y in Figure 6–3 can be omitted, in the case of a 32-bit machine, by using one of the three return statements given above, with table[i] = 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 4. This scheme does not quite work on a 64-bit machine.

Here is an interesting variation on the procedure of Figure 6–2, again aimed at machines that do not have number of leading zeros. Let a, b, c, and d be 1-bit variables for the predicates “the first byte of x is nonzero,” “the second byte of x is nonzero,” and so on. Then,

zbytel(x) = a + ab + abc + abcd.

The multiplications can be done with and’s, leading to the procedure shown in Figure 6–4 (only the executable code is shown). This comes to 15 instructions on the basic RISC, which is not particularly fast, but there is a certain amount of parallelism. On a superscalar machine that can execute up to three arithmetic instructions in parallel, provided they are independent, it comes to only ten cycles.


y = (x & 0x7F7F7F7F) + 0x7F7F7F7F;
y = y | x;           // Leading 1 on nonzero bytes.

t1 = y >> 31;                // tl = a.
t2 = (y >> 23) & tl;         // t2 = ab.
t3 = (y >> 15) & t2;         // t3 = abc.
t4 = (y >>  7) & t3;          // t4 = abcd.
return t1 + t2 + t3 + t4;


FIGURE 6–4. Find leftmost 0-byte by evaluating a polynomial.

A simple variation of this does the find rightmost 0-byte function, based on

zbyter(x) = abcd + bcd + cd + d.

(This requires one more and than the code of Figure 6–4.)

Some Simple Generalizations

Functions zbytel and zbyter can be used to search for a byte equal to any particular value, by first exclusive or’ing the argument x with a word consisting of the desired value replicated in each byte position. For example, to search x for an ASCII blank (0x20), search x0x 20202020 for a 0-byte.

Similarly, to search for a byte position in which two words x and y are equal, search xy for a 0-byte.

There is nothing special about byte boundaries in the code of Figure 6–2 and its variants. For example, to search a word for a 0-value in any of the first four bits, the next 12, or the last 16, use the code of Figure 6–2 with the mask replaced by 0x77FF7FFF [PHO]. (If a field length is 1, use a 0 in the mask at that position.)

Searching for a Value in a Given Range

The code of Figure 6–2 can easily be modified to search for a byte in the range 0 to any specified value less than 128. To illustrate, the following code finds the index of the leftmost byte having value from 0 to 9:

y = (x & 0x7F7F7F7F) + 0x76767676;
y = y | x;
y = y | 0x7F7F7F7F;          // Bytes > 9 are 0xFF.
y = ~y;                      // Bytes > 9 are 0x00,
                             // bytes <= 9 are 0x80.
n = nlz(y) >> 3;

More generally, suppose you want to find the leftmost byte in a word that is in the range a to b, where the difference between a and b is less than 128. For example, the uppercase letters encoded in ASCII range from 0x41 to 0x5A. To find the first uppercase letter in a word, subtract 0x41414141 in such a way that the borrow does not propagate across byte boundaries, and then use the above code to identify bytes having value from 0 to 0x19 (0x5A – 0x41). Using the formulas for subtraction given in Section 2–18, “Multibyte Add, Subtract, Absolute Value,” on page 40, with obvious simplifications possible with y = 0x41414141, gives

d = (x | 0x80808080) - 0x41414141;
d = ~((x | 0x7F7F7F7F) ^ d);
y = (d & 0x7F7F7F7F) + 0x66666666;
y = y | d;
y = y | 0x7F7F7F7F;    // Bytes not from 41–5A are FF.
y = ~y;                // Bytes not from 41–5A are 00,
                       // bytes from 41–5A are 80.
n = nlz(y) >> 3;

For some ranges of values, simpler code exists. For example, to find the first byte whose value is 0x30 to 0x39 (a decimal digit encoded in ASCII), simply exclusive or the input word with 0x30303030 and then use the code given above to search for a value in the range 0 to 9. (This simplification is applicable when the upper and lower limits have n high-order bits in common, and the lower limit ends with 8 – n 0’s.)

These techniques can be adapted to handle ranges of 128 or larger with no additional instructions. For example, to find the index of the leftmost byte whose value is in the range 0 to 137 (0x89), simply change the line y = y | x to y = y & x in the code above for searching for a value from 0 to 9.

Similarly, changing the line y = y | d to y = y & d in the code for finding the leftmost byte whose value is in the range 0x41 to 0x5A causes it to find the leftmost byte whose value is in the range 0x41 to 0xDA.

6–2 Find First String of 1-Bits of a Given Length

The problem here is to search a word in a register for the first string of 1-bits of a given length n or longer, and to return its position, with some special indication if no such string exists. Variants are to return only the yes/no indication and to locate the first string of exactly n 1-bits. This problem has application in disk-allocation programs, particularly for disk compaction (rearranging data on a disk so that all blocks used to store a file are contiguous). The problem was suggested to me by Albert Chang, who pointed out that it is one of the uses for the number of leading zeros instruction.

We assume here that the number of leading zeros instruction, or a suitable subroutine for that function, is available.

An algorithm that immediately comes to mind is to first count the number of leading 0’s and skip over them by shifting left by the number obtained. Then count the leading 1’s by inverting and counting leading 0’s. If this is of sufficient length, we are done. Otherwise, shift left by the number obtained and repeat from the beginning. This algorithm might be coded as shown below. If n consecutive 1-bits are found, it returns a number from 0 to 31, giving the position of the leftmost 1-bit in the leftmost such sequence. Otherwise, it returns 32 as a “not found” indication.

int ffstr1(unsigned x, int n) {
   int k, p;

   p = 0;                  // Initialize position to return.
   while (x != 0) {
      k = nlz(x);          // Skip over initial 0's
      x = x << k;          // (if any).
      p = p + k;
      k = nlz(~x);         // Count first/next group of 1's.
      if (k >= n)          // If enough,
         return p;         // return.
      x = x << k;          // Not enough 1's, skip over
      p = p + k;           // them.
   }
   return 32;
}

This algorithm is reasonable if it is expected that the loop will not be executed very many times—for example, if it is expected that x will have long sequences of 1’s and of 0’s. This might very well be the expectation in the disk-allocation application. Its worst-case execution time, however, is not very good; for example, about 178 full RISC instructions executed for x = 0x55555555 and n ≥ 2.

An algorithm that is better in worst-case execution time is based on a sequence of shift left and and instructions. To see how this works, consider searching for a string of eight or more consecutive 1-bits in a 32-bit word x. This might be done as follows:

Image

After the first assignment, the 1’s in x indicate the starting positions of strings of length 2. After the second assignment, the 1’s in x indicate the starting positions of strings of length 4 (a string of length 2 followed by another string of length 2). After the third assignment, the 1’s in x indicate the starting positions of strings of length 8. Executing number of leading zeros on this word gives the position of the first string of length 8 (or more), or 32 if none exists.

To develop an algorithm that works for any length n from 1 to 32, we will look at this a little differently. First, observe that the above three assignments can be done in any order. Reverse order will be more convenient. To illustrate the general method, consider the case n = 10:

Image

The first statement shifts by n/2. After executing it, the problem is reduced to finding a string of five consecutive 1-bits in x1. This can be done by shifting left by 5/2 = 2, and’ing, and searching the result for a string of length 3 (5 – 2). The last two statements identify where the strings of length 3 are in x2. The sum of the shift amounts is always n– 1. The algorithm is shown in Figure 6–5. The execution time ranges from 3 to 36 full RISC instructions, as n ranges from 1 to 32.


int ffstr1(unsigned x, int n) {
   int s;

   while (n > 1) {
      s = n >> 1;
      x = x & (x >> s);
      n = n − s;
   }
   return nlz(x);
}


FIGURE 6–5. Find first string of n 1’s, shift-and-and sequence.

If n is often moderately large, it is not unreasonable to unroll this loop by repeating the loop body five times and omitting the test n > 1. (Five is always sufficient for a 32-bit machine.) This gives a branch-free algorithm that runs in a constant time of 20 instructions executed (the last assignment to n can be omitted). Although for small values of n, the three assignments are executed more than necessary, the result is unchanged by the extra steps, because variable n sticks at the value 1, and for this value the three steps have no effect on x or n. The unrolled version is faster than the looping version for n ≥ 5, in terms of number of instructions executed.

A string of exactly n 1-bits can be found in six more instructions (four if and not is available). The quantity x computed by the algorithm of Figure 6–5 has 1-bits wherever a string of length n or more 1-bits begins. Hence, using the final value of x computed by that algorithm, the expression

Image

contains a 1-bit wherever the final x contains an isolated 1-bit, which is to say wherever the original x began a string of exactly n 1-bits.

The algorithm is also easily adapted to finding strings of length n that begin at certain locations. For example, to find strings that begin at byte boundaries, simply and the final x with 0x80808080.

It can be used to find strings of 0-bits either by complementing x at the start, or by changing the and’s to or’s and complementing x just before invoking nlz. For example, below is an algorithm for finding the first (leftmost) 0-byte (see Section 6–1, “Find First 0-Byte,” on page 117, for a precise definition of this problem).

Image

This executes in 12 instructions on the full RISC (not as good as the algorithm of Figure 6–2 on page 118, which executes in eight instructions).

6–3 Find Longest String of 1-Bits

The nicely concise function shown in Figure 6–6 returns the length of the longest string of 1-bits in x [Hsieh].


int maxstr1(unsigned x) {
   int k;
   for (k = 0; x ! = 0; k++) x = x & 2*x;
   return k;
}


FIGURE 6–6. Find length of longest string of 1’s.

It executes in 4n + 3 instructions on the basic RISC, where n is the length of the longest string of 1’s, or 131 instructions in the worst case.

To reduce the worst-case execution time, a “logarithmic” version is possible. It works by propagating 0’s one, two, four, eight, and 16 positions to the left, stopping at the last nonzero word, and then backtracking to find the length of the longest contiguous string of 1’s.

For example, suppose

 x = 0011 1111 1111 0011 1111 0011 1111 1000

Then

 x2 = 0011 1111 1110 0011 1110 0011 1111 0000
 x4 = 0011 1111 1000 0011 1000 0011 1100 0000
 x8 = 0011 1000 0000 0000 0000 0000 0000 0000
x16 = all 0's

In this case, the last nonzero word is x8. Observe that each 1-bit in x8 indicates the leftmost position of a string of eight 1’s. Thus, the longest string of 1’s begins at the leftmost position of a 1-bit in x8, bit position 29 in the example. To test for a string of length 12, one can test the bit at position 21 (29 – 8) in x4. Since that is 0, there is no string of length 12. To test for a string of length 10, one can test the bit at position 21 in x2. Since that is 1, position 29 is the start of a string of length 10 (or more). Last, to test for a string of length 11, one can test the bit at position 19 (21 – 2) in x. Because that is 0, the longest string is of length 10, and it starts at position 29.

This scheme is coded in Figure 6–7, except the code uses only two variables, x and y, instead of the five variables x, x2, x4, x8, and x16. This code finds both the length and position of the longest string of 1’s, with the position being measured from the left end of the string. The scheme does not work if x is 0 or all 1’s. These are special-cased, with the latter possibility being handled in a place that is not executed frequently.


int fmaxstr1(unsigned x, int *apos) {
   unsigned y;
   int s;

   if (x == 0) {*apos = 32; return 0;}
   y = x & (x << 1);
   if (y == 0) {s = 1; goto L1;}
   x = y & (y << 2);
   if (x == 0) {s = 2; x = y; goto L2;}
   y = x & (x << 4);
   if (y == 0) {s = 4; goto L4;}
   x = y & (y << 8);
   if (x == 0) {s = 8; x = y; goto L8;}
   if (x == 0xFFFF8000) {*apos = 0; return 32;}
   s = 16;

L16: y = x & (x << 8);
     if (y != 0) {s = s + 8; x = y;}
L8:  y = x & (x << 4);
     if (y != 0) {s = s + 4; x = y;}
L4:  y = x & (x << 2);
     if (y != 0) {s = s + 2; x = y;}
L2:  y = x & (x << 1);
     if (y != 0) {s = s + 1; x = y;}
L1:  *apos = nlz(x);
   return s;
}


FIGURE 6–7. Find length and position of longest string of 1’s.

The worst-case execution time on the basic RISC is 39 instructions, plus those required for the nlz function. If only the length of the longest string of 1’s is wanted, there is no significant savings in execution time, except for omitting the use of the nlz function.

6–4 Find Shortest String of 1-Bits

It is more difficult to find the shortest string of 1-bits in a word. One way to do it is to mark the beginnings of all strings of 1’s in a word b and the ends of all such strings in a word e. Then, if b & e is nonzero, the shortest string is of length 1. Otherwise, shift e left one position and test again. For example, if

x = 0011 1111 1111 0011 1111 0011 1111 1000

then

b = 0010 0000 0000 0010 0000 0010 0000 0000
e = 0000 0000 0001 0000 0001 0000 0000 1000

After shifting e left five places, b & e is nonzero. This means that the shortest string of 1-bits is of length 6.

This idea is embodied in the code shown in Figure 6–8. As in the preceding material, the position of the string is measured from the left, and if there are two or more minimal length strings of equal length, this function finds the leftmost one. For example, if x = 0x00FF0FF0 it returns length 8, position 8.


int fminstr1(unsigned x, int *apos) {
   int k;
   unsigned b, e;       // Beginnings, ends.

   if (x == 0) {*apos = 32; return 0;}
   b = ~(x >> 1) & x;   // 0–1 transitions.
   e = x & ~(x << 1);   // 1–0 transitions.
   for (k = 1; (b & e) == 0; k++)
      e = e << 1;
   *apos = nlz(b & e);
   return k;
}


FIGURE 6–8. Find length and position of shortest string of 1’s.

The function executes in 8 + 4 n instructions on the basic RISC (without andc), plus the time for the nlz function, for n ≥ 2, where n is the length of the shortest contiguous string of 1’s in x.

Perhaps the ultimate problem in this class is to find the length and position of the shortest string of 1’s in x that is at least as long as a given integer n> 0. In terms of the storage allocation problem, this is a “best fit” algorithm. This can be done by first left-propagating the 0’s in x by n − 1 positions and then finding the shortest string of 1’s in the revised x. See the exercises.

Exercises

1. Code an elaboration of Hsieh’s algorithm that will find both the length and position of the longest string of 1’s in a word x. You may use the nlz function.

2. Code a function for finding the length and position of the shortest string of 1’s in a word x that is at least as long as a given integer n.

3. Another way to find the shortest string of 1’s in a word x is to successively turn off the rightmost string of 1’s in x and observe the change in population count at each step. Code a function for the full RISC that uses this idea and also finds the position of a shortest string of 1’s.

4. For “completely random” 32-bit words x (each bit independently 0 or 1 with probability 0.5), what is the average number of strings of 1’s in x? The answer determines the average execution time of the function of exercise 3, for such input data.

5. Again, for “completely random” 32-bit words x, what is the average length of the shortest contiguous string of 1’s in x? The answer determines the average execution time of function fminstr1 in Figure 6–8 for such input data. Compute this with a Monte Carlo or exhaustive enumeration program.

6. Of the 2n binary words of length n, for how many is their shortest contained string of 1’s of length 1? That is, how many n-bit words begin with 10, or end with 01, or contain the sequence 010? Find a closed-form solution or a recursion, not an exhaustive enumeration program.

7. Similarly, of the 2n binary words of length n, for how many is their shortest contained string of 1’s of length 2?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset