Chapter 3. Power-of-2 Boundaries

3–1 Rounding Up/Down to a Multiple of a Known Power of 2

Rounding an unsigned integer x down to, for example, the next smaller multiple of 8 is trivial: x & −8 does it. An alternative is Image. These work for signed integers as well, provided “round down” means to round in the negative direction (e.g., (−37) & (−8) = −40).

Rounding up is almost as easy. For example, an unsigned integer x can be rounded up to the next greater multiple of 8 with either of

Image

These expressions are correct for signed integers as well, provided “round up” means to round in the positive direction. The second term of the second expression is useful if you want to know how much you must add to x to make it a multiple of 8 [Gold].

To round a signed integer to the nearest multiple of 8 toward 0, you can combine the two expressions above in an obvious way:

Image

An alternative for the first line is Image, which is useful if the machine lacks and immediate, or if the constant is too large for its immediate field.

Sometimes the rounding factor is given as the log2 of the alignment amount (e.g., a value of 3 means to round to a multiple of 8). In this case, code such as the following can be used, where k = log2(alignment amount):

Image

3–2 Rounding Up/Down to the Next Power of 2

We define two functions that are similar to floor and ceiling, but which are directed roundings to the closest integral power of 2, rather than to the closest integer. Mathematically, they are defined by

Image

The initial letters of the function names are intended to suggest “floor” and “ceiling.” Thus, flp2(x) is the greatest power of 2 that is x, and clp2(x) is the least power of 2 that is x. These definitions make sense even when x is not an integer (e.g., flp2(0.1) = 0.0625). The functions satisfy several relations analogous to those involving floor and ceiling, such as those shown below, where n is an integer.

Image

Computationally, we deal only with the case in which x is an integer, and we take it to be unsigned, so the functions are well defined for all x. We require the value computed to be the arithmetically correct value modulo 232 (that is, we take clp2(x) to be 0 for x > 231). The functions are tabulated below for a few values of x.

Image

Functions flp2 and clp2 are connected by the relations shown below. These can be used to compute one from the other, subject to the indicated restrictions.

Image

The round-up and round-down functions can be computed quite easily with the number of leading zeros instruction, as shown below. However, for these relations to hold for x = 0 and x > 231, the computer must have its shift instructions defined to produce 0 for shift amounts of –1, 32, and 63. Many machines (e.g., PowerPC) have “mod-64” shifts, which do this. In the case of −1, it is adequate if the machine shifts in the opposite direction (that is, a shift left of –1 becomes a shift right of 1).

Image

Rounding Down

Figure 3–1 illustrates a branch-free algorithm that might be useful if number of leading zeros is not available. This algorithm is based on right-propagating the leftmost 1-bit, and executes in 12 instructions.


unsigned flp2(unsigned x) {
   x = x | (x >> 1);
   x = x | (x >> 2);
   x = x | (x >> 4);
   x = x | (x >> 8);
   x = x | (x >> 16);
   return x - (x >> 1);
}


FIGURE 3–1. Greatest power of 2 less than or equal to x, branch free.

Figure 3–2 shows two simple loops that compute the same function. All variables are unsigned integers. The loop on the right keeps turning off the rightmost 1-bit of x until x = 0, and then returns the previous value of x.


y = 0x80000000;             do {
while (y > x)                  y = x;
   y = y >> 1;                 x = x & (x - 1);
return y;                   } while(x != 0);
                            return y;


FIGURE 3–2. Greatest power of 2 less than or equal to x, simple loops.

The loop on the left executes in 4nlz(x) + 3 instructions. The loop on the right, for x ≠ 0, executes in 4 pop(x) instructions,1 if the comparison to 0 is zero-cost.

Rounding Up

The right-propagation trick yields a good algorithm for rounding up to the next power of 2. This algorithm, shown in Figure 3–3, is branch free and runs in 12 instructions.


unsigned clp2(unsigned x) {
   x = x − 1;
   x = x | (x >> 1);
   x = x | (x >> 2);
   x = x | (x >> 4);
   x = x | (x >> 8);
   x = x | (x >> 16);
   return x + 1;
}


FIGURE 3–3. Least power of 2 greater than or equal to x.

An attempt to compute this with the obvious loop does not work out very well:


y = 1;

while (y < x)        // Unsigned comparison.
   y = 2*y;
return y;

This code returns 1 for x = 0, which is probably not what you want, loops forever for x > 231, and executes in 4n + 3 instructions, where n is the power of 2 of the returned integer. Thus, it is slower than the branch-free code, in terms of instructions executed, for n≥ 3 (x≥ 8).

3–3 Detecting a Power-of-2 Boundary Crossing

Assume memory is divided into blocks that are a power of 2 in size, starting at address 0. The blocks may be words, doublewords, pages, and so on. Then, given a starting address a and a length l, we wish to determine whether or not the address range from a to a + l –1, l ≥ 2, crosses a block boundary. The quantities a and l are unsigned and any values that fit in a register are possible.

If l = 0 or 1, a boundary crossing does not occur, regardless of a. If l exceeds the block size, a boundary crossing does occur, regardless of a. For very large values of l (wraparound is possible), a boundary crossing can occur even if the first and last bytes of the address range are in the same block.

There is a surprisingly concise way to detect boundary crossings on the IBM System/370 [CJS]. This method is illustrated below for a block size of 4096 bytes (a common page size).

O   RA,=A(-4096)
ALR RA,RL
BO  CROSSES

The first instruction forms the logical or of RA (which contains the starting address a) and the number 0xFFFFF000. The second instruction adds in the length and sets the machine’s 2-bit condition code. For the add logical instruction, the first bit of the condition code is set to 1 if a carry occurred, and the second bit is set to 1 if the 32-bit register result is nonzero. The last instruction branches if both bits are set. At the branch target, RA will contain the length that extends beyond the first page (this is an extra feature that was not asked for).

If, for example, a = 0 and l = 4096, a carry occurs, but the register result is 0, so the program properly does not branch to label CROSSES.

Let us see how this method can be adapted to RISC machines, which generally do not have branch on carry and register result nonzero. Using a block size of 8 for notational simplicity, the method of [CJS] branches to CROSSES if a carry occurred ((a | –8) + l≥ 232) and the register result is nonzero ((a | –8) + l ≠ 232). Thus, it is equivalent to the predicate

(a | –8) + l > 232.

This in turn is equivalent to getting a carry in the final addition in evaluating ((a | –8) – 1) + l. If the machine has branch on carry, this can be used directly, giving a solution in about five instructions, counting a load of the constant −8.

If the machine does not have branch on carry, we can use the fact that carry occurs in x + y iff Image (see “Unsigned Add/Subtract” on page 31) to obtain the expression

Image

Using various identities such as ¬(x1) = −x gives the following equivalent expressions for the “boundary crossed” predicate:

Image

These can be evaluated in five or six instructions on most RISC computers, counting the final conditional branch.

Using another tack, clearly an 8-byte boundary is crossed iff

(a & 7) + l − 1 ≥ 8.

This cannot be directly evaluated because of the possibility of overflow (which occurs if l is very large), but it is easily rearranged to 8 − (a & 7) < l, which can be directly evaluated on the computer (no part of it overflows). This gives the expression

Image

which can be evaluated in five instructions on most RISCs (four if it has subtract from immediate). If a boundary crossing occurs, the length that extends beyond the first block is given by l − (8 − (a & 7)), which can be calculated with one additional instruction (subtract).

This formula can be easily understood from the figure below [Kumar], which illustrates that a & 7 is the offset of a in its block, and thus 8 − (a & 7) is the space remaining in the block.

Image

Exercises

1. Show how to round an unsigned integer to the nearest multiple of 8, with the halfway case (a) rounding up, (b) rounding down, and (c) rounding up or down, whichever makes the next bit to the left a zero (“unbiased” rounding).

2. Show how to round an unsigned integer to the nearest multiple of 10, with the halfway case (a) rounding up, (b) rounding down, and (c) rounding up or down, whichever results in an even multiple of 10. Feel free to use division, remaindering, and multiplication instructions, and don’t be concerned about values very close to the largest unsigned integer.

3. Code a function in C that does an “unaligned load.” The function is given an address a and it loads the four bytes from addresses a through a + 3 into a 32-bit register, as if those four bytes contained an integer. Parameter a addresses the low-order byte (that is, the machine is little-endian). The function should be branch free, it should execute at most two load instructions and, if a is full-word aligned, it must not attempt to load from address a + 4, because that may be in a read-protected block.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset