Chapter 10. Integer Division By Constants

On many computers, division is very time consuming and is to be avoided when possible. A value of 20 or more elementary add times is not uncommon, and the execution time is usually the same large value even when the operands are small. This chapter gives some methods for avoiding the divide instruction when the divisor is a constant.

10–1 Signed Division by a Known Power of 2

Apparently, many people have made the mistake of assuming that a shift right signed of k positions divides a number by 2k, using the usual truncating form of division [GLS2]. It’s a little more complicated than that. The code shown below computes q = n ÷ 2k, for 1 ≤ k ≤ 31 [Hop].

shrsi t,n,k-1        Form the integer
shri  t,t,32-k       2**k – 1 if n < 0, else 0.
add   t,n,t          Add it to n,
shrsi q,t,k          and shift right (signed).

It is branch free. It simplifies to three instructions in the common case of division by 2 (k = 1). It does, however, rely on the machine’s being able to shift by a large amount in a short time. The case k = 31 does not make too much sense, because the number 231 is not representable in the machine. Nevertheless, the code does produce the correct result in that case (which is q = –1 if n = –231and q = 0 for all other n).

To divide by –2k, the above code can be followed by a negate instruction. There does not seem to be any better way to do it.

The more straightforward code for dividing by 2k is

        bge   n,label       Branch if n >= 0.
        addi  n,n,2**k-1    Add 2**k - 1 to n,
 label  shrsi n,n,k         and shift right (signed).

This would be preferable on a machine with slow shifts and fast branches.

PowerPC has an unusual device for speeding up division by a power of 2 [GGS]. The shift right signed instructions set the machine’s carry bit if the number being shifted is negative and one or more 1-bits are shifted out. That machine also has an instruction for adding the carry bit to a register, denoted addze. This allows division by any (positive) power of 2 to be done in two instructions:

shrsi q,n,k    
addze q,q

A single shrsi of k positions does a kind of signed division by 2k that coincides with both modulus and floor division. This suggests that one of these might be preferable to truncating division for computers and HLL’s to use. That is, modulus and floor division mesh with shrsi better than does truncating division, permitting a compiler to translate the expression n / 2 to an shrsi. Furthermore, shrsi followed by neg (negate) does modulus division by –2k, which is a hint that maybe modulus division is best. (This is mainly an aesthetic issue. It is of little practical significance, because division by a negative constant is no doubt extremely rare.)

10–2 Signed Remainder from Division by a Known Power of 2

If both the quotient and remainder of n ÷ 2k are wanted, it is simplest to compute the remainder r from r = nq * 2k This requires only two instructions after computing the quotient q:

shli  r,q,k
sub   r,n,r

To compute only the remainder seems to require about four or five instructions. One way to compute it is to use the four-instruction sequence above for signed division by 2k, followed by the two instructions shown immediately above to obtain the remainder. This results in two consecutive shift instructions that can be replaced by an and, giving a solution in five instructions (four if k = 1):

shrsi t,n,k-1      Form the integer
shri t,t,32-k      2**k - 1 if n < 0, else 0.
add t,n,t          Add it to n,
andi t,t,-2**k     clear rightmost k bits,     
sub r,n,t          and subtract it from n.

Another method is based on

Image

To use this, first compute Image, and then

r ← ((abs(n) & (2k1)) ⊕ t) – t

(five instructions) or, for k = 1, since (– n) & 1 = n & 1,

r ← ((n & 1) ⊕ t) – t

(four instructions). This method is not very good for k > 1 if the machine does not have absolute value (computing the remainder would then require six instructions).

Still another method is based on

Image

This leads to

Image

(five instructions for k > 1, four for k = 1).

The above methods all work for 1 ≤ k ≤ 31.

Incidentally, if shift right signed is not available, the value that is 2k – 1 for n < 0 and 0 for n ≥ 0 can be constructed from

Image

which adds only one instruction.

10–3 Signed Division and Remainder by Non-Powers of 2

The basic trick is to multiply by a sort of reciprocal of the divisor d, approximately 232/d, and then to extract the leftmost 32 bits of the product. The details, however, are more complicated, particularly for certain divisors such as 7.

Let us first consider a few specific examples. These illustrate the code that will be generated by the general method. We denote registers as follows:

n – the input integer (numerator)
M – loaded with a “magic number”
t - a temporary register
q - will contain the quotient
r - will contain the remainder

Division by 3

li      M,0x55555556    Load magic number, (2**32+2)/3.
mulhs   q,M,n           q = floor(M*n/2**32).

shri    t,n,31          Add 1 to q if
add     q,q,t           n is negative.

muli    t,q,3           Compute remainder from
sub     r,n,t           r = n - q*3.

Proof. The multiply high signed operation (mulhs) cannot overflow, as the product of two 32-bit integers can always be represented in 64 bits and mulhs gives the high-order 32 bits of the 64-bit product. This is equivalent to dividing the 64-bit product by 232 and taking the floor of the result, and this is true whether the product is positive or negative. Thus, for n ≥ 0 the above code computes

Image

Now, n < 231, because 231 – 1 is the largest representable positive number. Hence, the “error” term 2n / (3 · 232) is less than 1/3 (and is nonnegative), so by Theorem D4 (page 183) we have q = n / 3, which is the desired result (Equation (1) on page 182).

For n < 0, there is an addition of 1 to the quotient. Hence the code computes

Image

where we have used Theorem D2. Hence

Image

For –231n ≤ –1,

Image

The error term is nonpositive and greater than –1 / 3, so by Theorem D4 q = n / 3, which is the desired result (Equation (1) on page 182).

This establishes that the quotient is correct. That the remainder is correct follows easily from the fact that the remainder must satisfy

n = qd + r,

the multiplication by 3 cannot overflow (because –231 / 3 ≤ q ≤ (231 – 1) / 3), and the subtract cannot overflow because the result must be in the range –2 to +2.

The multiply immediate can be done with two add’s, or a shift and an add, if either gives an improvement in execution time.

On many present-day RISC computers, the quotient can be computed as shown above in nine or ten cycles, whereas the divide instruction might take 20 cycles or so.

Division by 5

For division by 5, we would like to use the same code as for division by 3, except with a multiplier of (232 + 4) / 5. Unfortunately, the error term is then too large; the result is off by 1 for about 1/5 of the values of n ≥ 230 in magnitude. However, we can use a multiplier of (233 + 3) / 5 and add a shift right signed instruction. The code is

li    M,0x66666667  Load magic number, (2**33+3)/5.
mulhs q,M,n         q = floor(M*n/2**32).
shrsi q,q,1
shri  t,n,31        Add 1 to q if
add   q,q,t         n is negative.

muli  t,q,5         Compute remainder from
sub   r,n,t         r = n - q*5.

Proof. The mulhs produces the leftmost 32 bits of the 64-bit product, and then the code shifts this right by one position, signed (or “arithmetically”). This is equivalent to dividing the product by 233 and then taking the floor of the result. Thus, for n ≥ 0 the code computes

Image

For 0 ≤ n < 231, the error term 3n / 5 · 233 is nonnegative and less than 1/5, so by Theorem D4, q = n / 5.

For n < 0, the above code computes

Image

The error term is nonpositive and greater than –1/5, so q = n / 5.
That the remainder is correct follows as in the case of division by 3.
The multiply immediate can be done with a shift left of two and an add.

Division by 7

Dividing by 7 creates a new problem. Multipliers of (232 + 3) / 7 and (233 + 6) / 7 give error terms that are too large. A multiplier of (234 + 5) / 7 would work, but it’s too large to represent in a 32-bit signed word. We can multiply by this large number by multiplying by (234 + 5) / 7 – 232 (a negative number), and then correcting the product by inserting an add. The code is

li    M,0x92492493  Magic num, (2**34+5)/7 - 2**32.
mulhs q,M,n         q = floor(M*n/2**32).
add   q,q,n         q = floor(M*n/2**32) + n.
shrsi q,q,2         q = floor(q/4).
shri  t,n,31        Add 1 to q if
add   q,q,t         n is negative.

muli  t,q,7         Compute remainder from
sub   r,n,t         r = n - q*7.

Proof. It is important to note that the instruction “add q,q,n” above cannot overflow. This is because q and n have opposite signs, due to the multiplication by a negative number. Therefore, this “computer arithmetic” addition is the same as real number addition. Hence for n ≥ 0 the above code computes

Image

where we have used the corollary of Theorem D3.

For 0 ≤ n ≤ 231, the error term 5n/ 7 · 234 is nonnegative and less than 1/7, so q = n / 7.

For n < 0, the above code computes

Image

The error term is nonpositive and greater than –1/7, so q = n / 7

The multiply immediate can be done with a shift left of three and a subtract.

10–4 Signed Division by Divisors2

At this point you may wonder if other divisors present other problems. We see in this section that they do not; the three examples given illustrate the only cases that arise (for d ≥ 2).

Some of the proofs are a bit complicated, so to be cautious, the work is done in terms of a general word size W.

Given a word size W ≥ 3 and a divisor d, 2 ≤ d ≤ 2W – 1 we wish to find the least integer m and integer p such that

Image
Image

with 0 ≤ m< 2W and pW.

The reason we want the least integer m is that a smaller multiplier may give a smaller shift amount (possibly zero) or may yield code similar to the “divide by 5” example, rather than the “divide by 7” example. We must have m ≤ 2W– 1 so the code has no more instructions than that of the “divide by 7” example (that is, we can handle a multiplier in the range 2W –1 to 2W– 1 by means of the add that was inserted in the “divide by 7” example, but we would rather not deal with larger multipliers). We must have pW, because the generated code extracts the left half of the product mn, which is equivalent to shifting right W positions. Thus, the total right shift is W or more positions.

There is a distinction between the multiplier m and the “magic number,” denoted M. The magic number is the value used in the multiply instruction. It is given by

Image

Because (1b) must hold for n = –d,md/ 2p + 1 = –1, which implies

Image

Let nc be the largest (positive) value of n such that rem(nc, d) = d – 1. nc exists because one possibility is nc = d – 1. It can be calculated from nc = 2W – 1 / d d – 1 = 2W – 1 – rem(2W –1, d) – 1. nc is one of the highest d admissible values of n, so

Image

and, clearly

Image

Because (1a) must hold for n = nc

Image

or

Image

Combining this with (2) gives

Image

Because m is to be the least integer satisfying (4), it is the next integer greater than 2p / d; that is,

Image

Combining this with the right half of (4) and simplifying gives

Image

The Algorithm

Thus, the algorithm to find the magic number M and the shift amount s from d is to first compute nc, and then solve (6) for p by trying successively larger values. If p < W, set p = W (the theorem below shows that this value of p also satisfies (6)). When the smallest pW satisfying (6) is found, m is calculated from (5). This is the smallest possible value of m, because we found the smallest acceptable p, and from (4) clearly smaller values of p yield smaller values of m. Finally, s = pW and M is simply a reinterpretation of m as a signed integer (which is how the mulhs instruction interprets it).

Forcing p to be at least W is justified by the following:

THEOREM DC1. If(6) is true for some value of p, then it is true for all larger values of p.

Proof. Suppose (6) is true for p = p0. Multiplying (6) by 2 gives

2p0 +1 > nc(2 d – 2rem(2p0, d)).

From Theorem D5, rem(2p0 +1, d) ≥ 2rem(2p0, d) – d. Combining gives

2p00 +1 > nc((2d – (rem(2p0 + 1, d) + d)), or
2p 0 +1 > nc(d – rem(2p0 +1, d)).

Therefore, (6) is true for p = p0 + 1, and hence for all larger values.

Thus, one could solve (6) by a binary search, although a simple linear search (starting with p = W) is probably preferable, because usually d is small, and small values of d give small values of p.

Proof That the Algorithm Is Feasible

We must show that (6) always has a solution and that 0 ≤ m < 2W. (It is not necessary to show that pW, because that is forced.)

We show that (6) always has a solution by getting an upper bound on p. As a matter of general interest, we also derive a lower bound under the assumption that p is not forced to be at least W. To get these bounds on p, observe that for any positive integer x, there is a power of 2 greater than x and less than or equal to 2x. Hence, from (6),

nc(d – rem(2p, d)) < 2p ≤ 2nc((d – rem(2p, d)).

Because 0 ≤ rem(2p, d) ≤ d – 1,

Image

From (3a) and (3b), nc ≥ max(2W – 1d, d – 1). The lines f1(d) = 2W –1d and f2(d) = d – 1 cross at d = (2W –1 + 1) / 2. Hence nc ≥ (2W –1 – 1) / 2. Because nc is an integer, nc ≥ 2W –2. Because nc, d ≤ 2W – 1 – 1, (7) becomes

2w –2 + 1 ≤ 2p ≤ 2(2w – 1 – 1)2

or

Image

The lower bound p = W – 1 can occur (e.g., for W = 32, d = 3), but in that case we set p = W.

If p is not forced to equal W, then from (4) and (7),

Image

Using (3b) gives

Image

Because nc ≤ 2W –1 – 1 (3a),

2 ≤ m ≤ 2w – 1.

If p is forced to equal W, then from (4),

Image

Because 2 ≤ d ≤ 2W –1 – 1 and nc ≥ 2W –2,

Image

Hence in either case m is within limits for the code schema illustrated by the “divide by 7” example.

Proof That the Product Is Correct

We must show that if p and m are calculated from (6) and (5), then Equations (1a) and (1b) are satisfied.

Equation (5) and inequality (6) are easily seen to imply (4). (In the case that p is forced to be equal to W, (6) still holds, as shown by Theorem DC1.) In what follows, we consider separately the following five ranges of values of n:

Image

From (4), because m is an integer,

Image

Multiplying by n / 2p, for n ≥ 0 this becomes

Image

For 0 ≤ nnc, 0 ≤ (2p – 1) n / (2pdnc) < 1 / d, so by Theorem D4,

Image

Hence (1a) is satisfied in this case (0 ≤ nnc).

For n > nc, n is limited to the range

Image

because nnc + d contradicts the choice of nc as the largest value of n such that rem(nc, d) = d – 1 (alternatively, from (3a), nnc + d implies n ≥ 2W – 1). From (4), for n ≥ 0,

Image

By elementary algebra, this can be written

Image

From (9), 1 ≤ nncd – 1, so

Image

Because ncd – 1 (by (3b)) and (nc + 1) / nc has its maximum when nc has its minimum,

Image

In (10), the term (nc + 1) / d is an integer. The term (nnc)(nc + 1) / dnc is less than or equal to 1. Therefore, (10) becomes

Image

For all n in the range (9),n/ d = (nc + 1) / d. Hence, (1a) is satisfied in this case (nc + 1 ≤ nnc + d – 1).

For n < 0, from (4) we have, because m is an integer,

Image

Multiplying by n / 2p, for n < 0 this becomes

Image

or

Image

Using Theorem D2 gives

Image

Because n + 1 ≤ 0, the right inequality can be weakened, giving

Image

For –ncn ≤ –1,

Image

Hence, by Theorem D4,

Image

so that (1b) is satisfied in this case (–ncn ≤ –1).

For n < –nc, n is limited to the range

Image

(From (3a), n < – ncd implies that n < –2W – 1, which is impossible.) Performing elementary algebraic manipulation of the left comparand of (11) gives

Image

For – ncd + 1 ≤ n ≤ – nc – 1,

Image

The ratio (nc + 1) / nc is a maximum when nc is a minimum; that is, nc = d – 1.

Therefore,

Image

From (13), because (– nc – 1) / d is an integer and the quantity added to it is between 0 and –1,

Image

For n in the range – ncd + 1 ≤ n ≤ – nc – 1,

Image

Hence, mn/ 2p + 1 = n / d—that is, (1b) is satisfied.

The last case, n = – ncd, can occur only for certain values of d. From (3a), – ncd ≤ –2W – 1, so if n takes on this value, we must have n = – ncd = –2W –1, and hence nc = 2W –1d. Therefore, rem(2W – 1, d) = rem(nc + d, d) = d – 1 (that is, d divides 2W –1 + 1).

For this case (n = – ncd), (6) has the solution p = W – 1 (the smallest possible value of p), because for p = W – 1,

Image

Then from (5),

Image

Therefore,

Image

so that (1b) is satisfied.

This completes the proof that if m and p are calculated from (5) and (6), then Equations (1a) and (1b) hold for all admissible values of n.

10–5 Signed Division by Divisors ≤ –2

Because signed integer division satisfies n ÷ (–d) = –(n ÷ d), it is adequate to generate code for n ÷ |d| and follow it with an instruction to negate the quotient. (This does not give the correct result for d = –2W –1, but for this and other negative powers of 2, you can use the code in Section 10–1, “Signed Division by a Known Power of 2,” on page 205, followed by a negating instruction.) It will not do to negate the dividend, because of the possibility that it is the maximum negative number.

It is possible to avoid the negating instruction. The scheme is to compute

Image

Adding 1 if n > 0 is awkward (because one cannot simply use the sign bit of n), so the code will instead add 1 if q < 0. This is equivalent, because the multiplier m is negative (as will be seen).

The code to be generated is illustrated below for the case W = 32, d = –7.

li    M,0x6DB6DB6D  Magic num, -(2**34+5)/7 + 2**32.
mulhs q,M,n         q = floor(M*n/2**32).
sub   q,q,n         q = floor(M*n/2**32) - n.
shrsi q,q,2         q = floor(q/4).
shri  t,q,31        Add 1 to q if
add   q,q,t         q is negative (n is positive).

muli  t,q,-7        Compute remainder from
sub   r,n,t         r = n - q*(-7).

This code is the same as that for division by +7, except that it uses the negative of the multiplier for +7, and a sub rather than an add after the multiply, and the shri of 31 must use q rather than n, as discussed above. (The case of d = +7 could also use q here, but there would be less parallelism in the code.) The subtract will not overflow, because the operands have the same sign. This scheme, however, does not always work! Although the code above for W = 32, d = –7 is correct, the analogous alteration of the “divide by 3” code to produce code to divide by –3 does not give the correct result for W = 32, n = –231.

Let us look at the situation more closely.

Given a word size W ≥ 3 and a divisor d, –2W –1d ≤ –2, we wish to find the least (in absolute value) integer m and integer p such that

Image
Image

with –2Wm ≤ 0 and pW.

Proceeding similarly to the case of division by a positive divisor, let nc be the most negative value of n such that nc = kd + 1 for some integer k. nc exists, because one possibility is nc = d + 1. It can be calculated from nc = (– 2W –1 – 1) / d d + 1 = – 2W – 1 + rem(2W –1 + 1, d). nc is one of the least |d| admissible values of n, so

Image

and, clearly

Image

Because (14b) must hold for n = – d, and (14a) must hold for n = nc, we obtain, analogous to (4),

Image

Because m is to be the greatest integer satisfying (16), it is the next integer less than 2p / d—that is,

Image

Combining this with the left half of (16) and simplifying gives

Image

The proof that the algorithm suggested by (17) and (18) is feasible, and that the product is correct, is similar to that for a positive divisor, and will not be repeated. A difficulty arises in trying to prove that – 2Wm ≤ 0. To prove this, consider separately the cases in which d is the negative of a power of 2, or some other number. For d = –2k, it is easy to show that nc = –2w – 1 + 1, p = W + k – 1, and m = – 2w –1 –1 (which is within range). For d not of the form –2k, it is straightforward to alter the earlier proof.

For Which Divisors Is m (–d) ≠ – m (d)?

By m(d) we mean the multiplier corresponding to a divisor d. If m(–d) = –m(d), code for division by a negative divisor can be generated by calculating the multiplier for |d|, negating it, and then generating code similar to that of the “divide by –7” case illustrated above.

By comparing (18) with (6) and (17) with (5), it can be seen that if the value of nc for –d is the negative of that for d, then m(–d) = –m(d). Hence, m(–d) ≠ m(d) can occur only when the value of nc calculated for the negative divisor is the maximum negative number, –2W – 1. Such divisors are the negatives of the factors of 2W –1 + 1. These numbers are fairly rare, as illustrated by the factorings below (obtained from Scratchpad).

215 + 1 = 32 · 11 · 331

231 + 1 = 3 · 715,827,883

263 + 1 = 33 · 19 · 43 · 5419 · 77,158,673,929

For all these factors, m(–d) ≠ m(d). Proof sketch: For d > 0 we have nc = 2w – 1d. Because rem(2w – 1, d) = d1, (6) is satisfied by p = W – 1 and hence also by p = W. For d < 0, however, we have nc = –2W – 1 and rem(2w – 1, d) = |d| –1. Hence, (18) is not satisfied for p = W – 1 or for p = W, so p > W.

10–6 Incorporation into a Compiler

For a compiler to change division by a constant into a multiplication, it must compute the magic number M and the shift amount s, given a divisor d. The straightforward computation is to evaluate (6) or (18) for p = W, W +1, ... until it is satisfied. Then, m is calculated from (5) or (17). M is simply a reinterpretation of m as a signed integer, and s = pW.

The scheme described below handles positive and negative d with only a little extra code, and it avoids doubleword arithmetic.

Recall that nc is given by

Image

Hence, |nc| can be computed from

Image

The remainder must be evaluated using unsigned division, because of the magnitude of the arguments. We have written rem(t, |d|) rather than the equivalent rem(t, d), to emphasize that the program must deal with two positive (and unsigned) arguments.

From (6) and (18), p can be calculated from

Image

and then |m| can be calculated from (c.f. (5) and (17)):

Image

Direct evaluation of rem(2p, |d|) in (19) requires “long division” (dividing a 2W-bit dividend by a W-bit divisor, giving a W-bit quotient and remainder), and, in fact, it must be unsigned long division. There is a way to solve (19), and to do all the calculations, that avoids long division and can easily be implemented in a conventional HLL using only W-bit arithmetic. We do, however, need unsigned division and unsigned comparisons.

We can calculate rem(2p, |d|) incrementally, by initializing two variables q and r to the quotient and remainder of 2p divided by |d| with p = W – 1, and then updating q and r as p increases.

As the search progresses—that is, when p is incremented by 1—q and r are updated from (see Theorem D5(a))

q = 2*q;
r = 2*r;
if (r>= abs(d)) {
   q = q + 1;
   r = r - abs(d);}

The left half of inequality (4) and the right half of (16), together with the bounds proved for m, imply that q = 2p /|d| < 2W, so q is representable as a W-bit unsigned integer. Also, 0 ≤ r < |d|, so r is representable as a W-bit signed or unsigned integer. (Caution: The intermediate result 2r can exceed 2W –1 – 1, so r should be unsigned and the comparison above should also be unsigned.)

Next, calculate δ = |d| – r. Both terms of the subtraction are representable as W-bit unsigned integers, and the result is also (1 ≤ δ ≤ |d|), so there is no difficulty here.

To avoid the long multiplication of (19), rewrite it as

Image

The quantity 2p / |nc| is representable as a W-bit unsigned integer (similar to (7), from (19) it can be shown that 2p ≤ 2|nc| · |d| and, for d = –2W – 1, nc = –2w – 1 + 1 and p = 2W – 2, so that 2p / |nc| = 22W – 2 / (2w – 1 − 1) < 2W for W ≥ 3). Also, it is easily calculated incrementally (as p increases) in the same manner as for rem(2p, |d|). The comparison should be unsigned, for the case 2p / |nc| ≥ 2W – 1 (which can occur, for large d).

To compute m, we need not evaluate (20) directly (which would require long division). Observe that

Image

The loop closure test 2p / |nc| > δ is awkward to evaluate. The quantity 2p/ |nc| is available only in the form of a quotient q1 and a remainder r1. 2p / |nc| may or may not be an integer (it is an integer only for d = 2W – 2 + 1 and a few negative values of d). The test 2p / |nc| ≤ δ can be coded as

q1 < δ | (q1 = δ & r1 = 0).

The complete procedure for computing M and s from d is shown in Figure 10–1, coded in C, for W = 32. There are a few places where overflow can occur, but the correct result is obtained if overflow is ignored.

To use the results of this program, the compiler should generate the li and mulhs instructions, generate the add if d > 0 and M < 0, or the sub if d < 0 and M > 0, and generate the shrsi if s > 0. Then, the shri and final add must be generated.

For W = 32, handling a negative divisor can be avoided by simply returning a precomputed result for d = 3 and d = 715,827,883, and using m(– d) = – m(d) for other negative divisors. However, that program would not be significantly shorter, if at all, than the one given in Figure 10–1.


struct ms {int M;          // Magic number
          int s;};          // and shift amount.

struct ms magic(int d) {    // Must have 2 <= d <= 2**31-1
                           // or   -2**31 <= d <= -2.
   int p;
   unsigned ad, anc, delta, q1, r1, q2, r2, t;
   const unsigned two31 = 0x80000000;      // 2**31.
   struct ms mag;

   ad = abs(d);
   t = two31 + ((unsigned)d >> 31);
   anc = t - 1 - t%ad;      // Absolute value of nc.
   p = 31;                 // Init. p.
   q1 = two31/anc;          // Init. q1 = 2**p/|nc|.
   r1 = two31 - q1*anc;     // Init. r1 = rem(2**p, |nc|).
   q2 = two31/ad;           // Init. q2 = 2**p/|d|.
   r2 = two31 - q2*ad;      // Init. r2 = rem(2**p, |d|).
   do {
      p = p + 1;
      q1 = 2*q1;            // Update q1 = 2**p/|nc|.
      r1 = 2*r1;           // Update r1 = rem(2**p, |nc|).
      if (r1 >= anc) {      // (Must be an unsigned
         q1 = q1 + 1;       // comparison here.)
         r1 = r1 - anc;}
      q2 = 2*q2;            // Update q2 = 2**p/|d|.
      r2 = 2*r2;            // Update r2 = rem(2**p, |d|).
      if (r2 >= ad) {       // (Must be an unsigned
         q2 = q2 + 1;       // comparison here.)
         r2 = r2 - ad;}
      delta = ad - r2;
   } while (q1 < delta || (q1 == delta && r1 == 0));

   mag.M = q2 + 1;
   if (d < 0) mag.M = -mag.M; // Magic number and
   mag.s = p - 32;             // shift amount to return.
   return mag;
}


FIGURE 10–1. Computing the magic number for signed division.

10–7 Miscellaneous Topics

THEOREM DC2. The least multiplier m is odd if p is not forced to equal W.

Proof. Assume that Equations (1a) and (1b) are satisfied with least (not forced) integer p, and m even. Then clearly m could be divided by 2 and p could be decreased by 1, and (1a) and (1b) would still be satisfied. This contradicts the assumption that p is minimal.

Uniqueness

The magic number for a given divisor is sometimes unique (e.g., for W = 32, d = 7), but often it is not. In fact, experimentation suggests that it is usually not unique. For example, for W = 32, d = 6, there are four magic numbers:

Image

Nevertheless, there is the following uniqueness property:

THEOREM DC3. For a given divisor d, there is only one multiplier m having the minimal value of p, if p is not forced to equal W.

Proof. First consider the case d > 0. The difference between the upper and lower limits of inequality (4) is 2p/ dnc. We have already proved (7) that if p is minimal, then 2p/dnc ≤ 2. Therefore, there can be at most two values of m satisfying (4). Let m be the smaller of these values, given by (5); then m + 1 is the other.

Let p0 be the least value of p for which m + 1 satisfies the right half of (4) (p0 is not forced to equal W). Then

Image

This simplifies to

2p0 > nc (2 d – rem(2p0, d)).

Dividing by 2 gives

Image

Because rem(2p0, d) ≤ 2rem(2p0 – 1, d) (by Theorem D5 on page 184),

2p0 – 1 > nc (d – rem(2p0 –1, d)),

contradicting the assumption that p0 is minimal.

The proof for d < 0 is similar and will not be given.

The Divisors with the Best Programs

The program for d = 3, W = 32 is particularly short, because there is no add or shrsi after the mulhs. What other divisors have this short program?

We consider only positive divisors. We wish to find integers m and p that satisfy Equations (1a) and (1b), and for which p = W and 0 ≤ m < 2W –1. Because any integers m and p that satisfy equations (1a) and (1b) must also satisfy (4), it suffices to find those divisors d for which (4) has a solution with p = W and 0 ≤ m < 2W –1. All solutions of (4) with p = W are given by

Image

Combining this with the right half of (4) and simplifying gives

Image

The weakest restriction on rem(2W, d) is with k = 1 and nc at its minimal value of 2W –2. Hence, we must have

rem(2W, d) > d – 4;

that is, d divides 2W+ 1, 2W+ 2, or 2W+ 3.

Now let us see which of these factors actually have optimal programs.

If d divides 2W+ 1, then rem(2W, d) = d – 1. Then a solution of (6) is p = W, because the inequality becomes

2W > nc (d – (d – 1)) = nc,

which is obviously true, because nc < 2W –1. Then in the calculation of m we have

Image

which is less than 2W –1 for d ≥ 3 (d ≠ 2 because d divides 2W + 1). Hence, all the factors of 2W+ 1 have optimal programs.

Similarly, if d divides 2W + 2, then rem(2W, d) = d – 2. Again, a solution of (6) is p = W, because the inequality becomes

2W > nc (d – (d – 2)) = 2nc,

which is obviously true. Then in the calculation of m we have

Image

which exceeds 2W –1 – 1 for d = 2, but which is less than or equal to 2W – 1 – 1 for W ≥ 3, d ≥ 3 (the case W = 3 and d = 3 does not occur, because 3 is not a factor of 23 + 2 = 10). Hence all factors of 2W + 2, except for 2 and the cofactor of 2, have optimal programs. (The cofactor of 2 is (2W+ 2) / 2, which is not representable as a W-bit signed integer).

If d divides 2W + 3, the following argument shows that d does not have an optimal program. Because rem(2W, d) = d – 3, inequality (21) implies that we must have

Image

for some k = 1, 2, 3, .... The weakest restriction is with k = 1, so we must have nc < 2W / 3.

From (3a), nc ≥ 2W –1d, or d ≥ 2W – 1nc. Hence, it is necessary that

Image

Also, because 2, 3, and 4 do not divide 2W + 3, the smallest possible factor of 2W + 3 is 5. Therefore, the largest possible factor is (2W + 3) / 5. Thus, if d divides 2W + 3 and d has an optimal program, it is necessary that

Image

Taking reciprocals of this with respect to 2W + 3 shows that the cofactor of d, (2W + 3) / d, has the limits

Image

For W ≥ 5, this implies that the only possible cofactors are 5 and 6. For W < 5, it is easily verified that there are no factors of 2W + 3. Because 6 cannot be a factor of 2W + 3, the only possibility is 5. Therefore, the only possible factor of 2W + 3 that might have an optimal program is (2W + 3) / 5.

For d = (2W + 3) / 5,

Image

For W ≥ 4,

Image

so

Image

This exceeds (2W / 3), so d = (2W + 3) / 5 does not have an optimal program. Because for W < 4 there are no factors of 2W + 3, we conclude that no factors of 2W + 3 have optimal programs.

In summary, all the factors of 2W + 1 and of 2W + 2, except for 2 and (2W + 2) / 2, have optimal programs, and no other numbers do. Furthermore, the above proof shows that algorithm magic (Figure 10–1 on page 223) always produces the optimal program when it exists.

Let us consider the specific cases W = 16, 32, and 64. The relevant factorizations are shown below.

Image

The result for W = 16 is that there are 20 divisors that have optimal programs. The ones less than 100 are 3, 6, 9, 11, 18, 22, 33, 66, and 99.

For W = 32, there are six such divisors: 3, 6, 641, 6,700,417, 715,827,883, and 1,431,655,766.

For W = 64, there are 126 such divisors. The ones less than 100 are 3, 6, 9, 18, 19, 27, 38, 43, 54, 57, and 86.

10–8 Unsigned Division

Unsigned division by a power of 2 is, of course, implemented by a single shift right logical instruction, and remainder by and immediate.

It might seem that handling other divisors will be simple: Just use the results for signed division with d > 0, omitting the two instructions that add 1 if the quotient is negative. We will see, however, that some of the details are actually more complicated in the case of unsigned division.

Unsigned Division by 3

For a non-power of 2, let us first consider unsigned division by 3 on a 32-bit machine. Because the dividend n can now be as large as 232 – 1, the multiplier (232 + 2) / 3 is inadequate, because the error term 2 n / 3 · 232 (see “divide by 3” example above) can exceed 1/3. However, the multiplier (233 + 1) / 3 is adequate. The code is

li    M,0xAAAAAAAB  Load magic number, (2**33+1)/3.
mulhu q,M,n         q = floor(M*n/2**32).
shri  q,q,1

muli  t,q,3         Compute remainder from
sub   r,n,t         r = n - q*3.

An instruction that gives the high-order 32 bits of a 64-bit unsigned product is required, which we show above as mulhu.

To see that the code is correct, observe that it computes

Image

For 0 ≤ n < 232, 0 ≤ n / (3 · 233) < 1 / 3, so by Theorem D4, q = n/ 3.

In computing the remainder, the multiply immediate can overflow if we regard the operands as signed integers, but it does not overflow if we regard them and the result as unsigned. Also, the subtract cannot overflow, because the result is in the range 0 to 2, so the remainder is correct.

Unsigned Division by 7

For unsigned division by 7 on a 32-bit machine, the multipliers (232 + 3) / 7, (233 + 6) / 7, and (234 + 5) / 7 are all inadequate, because they give too large an error term. The multiplier (235 + 3) / 7 is acceptable, but it’s too large to represent in a 32-bit unsigned word. We can multiply by this large number by multiplying by (235 + 3) / 7 – 232 and then correcting the product by inserting an add. The code is

li    M,0x24924925  Magic num, (2**35+3)/7 - 2**32.
mulhu q,M,n         q = floor(M*n/2**32).
add   q,q,n         Can overflow (sets carry).
shrxi q,q,3         Shift right with carry bit.

muli  t,q,7         Compute remainder from
sub   r,n,t         r = n - q*7.

Here we have a problem: The add can overflow. To allow for this, we have invented the new instruction shift right extended immediate (shrxi), which treats the carry from the add and the 32 bits of register q as a single 33-bit quantity, and shifts it right with 0-fill. On the Motorola 68000 family, this can be done with two instructions: rotate with extend right one position, followed by a logical right shift of two (roxr actually uses the X bit, but the add sets the X bit the same as the carry bit). On most machines, it will take more. For example, on PowerPC it takes three instructions: clear rightmost three bits of q, add carry to q, and rotate right three positions.

With shrxi implemented somehow, the code above computes

Image

For 0 ≤ n < 232, 0 ≤ 3 n /(7 · 235) < 1/7, so by Theorem D4, q = n / 7 .

Granlund and Montgomery [GM] have a clever scheme for avoiding the shrxi instruction. It requires the same number of instructions as the above three-instruction sequence for shrxi, but it employs only elementary instructions that almost any machine would have, and it does not cause overflow at all. It uses the identity

Image

Applying this to our problem, with q = Mn / 232 where 0 ≤ M <232, the subtraction will not overflow, because

Image

so that, clearly, 0 ≤ n – q < 232. Also, the addition will not overflow, because

Image

and 0 ≤ n,q < 232.

Using this idea gives the following code for unsigned division by 7:

li    M,0x24924925  Magic num, (2**35+3)/7 - 2**32.
mulhu q,M,n         q = floor(M*n/2**32).
sub   t,n,q         t = n - q.
shri  t,t,1         t = (n - q)/2.
add   t,t,q         t = (n - q)/2 + q = (n + q)/2.
shri  q,t,2         q = (n+Mn/2**32)/8 = floor(n/7).

muli  t,q,7         Compute remainder from
sub   r,n,t         r = n - q*7.

For this to work, the shift amount for the hypothetical shrxi instruction must be greater than 0. It can be shown that if d > 1 and the multiplier m ≥ 232 (so that the shrxi instruction is needed), then the shift amount is greater than 0.

10–9 Unsigned Division by Divisors1

Given a word size W ≥ 1 and a divisor d, 1 ≤ d < 2W, we wish to find the least integer m and integer p such that

Image

with 0 ≤ m < 2W+1 and pW.

In the unsigned case, the magic number M is given by

Image

Because (22) must hold for n = d, md/ 2p = 1, or

Image

As in the signed case, let nc be the largest value of n such that rem(nc, d) = d – 1. It can be calculated from nc = 2W / d d – 1 = 2W – rem(2W, d) – 1. Then

Image

and

Image

These imply that nc ≥ 2W – 1.

Because (22) must hold for n = nc

Image

or

Image

Combining this with (23) gives

Image

Because m is to be the least integer satisfying (25), it is the next integer greater than or equal to 2p / d—that is,

Image

Combining this with the right half of (25) and simplifying gives

Image

The Algorithm (Unsigned)

Thus, the algorithm is to find by trial and error the least pW satisfying (27). Then, m is calculated from (26). This is the smallest possible value of m satisfying (22) with pW. As in the signed case, if (27) is true for some value of p, then it is true for all larger values of p. The proof is essentially the same as that of Theorem DC1, except Theorem D5(b) is used instead of Theorem D5(a).

Proof That the Algorithm Is Feasible (Unsigned)

We must show that (27) always has a solution and that 0 ≤ m < 2W +1.

Because for any nonnegative integer x there is a power of 2 greater than x and less than or equal to 2 x + 1, from (27),

nc(d – 1 – rem(2p – 1, d)) < 2p ≤ 2nc(d – 1 – rem(2p – 1, d)) + 1.

Because 0 ≤ rem(2p– 1, d) ≤ d – 1,

Image

Because nc, d ≤ 2W – 1, this becomes

1 ≤ 2p ≤ 2(2W – 1)(2W – 2) + 1,

or

Image

Thus, (27) always has a solution.

If p is not forced to equal W, then from (25) and (28),

Image

If p is forced to equal W, then from (25),

Image

Because 1 ≤ d ≤ 2W– 1 and nc ≥ 2W – 1,

Image

In either case m is within limits for the code schema illustrated by the “unsigned divide by 7” example.

Proof That the Product Is Correct (Unsigned)

We must show that if p and m are calculated from (27) and (26), then (22) is satisfied.

Equation (26) and inequality (27) are easily seen to imply (25). Inequality (25) is nearly the same as (4), and the remainder of the proof is nearly identical to that for signed division with n ≥ 0.

10–10 Incorporation into a Compiler (Unsigned)

There is a difficulty in implementing an algorithm based on direct evaluation of the expressions used in this proof. Although p ≤ 2 W, which is proved above, the case p = 2 W can occur (e.g., for d = 2W – 2 with W ≥ 4). When p = 2 W, it is difficult to calculate m, because the dividend in (26) does not fit in a 2W-bit word.

However, it can be implemented by the “incremental division and remainder” technique of algorithm magic. The algorithm is given in Figure 10–2 for W = 32. It passes back an indicator a, which tells whether or not to generate an add instruction. (In the case of signed division, the caller recognizes this by M and d having opposite signs.)

Some key points in understanding this algorithm are as follows:

• Unsigned overflow can occur at several places and should be ignored.

nc = 2w – rem (2w,d) – 1 = (2W – 1) – rem(2Wd, d).

• The quotient and remainder of dividing 2p by nc cannot be updated in the same way as is done in algorithm magic, because here the quantity 2*r1 can overflow. Hence, the algorithm has the test “if (r1 > = nc – r l),” whereas “if (2*rl >= nc)” would be more natural. A similar remark applies to computing the quotient and remainder of 2P –1 divided by d.

• 0 ≤ δ ≤ d – 1, so δ is representable as a 32-bit unsigned integer.


struct mu {unsigned M;     // Magic number,
          int a;           // "add" indicator,
          int s;};         // and shift amount.

struct mu magicu(unsigned d) {
                            // Must have 1 <= d <= 2**32-1.
   int p;
   unsigned nc, delta, q1, r1, q2, r2;
   struct mu magu;

   magu.a = 0;             // Initialize "add" indicator.
   nc = -1 - (-d)%d;       // Unsigned arithmetic here.
   p = 31;                 // Init. p.
   q1 = 0x80000000/nc;     // Init. q1 = 2**p/nc.
   r1 = 0x80000000 - q1*nc;// Init. r1 = rem(2**p, nc).
   q2 = 0x7FFFFFFF/d;      // Init. q2 = (2**p - 1)/d.
   r2 = 0x7FFFFFFF - q2*d; // Init. r2 = rem(2**p - 1, d).
   do {
      p = p + 1;
      if (q1 >= 0x80000000) gt = 1; // Means q1 > delta.
      if (r1 >= nc - r1) {
         q1 = 2*q1 + 1;             // Update q1.
         r1 = 2*r1 - nc;}           // Update r1.
      else {
         q1 = 2*q1;
         r1 = 2*r1;}
      if (r2 + 1 >= d - r2) {
         if (q2 >= 0x7FFFFFFF) magu.a = 1;
         q2 = 2*q2 + 1;             // Update q2.
         r2 = 2*r2 + 1 - d;}        // Update r2.
      else {
         if (q2 >= 0x80000000) magu.a = 1;
         q2 = 2*q2;
         r2 = 2*r2 + 1;}
      delta = d - 1 - r2;
   } while (p < 64 &&
           (q1 < delta || (q1 == delta && r1 == 0)));

   magu.M = q2 + 1;        // Magic number
   magu.s = p - 32;        // and shift amount to return
   return magu;            // (magu.a was set above).
}


FIGURE 10–2. Computing the magic number for unsigned division.

m = (2p +d – 1 – rem(2p – 1, d)) / d = (2p – 1) / d +1 = q2 + 1.

• The subtraction of 2W when the magic number M exceeds 2W – 1 is not explicit in the program; it occurs if the computation of q2 overflows.

• The “add” indicator, magu.a, cannot be set by a straightforward comparison of M to 232, or of q2 to 232 – 1, because of overflow. Instead, the program tests q2 before overflow can occur. If q2 ever gets as large as 232 – 1, so that M will be greater than or equal to 232, then magu.a is set equal to 1. If q2 stays below 232 – 1, then magu.a is left at its initial value of 0.

• Inequality (27) is equivalent to 2p/ nc > δ.

• The loop test needs the condition p < 64, because without it, overflow of q1 would cause the program to loop too many times, giving incorrect results.

To use the results of this program, the compiler should generate the li and mulhu instructions and, if the “add” indicator a = 0, generate the shri of s (if s > 0), as illustrated by the example of “Unsigned Division by 3,” on page 227. If a = 1 and the machine has the shrxi instruction, the compiler should generate the add and shrxi of s as illustrated by the example of “Unsigned Division by 7,” on page 228. If a = 1 and the machine does not have the shrxi instruction, use the example on page 229: generate the sub, the shri of 1, the add, and finally the shri of s – 1 (if s – 1 > 0; s will not be 0 at this point except in the trivial case of division by 1, which we assume the compiler deletes).

10–11 Miscellaneous Topics (Unsigned)

THEOREM DC2U. The least multiplier m is odd if p is not forced to equal W.

THEOREM DC3U. For a given divisor d, there is only one multiplier m having the minimal value of p, if p is not forced to equal W.

The proofs of these theorems follow very closely the corresponding proofs for signed division.

The Divisors with the Best Programs (Unsigned)

For unsigned division, to find the divisors (if any) with optimal programs of two instructions to obtain the quotient (li, mulhu), we can do an analysis similar to that of the signed case (see “The Divisors with the Best Programs” on page 225). The result is that such divisors are the factors of 2W or 2W + 1, except for d = 1. For the common word sizes, this leaves very few nontrivial divisors that have optimal programs for unsigned division. For W = 16, there are none. For W = 32, there are only two: 641 and 6,700,417. For W = 64, again there are only two: 274,177 and 67,280,421,310,721.

The case d = 2k, k = 1, 2, ..., deserves special mention. In this case, algorithm magicu produces p = W (forced), m = 232 – k. This is the minimal value of m, but it is not the minimal value of M. Better code results if p = W + k is used, if sufficient simplifications are done. Then, m = 2W, M = 0, a = 1, and s = k. The generated code involves a multiplication by 0 and can be simplified to a single shift right k instruction. As a practical matter, divisors that are a power of 2 would probably be special-cased without using magicu. (This phenomenon does not occur for signed division, because for signed division m cannot be a power of 2. Proof: For d > 0, inequality (4) combined with (3b) implies that d – 1 < 2p / m < d. Therefore, 2p / m cannot be an integer. For d < 0, the result follows similarly from (16) combined with (15b).)

For unsigned division, the code for the case m ≥ 2W is considerably worse than the code for the case m < 2W if the machine does not have shrxi. It is of interest to have some idea of how often the large multipliers arise. For W = 32, among the integers less than or equal to 100, there are 31 “bad” divisors: 1, 7, 14, 19, 21, 27, 28, 31, 35, 37, 38, 39, 42, 45, 53, 54, 55, 56, 57, 62, 63, 70, 73, 74, 76, 78, 84, 90, 91, 95, and 97.

Using Signed in Place of Unsigned Multiply, and the Reverse

If your machine does not have mulhu, but it does have mulhs (or signed long multiplication), the trick given in “High-Order Product Signed from/to Unsigned,” on page 174, might make our method of doing unsigned division by a constant still useful.

That section gives a seven-instruction sequence for getting mulhu from mulhs. However, for this application it simplifies, because the magic number M is known. Thus, the compiler can test the most significant bit of the magic number, and generate code such as the following for the operation “mulhu q,M,n.” Here t denotes a temporary register.

           M31 = 0           M31 = 1
         mulhs q,M,n       mulhs q,M,n
         shrsi t,n,31      shrsi t,n,31
         and   t,t,M       and   t,t,M
         add   q,q,t       add   t,t,n
                           add   q,q,t

Accounting for the other instructions used with mulhu, this uses a total of six to eight instructions to obtain the quotient of unsigned division by a constant on a machine that does not have unsigned multiply.

This trick can be inverted, to get mulhs in terms of mulhu. The code is the same as that above, except the mulhs is changed to mulhu and the final add in each column is changed to sub.

A Simpler Algorithm (Unsigned)

Dropping the requirement that the magic number be minimal yields a simpler algorithm. In place of (27) we can use

Image

and then use (26) to compute m, as before.

It should be clear that this algorithm is formally correct (that is, that the value of m computed does satisfy Equation (22)), because its only difference from the previous algorithm is that it computes a value of p that, for some values of d, is unnecessarily large. It can be proved that the value of m computed from (30) and (26) is less than 2W +1. We omit the proof and simply give the algorithm (Figure 10–3).


struct mu {unsigned M;      // Magic number,
          int a;            // "add" indicator,
          int s;};          // and shift amount.

struct mu magicu2(unsigned d) {
                            // Must have 1 <= d <= 2**32-1.
   int p;
   unsigned p32, q, r, delta;
   struct mu magu;
   magu.a = 0;              // Initialize "add" indicator.
   p = 31;                  // Initialize p.
   q = 0x7FFFFFFF/d;        // Initialize q = (2**p - 1)/d.
   r = 0x7FFFFFFF - q*d;    // Init. r = rem(2**p - 1, d).
   do {
      p = p + 1;
      if (p == 32) p32 = 1;     // Set p32 = 2**(p-32).
      else p32 = 2*p32;
      if (r + 1 >= d - r) {
         if (q >= 0x7FFFFFFF) magu.a = 1;
         q = 2*q + 1;           // Update q.
         r = 2*r + 1 - d;       // Update r.
      }
      else {
         if (q >= 0x80000000) magu.a = 1;
         q = 2*q;
         r = 2*r + 1;
      }
      delta = d - 1 - r;
   } while (p < 64 && p32 < delta);
   magu.M = q + 1;         // Magic number and
   magu.s = p - 32;        // shift amount to return
   return magu;            // (magu.a was set above).
}


FIGURE 10–3. Simplified algorithm for computing the magic number, unsigned division.

Alverson [Alv] gives a much simpler algorithm, discussed in the next section, but it gives somewhat large values for m. The point of algorithm magicu2 is that it nearly always gives the minimal value for m when d ≤ 2W –1. For W = 32, the smallest divisor for which magicu2 does not give the minimal multiplier is d = 102,807, for which magicu calculates m = 2,737,896,999 and magicu2 calculates m = 5,475,793,997.

There is an analog of magicu2 for signed division by positive divisors, but it does not work out very well for signed division by arbitrary divisors.

10–12 Applicability to Modulus and Floor Division

It might seem that turning modulus or floor division by a constant into multiplication would be simpler, in that the “add 1 if the dividend is negative” step could be omitted. This is not the case. The methods given above do not apply in any obvious way to modulus and floor division. Perhaps something could be worked out; it might involve altering the multiplier m slightly, depending upon the sign of the dividend.

10–13 Similar Methods

Rather than coding algorithm magic, we can provide a table that gives the magic numbers and shift amounts for a few small divisors. Divisors equal to the tabulated ones multiplied by a power of 2 are easily handled as follows:

1. Count the number of trailing 0’s in d, and let this be denoted by k.

2. Use as the lookup argument d / 2k (shift right k).

3. Use the magic number found in the table.

4. Use the shift amount found in the table, increased by k.

Thus, if the table contains the divisors 3, 5, 25, and so on, divisors of 6, 10, 100, and so forth can be handled.

This procedure usually gives the smallest magic number, but not always. The smallest positive divisor for which it fails in this respect for W = 32 is d = 334,972, for which it computes m = 3,361,176,179 and s = 18. However, the minimal magic number for d = 334,972 is m = 840,294,045, with s = 16. The procedure also fails to give the minimal magic number for d = –6. In both these cases, output code quality is affected.

Alverson [Alv] is the first known to the author to state that the method described here works with complete accuracy for all divisors. Using our notation, his method for unsigned integer division by d is to set the shift amount p = W + log2 d , and the multiplier m = 2p / d and then do the division by n ÷ d = mn / 2p (that is, multiply and shift right). He proves that the multiplier m is less than, 2W+1 and that the method gets the exact quotient for all n expressible in W bits.

Alverson’s method is a simpler variation of ours in that it doesn’t require trial and error to determine p, and is therefore more suitable for building in hardware, which is his primary interest. His multiplier m is always greater than or equal to 2W, and hence for the software application always gives the code illustrated by the “unsigned divide by 7” example (that is, always has the add and shrxi, or the alternative four instructions). Because most small divisors can be handled with a multiplier less than 2W, it seems worthwhile to look for these cases.

For signed division, Alverson suggests finding the multiplier for |d| and a word length of W – 1 (then 2W –1m < 2W), multiplying the dividend by it, and negating the result if the operands have opposite signs. (The multiplier must be such that it gives the correct result when the dividend is 2W –1, the absolute value of the maximum negative number.) It seems possible that this suggestion might give better code than what has been given here in the case that the multiplier m ≥ 2W. Applying it to signed division by 7 gives the following code, where we have used the relation –x = Image +1 to avoid a branch:

abs   an,n
li    M,0x92492493  Magic number, (2**34+5)/7.
mulhu q,M,an        q = floor(M*an/2**32).
shri  q,q,2
shrsi t,n,31        These three instructions
xor   q,q,t         negate q if n is
sub   q,q,t         negative.

This is not quite as good as the code we gave for signed division by 7 (six versus seven instructions), but it would be useful on a machine that has abs and mulhu, but not mulhs.

The next section gives some representative magic numbers.

10–14 Sample Magic Numbers

TABLE10–1. SOME MAGIC NUMBERS FOR W = 32

Image

TABLE 10–2. SOME MAGIC NUMBERS FOR W = 64

Image

10–15 Simple Code in Python

Computing a magic number is greatly simplified if one is not limited to doing the calculations in the same word size as that of the environment in which the magic number will be used. For the unsigned case, for example, in Python it is straightforward to compute nc and then evaluate Equations (27) and (26), as described in Section 10–9. Figure 10–4 shows such a function.


def magicgu(nmax, d):
   nc = ((nmax + 1)//d*d - 1
   nbits = len(bin(nmax)) - 2
   for p in range(0, 2*nbits + 1):
      if 2**p > nc*(d - 1 - (2**p - 1)%d):
         m = (2**p + d - 1 - (2**p - 1)%d)//d
         return (m, p)
   print "Can't find p, something is wrong."
   sys.exit(1)


FIGURE 10–4. Python code for computing the magic number for unsigned division.

The function is given the maximum value of the dividend nmax and the divisor d. It returns a pair of integers: the magic number m and a shift amount p. To divide a dividend x by d, one multiplies x by m and then shifts the (full length) product right p bits.

This program is more general than the others in this chapter in two ways: (1) one specifies the maximum value of the dividend (nmax), rather than the number of bits required for the dividend, and (2) the program can be used for arbitrarily large dividends and divisors (“bignums”). The advantage of specifying the maximum value of the dividend is that one sometimes gets a smaller magic number than would be obtained if the next power of two less 1 were used for the maximum value. For example, suppose the maximum value of the dividend is 90, and the divisor is 7. Then function magicgu returns (37, 8), meaning that the magic number is 37 (a 6-bit number) and the shift amount is 8. But if we asked for a magic number that can handle divisors up to 127, then the result is (147, 10), and 147 is an 8-bit number.

10–16 Exact Division by Constants

By “exact division,” we mean division in which it is known beforehand, somehow, that the remainder is 0. Although this situation is not common, it does arise, for example, when subtracting two pointers in the C language. In C, the result of pq, where p and q are pointers, is well defined and portable only if p and q point to objects in the same array [H&S, sec. 7.6.2]. If the array element size is s, the object code for the difference pq computes (pq) / s.

The material in this section was motivated by [GM, sec. 9].

The method to be given applies to both signed and unsigned exact division, and is based on the following theorem.

THEOREM MI. If a and m are relatively prime integers, then there exists an integer ā, 1 ≤ ā < m, such that

aā ≡ 1 (mod m).

That is, ā is a multiplicative inverse of a, modulo m. There are several ways to prove this theorem; three proofs are given in [NZM, p. 52]. The proof below requires only a very basic familiarity with congruences.

Proof. We will prove something a little more general than the theorem. If a and m are relatively prime (therefore nonzero), then as x ranges over all m distinct values modulo m, ax takes on all m distinct values modulo m. For example, if a = 3 and m = 8, then as x ranges from 0 to 7, ax = 0, 3, 6, 9, 12, 15, 18, 21 or, reduced modulo 8, ax = 0, 3, 6, 1, 4, 7, 2, 5. Observe that all values from 0 to 7 are present in the last sequence.

To see this in general, assume that it is not true. Then there exist distinct integers that map to the same value when multiplied by a; that is, there exist x and y, with x Image y(mod m), such that

axay (mod m).

Then there exists an integer k such that

axay = km, or

a (xy) = km.

Because a has no factor in common with m, it must be that xy is a multiple of m; that is,

xy (mod m).

This contradicts the hypothesis.

Now, because ax takes on all m distinct values modulo m, as x ranges over the m values, it must take on the value 1 for some x.

The proof shows that there is only one value (modulo m) of x such that ax ≡ 1 (mod m)—that is, the multiplicative inverse is unique, apart from additive multiples of m. It also shows that there is a unique (modulo m) integer x such that axb (mod m), where b is any integer.

As an example, consider the case m = 16. Then Image, because 3 · 11 = 33 ≡ 1 (mod 16). We could just as well take Image, because 3 · (–5) = –15 ≡ 1 (mod 16). Similarly Image, because (–3) · 5 = –15 ≡ 1 (mod 16).

These observations are important because they show that the concepts apply to both signed and unsigned numbers. If we are working in the domain of unsigned integers on a 4-bit machine, we take Image. In the domain of signed integers, we take Image. But 11 and –5 have the same representation in two’s-complement (because they differ by 16), so the same computer word contents can serve in both domains as the multiplicative inverse.

The theorem applies directly to the problem of division (signed and unsigned) by an odd integer d on a W-bit computer. Because any odd integer is relatively prime to 2W, the theorem says that if d is odd, there exists an integer Image(unique in the range 0 to 2W– 1 or in the range –2W – 1 to 2W –1 – 1) such that

Image

Hence, for any integer n that is a multiple of d,

Image

In other words, n/d can be calculated by multiplying n by Image and retaining only the rightmost W bits of the product.

If the divisor d is even, let d = do · 2k, where do is odd and k ≥ 1. Then, simply shift n right k positions (shifting out 0’s), and then multiply by Image (the shift could be done after the multiplication as well).

Below is the code for division of n by 7, where n is a multiple of 7. This code gives the correct result whether it is considered to be signed or unsigned division.

li    M,0xB6DB6DB7  Mult. inverse, (5*2**32 + 1)/7.
mul   q,M,n         q = n/7.

Computing the Multiplicative Inverse by the Euclidean Algorithm

How can we compute the multiplicative inverse? The standard method is by means of the “extended Euclidean algorithm.” This is briefly discussed below as it applies to our problem, and the interested reader is referred to [NZM, p. 13] and to [Knu2, 4.5.2] for a more complete discussion.

Given an odd divisor d, we wish to solve for x

dx ≡ 1(mod/m),

where, in our application, m = 2W and W is the word size of the machine. This will be accomplished if we can solve for integers x and y (positive, negative, or 0) the equation

dx + my = 1.

Toward this end, first make d positive by adding a sufficient number of multiples of m to it. (d and d + km have the same multiplicative inverse.) Second, write the following equations (in which d, m > 0):

Image

If d = 1, we are done, because (ii) shows that x = 1. Otherwise, compute

Image

Third, multiply Equation (ii) by q and subtract it from (i). This gives

d (– 1 – q) + m(1) = mdqd = rem(md, d).

This equation holds because we have simply multiplied one equation by a constant and subtracted it from another. If rem(md, d) = 1, we are done; this last equation is the solution and x = – 1 – q.

Repeat this process on the last two equations, obtaining a fourth, and continue until the right-hand side of the equation is 1. The multiplier of d, reduced modulo m, is then the desired inverse of d.

Incidentally, if md < d, so that the first quotient is 0, then the third row will be a copy of the first, so that the second quotient will be nonzero. Furthermore, most texts start with the first row being

d(0) + m(1) = m,

but in our application m = 2W is not representable in the machine.

The process is best illustrated by an example: Let m = 256 and d = 7. Then the calculation proceeds as follows. To get the third row, note that q = 249 / 7 = 35.

 7(-1) + 256( 1) = 249
 7( 1) + 256( 0) = 7
7(-36) + 256( 1) = 4
7( 37) + 256(-1) = 3
7(-73) + 256( 2) = 1

Thus, the multiplicative inverse of 7, modulo 256, is –73 or, expressed in the range 0 to 255, is 183. Check: 7 · 183 = 1281 ≡ 1 (mod 256).

From the third row on, the integers in the right-hand column are all remainders of dividing the number above it into the number two rows above it, so they form a sequence of strictly decreasing nonnegative integers. Therefore, the sequence must end in 0 (as the above would if carried one more step). Furthermore, the value just before the 0 must be 1, for the following reason. Suppose the sequence ends in b followed by 0, with b ≠ 1. Then, the integer preceding the b must be a multiple of b, let’s say k1b, for the next remainder to be 0. The integer preceding k1 b must be of the form k1k2 b+ b, for the next remainder to be b. Continuing up the sequence, every number must be a multiple of b, including the first two (in the positions of the 249 and the 7 in the above example). This is impossible, because the first two integers are md and d, which are relatively prime.

This constitutes an informal proof that the above process terminates, with a value of 1 in the right-hand column, and hence it finds the multiplicative inverse of d.

To carry this out on a computer, first note that if d < 0, we should add 2W to it. With two’s-complement arithmetic it is not necessary to actually do anything here; simply interpret d as an unsigned number, regardless of how the application interprets it.

The computation of q must use unsigned division.

Observe that the calculations can be done modulo m, because this does not change the right-hand column (these values are in the range 0 to m – 1 anyway). This is important, because it enables the calculations to be done in “single precision,” using the computer’s modulo-2W unsigned arithmetic.

Most of the quantities in the table need not be represented. The column of multiples of 256 need not be represented, because in solving dx + my = 1, we do not need the value of y. There is no need to represent d in the first column. Reduced to its bare essentials, then, the calculation of the above example is carried out as follows:

255  249
  1    7
220    4
 37    3
183    1

A C program for performing this computation is shown in Figure 10–5.


unsigned mulinv(unsigned d) {           // d must be odd.
   unsigned x1, v1, x2, v2, x3, v3, q;

   x1 = 0xFFFFFFFF;      v1 = -d;
   x2 = 1;              v2 = d;
   while (v2 > 1) {
      q = v1/v2;
      x3 = x1 - q*x2;   v3 = v1 - q*v2;
      x1 = x2;           v1 = v2;
      x2 = x3;           v2 = v3;
   }
   return x2;
}


FIGURE 10–5. Multiplicative inverse modulo 232 by the Euclidean algorithm.

The reason the loop continuation condition is (v2 > 1), rather than the more natural (v2 != 1), is that if the latter condition were used, the loop would never terminate if the program were invoked with an even argument. It is best that programs not loop forever even if misused. (If the argument d is even, v2 never takes on the value 1, but it does become 0.)

What does the program compute if given an even argument? As written, it computes a number x such that dx ≡ 0 (mod 232), which is probably not useful. However, with the minor modification of changing the loop continuation condition to (v2 != 0) and returning x1 rather than x2, it computes a number x such that dxg (mod 232), where g is the greatest common divisor of d and 232—that is, the greatest power of 2 that divides d. The modified program still computes the multiplicative inverse of d for d odd, but it requires one more iteration than the unmodified program.

As for the number of iterations (divisions) required by the above program, for d odd and less than 20, it requires a maximum of 3 and an average of 1.7. For d in the neighborhood of 1000, it requires a maximum of 11 and an average of about 6.

Computing the Multiplicative Inverse by Newton’s Method

It is well known that, over the real numbers, 1 / d, for d ≠ 0, can be calculated to ever-increasing accuracy by iteratively evaluating

Image

provided the initial estimate x0 is sufficiently close to 1/d. The number of digits of accuracy approximately doubles with each iteration.

It is not so well known that this same formula can be used to find the multiplicative inverse modulo any power of 2!. For example, to find the multiplicative inverse of 3, modulo 256, start with x0 = 1 (any odd number will do). Then,

Image

The iteration has reached a fixed point modulo 256, so –85, or 171, is the multiplicative inverse of 3 (modulo 256). All calculations can be done modulo 256.

Why does this work? Because if xn satisfies

dxn ≡ 1 (mod m)

and if xn + 1 is defined by (31), then

dxn +1 ≡ 1 (mod m2).

To see this, let dxn = 1 + km. Then

Image

In our application, m is a power of 2, say 2N. In this case, if

Image

In a sense, if xn is regarded as a sort of approximation to Image, then each iteration of (31) doubles the number of bits of “accuracy” of the approximation.

It happens that modulo 8, the multiplicative inverse of any (odd) number d is d itself. Thus, taking x0 = d is a reasonable and simple initial guess at Image. Then, (31) will give values of x1, x2, ..., such that

Image

Thus, four iterations suffice to find the multiplicative inverse modulo 232 (if x ≡ 1 (mod 248), then x ≡ 1 (mod 2n) for n ≤ 48). This leads to the C program in Figure 10–6, in which all computations are done modulo 232.

For about half the values of d, this program takes 4.5 iterations, or nine multiplications. For the other half (those for which the initial value of xn is “correct to 4 bits”—that is, d2 ≡ 1 (mod 16)), it takes seven or fewer, usually seven, multiplications. Thus, it takes about eight multiplications on average.


unsigned mulinv(unsigned d) {       // d must be odd.
   unsigned xn, t;

   xn = d;
loop: t = d*xn;
      if (t == 1) return xn;
      xn = xn*(2 - t);
      goto loop;
}


FIGURE 10–6. Multiplicative inverse modulo 232 by Newton’s method.

A variation is to simply execute the loop four times, regardless of d, perhaps “strung out” to eliminate the loop control (eight multiplications). Another variation is to somehow make the initial estimate x0 “correct to 4 bits” (that is, find x0 that satisfies dx0 ≡ 1 (mod 16)). Then, only three loop iterations are required. Some ways to set the initial estimate are

Image

Here, the multiplication by 2 is a left shift, and the computations are done modulo 232 (ignoring overflow). Because the second formula uses a multiplication, it saves only one.

This concern about execution time is, of course, totally unimportant for the compiler application. For that application, the routine would be so seldom used that it should be coded for minimum space. But there may be applications in which it is desirable to compute the multiplicative inverse quickly.

The “Newton method” described here applies only when (1) the modulus is an integral power of some number a, and (2) the multiplicative inverse of d modulo a is known. It works particularly well for a = 2, because then the multiplicative inverse of any (odd) number d modulo 2 is known immediately—it is 1.

Sample Multiplicative Inverses

We conclude this section with a listing of some multiplicative inverses in Table 10–3.

TABLE 10–3. SAMPLE MULTIPLICATIVE INVERSES

Image

You may notice that in several cases (d = 3, 5, 9, 11), the multiplicative inverse of d is the same as the magic number for unsigned division by d (see Section 10–14, “Sample Magic Numbers,” on page 238). This is more or less a coincidence. It happens that for these numbers, the magic number M is equal to the multiplier m, and these are of the form (2p + 1) / d, with p ≥ 32. In this case, notice that

Image

so that MImage(mod 232).

10–17 Test for Zero Remainder after Division by a Constant

The multiplicative inverse of a divisor d can be used to test for a zero remainder after division by d[GM].

Unsigned

First, consider unsigned division with the divisor d odd. Denote by Image the multiplicative inverse of d. Then, because Image, where W is the machine’s word size in bits, Image is also odd. Thus, Image is relatively prime to 2W, and as shown in the proof of theorem MI in the preceding section, as n ranges over all 2W distinct values modulo 2W, Image takes on all 2W distinct values modulo 2W.

It was shown in the preceding section that if n is a multiple of d,

Image

That is, for n = 0, d, 2d, ..., (2W – 1) / d d, Image ≡ 0, 1, 2, ..., (2W – 1) / d (mod 2W). Therefore, for n not a multiple of d, the value of Image, reduced modulo 2W to the range 0 to 2W – 1, must exceed (2W – 1) / d .

This can be used to test for a zero remainder. For example, to test if an integer n is a multiple of 25, multiply n by Image and compare the rightmost W bits to (2W– 1) / 25 . On our basic RISC:

li      M,0xC28F5C29  Load mult. inverse of 25.
mul    q,M,n          q = right half of M*n.
li      c,0x0A3D70A3  c = floor((2**32-1)/25).
cmpleu t,q,c          Compare q and c, and branch
bt     t,is_mult      if n is a multiple of 25.

To extend this to even divisors, let d = do · 2k, where do is odd and k ≥ 1. Then, because an integer is divisible by d if and only if it is divisible by do and by 2k, and because n and Image have the same number of trailing zeros (Image is odd), the test that n is a multiple of d is

Image

where the mod function is understood to reduce Image to the interval [0, 2W –1].

Direct implementation of this requires two tests and conditional branches, but it can be reduced to one compare-branch quite efficiently if the machine has the rotate-shift instruction. This follows from the following theorem, in which Image denotes the computer word a rotated right k positions (0 ≤ k ≤ 32).

THEOREM ZRU. Image and x ends in k 0-bits if and only if Image

Proof. (Assume a 32-bit machine.) Suppose Imageand x ends in k 0-bits. Then, because Image, Image But Image Therefore, Image If x does not end in k 0-bits, then Image does not begin with k 0-bits, whereas a /2k does, so Image Lastly, if Image and x ends in k 0-bits, then the integer formed from the first 32 – k bits of x must exceed that formed from the first 32 - k bits of a, so that Image

Using this theorem, the test that n is a multiple of d, where n and d > 1 are unsigned integers and d = do · 2k with do odd, is

Image

Here we used (2W– 1) / do / 2k = (2W– 1) / (do · 2k) = (2W– 1) / d.

As an example, the following code tests an unsigned integer n to see if it is a multiple of 100:

li     M,0xC28F5C29  Load mult. inverse of 25.
mul    q,M,n         q = right half of M*n.
shrri  q,q,2         Rotate right two positions.
li     c,0x028F5C28  c = floor((2**32-1)/100).
cmpleu t,q,c         Compare q and c, and branch
bt     t,is_mult     if n is a multiple of 100.

Signed, Divisor ≥ 2

For signed division, it was shown in the preceding section that if n is a multiple of d and d is odd, then

Image

Thus, for n = –2W – 1 /d · d, ...,–d,0, d, ...,(2W – 1 – 1) / d · d, we have Image– 2W – 1 / d,...,–1,0, 1, ..., (2W –1 – 1) / d (mod 2W). Furthermore, because d is relatively prime to 2W, as n ranges over all 2W distinct values modulo 2W, Image takes on all 2W distinct values modulo 2W. Therefore, n is a multiple of d if and only if

Image

where the mod function is understood to reduce Image to the interval [–2W – 1 2W – 1 –1]

This can be simplified a little by observing that because d is odd and, as we are assuming, positive and not equal to 1, it does not divide 2W–1. Therefore,

– 2W – 1 / d = (–2W – 1 + 1) d = - (2W–1 –1)/d.

Thus, for signed numbers, the test that n is a multiple of d, where d = do · 2k and do is odd, is

Set q = mod(Image, 2W);

- (2W –1 – 1) / do q (2W – 1 – 1)/do and q ends in k or more 0-bits.

On the surface, this would seem to require three tests and branches. However, as in the unsigned case, it can be reduced to one compare-branch by use of the following theorem:

THEOREM ZRS. If a ≥ 0, the following assertions are equivalent:

Image

where a′ is a with its rightmost k bits set to 0 (that is, a ′ = a & –2k).

Proof. (Assume a 32-bit machine). To see that (1) is equivalent to (2), clearly the assertion – axa is equivalent to abs(x) ≤ a. Then, Theorem ZRU applies, because both sides of this inequality are nonnegative.

To see that (1) is equivalent to (3), note that assertion (1) is equivalent to itself with a replaced with a′. Then, by the theorem on bounds checking on page 68, this in turn is equivalent to

Image

Because x + a′ ends in k 0-bits if and only if x does, Theorem ZRU applies, giving the result.

Using part (3) of this theorem, the test that n is a multiple of d, where n and d ≥ 2 are signed integers and d = do · 2k with do odd, is

Image

(a′ can be computed at compile time, because d is a constant.)

As an example, the following code tests a signed integer n to see if it is a multiple of 100. Notice that the constant 2a′ / 2k can always be derived from the constant a′ by a shift of k – 1 bits, saving an instruction or a load from memory to develop the comparand.

li     M,0xC28F5C29  Load mult. inverse of 25.
mul    q,M,n         q = right half of M*n.
li     c,0x051EB850  c = floor((2**31 – 1)/25) & –4.
add    q,q,c         Add c.
shrri  q,q,2         Rotate right two positions.
shri   c,c,1         Compute const. for comparison.
cmpleu t,q,c         Compare q and c, and
bt     t,is_mult     branch if n is a mult. of 100.

10–18 Methods Not Using Multiply High

In this section we consider some methods for dividing by constants that do not use the multiply high instruction, or a multiplication instruction that gives a double-word result. We show how to change division by a constant into a sequence of shift and add instructions, or shift, add, and multiply for more compact code.

Unsigned Division

For these methods, unsigned division is simpler than signed division, so we deal with unsigned division first. One method is to use the techniques given that use the multiply high instruction, but use the code shown in Figure 8–2 on page 174 to do the multiply high operation. Figure 10–7 shows how this works out for the case of (unsigned) division by 3. This is a combination of the code on page 228 and Figure 8–2 with “int” changed to “unsigned.” The code is 15 instructions, including four multiplications. The multiplications are by large constants and would take quite a few instructions if converted to shift’s and add’s. Very similar code can be devised for the signed case. This method is not particularly good and won’t be discussed further.

Another method [GLS1] is to compute in advance the reciprocal of the divisor, and multiply the dividend by that with a series of shift right and add instructions. This gives an approximation to the quotient. It is merely an approximation, because the reciprocal of the divisor (which we assume is not an exact power of two) is not expressed exactly in 32 bits, and also because each shift right discards bits of the dividend. Next, the remainder with respect to the approximate quotient is computed, and that is divided by the divisor to form a correction, which is added to the approximate quotient, giving the exact quotient. The remainder is generally small compared to the divisor (a few multiples thereof), so there is often a simple way to compute the correction without using a divide instruction.

To illustrate this method, consider dividing by 3, that is, computing n / 3 where 0 ≤ n < 232. The reciprocal of 3, in binary, is approximately

0.0101 0101 0101 0101 0101 0101 0101 0101.

To compute the approximate product of that and n, we could use

Image

unsigned divu3(unsigned n) {
   unsigned n0, n1, w0, w1, w2, t, q;

   n0 = n & 0xFFFF;
   n1 = n >> 16;
   w0 = n0*0xAAAB;
   t = n1*0xAAAB + (w0 >> 16);
   w1 = t & 0xFFFF;
   w2 = t >> 16;
   w1 = n0*0xAAAA + w1;
   q = n1*0xAAAA + w2 + (w1 >> 16);
   return q >> 1;
}


FIGURE 10–7. Unsigned divide by 3 using simulated multiply high unsigned.

(29 instructions; the last 1 in the reciprocal is ignored because it would add the term Image which is obviously 0). However, the simple repeating pattern of 1’s and 0’s in the reciprocal permits a method that is both faster (nine instructions) and more accurate:

Image

To compare these methods for their accuracy, consider the bits that are shifted out by each term of (32), if n is all 1-bits. The first term shifts out two 1-bits, the next four 1-bits, and so on. Each of these contributes an error of almost 1 in the least significant bit. Since there are 16 terms (counting the term we ignored), the shifts contribute an error of almost 16. There is an additional error due to the fact that the reciprocal is truncated to 32 bits; it turns out that the maximum total error is 16.

For procedure (1), each right shift also contributes an error of almost 1 in the least significant bit. But there are only five shift operations. They contribute an error of almost 5, and there is a further error due to the fact that the reciprocal is truncated to 32 bits; it turns out that the maximum total error is 5.

After computing the estimated quotient q, the remainder r is computed from

rnq * 3.

The remainder cannot be negative, because q is never larger than the exact quotient. We need to know how large r can be to devise the simplest possible method for computing Image In general, for a divisor d and an estimated quotient q too low by k, the remainder will range from k*d to k*d + d – 1. (The upper limit is conservative; it may not actually be attained.) Thus, using (1), for which q is too low by at most 5, we expect the remainder to be at most 5*3 + 2 = 17. Experimentation reveals that it is actually at most 15. Thus, for the correction we must compute (exactly)

Image

Since r is small compared to the largest value that a register can hold, this can be approximated by multiplying r by some approximation to 1/3 of the form a/b where b is a power of 2. This is easy to compute, because the division is simply a shift. The value of a/ b must be slightly larger than 1/3, so that after shifting the result will agree with truncated division. A sequence of such approximations is:

1/2, 2/4, 3/8, 6/16, 11/32, 22/64, 43/128, 86/256, 171/512, 342/1024, ....

Usually, the smaller fractions in the sequence are easier to compute, so we choose the smallest one that works; in the case at hand this is 11/32. Therefore, the final, exact, quotient is given by

Image

The solution involves two multiplications by small numbers (3 and 11); these can be changed to shift’s and add’s.

Figure 10–8 shows the entire solution in C. As shown, it consists of 14 instructions, including two multiplications. If the multiplications are changed to shift’s and add’s, it amounts to 18 elementary instructions. However, if it is desired to avoid the multiplications, then either alternative return statement shown gives a solution in 17 elementary instructions. Alternative 2 has just a little instruction-level parallelism, but in truth this method generally has very little of that.

A more accurate estimate of the quotient can be obtained by changing the first executable line to

q = (n >> 1) + (n >> 3);

(which makes q too large by a factor of 2, but it has one more bit of accuracy), and then inserting just before the assignment to r,

q = q >> 1;

With this variation, the remainder is at most 9. However, there does not seem to be any better code for calculating Image with r limited to 9 than there is for r limited to 15 (four elementary instructions in either case). Thus, using the idea would cost one instruction. This possibility is mentioned because it does give a code improvement for most divisors.


unsigned divu3(unsigned n) {
   unsigned q, r;

   q = (n >> 2) + (n >> 4);     // q = n*0.0101 (approx).
   q = q + (q >> 4);            // q = n*0.01010101.
   q = q + (q >> 8);
   q = q + (q >> 16);
   r = n - q*3;                 // 0 <= r <= 15.
   return q + (11*r >> 5);      // Returning q + r/3.
// return q + (5*(r + 1) >> 4);         // Alternative 1.
// return q + ((r + 5 + (r << 2)) >> 4);// Alternative 2.
}


FIGURE 10–8. Unsigned divide by 3.

Figure 10–9 shows two variations of this method for dividing by 5. The reciprocal of 5, in binary, is

0.0011 0011 0011 0011 0011 0011 0011 0011.

As in the case of division by 3, the simple repeating pattern of 1’s and 0’s allows a fairly efficient and accurate computing of the quotient estimate. The estimate of the quotient computed by the code on the left can be off by at most 5, and it turns out that the remainder is at most 25. The code on the right retains two additional bits of accuracy in computing the quotient estimate, which is off by at most 2. The remainder in this case is at most 10. The smaller maximum remainder permits approximating 1/5 by 7/32 rather than 13/64, which gives a slightly more efficient program if the multiplications are done by shift’s and add’s. The instruction counts are, for the code on the left: 14 instructions including two multiplications, or 18 elementary instructions; for the code on the right: 15 instructions including two multiplications, or 17 elementary instructions. The alternative code in the return statement is useful only if your machine has comparison predicate instructions. It doesn’t reduce the instruction count, but merely has a little instruction-level parallelism.

For division by 6, the divide-by-3 code can be used, followed by a shift right of 1. However, the extra instruction can be saved by doing the computation directly, using the binary approximation

4/6 ≈ 0.1010 1010 1010 1010 1010 1010 1010 1010.

Image

FIGURE 10–9. Unsigned divide by 5.

The code is shown in Figure 10–10. The version on the left multiplies by an approximation to 1/6 and then corrects with a multiplication by 11/64. The version on the right takes advantage of the fact that by multiplying by an approximation to 4/6, the quotient estimate is off by only 1 at most. This permits simpler code for the correction; it simply adds 1 to q if r ≥ 6. The code in the second return statement is appropriate if the machine has the comparison predicate instructions. Function divu6b is 15 instructions, including one multiply, as shown, or 17 elementary instructions if the multiplication by 6 is changed to shift’s and add’s.

Image

FIGURE 10–10. Unsigned divide by 6.

For larger divisors, usually it seems to be best to use an approximation to 1/ d that is shifted left so that its most significant bit is 1. It seems that the quotient is then off by at most 1 usually (possibly always, this writer does not know), which permits efficient code for the correction step. Figure 10–11 shows code for dividing by 7 and 9, using the binary approximations

Image

If the multiplications by 7 and 9 are expanded into shift’s and add’s, these functions take 16 and 15 elementary instructions, respectively.

Image

FIGURE 10–11. Unsigned divide by 7 and 9.

Figures 10–12 and 10–13 show code for dividing by 10, 11, 12, and 13. These are based on the binary approximations:

Image

If the multiplications are expanded into shift’s and add’s, these functions take 17, 20, 17, and 20 elementary instructions, respectively.

Image

FIGURE 10–12. Unsigned divide by 10 and 11.

Image

FIGURE 10–13. Unsigned divide by 12 and 13.

The case of dividing by 13 is instructive because it shows how you must look for repeating strings in the binary expansion of the reciprocal of the divisor. The first assignment sets q equal to n*0.1001. The second assignment to q adds n*0.00001001 and n*0.000001001. At this point, q is (approximately) equal to n*0.100111011. The third assignment to q adds in repetitions of this pattern. It sometimes helps to use subtraction, as in the case of divu9 above. However, you must use care with subtraction, because it may cause the quotient estimate to be too large, in which case the remainder is negative and the method breaks down. It is quite complicated to get optimal code, and we don’t have a general cookbook method that you can put in a compiler to handle any divisor.

The examples above are able to economize on instructions, because the reciprocals have simple repeating patterns, and because the multiplication in the computation of the remainder r is by a small constant, which can be done with only a few shift’s and add’s. One might wonder how successful this method is for larger divisors. To roughly assess this, Figures 10–14 and 10–15 show code for dividing by 100 and by 1000 (decimal). The relevant reciprocals are

Image

If the multiplications are expanded into shift’s and add’s, these functions take 25 and 23 elementary instructions, respectively.


unsigned divu100(unsigned n) {
   unsigned q, r;

   q = (n >> 1) + (n >> 3) + (n >> 6) - (n >> 10) +
       (n >> 12) + (n >> 13) - (n >> 16);
   q = q + (q >> 20);
   q = q >> 6;
   r = n - q*100;
   return q + ((r + 28) >> 7);
// return q + (r > 99);
}


FIGURE 10–14. Unsigned divide by 100.


unsigned divu1000(unsigned n) {
   unsigned q, r, t;

   t = (n >> 7) + (n >> 8) + (n >> 12);
   q = (n >> 1) + t + (n >> 15) + (t >> 11) + (t >> 14);
   q = q >> 9;
   r = n - q*1000;
   return q + ((r + 24) >> 10);
// return q + (r > 999);
}


FIGURE 10–15. Unsigned divide by 1000.

In the case of dividing by 1000, the least significant eight bits of the reciprocal estimate are nearly ignored. The code of Figure 10–15 replaces the binary 1001 0111 with 0100 0000, and still the quotient estimate is within one of the true quotient. Thus, it appears that although large divisors might have very little repetition in the binary representation of the reciprocal estimate, at least some bits can be ignored, which helps hold down the number of shift’s and add’s required to compute the quotient estimate.

This section has shown, in a somewhat imprecise way, how unsigned division by a constant can be reduced to a sequence of, typically, about 20 elementary instructions. It is nontrivial to get an algorithm that generates these code sequences that is suitable for incorporation into a compiler, because of three difficulties in getting optimal code.

1. It is necessary to search the reciprocal estimate bit string for repeating patterns.

2. Negative terms (as in divu10 and divu100) can be used sometimes, but the error analysis required to determine just when they can be used is difficult.

3. Sometimes some of the least significant bits of the reciprocal estimate can be ignored (how many?).

Another difficulty for some target machines is that there are many variations on the code examples given that have more instructions, but that would execute faster on a machine with multiple shift and add units.

The code of Figures 10–7 through 10–15 has been tested for all 232 values of the dividends.

Signed Division

The methods given above can be made to apply to signed division. The right shift instructions in computing the quotient estimate become signed right shift instructions, which compute floor division by powers of 2. Thus, the quotient estimate is too low (algebraically), so the remainder is nonnegative, as in the unsigned case.

The code most naturally computes the floor division result, so we need a correction to make it compute the conventional truncated-toward-0 result. This can be done with three computational instructions by adding d – 1 to the dividend if the dividend is negative. For example, if the divisor is 6, the code begins with (the shift here is a signed shift)

n = n + (n >> 31 & 5);

Other than this, the code is very similar to that of the unsigned case. The number of elementary operations required is usually three more than in the corresponding unsigned division function. Several examples are given in Figures 10–16 through 10–22. All have been exhaustively tested.


int divs3(int n) {
   int q, r;

   n = n + (n>>31 & 2);         // Add 2 if n < 0.
   q = (n >> 2) + (n >> 4);     // q = n*0.0101 (approx).
   q = q + (q >> 4);            // q = n*0.01010101.
   q = q + (q >> 8);
   q = q + (q >> 16);
   r = n - q*3;                 // 0 <= r <= 14.
   return q + (11*r >> 5);      // Returning q + r/3.
// return q + (5*(r + 1) >> 4);         // Alternative 1.
// return q + ((r + 5 + (r << 2)) >> 4);// Alternative 2.
}


FIGURE 10–16. Signed divide by 3.

Image

FIGURE 10–17. Signed divide by 5 and 6.

Image

FIGURE 10–18. Signed divide by 7 and 9.

Image

FIGURE 10–19. Signed divide by 10 and 11.

Image

FIGURE 10–20. Signed divide by 12 and 13.

Image

FIGURE 10–21. Signed divide by 100.


int divs1000(int n) {
   int q, r, t;

   n = n + (n >> 31 & 999);
   t = (n >> 7) + (n >> 8) + (n >> 12);
   q = (n >> 1) + t + (n >> 15) + (t >> 11) + (t >> 14) +
       (n >> 26) + (t >> 21);
   q = q >> 9;
   r = n - q*1000;
   return q + ((r + 24) >> 10);
// return q + (r > 999);
}


FIGURE 10-22. Signed divide by 1000.

10-19 Remainder by Summing Digits

This section addresses the problem of computing the remainder of division by a constant without computing the quotient. The methods of this section apply only to divisors of the form 2k ± 1, for k an integer greater than or equal to 2, and in most cases the code resorts to a table lookup (an indexed load instruction) after a fairly short calculation.

We will make frequent use of the following elementary property of congruences:

THEOREM C. If ab (mod m) and cd (mod m), then

Image

The unsigned case is simpler and is dealt with first.

Unsigned Remainder

For a divisor of 3, multiplying the trivial congruence 1 ≡ 1 (mod 3) repeatedly by the congruence 2 ≡ –1 (mod 3), we conclude by Theorem C that

Image

Therefore, a number n written in binary as ...b3 b2 b1 b0 satisfies

n = ... + b3 · 23 + b2 · 22 + b1 · 2 + b0 ≡ ...– b3 + b2b1 + b0 (mod 3),

which is derived by using Theorem C repeatedly. Thus, we can alternately add and subtract the bits in the binary representation of the number to obtain a smaller number that has the same remainder upon division by 3. If the sum is negative, you must add a multiple of 3 to make it nonnegative. The process can then be repeated until the result is in the range 0 to 2.

The same trick works for finding the remainder after dividing a decimal number by 11.

Thus, if the machine has the population count instruction, a function that computes the remainder modulo 3 of an unsigned number n might begin with

n = pop(n & 0x55555555) - pop(n & 0xAAAAAAAA);

This can be simplified by using the following surprising identity discovered by Paolo Bonzini [Bonz]:

Image
Image

Since the references to 32 (the word size) cancel out, the result holds for any word size. Another way to prove (2) is to observe that it holds for x = 0, and if a 0-bit in x is changed to a 1 where m is 1, then both sides of (2) decrease by 1, and if a 0-bit of x is changed to a 1 where m is 0, then both sides of (2) increase by 1.

Applying (2) to the line of C code above gives

n = pop(n ^ 0xAAAAAAAA) - 16;

We want to apply this transformation again, until n is in the range 0 to 2, if possible. It is best to avoid producing a negative value of n, because the sign bit would not be treated properly on the next round. A negative value can be avoided by adding a sufficiently large multiple of 3 to n. Bonzini’s code, shown in Figure 10–23, increases the constant by 39. This is larger than necessary to make n nonnegative, but it causes n to range from –3 to 2 (rather than –3 to 3) after the second round of reduction. This simplifies the code on the return statement, which is adding 3 if n is negative. The function executes in 11 instructions, counting two to load the large constant.

Figure 10–24 shows a variation that executes in four instructions, plus a simple table lookup operation (e.g., an indexed load byte instruction).


int remu3(unsigned n) {
   n = pop(n ^ 0xAAAAAAAA) + 23;       // Now 23 <= n <= 55.
   n = pop(n ^ 0x2A) - 3;              // Now -3 <= n <= 2.
   return n + (((int)n >> 31) & 3);    // (Signed shift).
}


FIGURE 10–23. Unsigned remainder modulo 3, using population count.


int remu3(unsigned n) {

   static char table[33] = {2, 0,1,2, 0,1,2, 0,1,2,
          0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
          0,1,2, 0,1};

   n = pop(n ^ 0xAAAAAAAA);
   return table[n];
}


FIGURE 10–24. Unsigned remainder modulo 3, using population count and a table lookup.

To avoid the population count instruction, notice that because 4 ≡ 1 (mod 3), 4k ≡ 1 (mod 3). A binary number can be viewed as a base 4 number by taking its bits in pairs and interpreting the bits 00 to 11 as a base 4 digit ranging from 0 to 3. The pairs of bits can be summed using the code of Figure 5–2 on page 82, omitting the first executable line (overflow does not occur in the additions). The final sum ranges from 0 to 48, and a table lookup can be used to reduce this to the range 0 to 2. The resulting function is 16 elementary instructions, plus an indexed load.

There is a similar, but slightly better, way. As a first step, n can be reduced to a smaller number that is in the same congruence class modulo 3 with

n = (n >> 16) + (n & 0xFFFF);

This splits the number into two 16-bit portions, which are added together. The contribution modulo 3 of the left 16 bits of n is not altered by shifting them right 16 positions, because the shifted number, multiplied by 216, is the original number, and 216 ≡ 1 (mod 3). More generally, 2k ≡ 1 (mod 3) if k is even. This is used repeatedly (five times) in the code shown in Figure 10–25. This code is 19 instructions. The instruction count can be reduced by cutting off the digit summing earlier and using an in-memory table lookup, as illustrated in Figure 10–26 (nine instructions, plus an indexed load). The instruction count can be reduced to six (plus an indexed load) by using a table of size 0x2FE = 766 bytes.

To compute the unsigned remainder modulo 5, the code of Figure 10–27 uses the relations 16k ≡ 1 (mod 5) and 4 ≡ –1 (mod 5). It is 21 elementary instructions, assuming the multiplication by 3 is expanded into a shift and add.


int remu3(unsigned n) {
   n = (n >> 16) + (n & 0xFFFF);       // Max 0x1FFFE.
   n = (n >>  8) + (n & 0x00FF);       // Max 0x2FD.
   n = (n >>  4) + (n & 0x000F);       // Max 0x3D.
   n = (n >>  2) + (n & 0x0003);       // Max 0x11.
   n = (n >>  2) + (n & 0x0003);       // Max 0x6.
   return (0x0924 >> (n << 1)) & 3;
}


FIGURE 10–25. Unsigned remainder modulo 3, digit summing and an in-register lookup.


int remu3(unsigned n) {
   static char table[62] = {0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2,  0,1};

   n = (n >> 16) + (n & 0xFFFF);    // Max 0x1FFFE.
   n = (n >>  8) + (n & 0x00FF);    // Max 0x2FD.
   n = (n >>  4) + (n & 0x000F);    // Max 0x3D.
   return table[n];
}


FIGURE 10–26. Unsigned remainder modulo 3, digit summing and an in-memory lookup.


int remu5(unsigned n) {
   n = (n >> 16) + (n & 0xFFFF);            // Max 0x1FFFE.
   n = (n >>  8) + (n & 0x00FF);            // Max 0x2FD.
   n = (n >>  4) + (n & 0x000F);            // Max 0x3D.
   n = (n>>4) - ((n>>2) & 3) + (n & 3);     // -3 to 6.
   return (01043210432 >> 3*(n + 3)) & 7;   // Octal const.
}


FIGURE 10–27. Unsigned remainder modulo 5, digit summing method.

The instruction count can be reduced by using a table, similar to what is done in Figure 10–26. In fact, the code is identical, except the table is:

static char table[62] = {0,1,2,3,4, 0,1,2,3,4,
   0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
   0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
   0,1,2,3,4, 0,1,2,3,4, 0,1};

For the unsigned remainder modulo 7, the code of Figure 10–28 uses the relation 8k ≡ 1 (mod 7) (nine elementary instructions, plus an indexed load).

As a final example, the code of Figure 10–29 computes the remainder of unsigned division by 9. It is based on the relation 8 ≡ –1 (mod 9). As shown, it is nine elementary instructions, plus an indexed load. The elementary instruction count can be reduced to six by using a table of size 831 (decimal).


int remu7(unsigned n) {

   static char table[75] = {0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
             0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
             0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
             0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4};

   n = (n >> 15) + (n & 0x7FFF);        // Max 0x27FFE.
   n = (n >>  9) + (n & 0x001FF);       // Max 0x33D.
   n = (n >>  6) + (n & 0x0003F);       // Max 0x4A.
   return table[n];
}


FIGURE 10–28. Unsigned remainder modulo 7, digit summing method.


int remu9(unsigned n) {

   int r;
   static char table[75] = {0,1,2,3,4,5,6,7,8,
         0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
         0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
         0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
         0,1,2,3,4,5,6,7,8, 0,1,2};

   r = (n & 0x7FFF) - (n >> 15);    // FFFE0001 to 7FFF.
   r = (r & 0x01FF) - (r >>  9);    // FFFFFFC1 to 2FF.
   r = (r & 0x003F) + (r >>  6);    // 0 to 4A.
   return table[r];
}


FIGURE 10–29. Unsigned remainder modulo 9, digit summing method.

Signed Remainder

The digit summing method can be adapted to compute the remainder resulting from signed division. There seems to be no better way than to add a few steps to correct the result of the method as applied to unsigned division. Two corrections are necessary: (1) correct for a different interpretation of the sign bit, and (2) add or subtract a multiple of the divisor d to get the result in the range 0 to – (d – 1).

For division by 3, the unsigned remainder code interprets the sign bit of the dividend n as contributing 2 to the remainder (because 231 mod 3 = 2). For the remainder of signed division, the sign bit contributes only 1 (because (–231) mod 3 = 1). Therefore, we can use the code for an unsigned remainder and correct its result by subtracting 1. Then, the result must be put in the range 0 to –2. That is, the result of the unsigned remainder code must be mapped as follows:

(0, 1, 2) ⇒ (–1, 0, 1) ⇒ (–1, 0, –2).

This adjustment can be done fairly efficiently by subtracting 1 from the unsigned remainder if it is 0 or 1, and subtracting 4 if it is 2 (when the dividend is negative). The code must not alter the dividend n, because it is needed in this last step.

This procedure can easily be applied to any of the functions given for the unsigned remainder modulo 3. For example, applying it to Figure 10–26 on page 265 gives the function shown in Figure 10–30. It is 13 elementary instructions, plus an indexed load. The instruction count can be reduced by using a larger table.


int rems3(int n) {
   unsigned r;
   static char table[62] = {0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1};

   r = n;
   r = (r >> 16) + (r & 0xFFFF);    // Max 0x1FFFE.
   r = (r >>  8) + (r & 0x00FF);    // Max 0x2FD.
   r = (r >>  4) + (r & 0x000F);    // Max 0x3D.
   r = table[r];
   return r - (((unsigned)n >> 31) << (r & 2));
}


FIGURE 10–30. Signed remainder modulo 3, digit summing method.

Figures 10–31 to 10–33 show similar code for computing the signed remainder of division by 5, 7, and 9. All the functions consist of 15 elementary operations, plus an indexed load. They use signed right shifts, and the final adjustment consists of subtracting the modulus if the dividend is negative and the remainder is nonzero. The number of instructions can be reduced by using larger tables.


int rems5(int n) {
   int r;
   static char table[62] = {2,3,4, 0,1,2,3,4, 0,1,2,3,4,
             0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
             0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
             0,1,2,3,4, 0,1,2,3};

   r = (n >> 16) + (n & 0xFFFF);  // FFFF8000 to 17FFE.
   r = (r >>  8) + (r & 0x00FF);  // FFFFFF80 to 27D.
   r = (r >>  4) + (r & 0x000F);  // -8 to 53 (decimal).
   r = table[r + 8];
   return r - (((int)(n & -r) >> 31) & 5);
}


FIGURE 10–31. Signed remainder modulo 5, digit summing method.


int rems7(int n) {
   int r;
   static char table[75] =   {5,6, 0,1,2,3,4,5,6,
     0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
     0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
     0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2};

   r = (n >> 15) + (n & 0x7FFF);   // FFFF0000 to 17FFE.
   r = (r >>  9) + (r & 0x001FF);  // FFFFFF80 to 2BD.
   r = (r >>  6) + (r & 0x0003F);  // -2 to 72 (decimal).
   r = table[r + 2];
   return r - (((int)(n & -r) >> 31) & 7);
}


FIGURE 10–32. Signed remainder modulo 7, digit summing method.


int rems9(int n) {
   int r;
   static char table[75] = {7,8, 0,1,2,3,4,5,6,7,8,
              0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
              0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
              0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
              0,1,2,3,4,5,6,7,8, 0};

   r = (n & 0x7FFF) - (n >> 15);   // FFFF7001 to 17FFF.
   r = (r & 0x01FF) - (r >>  9);   // FFFFFF41 to 0x27F.
   r = (r & 0x003F) + (r >>  6);   // -2 to 72 (decimal).
   r = table[r + 2];
   return r - (((int)(n & -r) >> 31) & 9);
}


FIGURE 10–33. Signed remainder modulo 9, digit summing method.

10–20 Remainder by Multiplication and Shifting Right

The method described in this section applies, in principle, to all integer divisors greater than 2, but as a practical matter only to fairly small divisors and to divisors of the form 2k – 1. As in the preceding section, in most cases the code resorts to a table lookup after a fairly short calculation.

Unsigned Remainder

This section uses the mathematical (not computer algebra) notation a mod b, where a and b are integers and b > 0, to denote the integer x, 0 ≤ x < b, that satisfies xa (mod b).

To compute n mod 3, observe that

Image

Proof: Let n = 3 k + δ, where δ and k are integers and 0 ≤ δ ≤ 2. Then

Image

Clearly, the value of the last expression is 0, 1, or 2 for δ = 0, 1, or 2 respectively. This allows changing the problem of computing the remainder modulo 3 to one of computing the remainder modulo 4, which is of course much easier on a binary computer.

Relations like (3) do not hold for all moduli, but similar relations do hold if the modulus is of the form 2k – 1, for k an integer greater than 1. For example, it is easy to show that

Image

For numbers not of the form 2k – 1, there is no such simple relation, but there is a certain uniqueness property that can be used to compute the remainder for other divisors. For example, if the divisor is 10 (decimal), consider the expression

Image

Let n = 10 k + δ where 0 ≤ δ ≤ 9. Then

Image

For δ = 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, the last expression takes on the values 0, 1, 3, 4, 6, 8, 9, 11, 12, and 14 respectively. The latter numbers are all distinct. Therefore, if we can find a reasonably easy way to compute (4), we can translate 0 to 0, 1 to 1, 3 to 2, 4 to 3, and so on, to obtain the remainder of division by 10. This will generally require a translation table of size equal to the next power of 2 greater than the divisor, so the method is practical only for fairly small divisors (and for divisors of the form 2k – 1, for which table lookup is not required).

The code to be shown was derived by using a little of the above theory and a lot of trial and error.

Consider the remainder of unsigned division by 3. Following (3), we wish to compute the rightmost two bits of the integer part of 4n/ 3. This can be done approximately by multiplying by 232 / 3 and then dividing by 230 using a shift right instruction. When the multiplication by 232 / 3 is done (using the multiply instruction that gives the low-order 32 bits of the product), high-order bits will be lost. But that doesn’t matter, and, in fact, it’s helpful, because we want the result modulo 4. Therefore, because 232 / 3 = 0x55555555, a possible plan is to compute

Image

Experiment indicates that this works for n in the range 0 to 230 + 2. It almost works, I should say; if n is nonzero and a multiple of 3, it gives the result 3. Therefore, it must be followed by a translation step that translates (0, 1, 2, 3) to (0, 1, 2, 0) respectively.

To extend the range of applicability, the multiplication must be done more accurately. Two more bits of accuracy suffice (that is, multiplying by 0x55555555.4). The following calculation, followed by the translation step, works for all n representable as an unsigned 32-bit integer:

Image

It is, of course, possible to give a formal proof of this, but the algebra is quite lengthy and error prone.

The translation step can be done in three or four instructions on most machines, but there is a way to avoid it at a cost of two instructions. The above expression for computing r estimates low. If you estimate slightly high, the result is always 0, 1, or 2. This gives the C function shown in Figure 10–34 (eight instructions, including a multiply).


int remu3(unsigned n) {
   return (0x55555555*n + (n >> 1) - (n >> 3)) >> 30;
}


FIGURE 10–34. Unsigned remainder modulo 3, multiplication method.

The multiplication can be expanded, giving the 13-instruction function shown in Figure 10–35 that uses only shift’s and add’s.


int remu3(unsigned n) {
   unsigned r;

   r = n + (n << 2);
   r = r + (r << 4);
   r = r + (r << 8);
   r = r + (r << 16);
   r = r + (n >> 1);
   r = r - (n >> 3);
   return r >> 30;
}


FIGURE 10–35. Unsigned remainder modulo 3, multiplication (expanded) method.

The remainder of unsigned division by 5 can be computed very similarly to the remainder of division by 3. Let n = 5 k + r with 0 ≤ r ≤ 4. Then (8 / 5)n mod 8 = (8 / 5)(5 k+ r) mod 8 = (8 / 5)r mod 8. For r = 0, 1, 2, 3, and 4, this takes on the values 0, 1, 3, 4, and 6 respectively. Since 232 / 5 = 0x33333333, this leads to the function shown in Figure 10–36 (11 instructions, including a multiply). The last step (code on the return statement) is mapping (0, 1, 3, 4, 6, 7) to (0, 1, 2, 3, 4, 0) respectively, using an in-register method rather than an indexed load from memory. By also mapping 2 to 2 and 5 to 4, the precision required in the multiplication by 232 / 5 is reduced to using just the term n >> 3 to approximate the missing part of the multiplier (hexadecimal 0.333...). If the “accuracy” term n >> 3 is omitted, the code still works for n ranging from 0 to 0x60000004.


int remu5(unsigned n) {
   n = (0x33333333*n + (n >> 3)) >> 29;
   return (0x04432210 >> (n << 2)) & 7;
}


FIGURE 10–36. Unsigned remainder modulo 5, multiplication method.

The code for computing the unsigned remainder modulo 7 is similar, but the mapping step is simpler; it is necessary only to convert 7 to 0. One way to code it is shown in Figure 10–37 (11 instructions, including a multiply). If the accuracy term n >> 4 is omitted, the code still works for n up to 0x40000006. With both accuracy terms omitted, it works for n up to 0x08000006.


int remu7(unsigned n) {
   n = (0x24924924*n + (n >> 1) + (n >> 4)) >> 29;
   return n & ((int)(n - 7) >> 31);
}


FIGURE 10–37. Unsigned remainder modulo 7, multiplication method.

Code for computing the unsigned remainder modulo 9 is shown in Figure 10–38. It is six instructions, including a multiply, plus an indexed load. If the accuracy term n >> 1 is omitted and the multiplier is changed to 0x1C71C71D, the function works for n up to 0x1999999E.


int remu9(unsigned n) {
   static char table[16] = {0, 1, 1, 2, 2, 3, 3, 4,
                            5, 5, 6, 6, 7, 7, 8, 8};

   n = (0x1C71C71C*n + (n >> 1)) >> 28;
   return table[n];
}


FIGURE 10–38. Unsigned remainder modulo 9, multiplication method.

Figure 10–39 shows a way to compute the unsigned remainder modulo 10. It is eight instructions, including a multiply, plus an indexed load instruction. If the accuracy term n >> 3 is omitted, the code works for n up to 0x40000004. If both accuracy terms are omitted, it works for n up to 0x0AAAAAAD.


int remu10(unsigned n) {
   static char table[16] = {0, 1, 2, 2, 3, 3, 4, 5,
                            5, 6, 7, 7, 8, 8, 9, 0};

   n = (0x19999999*n + (n >> 1) + (n >> 3)) >> 28;
   return table[n];
}


FIGURE 10–39. Unsigned remainder modulo 10, multiplication method.

As a final example, consider the computation of the remainder modulo 63. This function is used by the population count program at the top of page 84. Joe Keane [Keane] has come up with the rather mysterious code shown in Figure 10–40. It is 12 elementary instructions on the basic RISC.


int remu63(unsigned n) {
   unsigned t;

   t = (((n >> 12) + n) >> 10) + (n << 2);
   t = ((t >> 6) + t + 3) & 0xFF;
   return (t - (t >> 6)) >> 2;
}


FIGURE 10–40. Unsigned remainder modulo 63, Keane’s method.

The “multiply and shift right” method leads to the code shown in Figure 10–41. This is 11 instructions on the basic RISC, one being a multiply. This would not be as fast as Keane’s method, unless the machine has a very fast multiply and the load of the constant 0x04104104 can move out of a loop.


int remu63(unsigned n) {
   n = (0x04104104*n + (n >> 4) + (n >> 10)) >> 26;
   return n & ((n - 63) >> 6);   // Change 63 to 0.
}


FIGURE 10–41. Unsigned remainder modulo 63, multiplication method.

On some machines, an improvement can result from expanding the multiplication into shifts and adds as follows (15 elementary instructions for the whole function):

r = (n << 2) + (n << 8);       // r = 0x104*n.
r = r + (r << 12);             // r = 0x104104*n.
r = r + (n << 26);             // r = 0x04104104*n.

Signed Remainder

As in the case of the digit summing method, the “multiply and shift right” method can be adapted to compute the remainder resulting from signed division. Again, there seems to be no better way than to add a few steps to correct the result of the method as applied to unsigned division. For example, the code shown in Figure 10–42 is derived from Figure 10–34 on page 270 (12 instructions, including a multiply).


int rems3(int n) {
   unsigned r;

   r = n;
   r = (0x55555555*r + (r >> 1) - (r >> 3)) >> 30;
   return r - (((unsigned)n >> 31) << (r & 2));
}


FIGURE 10–42. Signed remainder modulo 3, multiplication method.

Some plausible ways to compute the remainder of signed division by 5, 7, 9, and 10 are shown in Figures 10–43 to 10–46. The code for a divisor of 7 uses quite a few extra instructions (19 in all, including a multiply); it might be preferable to use a table similar to that shown for the cases in which the divisor is 5, 9, or 10. In the latter cases, the table used for unsigned division is doubled in size, with the sign bit of the divisor factored in to index the table. Entries shown as u are unused.


int rems5(int n) {
   unsigned r;
   static signed char table[16] = {0, 1, 2, 2, 3, u, 4, 0,
                                   u, 0,-4, u,-3,-2,-2,-1};
   r = n;
   r = ((0x33333333*r) + (r >> 3)) >> 29;
   return table[r + (((unsigned)n >> 31) << 3)];
}


FIGURE 10–43. Signed remainder modulo 5, multiplication method.


int rems7(int n) {
   unsigned r;

   r = n - (((unsigned)n >> 31) << 2);  // Fix for sign.
   r = ((0x24924924*r) + (r >> 1) + (r >> 4)) >> 29;
   r = r & ((int)(r - 7) >> 31);        // Change 7 to 0.
   return r - (((int)(n&-r) >> 31) & 7);// Fix n<0 case.
}


FIGURE 10–44. Signed remainder modulo 7, multiplication method.


int rems9(int n) {
   unsigned r;
   static signed char table[32] = {0, 1, 1, 2, u, 3, u, 4,
                                   5, 5, 6, 6, 7, u, 8, u,
                                  -4, u,-3, u,-2,-1,-1, 0,
                                   u,-8, u,-7,-6,-6,-5,-5};
   r = n;
   r = (0x1C71C71C*r + (r >> 1)) >> 28;
   return table[r + (((unsigned)n >> 31) << 4)];
}


FIGURE 10–45. Signed remainder modulo 9, multiplication method.


int rems10(int n) {
   unsigned r;
   static signed char table[32] = {0, 1, u, 2, 3, u, 4, 5,
                                   5, 6, u, 7, 8, u, 9, u,
                                  -6,-5, u,-4,-3,-3,-2, u,
                                  -1, 0, u,-9, u,-8,-7, u};
   r = n;
   r = (0x19999999*r + (r >> 1) + (r >> 3)) >> 28;
   return table[r + (((unsigned)n >> 31) << 4)];
}


FIGURE 10–46. Signed remainder modulo 10, multiplication method.

10–21 Converting to Exact Division

Since the remainder can be computed without computing the quotient, the possibility arises of computing the quotient q = n/d by first computing the remainder, subtracting this from the dividend n, and then dividing the difference by the divisor d. This last division is an exact division, and it can be done by multiplying by the multiplicative inverse of d (see Section 10–16, “Exact Division by Constants,” on page 240). This method would be particularly attractive if both the quotient and remainder are wanted.

Let us try this for the case of unsigned division by 3. Computing the remainder by the multiplication method (Figure 10–34 on page 270) leads to the function shown in Figure 10–47.


unsigned divu3(unsigned n) {
   unsigned r;

   r = (0x55555555*n + (n >> 1) - (n >> 3)) >> 30;
   return (n - r)*0xAAAAAAAB;
}


FIGURE 10–47. Unsigned remainder and quotient with divisor = 3, using exact division.

This is 11 instructions, including two multiplications by large numbers. (The constant 0x55555555 can be generated by shifting the constant 0xAAAAAAAB right one position.) In contrast, the more straightforward method of computing the quotient q using (for example) the code of Figure 10–8 on page 254, requires 14 instructions, including two multiplications by small numbers, or 17 elementary operations if the multiplications are expanded into shift’s and add’s. If the remainder is also wanted, and it is computed from r = n - q*3, the more straightforward method requires 16 instructions, including three multiplications by small numbers, or 20 elementary instructions if the multiplications are expanded into shift’s and add’s.

The code of Figure 10–47 is not attractive if the multiplications are expanded into shift’s and add’s; the result is 24 elementary instructions. Thus, the exact division method might be a good one on a machine that does not have multiply high but does have a fast modulo 232 multiply and slow divide, particularly if it can easily deal with the large constants.

For signed division by 3, the exact division method might be coded as shown in Figure 10–48. It is 15 instructions, including two multiplications by large constants.


int divs3(int n) {
   unsigned r;

   r = n;
   r = (0x55555555*r + (r >> 1) - (r >> 3)) >> 30;
   r = r - (((unsigned)n >> 31) << (r & 2));
   return (n - r)*0xAAAAAAAB;
}


FIGURE 10–48. Signed remainder and quotient with divisor = 3, using exact division.

As a final example, Figure 10–49 shows code for computing the quotient and remainder for unsigned division by 10. It is 12 instructions, including two multiplications by large constants, plus an indexed load instruction.


unsigned divu10(unsigned n) {
   unsigned r;
   static char table[16] = {0, 1, 2, 2, 3, 3, 4, 5,
                            5, 6, 7, 7, 8, 8, 9, 0};

   r = (0x19999999*n + (n >> 1) + (n >> 3)) >> 28;
   r = table[r];
   return ((n - r) >> 1)*0xCCCCCCCD;
}


FIGURE 10–49. Signed remainder and quotient with divisor = 10, using exact division.

10–22 A Timing Test

Many machines have a 32×32 ⇒ 64 multiply instruction, so one would expect that to divide by a constant such as 3, the code shown on page 228 would be fastest. If that multiply instruction is not present, but the machine has a fast 32×32 ⇒ 32 multiply instruction, then the exact division method might be a good one if the machine has a slow divide and a fast multiply. To test this conjecture, an assembly language program was constructed to compare four methods of dividing by 3. The results are shown in Table 10–4. The machine used was a 667 MHz Pentium III (ca. 2000), and one would expect similar results on many other machines.

TABLE 10–4. UNSIGNED DIVIDE BY 3 ON A PENTIUM III

Image

The first row gives the time in cycles for just two instructions: an xorl to clear the left half of the 64-bit source register, and the divl instruction, which evidently takes 40 cycles. The second row also gives the time for just two instructions: multiply and shift right 1 (mull and shrl). The third row gives the time for a sequence of 21 elementary instructions. It is the code of Figure 10–8 on page 254 using alternative 2, and with the multiplication by 3 done with a single instruction (leal). Several move instructions are necessary because the machine is (basically) two-address. The last row gives the time for a sequence of 10 instructions: two multiplications (imull) and the rest elementary. The two imull instructions use 4-byte immediate fields for the large constants. (The signed multiply instruction imull is used rather than its unsigned counterpart mull, because they give the same result in the low-order 32 bits, and imull has more addressing modes available.)

The exact division method would be even more favorable compared to the second and third methods if both the quotient and remainder were wanted, because they would require additional code for the computation rnq*3. (The divl instruction produces the remainder as well as the quotient.)

10–23 A Circuit for Dividing by 3

There is a simple circuit for dividing by 3 that is about as complex as an adder. It can be constructed very similarly to the elementary way one constructs an n-bit adder from n 1-bit “full adder” circuits. However, in the divider signals flow from most significant to least significant bit.

Consider dividing by 3 the way it is taught in grade school, but in binary. To produce each bit of the quotient, you divide 3 into the next bit, but the bit is preceded by a remainder of 0, 1, or 2 from the previous stage. The logic is shown in Table 10–5. Here the remainder is represented by two bits ri and si, with ri being the most significant bit. The remainder is never 3, so the last two rows of the table represent “don’t care” cases.

A circuit for 32-bit division by 3 is shown in Figure 10–50. The quotient is the word consisting of bits y31 through y0, and the remainder is 2r0 + s0.

Another way to implement the divide-by-3 operation in hardware is to use the multiplier to multiply the dividend by the reciprocal of 3 (binary 0.010101...), with appropriate rounding and scaling. This is the technique shown on pages 207 and 228.

TABLE 10–5. LOGIC FOR DIVIDING BY 3

Image
Image

FIGURE 10–50. Logic circuit for dividing by 3.

Exercises

1. Show that for unsigned division by an even number, the shrxi instruction (or equivalent code) can be avoided by first (a) turning off the low-order bit of the dividend (and operation) [CavWer] or (b) dividing the dividend by 2 (shift right 1 instruction) and then dividing by half the divisor.

2. Code a function in Python similar to that of Figure 10–4 on page 240, but for computing the magic number for signed division. Consider only positive divisors.

3. Show how you would use Newton’s method to calculate the multiplicative inverse of an integer d modulo 81. Show the calculations for d = 146.

I think that I shall never envision

An op unlovely as division.

An op whose answer must be guessed

And then, through multiply, assessed;

An op for which we dearly pay,

In cycles wasted every day.

Division code is often hairy;

Long division’s downright scary.

The proofs can overtax your brain,

The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,

But even God can’t divide by zero!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset