Chapter 9. Integer Division

9–1 Preliminaries

This chapter and the following one give a number of tricks and algorithms involving “computer division” of integers. In mathematical formulas we use the expression x / y to denote ordinary rational division, x ÷ y to denote signed computer division of integers (truncating toward 0), and Image to denote unsigned computer division of integers. Within C code, x/y, of course, denotes computer division, unsigned if either operand is unsigned, and signed if both operands are signed.

Division is a complex process, and the algorithms involving it are often not very elegant. It is even a matter of judgment as to just how signed integer division should be defined. Most high-level languages and most computer instruction sets define the result to be the rational result truncated toward 0. This and two other possibilities are illustrated below.


                truncating   modulus   floor
7÷3         =   2 rem 1      2 rem 1   2 rem 1
(–7)÷3      =  –2 rem –1    -3 rem 2  –3 rem 2
7÷–(–3)     =  –2 rem 1     –2 rem 1  –3 rem –2
(–7)÷– (–3) =   2 rem –1     3 rem 2   2 rem –1

The relation dividend = quotient * divisor + remainder holds for all three possibilities. We define “modulus” division by requiring that the remainder be nonnegative.1 We define “floor” division by requiring that the quotient be the floor of the rational result. For positive divisors, modulus and floor division are equivalent. A fourth possibility, seldom used, rounds the quotient to the nearest integer.

One advantage of modulus and floor division is that most of the tricks simplify. For example, division by 2n can be replaced by a shift right signed of n positions, and the remainder of dividing x by 2n is given by the logical and of x and 2n – 1. I suspect that modulus and floor division more often give the result you want. For example, suppose you are writing a program to graph an integer-valued function, and the values range from imin to imax. You want to set up the extremes of the ordinate to be the smallest multiples of 10 that include imin and imax. Then the extreme values are simply (imin ÷ 10) * 10 and ((imax + 9) ÷ 10) * 10 if modulus or floor division is used. If conventional division is used, you must evaluate something like:

if (imin >= 0)   gmin = (imin/10)*10;
else             gmin = ((imin - 9)/10)*10;
if (imax >= 0)   gmax = ((imax + 9)/10)*10;
else             gmax = (imax/10)*10;

Besides the quotient being more useful with modulus or floor division than with truncating division, we speculate that the nonnegative remainder is probably wanted more often than a remainder that can be negative.

It is hard to choose between modulus and floor division, because they differ only when the divisor is negative, which is unusual. Appealing to existing high-level languages does not help, because they almost universally use truncating division for x/y when the operands are signed integers. A few give floating-point numbers, or rational numbers, for the result. Looking at remainders, there is confusion. In Fortran 90, the MOD function gives the remainder of truncating division and MODULO gives the remainder of floor division (which can be negative). Similarly, in Common Lisp and ADA, REM is the remainder of truncating division, and MOD is the remainder of floor division. In PL/I, MOD is always nonnegative (it is the remainder of modulus division). In Pascal, A mod B is defined only for B > 0, and then it is the nonnegative value (the remainder of either modulus or floor division).

Anyway, we cannot change the world even if we knew how we wanted to change it,2 so in what follows we will use the usual definition (truncating) for x ÷ y.

A nice property of truncating division is that it satisfies

(–n) ÷ d = n ÷ (–d) = –(n ÷ d), for d ≠0.

Care must be exercised when applying this to transform programs, because if n or d is the maximum negative number, –n or –d cannot be represented in 32 bits. The operation (–231) ÷ (–1) is an overflow (the result cannot be expressed as a signed quantity in two’s-complement notation), and on most machines the result is undefined or the operation is suppressed.

Signed integer (truncating) division is related to ordinary rational division by

Image

Unsigned integer division—that is, division in which both n and d are interpreted as unsigned integers—satisfies the upper portion of (1).

In the discussion that follows, we make use of the following elementary properties of arithmetic, which we don’t prove here. See [Knu1] and [GKP] for interesting discussions of the floor and ceiling functions.

THEOREM D1. For x real, k an integer,

Image

THEOREM D2. For n, d integers, d > 0,

Image

If d < 0:

Image

THEOREM D3. For x real, d an integer > 0:

Image

COROLLARY. For a, b real, b ≠ 0, d an integer > 0,

Image

THEOREM D4. For n, d integers, d ≠ 0, and x real,

Image

In the theorems below, rem(n, d) denotes the remainder of n divided by d. For negative d, it is defined by rem(n, –d) = rem(n, d), as in truncating and modulus division. We do not use rem(n, d) with n < 0. Thus, for our use, the remainder is always nonnegative.

THEOREM D5. For n ≥ 0, d ≠0,

Image

(whichever value is greater than or equal to 0 and less than |d|).

THEOREM D6. For n ≥ 0, d ≠ 0,

rem(2n, 2d) = 2rem(n, d).

Theorems D5 and D6 are easily proved from the basic definition of remainder—that is, that for some integer q it satisfies

n = qd + rem(n, d) with 0 ≤ rem(n, d) < |d|,

provided n ≥ 0 and d ≠ 0 (n and d can be non-integers, but we will use these theorems only for integers).

9–2 Multiword Division

As in the case of multiword multiplication, multiword division can be done by the traditional grade-school method. The details, however, are surprisingly complicated. Figure 9–1 is Knuth’s Algorithm D [Knu2, 4.3.1], coded in C. The underlying form of division it uses is Image. (Actually, the quotient of these underlying division operations is at most 17 bits long.)


int divmnu(unsigned short q[], unsigned short r[],
     const unsigned short u[], const unsigned short v[],
     int m, int n) {

   const unsigned b = 65536;   // Number base (16 bits).
   unsigned short *un, *vn;    // Normalized form of u, v.
   unsigned qhat;              // Estimated quotient digit.
   unsigned rhat;              // A remainder.
   unsigned p;                 // Product of two digits.
   int s, i, j, t, k;

   if (m < n || n <= 0 || v[n-1] == 0)
      return 1;                // Return if invalid param.

   if (n == 1) {                      // Take care of
      k = 0;                          // the case of a
      for (j = m - 1; j >= 0; j--) {  // single-digit
         q[j] = (k*b + u[j])/v[0];    // divisor here.
         k = (k*b + u[j]) - q[j]*v[0];
      }
      if (r != NULL) r[0] = k;
      return 0;
   }

   // Normalize by shifting v left just enough so that
   // its high-order bit is on, and shift u left the
   // same amount. We may have to append a high-order
   // digit on the dividend; we do that unconditionally.

   s = nlz(v[n-1]) - 16;       // 0 <= s <= 16.
   vn = (unsigned short *)alloca(2*n);
   for (i = n - 1; i > 0; i--)
      vn[i] = (v[i] << s) | (v[i-1] >> 16-s);
   vn[0] = v[0] << s;

   un = (unsigned short *)alloca(2*(m + 1));
   un[m] = u[m-1] >> 16-s;
   for (i = m - 1; i > 0; i--)
      un[i] = (u[i] << s) | (u[i-1] >> 16-s);
   un[0] = u[0] << s;
   for (j = m - n; j >= 0; j--) {    // Main loop.
      // Compute estimate qhat of q[j].
      qhat = (un[j+n]*b + un[j+n-1])/vn[n-1];
      rhat = (un[j+n]*b + un[j+n-1]) - qhat*vn[n-1];
again:
      if (qhat >= b || qhat*vn[n-2] > b*rhat + un[j+n-2])
      { qhat = qhat - 1;
        rhat = rhat + vn[n-1];
        if (rhat < b) goto again;
      }
   
      // Multiply and subtract.
      k = 0;
      for (i = 0; i < n; i++) {
         p = qhat*vn[i];
         t = un[i+j] - k - (p & 0xFFFF);
         un[i+j] = t;
         k = (p >> 16) - (t >> 16);
      }
      t = un[j+n] - k;
      un[j+n] = t;
   
      q[j] = qhat;         // Store quotient digit.
      if (t < 0) {         // If we subtracted too
         q[j] = q[j] - 1;   // much, add back.
         k = 0;
         for (i = 0; i < n; i++) {
            t = un[i+j] + vn[i] + k;
            un[i+j] = t;
            k = t >> 16;
         }
         un[j+n] = un[j+n] + k;
       }
    } // End j.
    // If the caller wants the remainder, unnormalize
    // it and pass it back.
    if (r != NULL) {
       for (i = 0; i < n - 1; i++)
          r[i] = (un[i] >> s) | (un[i + 1] << 16-s);
          r[n-1] = un[n-1] >> s;
    }
    return 0;
}


FIGURE 9–1. Multiword integer division, unsigned.

The algorithm processes its inputs and outputs a halfword at a time. Of course, we would prefer to process a fullword at a time, but it seems that such an algorithm would require an instruction that does Image division. We assume here that either the machine does not have that instruction or it is hard to access from our high-level language. Although we generally assume the machine has Image division, for this problem Image suffices.

Thus, for this implementation of Knuth’s algorithm, the base b is 65536. See [Knu2] for most of the explanation of this algorithm.

The dividend u and the divisor v are in “little-endian” order—that is, u[0] and v[0] are the least significant digits. (The code works correctly on both big- and little-endian machines.) Parameters m and n are the number of halfwords in u and v, respectively (Knuth defines m to be the length of the quotient). The caller supplies space for the quotient q and, optionally, for the remainder r. The space for the quotient must be at least m – n + 1 halfwords, and for the remainder, n halfwords. Alternatively, a value of NULL can be given for the address of the remainder to signify that the remainder is not wanted.

The algorithm requires that the most significant digit of the divisor, v[n – 1], be nonzero. This simplifies the normalization steps and helps to ensure that the caller has allocated sufficient space for the quotient. The code checks that v[n – 1] is nonzero, and also the requirements that n ≥ 1 and m ≥ n. If any of these conditions are violated, it returns with an error code (return value 1).

After these checks, the code performs the division for the simple case in which the divisor is of length 1. This case is not singled out for speed; the rest of the algorithm requires that the divisor be of length 2 or more.

If the divisor is of length 2 or more, the algorithm normalizes the divisor by shifting it left just enough so that its high-order bit is 1. The dividend is shifted left the same amount, so the quotient is not changed by these shifts. As explained by Knuth, these steps are necessary to make it easy to guess each quotient digit with good accuracy. The number of leading zeros function, nlz(x), is used to determine the shift amount.

In the normalization steps, new space is allocated for the normalized dividend and divisor. This is done because it is generally undesirable, from the caller’s point of view, to alter these input arguments, and because it may be impossible to alter them—they may be constants in read-only memory. Furthermore, the dividend may need an additional high-order digit. C’s “alloca” function is ideal for allocating this space. It is usually implemented very efficiently, requiring only two or three in-line instructions to allocate the space and no instructions at all to free it. The space is allocated on the program’s stack, in such a way that it is freed automatically upon subroutine return.

In the main loop, the quotient digits are cranked out one per loop iteration, and the dividend is reduced until it becomes the remainder. The estimate qhat of each quotient digit, after being refined by the steps in the loop labeled again, is always either exact or too high by 1.

The next steps multiply qhat by the divisor and subtract the product from the current remainder, as in the grade-school method. If the remainder is negative, it is necessary to decrease the quotient digit by 1 and either re-multiply and subtract or, more simply, adjust the remainder by adding the divisor to it. This need be done at most once, because the quotient digit was either exact or 1 too high.

Lastly, the remainder is given back to the caller if the address of where to put it is non-null. The remainder must be shifted right by the normalization shift amount s.

The “add back” steps are executed only rarely. To see this, observe that the first calculation of each estimated quotient digit qhat is done by dividing the most significant two digits of the current remainder by the most significant digit of the divisor. The steps in the “again” loop amount to refining qhat to be the result of dividing the most significant three digits of the current remainder by the most significant two digits of the divisor (proof omitted; convince yourself of this by trying some examples using b = 10). Note that the divisor is greater than or equal to b/2 (because of normalization), and the dividend is less than or equal to b times the divisor (because each remainder is less than the divisor).

How accurate is the quotient estimated by using only three dividend digits and two divisor digits? Because normalization was done, it can be shown to be quite accurate. To see this somewhat intuitively (not a formal proof), consider estimating u / v in this way for base ten arithmetic. It can be shown that the estimate is always high (or exact). Thus, the worst case occurs if truncation of the divisor to two digits decreases the divisor by as much as possible in the sense of relative error, and truncation of the dividend to three digits decreases it by as little as possible (which is 0), and if the dividend is as large as possible. This occurs for the case 49900...0/5099...9, which we estimate by 499/50 = 9.98. The true result is approximately 499/51 ≈ 9.7843. The difference of 0.1957 reveals that the estimated quotient digit and the true quotient digit, which are the floor functions of these ratios, will differ by at most 1, and this will occur about 20% of the time (assuming the quotient digits are uniformly distributed). This, in turn, means that the “add back” steps will be executed about 20% of the time.

Carrying out this (non-rigorous) analysis for a general base b yields the result that the estimated and true quotients differ by at most 2 / b. For b = 65536, we again obtain the result that the difference between the estimated and true quotient digits is at most 1, and this occurs with probability 2/65536 ≈ 0.00003. Thus, the “add back” steps are executed for only about 0.003% of the quotient digits.

An example that requires the add back step is, in decimal, 4500/501. A similar example for base 65536 is 0x7FFF8000 00000000/0x8000 00000001.

We will not attempt to estimate the running time of this entire program, but simply note that for large m and n, the execution time is dominated by the multiply/subtract loop. On a good compiler this will compile into about 16 basic RISC instructions, one of which is multiply. The “for j” loop is executed mn + 1 times, and the multiply/subtract loop n times, giving an execution time for this part of the program of (15 + mul)n(mn + 1) cycles, where mul is the time to multiply two 16-bit variables. The program also executes mn + 1 divide instructions and one number of leading zeros instruction.

Signed Multiword Division

We do not give an algorithm specifically for signed multiword division, but merely point out that the unsigned algorithm can be adapted for this purpose as follows:

1. Negate the dividend if it is negative, and similarly for the divisor.

2. Convert the dividend and divisor to unsigned representation.

3. Use the unsigned multiword division algorithm.

4. Convert the quotient and remainder to signed representation.

5. Negate the quotient if the dividend and divisor had opposite signs.

6. Negate the remainder if the dividend was negative.

These steps sometimes require adding or deleting a most significant digit. For example, assume for simplicity that the numbers are represented in base 256 (one byte per digit), and that in the signed representation, the high-order bit of the sequence of digits is the sign bit. This is much like ordinary two’s-complement representation. Then, a divisor of 255, which has signed representation 0x00FF, must be shortened in step 2 to 0xFF. Similarly, if the quotient from step 3 begins with a 1-bit, it must be provided with a leading 0-byte for correct representation as a signed quantity.

9–3 Unsigned Short Division from Signed Division

By “short division” we mean the division of one single word by another (e.g., 32÷32 ⇒ 32). It is the form of division provided by the “/” operator, when the operands are integers, in C and many other high-level languages. C has both signed and unsigned short division, but some computers provide only signed division in their instruction repertoire. How can you implement unsigned division on such a machine? There does not seem to be any really slick way to do it, but we offer some possibilities here.

Using Signed Long Division

Even if the machine has signed long division (64÷32 ⇒ 32), unsigned short division is not as simple as you might think. In the XLC compiler for the IBM RS/6000, it is implemented as illustrated below for Image.

Image

The third line is really testing to see if Image. If d is algebraically less than or equal to 1 at this point, then because it is not equal to 1 (from the second line), it must be algebraically less than or equal to 0. We don’t care about the case d = 0, so for the cases of interest, if the test on the third line evaluates to true, the sign bit of d is on, that is, Image. Because from the first line it is known that Image, and because n cannot exceed 232 – 1, Image.

The notation on the fourth line means to form the double-length integer consisting of 32 0-bits followed by the 32-bit quantity n, and divide it by d. The test for d = 1 (second line) is necessary to ensure that this division does not overflow (it would overflow if Image, and then the quotient would be undefined).

By commoning the comparisons on the second and third lines,3 the above can be implemented in 11 instructions, three of which are branches. If it is necessary that the divide be executed when d = 0, to get the overflow interrupt, then the third line can be changed to “else if d < 0 then q ← 1,” giving a 12-instruction solution on the RS/6000.

It is a simple matter to alter the above code so that the probable usual cases Image do not go through so many tests (begin with if d ≤ 1 ...), but the code volume increases slightly.

Using Signed Short Division

This section is written for a 32-bit machine, but it applies to a 64-bit machine (that is, getting unsigned 64÷64 ⇒ 64 division from the same form of signed division) by changing all occurrences of 31 to 63. It can be used to get unsigned division in Java, which lacks unsigned integers.

If signed long division is not available, but signed short division is, then Image can be implemented by somehow reducing the problem to the case n, d < 231 and using the machine’s divide instruction. If Image, then Image can only be 0 or 1, so this case is easily dispensed with. Then, we can reduce the dividend by using the fact that the expression (Image approximates Image with an error of only 0 or 1. This leads to the following method:

Image

The test d < 0 on line 1 is really testing to determine if Image. If Image, then the largest the quotient could be is (232 – 1) ÷ 231 = 1, so the first two lines compute the correct quotient.

Line 4 represents the code shift right unsigned 1, divide, shift left 1. Clearly, Image, and at this point Image as well, so these quantities can be used in the computer’s signed division instruction. (If d = 0, overflow will be signaled here.)

The estimate computed at line 4 is

Image

where we have used the corollary of Theorem D3. Line 5 computes the remainder corresponding to the estimated quotient. It is

Image

Thus, 0r < 2d. If r < d, then q is the correct quotient. If rd, then adding 1 to q gives the correct quotient (the program must use an unsigned comparison here, because of the possibility that r ≥ 231).

By moving the load immediate of 0 into q ahead of the comparison Image, and coding the assignment q ← 1 in line 2 as a branch to the assignment qq + 1 in line 6, this can be coded in 14 instructions on most machines, four of which are branches. It is straightforward to augment the code to produce the remainder as well: to line 1 append rn, to line 2 append rnd, and to the “then” clause in line 6 append rrd. (Or, at the cost of a multiply, simply append rnqd to the end of the whole sequence.)

An alternative for lines 1 and 2 is

Image

which can be coded a little more compactly, for a total of 13 instructions, three of which are branches. But it executes more instructions in what is probably the usual case (small numbers with n > d).

Using predicate expressions, the program can be written

Image

which saves two branches if there is a way to evaluate the predicates without branching. On the basic RISC they can be evaluated in one instruction (CMPGEU); on MIPS they take two (SLTU, XORI). On most computers, they can be evaluated in four instructions each (three if equipped with a full set of logic instructions), by using the expression for Image given in “Comparison Predicates” on page 23, and simplifying because on line 1 of the program above it is known that d31 = 1, and on line 5 it is known that d31 = 0. The expression simplifies to

Image

We can get branch-free code by forcing the dividend to be 0 when Image. Then, the divisor can be used in the machine’s signed divide instruction, because when it is misinterpreted as a negative number, the result is set to 0, which is within 1 of being correct. We’ll still handle the case of a large dividend by shifting it one position to the right before the division, and then shifting the quotient one position to the left after the division. This gives the following program (ten basic RISC instructions):

Image

9–4 Unsigned Long Division

By “long division” we mean the division of a doubleword by a single word. For a 32-bit machine, this is Image division, with the result unspecified in the overflow cases, including division by 0.

Some 32-bit machines provide an instruction for unsigned long division. Its full capability, however, gets little use, because only Image division is accessible with most high-level languages. Therefore, a computer designer might elect to provide only Image division and would probably want an estimate of the execution time of a subroutine that implements the missing function. Here we give two algorithms for providing this missing function.

Hardware Shift-and-Subtract Algorithms

As a first attempt at doing long division, we consider doing what the hardware does. There are two algorithms commonly used, called restoring and nonrestoring division [H&P, sec. A-2; EL]. They are both basically “shift-and-subtract” algorithms. In the restoring version, shown below, the restoring step consists of adding back the divisor when the subtraction gives a negative result. Here x, y, and z are held in 32-bit registers. Initially, the double-length dividend is x || y, and the divisor is z. We need a single-bit register c to hold the overflow from the subtraction.

Image

Upon completion, the quotient is in register y and the remainder is in register x.

The algorithm does not give a useful result in the overflow cases. For division of the doubleword quantity x || y by 0, the quotient obtained is the one’s-complement of x, and the remainder obtained is y. In particular, Image rem 0. The other overflow cases are difficult to characterize.

It might be useful if, for nonzero divisors, the algorithm would give the correct quotient modulo 232, and the correct remainder. The only way to do this seems to be to make the register represented by c || x || y above 97 bits long, and do the loop 64 times. This is doing Image division. The subtractions would still be 33-bit operations, but the additional hardware and execution time make this refinement probably not worthwhile.

This algorithm is difficult to implement exactly in software, because most machines do not have the 33-bit register that we have represented by c || x. Figure 9–2, however, illustrates a shift-and-subtract algorithm that reflects the hardware algorithm to some extent.

The variable t is used for a device to make the comparison come out right. We want to do a 33-bit comparison after shifting x || y. If the first bit of x is 1 (before the shift), then certainly the 33-bit quantity is greater than the divisor (32 bits). In this case, x | t is all 1’s, so the comparison gives the correct result (true). On the other hand, if the first bit of x is 0, then a 32-bit comparison is sufficient.

The code of the algorithm in Figure 9–2 executes in 321 to 385 basic RISC instructions, depending upon how often the comparison is true. If the machine has shift left double, the shifting operation can be done in one instruction, rather than the four used above. This would reduce the execution time to about 225 to 289 instructions (we are allowing two instructions per iteration for loop control).

The algorithm in Figure 9–2 can be used to do Image division by supplying x = 0. The only simplification that results is that the variable t can be omitted, as its value would always be 0.


unsigned divlu(unsigned x, unsigned y, unsigned z) {
   // Divides (x || y) by z.
   int i;
   unsigned t;

   for (i = 1; i <= 32; i++) {
      t = (int)x >> 31;          // All 1’s if x(31) = 1.
      x = (x << 1) | (y >> 31);  // Shift x || y left
      y = y << 1;                // one bit.
      if ((x | t) >= z) {
         x = x - z;
         y = y + 1;
      }
   }
   return y;                     // Remainder is x.
}


FIGURE 9–2. Divide long unsigned, shift-and-subtract algorithm.

On the next page is the nonrestoring hardware division algorithm (unsigned). The basic idea is that, after subtracting the divisor z from the 33-bit quantity that we denote by c || x, there is no need to add back z if the result was negative. Instead, it suffices to add on the next iteration rather than subtract. This is because adding z (to correct the error of having subtracted z on the previous iteration), shifting left, and subtracting z is equivalent to adding z(2(u + z) – z = 2 u + z). The advantage to hardware is that there is only one add or subtract operation on each loop iteration, and the adder is likely to be the slowest circuit in the loop.4 An adjustment to the remainder is needed at the end if it is negative. (No corresponding adjustment of the quotient is required.)

The input dividend is the doubleword quantity x || y, and the divisor is z. Upon completion, the quotient is in register y and the remainder is in register x.

Image

This does not seem to adapt very well to a 32-bit algorithm.

The 801 minicomputer (an early experimental RISC machine built by IBM) had a divide step instruction that essentially performed the steps in the body of the loop above. It used the machine’s carry status bit to hold c and the MQ (a 32-bit register) to hold y. A 33-bit adder/subtracter is needed for its implementation. The 801’s divide step instruction was a little more complicated than the loop above, because it performed signed division and it had an overflow check. Using it, a division subroutine can be written that consists essentially of 32 consecutive divide step instructions followed by some adjustments to the quotient and remainder to make the remainder have the desired sign.

Using Short Division

An algorithm for Image division can be obtained from the multiword division algorithm of Figure 9–1 on page 185, by specializing it to the case m = 4, n = 2. Several other changes are necessary. The parameters should be fullwords passed by value, rather than arrays of halfwords. The overflow condition is different; it occurs if the quotient cannot be contained in a single fullword. It turns out that many simplifications to the routine are possible. It can be shown that the guess qhat is always exact; it is exact if the divisor consists of only two halfword digits. This means that the “add back” steps can be omitted. If the “main loop” of Figure 9–1 and the loop within it are unrolled, some minor simplifications become possible.

The result of these transformations is shown in Figure 9–3. The dividend is in u1 and u0, with u1 containing the most significant word. The divisor is parameter v. The quotient is the returned value of the function. If the caller provides a non-null pointer in parameter r, the function will return the remainder in the word to which r points.

For an overflow indication, the program returns a remainder equal to the maximum unsigned integer. This is an impossible remainder for a valid division operation, because the remainder must be less than the divisor. In the overflow case, the program also returns a quotient equal to the maximum unsigned integer, which may be an adequate indicator in some cases in which the remainder is not wanted.

The strange expression (-s >> 31) in the assignment to un32 is supplied to make the program work for the case s = 0 on machines that have mod 32 shifts (e.g., Intel x86).

Experimentation with uniformly distributed random numbers suggests that the bodies of the “again” loops are each executed about 0.38 times for each execution of the function. This gives an execution time, if the remainder is not wanted, of about 52 instructions. Of these instructions, one is number of leading zeros, two are divide, and 6.5 are multiply (not counting the multiplications by b, which are shift’s). If the remainder is wanted, add six instructions (counting the store of r), one of which is multiply.

What about a signed version of divlu? It would probably be difficult to modify the code of Figure 9–3, step by step, to produce a signed variant. That algorithm, however, can be used for signed division by taking the absolute value of the arguments, running divlu, and then complementing the result if the signs of the original arguments differ. There is no problem with extreme values such as the maximum negative number, because the absolute value of any signed integer has a correct representation as an unsigned integer. This algorithm is shown in Figure 9–4.

It is hard to devise really good code to detect overflow in the signed case. The algorithm shown in Figure 9–4 makes a preliminary determination identical to that used by the unsigned long division routine, which ensures that |u / v| < 232. After that, it is necessary only to ensure that the quotient has the proper sign or is 0.


unsigned divlu(unsigned u1, unsigned u0, unsigned v,
               unsigned *r) {
   const unsigned b = 65536;        // Number base (16 bits).
   unsigned un1, un0,               // Norm. dividend LSD’s.
            vn1, vn0,               // Norm. divisor digits.
            q1, q0,                 // Quotient digits.
            un32, un21, un10,       // Dividend digit pairs.
            rhat;                   // A remainder.
   int s;                           // Shift amount for norm.
 
   if (u1 >= v) {                   // If overflow, set rem.
      if (r != NULL)                // to an impossible value,
         *r = 0xFFFFFFFF;           // and return the largest
      return 0xFFFFFFFF;}           // possible quotient.
 
   s = nlz(v);                      // 0 <= s <= 31.
   v = v << s;                      // Normalize divisor.
   vn1 = v >> 16;                   // Break divisor up into
   vn0 = v & 0xFFFF;                // two 16-bit digits.
 
   un32 = (u1 << s) | (u0 >> 32 - s) & (-s >> 31);
   un10 = u0 << s;                  // Shift dividend left.
 
   un1 = un10 >> 16;                // Break right half of
   un0 = un10 & 0xFFFF;             // dividend into two digits.
 
   q1 = un32/vn1;                   // Compute the first
   rhat = un32 - q1*vn1;            // quotient digit, q1.
 again1:
   if (q1 >= b || q1*vn0 > b*rhat + un1) {
      q1 = q1 - 1;
      rhat = rhat + vn1;
      if (rhat < b) goto again1;}
 
   un21 = un32*b + un1 - q1*v;         // Multiply and subtract.
 
   q0 = un21/vn1;                    // Compute the second
   rhat = un21 - q0*vn1;             // quotient digit, q0.
 again2:
   if (q0 >= b || q0*vn0 > b*rhat + un0) {
     q0 = q0 - 1;
     rhat = rhat + vn1;
     if (rhat < b) goto again2;}
 
   if (r != NULL)                    // If remainder is wanted,
      *r = (un21*b + un0 - q0*v) >> s;     // return it.
   return q1*b + q0;
 }


FIGURE 9–3. Divide long unsigned, using fullword division instruction.


int divls(int u1, unsigned u0, int v, int *r) {
  int q, uneg, vneg, diff, borrow;

  uneg = u1 >> 31;        // -1 if u < 0.
  if (uneg) {             // Compute the absolute
     u0 = -u0;            // value of the dividend u.
     borrow = (u0 != 0);
     u1 = -u1 - borrow;}

  vneg = v >> 31;         // -1 if v < 0.
  v = (v ^ vneg) - vneg;  // Absolute value of v.

  if ((unsigned)u1 >= (unsigned)v) goto overflow;

  q = divlu(u1, u0, v, (unsigned *)r);

  diff = uneg ^ vneg;     // Negate q if signs of
  q = (q ^ diff) - diff;  // u and v differed.
  if (uneg && r != NULL)
     *r = -*r;

  if ((diff ^ q) < 0 && q != 0) {  // If overflow,
overflow:                 // set remainder
     if (r != NULL)       // to an impossible value,
       *r = 0x80000000;   // and return the largest
     q = 0x80000000;}     // possible neg. quotient.
  return q;
}


FIGURE 9–4. Divide long signed, using divide long unsigned.

9–5 Doubleword Division from Long Division

This section considers how to do 64 ÷ 64 ⇒ 64 division from 64 ÷ 32 ⇒ 32 division, for both the unsigned and signed cases. The algorithms that follow are most suited to a machine that has an instruction for long division (64 ÷ 32), at least for the unsigned case. It is also helpful if the machine has the number of leading zeros instruction. The machine may have either 32-bit or 64-bit registers, but we will assume that if it has 32-bit registers, then the compiler implements basic operations such as adds and shifts on 64-bit operands (the “long long” data type in C).

These functions are known as “_ _udivdi3” and “_ _divdi3” in the GNU C world, and similar names are used here.

Unsigned Doubleword Division

A procedure for this operation is shown in Figure 9–5.


unsigned long long udivdi3(unsigned long long u,
                           unsigned long long v) {

   unsigned long long u0, u1, v1, q0, q1, k, n;

   if (v >> 32 == 0) {        // If v < 2**32:
      if (u >> 32 < v)        // If u/v cannot overflow,
         return DIVU(u, v)    // just do one division.
            & 0xFFFFFFFF;
      else {                  // If u/v would overflow:
         u1 = u >> 32;        // Break u up into two
         u0 = u & 0xFFFFFFFF; // halves.
         q1 = DIVU(u1, v)     // First quotient digit.
            & 0xFFFFFFFF;
         k = u1 - q1*v;       // First remainder, < v.
         q0 = DIVU((k << 32) + u0, v)  // 2nd quot. digit.
            & 0xFFFFFFFF;
         return (q1 << 32) + q0;
      }
   }
                              // Here v >= 2**32.
   n = nlz64(v);              // 0 <= n <= 31.
   v1 = (v << n) >> 32;       // Normalize the divisor
                              // so its MSB is 1.
   u1 = u >> 1;               // To ensure no overflow.
   q1 = DIVU(u1, v1)          // Get quotient from
      & 0xFFFFFFFF;           // divide unsigned insn.
   q0 = (q1 << n) >> 31;      // Undo normalization and
                              // division of u by 2.
   if (q0 != 0)               // Make q0 correct or
      q0 = q0 - 1;            // too small by 1.
   if ((u - q0*v) >= v)
      q0 = q0 + 1;            // Now q0 is correct.
   return q0;
}


FIGURE 9–5. Unsigned doubleword division from long division.

This code distinguishes three cases: (1) the case in which a single execution of the machine’s unsigned long division instruction (DIVU) can be used, (2) the case in which (1) does not apply, but the divisor is a 32-bit quantity, and (3) the cases in which the divisor cannot be represented in 32 bits. It is not too hard to see that the above code is correct for cases (1) and (2). For case (2), think of the grade-school method of doing long division.

Case (3), though, deserves proof, because it is very close to not working in some cases. Notice that in this case only a single execution of DIVU is needed, but the number of leading zeros and multiply operations are needed.

For the proof, we need these basics (for integer variables):

Image
Image

From the first line in the section of the procedure of interest (we assume that v ≠ 0),

0 ≤ n ≤ 31.

In computing v1, the left shift clearly cannot overflow. Therefore,

Image

In computing q1, u1 and v1 are in range for the DIVU instruction and it cannot overflow. Hence,

q1 = u1 / v1.

In the first computation of q0, the left shift cannot overflow because q1 < 232 (because the maximum value of u1 is 263 – 1 and the minimum value of v1 is 231). Therefore,

q0 = q1/231 – n.

Now, for the main part of the proof, we want to show that

u / v q0u / v + 1,

which is to say, the first computation of q0 is the desired result or is that plus 1.

Using Equation (2) twice gives

Image

Using Equation (3) gives

Image

Using algebra to get this in the form u / v + something:

Image

This is of the form

Image

and we will now show that δ < 1.

δ is largest when rem(v, 232 – n) is as large as possible and, given that, when v is as small as possible. The maximum value of rem(v, 232 – n) is 232 – n – 1. Because of the way n is defined in terms of v, v ≥ 263 – n. Thus, the smallest value of v having that remainder is

263 – n + 232 – n – 1.

Therefore,

Image

By inspection, for n in its range of 0 to 31,

Image

Since u is at most 264 – 1, δ < 1. Because Image and δ < 1 (and obviously δ ≥ 0),

Image

To correct this result by subtracting 1 when necessary, we would like to code

if (u < q0*v) q0 = q0 - 1;

(i.e., if the remainder uq0v is negative, subtract 1 from q0). However, this doesn’t quite work, because q0 v can overflow (e.g., for u = 264 – 1 and v = 232 + 3). Instead, we subtract 1 from q0, so that it is either correct or too small by 1. Then q0 v will not overflow. We must avoid subtracting 1 if q0 = 0 (if q0 = 0, it is already the correct quotient).

Then the final correction is:

if ((u - q0*v) >= v) q0 = q0 - 1;

To see that this is a valid computation, we already noted that q0v does not overflow. It is easy to show that

0 ≤ uq0v < 2v.

If v is very large (≥ 263), can the subtraction overflow by trying to produce a result greater than v? No, because u < 264 and q0v ≥ 0.

Incidentally, there are alternatives to the lines

if (q0 != 0)     // Make q0 correct or
   q0 = q0 - 1   // too small by 1.

that may be preferable on some machines. One is to replace them with

if (q0 == 0) return 0;

Another is to place at the beginning of this section of the procedure, or at the beginning of the whole procedure, the line

if (u < v) return 0; // Avoid a problem later.

These alternatives are preferable if branches are not costly. The code shown in Figure 9–5 works well if the machine’s comparison instructions produce a 0/1 integer result in a general register. Then, the compiler can change it to, in effect,

q0 = q0 - (q0 != 0);

(or you can code it that way if your compiler doesn’t do this optimization). This is just a compare and subtract on such machines.

Signed Doubleword Division

In the signed case, there seems to be no better way to do doubleword division than to divide the absolute values of the operands, using function udivdi3, and then negate the sign of the quotient if the operands have different signs. If the machine has a signed long division instruction, which we designate here as DIVS, then it may be advantageous to single out the cases in which DIVS can be used rather than invoking udivdi3. This presumes that these cases are common. Such a function is shown in Figure 9–6.

The “#define” in the code in Figure 9–6 uses the GCC facility of enclosing a compound statement in parentheses to construct an expression, a facility that most C compilers do not have. Some other compilers may have llabs(x) as a built-in function.


#define llabs(x)
({unsigned long long t = (x) >> 63; ((x) ^ t) - t;})

long long divdi3(long long u, long long v) {

   unsigned long long au, av;
   long long q, t;

   au = llabs(u);
   av = llabs(v);
   if (av >> 31 == 0) {         // If |v| < 2**31 and
      if (au < av << 31) {      // |u|/|v| cannot
         q = DIVS(u, v);        // overflow, use DIVS.
         return (q << 32) >> 32;
      }
   }
   q = au/av;                   // Invoke udivdi3.
   t = (u ^ v) >> 63;           // If u, v have different
   return (q ^ t) - t;          // signs, negate q.
}


FIGURE 9–6. Signed doubleword division from unsigned doubleword division.

The test that v is in range is not precise; it misses the case in which v = –231. If it is important to use the DIVS instruction in that case, the test

if ((v << 32) >> 32 == v) { // If v is in range and

can be used in place of the third executable line in Figure 9–6 (at a cost of one instruction). Similarly, the test that |u| / |v| cannot overflow is simplified and a few “corner cases” will be missed; the code amounts to using δ = 0 in the signed division overflow test scheme shown in “Division” on page 34.

Exercises

1. Show that for real x, x = – x .

2. Find branch-free code for computing the quotient and remainder of modulus division on a basic RISC that has division and remainder instructions for truncating division.

3. Similarly, find branch-free code for computing the quotient and remainder of floor division on a basic RISC that has division and remainder instructions for truncating division.

4. How would you compute n / d for unsigned integers n and d, 0 ≤ n ≤ 232 – 1 and 1 ≤ d ≤ 232 – 1? Assume your machine has an unsigned divide instruction that computes n / d .

5. Theorem D3 states that for x real and d an integer, x / d = x / d. Show that, more generally, if a function f(x) is (a) continuous, (b) monotonically increasing, and (c) has the property that if f(x) is an integer then x is an integer, then f(x) = f(x) [GKP].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset