First analyze the main arithmetic operations and generate the corresponding computation algorithms.
Given two positive floating-point numbers s1.Be1 and s2.Be2 their sum s.Be is computed as follows.
Assume that e1 is greater than or equal to e2; then (alignment) the sum of s1.Be1 and s2.Be2 can be expressed in the form s.Be, where
The value of s belongs to the interval
so that s could be greater than or equal to B. If it is the case, that is, if
then (normalization) substitute s by s/B, and e by e + 1, so that the value of s.Be is the same as before, and the new value of s satisfies
The significands s1 and s2 of the operands are multiples of ulp. If e1 is greater than e2, the value of s could no longer be a multiple of ulp and some rounding function should be applied to s. Assume that
s′ and s″ being two successive multiples of ulp. Then the rounding function associates to s either s′ or s″, according to some rounding strategy. According to (16.9) and to the fact that 1 and B − ulp are multiples of ulp, it is obvious that
Nevertheless, if condition (16.8) does not hold, that is, if
s could belong to the interval
so that rounding(s) could be equal to B. A new normalization step would be necessary, that is, substitution of s = B by s = 1 and e by e + 1.
Algorithm 16.1 Sum of Positive Numbers
if e1>=e2 then e:=e1; s:=s1+(s2/B*(e1-e2)); else e:=e2; s:=(s1/B*(e2-e1))+s2; end if; if s>=B then e:=e+1; s:=s/B; end if; s:=round(s); if s>=B then e:=e+1; s:=s/B; end if;
Examples 16.2 Assume that B = 10 and ulp = 10−4, so that the numbers are represented in the form s.10e where 1 ≤ s ≤ 9.9999.
1. Compute z = (3.4375 × 103) + (2.5491 × 10−1):
2. Compute z = (9.4375 × 103) + (8.6247 × 102):
3. Compute z = (9.4375 × 103) + (5.6247 × 102):
Comment 16.1 The addition of two positive numbers could produce an overflow, as the final value of e could be greater than emax.
Given two positive floating-point numbers s1.Be1 and s2.Be2 their difference s.Be is computed as follows:
Assume that e1 is greater than or equal to e2; then (alignment) the difference between s1.Be1 and s2.Be2 can be expressed in the form s.Be, where
The value of s belongs to the interval
If s is negative, then it is substituted by –s and the sign of the final result will be modified accordingly. If s is equal to 0, an exception equal_zero could be raised. It remains to consider the case where
The value of s could be smaller than 1. In order to normalize the significand, a procedure
procedure leading_zeroes(s: in fixed_point; k: out natural)
must be executed: it counts the number of initial 0′s of the representation of s. In other words, it looks for the minimum exponent k such that s.Bk ≥ 1. Then s is substituted by s.Bk and e by e − k. Thus, the relation (16.10) holds, that is,
It remains to round (up or down) the significand and to normalize it if necessary.
Algorithm 16.2 Difference of Positive Numbers
if e1>=e2 then e:=e1; s:=s1-(s2/B**(e1-e2)); else e:=e2; s:=(s1/B**(e2-e1))-s2; end if; if s<0 then s:=-s; sign:=1; end if; leading_zeroes(s, k); s:=s*(B**k); e:=e-k; s:=round(s); if s>=B then e:=e+1; s:=s/B; end if;
Examples 16.3 Assume again that B = 10 and ulp = 10−4, so that the numbers are represented in the form s.10e where 1 ≤ s ≤ 9.9999. For computing the difference, the 10's complement system is used.
1. Compute z = (3.4518 × 10−1) − (7.2471 × 103):
2. Compute z = (1.0014 × 103) − (9.9491 × 102):
3. Compute z = (1.0714 × 104) − (7.1403 × 102):
Comment 16.2 The difference of two positive numbers could produce an underflow, as the final value of e could be smaller than emin.
Given two floating-point numbers (−1)sign1.s1.Be1 and (−1)sign2.s2.Be2, and a control variable operation, an algorithm is defined for computing
Once the significands have been aligned, the actual operation (addition or subtraction of the significands) depends on the values of operation, sign1, and sign2 (Table 16.1).
The following algorithm, based on Algorithms 16.1 and 16.2 as well as Table 16.1, computes z.
Algorithm 16.3 Addition and Subtraction
if e1>=e2 then e:=e1; s2:=s2/B**(e1-e2); else e:=e2; s1:=s1/B**(e2-e1); end if; sign:=sign1; if operation xor sign1 xor sign2=0 then s:=s1+s2; if s>=B then e:=e+1; s:=s/B; end if; s:=round(s); if s>=B then e:=e+1; s:=s/B; end if; else s:=s1-s2; if s<0 then s:=-s; sign:=1-sign; end if; leading_zeroes(s, k); s:=s*(B**k); e:=e-k; s:=round(s); if s>=B then e:=e+1; s:=s/B; end if; end if;
As regards the hardware implementation, the following equivalent algorithm is better.
Algorithm 16.4 Addition and Subtraction, Second Version
if operation=1 then sign2:=1-sign2; end if; if e1<e2 then swap(sign1, sign2); swap(s1, s2); swap (e1, e2); end if; e:=e1; s2:=s2/B**(e1-e2); sign:=sign1; if sign xor sign2=0 then s:=s1+s2; if s>=B then e:=e+1; s:=s/B; end if; else if (e1=e2) and (s1<s2) then swap(s1, s2); sign:=1-sign; end if; s:=s1-s2; leading_zeroes(s, k); s:=s*(B**k); e:=e-k; end if; s:=round(s); if s>=B then e:=e+1; s:=s/B; end if;
Given two floating-point numbers (−1)sign1.s1.Be1 and (−1)sign2.s2.Be2, their product (−1)sign.s.Be is computed as follows:
The value of s belongs to the interval
and could be greater than or equal to B. If it is the case, that is, if
then (normalization) substitute s by s/B, and e by e + 1. The new value of s satisfies
(ulp < B so that 2 − ulp/B > 1).
It remains to round the significand and to normalize if necessary.
Algorithm 16.5 Multiplication
sign:=sign1 xor sign2; s:=s1*s2; e:=e1+e2; if s>=B then e:=e+1; s:=s/B; end if; s:=round(s); if s>=B then e:=e+1; s:=s/B; end if;
Examples 16.4 Assume that B = 10 and ulp = 10−4, so that the numbers are represented in the form s.10e, where 1 ≤ s ≤ 9.9999.
1. Compute z = (3.4382 × 103)×(2.5471 × 10−1):
2. Compute z = (9.4300 × 103)×(8.6200 × 102):
3. Compute z = (4.7619 × 102)×(2.1000 × 103):
Comment 16.3 The product of two real numbers could produce an overflow as the final value of e could be greater than emax.
Given two floating-point numbers (−1)sign1.s1.Be1 and (−1)sign2.s2.Be2 their quotient (−1)sign.s.Be is computed as follows:
The value of s belongs to the interval
and could be smaller than 1. If that is the case, that is if s = s1/s2 < 1, then
and
Then (normalization) substitute s by s.B, and e by e − 1. The new value of s satisfies
It remains to round the significand.
Algorithm 16.6 Division
sign:=sign1 xor sign2; s:=s1/s2; e:=e1 – e2; if s<1 then e:=e–1; s:=s*B; end if; s:=round(s);
Examples 16.5 Assume that B = 10 and ulp = 10−4, so that the numbers are represented in the form s.10e, where 1 ≤ s ≤ 9.9999.
1. Compute z = (3.4375 × 103)/(2.5491 × 10−1):
2. Compute z = (2.5491 × 10−1)/(3.4375 × 103):
Comment 16.4 The quotient of two real numbers could produce an underflow, as the final value of e could be smaller than emin.
Given a positive floating-point number s1.Be1, its square root s.Be is computed as follows:
In the first case (16.22),
In the second case (16.23),
and (normalization) s must be substituted by s.B and e by e – 1, so that
It remains to round the significand and to normalize if necessary.
Algorithm 16.7 Square Root
if (e1 mod 2)=1 then s1:=s1/B; e1:=e1+1; end if; s:=square_root(s1); e:=e1/2; if s<1 then e:=e-1; s:=s*B; end if; s:=round(s); if s>=B then e:=e+1; s:=s/B; end if;
Examples 16.6 Assume that B = 10 and ulp = 10−4, so that the numbers are represented in the form s.10e, where 1 ≤ s ≤ 9.9999.
1. Compute z = (9.9491 × 102)1/2:
2. Compute z = (3.4518×10−1)1/2:
3. Compute z = (9.9999 × 103)1/2:
Comments 16.5 The square rooting of a real number could produce an underflow, as the final value of e could be smaller than emin.