As regards the real numbers, there are two types of approximations: fixed-point and floating-point numeration systems. The fixed-point system is a simple extension of the integer representation system; it allows the representation of a relatively reduced range of numbers with some constant absolute precision. The floating point system allows the representation of a very large range of numbers, with some constant relative precision.
Definitions 3.9
is x/Bp, where x is the integer represented by the same sequence of digits without point.
can be represented in the form (3.21) with some error equal to the absolute value of the difference between x and its representation.
Example 3.6 The range of numbers x that can be represented in B's complement, with B = 10, n = 9 digits, and ulp = 10−3 is
The following numbers can be exactly represented:
The distance between them is equal to ulp = 0.001.
and the minimum absolute value of a represented number is
Comment 3.6 In a floating-point system, with q digits for representing the absolute value s of the significand and t digits for representing the exponent, the range of positive numbers is
and the maximum relative error is equal to .
In a fixed-point system with q + t digits, the range of positive numbers is
the maximum error is equal to
and the maximum relative error is equal to .
In order to compare both systems, one can compute the quotient rr (relative range) between the maximum and the minimum value of x (x positive). In the floating-point system
and in the fixed point system
Taking into account that it is obvious that
Nevertheless, the maximum relative errors are equal. As regards the maximum errors, their values depend on the ulp (not necessarily the same value in both cases).
Example 3.7 In the ANSI/IEEE ([ANS1985]) single-precision floating-point system, the significand is a sign-magnitude integer
where s−1 s−2 … s−23 is called the mantissa, and the exponent is an excess − 127 integer e7 e6 … e0. The 32-bit word
represents the number
where
Thus
Nevertheless, emin and emax are not used for representing ordinary numbers; they are used for representing
and other nonordinary numbers. The actual minimum and maximum values are
so that the range of represented numbers is that is
and the minimum positive represented number is 1.2−126.