3.3 REAL NUMBERS

As regards the real numbers, there are two types of approximations: fixed-point and floating-point numeration systems. The fixed-point system is a simple extension of the integer representation system; it allows the representation of a relatively reduced range of numbers with some constant absolute precision. The floating point system allows the representation of a very large range of numbers, with some constant relative precision.

Definitions 3.9

  1. In a fixed-point numeration system, the number represented in the form

    image

    is x/Bp, where x is the integer represented by the same sequence of digits without point.

  2. Let xmin and xmax be the minimum and maximum integers that can be represented with n digits, that is, xmin = 1 − Bn−1 and xmax = Bn−1 − 1 in sign-magnitude representation, and xmin = − Bn/2 and xmax = Bn/2 − 1 in B's complement or excess-Bn/2 representation. Then, any real number x belonging to the interval

    image

    can be represented in the form (3.21) with some error equal to the absolute value of the difference between x and its representation.

  3. The distance d between exactly represented numbers is equal to the unit in the least significant position (ulp), that is, B−p, so that the maximum error is equal to

    image

  4. The maximum relative error is equal to image then image so that the maximum relative error is less than or equal to image.

Example 3.6 The range of numbers x that can be represented in B's complement, with B = 10, n = 9 digits, and ulp = 10−3 is

image

The following numbers can be exactly represented:

image

The distance between them is equal to ulp = 0.001.

Definitions 3.10

  1. In a floating-point numeration system, the representation consists of two numbers: a fixed-point number (the significand) + s or −s, where s is a nonnegative number, and an integer (the exponent) e. The corresponding number is ±s.be, where b is the chosen base (not necessarily equal to B).
  2. Let smin, smax, emin, and emax be the minimum and maximum values of s and e, respectively. The range of represented numbers is

    image

    and the minimum absolute value of a represented number is

    image

  3. Let ulp be the unit in the least representative position of the significand. Then the distance D between exactly represented numbers is D = d.be, where d = ulp is the distance between two successive values of the significand. Thus the value of D depends on the exponent e. The maximum error is equal to

    image

  4. The maximum relative error is equal to D/(2.|x|) = ulp.be/(2.s.be) = ulp/2.s. As in the preceding case (Definition 3.9(4)) the maximum relative error is less than or equal to image.

Comment 3.6 In a floating-point system, with q digits for representing the absolute value s of the significand and t digits for representing the exponent, the range of positive numbers is

image

the maximum error is equal to

image

and the maximum relative error is equal to image.

In a fixed-point system with q + t digits, the range of positive numbers is

image

the maximum error is equal to

image

and the maximum relative error is equal to image.

In order to compare both systems, one can compute the quotient rr (relative range) between the maximum and the minimum value of x (x positive). In the floating-point system

image

and in the fixed point system

image

Taking into account that image it is obvious that

image

Nevertheless, the maximum relative errors are equal. As regards the maximum errors, their values depend on the ulp (not necessarily the same value in both cases).

Example 3.7 In the ANSI/IEEE ([ANS1985]) single-precision floating-point system, the significand is a sign-magnitude integer

image

where s−1 s−2s−23 is called the mantissa, and the exponent is an excess − 127 integer e7 e6e0. The 32-bit word

image

represents the number

image

where image

Thus

image

Nevertheless, emin and emax are not used for representing ordinary numbers; they are used for representing

image

and other nonordinary numbers. The actual minimum and maximum values are

image

so that the range of represented numbers is image that is

image

and the minimum positive represented number is 1.2−126.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset