3.3 REAL NUMBERS

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3.3 REAL NUMBERS

As regards the real numbers, there are two types of approximations: fixed-point and floating-point numeration systems. The fixed-point system is a simple extension of the integer representation system; it allows the representation of a relatively reduced range of numbers with some constant absolute precision. The floating point system allows the representation of a very large range of numbers, with some constant relative precision.

Definitions 3.9

In a fixed-point numeration system, the number represented in the form

is x/B^p, where x is the integer represented by the same sequence of digits without point.
Let x_min and x_max be the minimum and maximum integers that can be represented with n digits, that is, x_min = 1 − Bⁿ⁻¹ and x_max = Bⁿ⁻¹ − 1 in sign-magnitude representation, and x_min = − Bⁿ/2 and x_max = Bⁿ/2 − 1 in B's complement or excess-Bⁿ/2 representation. Then, any real number x belonging to the interval

can be represented in the form (3.21) with some error equal to the absolute value of the difference between x and its representation.
The distance d between exactly represented numbers is equal to the unit in the least significant position (ulp), that is, B^−p, so that the maximum error is equal to
The maximum relative error is equal to then so that the maximum relative error is less than or equal to .

Example 3.6 The range of numbers x that can be represented in B's complement, with B = 10, n = 9 digits, and ulp = 10⁻³ is

The following numbers can be exactly represented:

The distance between them is equal to ulp = 0.001.

Definitions 3.10

In a floating-point numeration system, the representation consists of two numbers: a fixed-point number (the significand) + s or −s, where s is a nonnegative number, and an integer (the exponent) e. The corresponding number is ±s.b^e, where b is the chosen base (not necessarily equal to B).
Let s_min, s_max, e_min, and e_max be the minimum and maximum values of s and e, respectively. The range of represented numbers is

and the minimum absolute value of a represented number is
Let ulp be the unit in the least representative position of the significand. Then the distance D between exactly represented numbers is D = d.b^e, where d = ulp is the distance between two successive values of the significand. Thus the value of D depends on the exponent e. The maximum error is equal to
The maximum relative error is equal to D/(2.|x|) = ulp.b^e/(2.s.b^e) = ulp/2.s. As in the preceding case (Definition 3.9(4)) the maximum relative error is less than or equal to .

Comment 3.6 In a floating-point system, with q digits for representing the absolute value s of the significand and t digits for representing the exponent, the range of positive numbers is

the maximum error is equal to

and the maximum relative error is equal to .

In a fixed-point system with q + t digits, the range of positive numbers is

the maximum error is equal to

and the maximum relative error is equal to .

In order to compare both systems, one can compute the quotient rr (relative range) between the maximum and the minimum value of x (x positive). In the floating-point system

and in the fixed point system

Taking into account that it is obvious that

Nevertheless, the maximum relative errors are equal. As regards the maximum errors, their values depend on the ulp (not necessarily the same value in both cases).

Example 3.7 In the ANSI/IEEE ([ANS1985]) single-precision floating-point system, the significand is a sign-magnitude integer