Name

Double Type

Syntax

type Double;

Description

The Double type is an IEEE standard floating-point type that uses 8 bytes to store a sign bit, an 11-bit exponent, and a 52-bit mantissa. The mantissa is usually normalized, that is, it has an implicit 1 bit before the most significant bit. If the exponent is zero, however, the mantissa is denormalized—without the implicit 1 bit. Thus, the numerical value of +0.0 is represented by all zero bits. An exponent of all 1 bits represents infinity (mantissa is zero) or not-a-number (mantissa is not zero).

The limits of the Double type are approximately 2.23 × 10-308 to 1.79 × 10308, with about 15 decimal digits of precision. Table 5-1 shows the detailed format of finite and special Double values.

Table 5-1. Format of Double Floating-Point Numbers

Numeric class

Sign

Exponent Bits

Mantissa Bits

Positive

   

Normalized

0

0...1 to 1...10

0...0 to 1...1

Denormalized

0

0...0

0...1 to 1...1

Zero

0

0...0

0...0

Infinity

0

1...1

0...0

Signaling NaN

0

1...1

0...1 to 01...1

Quiet NaN

0

1...1

1...0 to 1...1

Negative

   

Normalized

1

0...1 to 1...10

0...0 to 1...1

Denormalized

1

0...0

0...1 to 1...1

Zero

1

0...0

0...0

Infinity

1

1...1

0...0

Signaling NaN

1

1...1

0...1 to 01...1

Quiet NaN

1

1...1

1...0 to 1...1

Tips and Tricks

  • Double is a popular type that provides a good balance between performance and precision.

  • The Double type corresponds to the double type in Java, C, and C++.

  • Refer to the Intel architecture manuals (such as the Pentium Developer’s Manual, volume 3, Architecture and Programming Manual) or IEEE standard 754 for more information about infinity and NaN (not a number). In Delphi, use of a signaling NaN raises runtime error 6 (EInvalidOp).

Example

type
  TDouble = packed record
    case Integer of
    0: (Float: Double;);
    1: (Bytes: array[0..7] of Byte;);
    2: (Words: array[0..3] of Word;);
    3: (LongWords: array[0..1] of LongWord;);
    4: (Int64s: array[0..0] of Int64;);
  end;
  TFloatClass = (fcPosNorm, fcNegNorm, fcPosDenorm, fcNegDenorm,
            fcPosZero, fcNegZero, fcPosInf, fcNegInf, fcQNaN, fcSNan);

// Return the class of a floating-point number: finite, infinity,
// not-a-number; also positive or negative, normalized or denormalized.
// Determine the class by examining the exponent, sign bit,
// and mantissa separately.
function fp_class(X: Double): TFloatClass; overload;
var
  XParts: TDouble absolute X;
  Negative: Boolean;
  Exponent: Word;
  Mantissa: Int64;
begin
  Negative := (XParts.LongWords[1] and $80000000) <> 0;
  Exponent := (XParts.LongWords[1] and $7FF00000) shr 20;
  Mantissa :=  XParts.Int64s[0] and $000FFFFFFFFFFFFF;

  // The first three cases can be positive or negative.
  // Assume positive, and test the sign bit later.
  if (Mantissa = 0) and (Exponent = 0) then
    // Mantissa and exponent are both zero, so the number is zero.
    Result := fcPosZero
  else if Exponent = 0 then
    // If the exponent is zero, but the mantissa is not,
    // the number is finite but denormalized.
    Result := fcPosDenorm
  else if Exponent <> $7FF then
    // Otherwise, if the exponent is not all 1, the number is normalized.
    Result := fcPosNorm
  else if Mantissa = 0 then
    // Exponent is all 1, and mantissa is all 0 means infinity.
    Result := fcPosInf

  else
  begin
    // Exponent is all 1, and mantissa is non-zero, so the value
    // is not a number. Test for quiet or signaling NaN.    
    if (Mantissa and $8000000000000) <> 0 then
      Result := fcQNaN
    else
      Result := fcSNaN;
    Exit; // Do not distinguish negative NaNs.
  end;

  if Negative then
    Inc(Result);
end;

See Also

CompToDouble Function, DoubleToComp Procedure, Extended Type, Real Type, Single Type
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset