Double Type
The Double
type is an IEEE standard floating-point
type that uses 8 bytes to store a sign bit, an 11-bit exponent, and a
52-bit mantissa. The mantissa is usually normalized, that is, it has
an implicit 1 bit before the most significant bit. If the exponent is
zero, however, the mantissa is denormalized—without the
implicit 1 bit. Thus, the numerical value of +0.0 is represented by
all zero bits. An exponent of all 1 bits represents infinity
(mantissa is zero) or not-a-number (mantissa is not zero).
The limits of the Double
type are approximately
2.23 × 10-308 to 1.79 × 10308,
with about 15 decimal digits of precision. Table 5-1 shows the detailed format of finite and
special Double
values.
Numeric class |
Sign |
Exponent Bits |
Mantissa Bits |
Positive | |||
Normalized |
0 |
0...1 to 1...10 |
0...0 to 1...1 |
Denormalized |
0 |
0...0 |
0...1 to 1...1 |
Zero |
0 |
0...0 |
0...0 |
Infinity |
0 |
1...1 |
0...0 |
Signaling NaN |
0 |
1...1 |
0...1 to 01...1 |
Quiet NaN |
0 |
1...1 |
1...0 to 1...1 |
Negative | |||
Normalized |
1 |
0...1 to 1...10 |
0...0 to 1...1 |
Denormalized |
1 |
0...0 |
0...1 to 1...1 |
Zero |
1 |
0...0 |
0...0 |
Infinity |
1 |
1...1 |
0...0 |
Signaling NaN |
1 |
1...1 |
0...1 to 01...1 |
Quiet NaN |
1 |
1...1 |
Double
is a popular type that provides a good
balance between performance and precision.
The Double
type corresponds to the
double
type in Java, C, and C++.
Refer to the Intel architecture manuals (such as the
Pentium Developer’s Manual, volume 3,
Architecture and Programming Manual) or IEEE
standard 754 for more information about infinity and NaN (not a
number). In Delphi, use of a signaling NaN raises runtime error 6
(EInvalidOp
).
type TDouble = packed record case Integer of 0: (Float: Double;); 1: (Bytes: array[0..7] of Byte;); 2: (Words: array[0..3] of Word;); 3: (LongWords: array[0..1] of LongWord;); 4: (Int64s: array[0..0] of Int64;); end; TFloatClass = (fcPosNorm, fcNegNorm, fcPosDenorm, fcNegDenorm, fcPosZero, fcNegZero, fcPosInf, fcNegInf, fcQNaN, fcSNan); // Return the class of a floating-point number: finite, infinity, // not-a-number; also positive or negative, normalized or denormalized. // Determine the class by examining the exponent, sign bit, // and mantissa separately. function fp_class(X: Double): TFloatClass; overload; var XParts: TDouble absolute X; Negative: Boolean; Exponent: Word; Mantissa: Int64; begin Negative := (XParts.LongWords[1] and $80000000) <> 0; Exponent := (XParts.LongWords[1] and $7FF00000) shr 20; Mantissa := XParts.Int64s[0] and $000FFFFFFFFFFFFF; // The first three cases can be positive or negative. // Assume positive, and test the sign bit later. if (Mantissa = 0) and (Exponent = 0) then // Mantissa and exponent are both zero, so the number is zero. Result := fcPosZero else if Exponent = 0 then // If the exponent is zero, but the mantissa is not, // the number is finite but denormalized. Result := fcPosDenorm else if Exponent <> $7FF then // Otherwise, if the exponent is not all 1, the number is normalized. Result := fcPosNorm else if Mantissa = 0 then // Exponent is all 1, and mantissa is all 0 means infinity. Result := fcPosInf else begin // Exponent is all 1, and mantissa is non-zero, so the value // is not a number. Test for quiet or signaling NaN. if (Mantissa and $8000000000000) <> 0 then Result := fcQNaN else Result := fcSNaN; Exit; // Do not distinguish negative NaNs. end; if Negative then Inc(Result); end;