EXPLORATION 26

image

Very Big and Very Little Numbers

Even the longest long long cannot represent truly large numbers, such as Avogadro’s number (6.02up-down-arrow.jpg1023) or extremely small numbers, such as the mass of an electron (9.1up-down-arrow.jpg10–31 kg). Scientists and engineers use scientific notation, which consists of a mantissa (such as 6.02 or 9.1) and an exponent (such as 23 or –31), relative to a base (10).

Computers represent very large and very small numbers using a similar representation, known as floating-point. I know many of you have been waiting eagerly for this Exploration, as you’ve probably grown tired of using only integers, so let’s jump in.

Floating-Point Numbers

Computers use floating-point numbers for very large and very small values. By sacrificing precision, you can gain a greatly extended range. However, never forget that the range and precision are limited. Floating-point numbers are not the same as mathematical real numbers, although they can often serve as useful approximations of real numbers.

Like its scientific notation counterpart, a floating-point number has a mantissa, also called a significand, a sign, and an exponent. The mantissa and exponent use a common base or radix. Although integers in C++ are always binary in their representation, floating-point numbers can use any base. Binary is a popular base, but some computers use 16 or even 10 as the base. The precise details are, as always, dependent upon the implementation. In other words, each C++ implementation uses its native floating-point format for maximum performance.

Floating-point values often come in multiple flavors. C++ offers single, double, and extended precision, called float, double, and long double, respectively. The difference is that float usually has less precision and a smaller range than double, and double usually has less precision and smaller range than long double. In exchange, long double usually requires more memory and computation time than double, which usually consumes more memory and computation time than float. On the other hand, an implementation is free to use the same representation for all three types.

Use double, unless there is some reason not to. Use float when memory is at a premium and you can afford to lose precision or long double when you absolutely need the extra precision or range and can afford to give up memory and performance.

A common binary representation of floating-point numbers is the IEC 60559 standard, which is better known as IEEE 754. Most likely, your desktop system has hardware that implements the IEC 60559 standard. For the sake of convenience, the following discussion describes only IEC 60559; however, never forget that C++ permits many floating-point representations. Mainframes and DSPs, for example, often use other representations.

An IEC 60559 floatoccupies 32 bits, of which 23 bits make up the mantissa and 8 bits form the exponent, leaving one bit for the mantissa’s sign. The radix is 2, so the range of an IEC 60559 float is roughly 2–127 to 2127, or 10–38 to 1038. (I lied. Smaller numbers are possible, but the details are not germane to C++. If you are curious, look up denormalization in your favorite computer science reference.)

The IEC 60559 standard reserves some bit patterns for special values. In particular, if the exponent is all one bits, and the mantissa is all zero bits, the value is considered “infinity.” It’s not quite a mathematical infinity, but it does its best to pretend. Adding any finite value to infinity, for example, yields an answer of infinity. Positive infinity is always greater than any finite value, and negative infinity is always smaller than finite values.

If the exponent is all one bits, and the mantissa is not all zero bits, the value is considered as not-a-number, or NaN. NaN comes in two varieties: quiet and signaling. Arithmetic with quiet NaN always yields an NaN result. Using a signaling NaN results in a machine interrupt. How that interrupt manifests itself in your program is up to the implementation. In general, you should expect your program to terminate abruptly. Consult your compiler’s documentation to learn the details. Certain arithmetic operations that have no meaningful result can also yield NaN, such as adding positive infinity to negative infinity.

Test whether a value is NaN by calling std::isnan (declared in <cmath>). Similar functions exist to test for infinity and other properties of floating-point numbers.

A double is similar in structure to a float, except it takes up 64 bits: 52 bits for the mantissa, 11 bits for the exponent, and 1 sign bit. A double can also have infinity and NaN values, with the same structural representation (that is, exponent all ones).

A long double is even longer than double. The IEC 60559 standard permits an extended double-precision format that requires at least 79 bits. Many desktop and workstation systems implement extended-precision, floating-point numbers using 80 bits (63 for the mantissa, 16 for the exponent, and 1 sign bit).

Floating-Point Literals

Any numeric literal with a decimal point or a decimal exponent represents a floating-point number. The decimal point is always '.', regardless of locale. The exponent starts with the letter e or E and can be signed. No spaces are permitted in a numeric literal. For example:

3.1415926535897
31415926535897e-13
0.000314159265e4

By default, a floating-point literal has type double. To write a float literal, add the letter f or F after the number. For a long double, use the letter l or L, as in the following examples:

3.141592f
31415926535897E-13l
0.000314159265E+420L

As with long int literals, I prefer uppercase L, to avoid confusion with the digit 1. Feel free to use f or F, but I recommend you pick one and stick with it. For uniformity with L, I prefer to use F.

If a floating-point literal exceeds the range of the type, the compiler will tell you. If you ask for a value at greater precision than the type supports, the compiler will silently give you as much precision as it can. Another possibility is that you request a value that the type cannot represent exactly. In that case, the compiler gives you the next higher or lower value.

For example, your program may have the literal 0.2F, which seems like a perfectly fine real number, but as a binary floating-point value, it has no exact representation. Instead, it is approximately 0.00110011002. The difference between the decimal value and the internal value can give rise to unexpected results, the most common of which is when you expect two numbers to be equal and they are not. Read Listing 26-1 and predict the outcome.

Listing 26-1.  Floating-Point Numbers Do Not Always Behave As You Expect

#include <cassert>
int main()
{
  float a{0.03F};
  float b{10.0F};
  float c{0.3F};
  assert(a * b == c);
}

What is your prediction?

_____________________________________________________________

What is the actual outcome?

_____________________________________________________________

Were you correct? ________________

The problem is that 0.03 and 0.3 do not have exact representations in binary, so if your floating-point format is binary (and most are), the values the computer uses are approximations of the real values. Multiplying 0.03 by 10 gives a result that is very close to 0.3, but the binary representation differs from that obtained by converting 0.3 to binary. (In IEC 60559 single-precision format, 0.03 * 10.0 gives 0.01110011001100110011001002 and 0.3 is 0.01110011001100110011010002. The numbers are very close, but they differ in the 22nd significant bit.

Some programmers mistakenly believe that floating-point arithmetic is therefore “imprecise.” On the contrary, floating-point arithmetic is exact. The problem lies only in the programmer’s expectations, if you anticipate floating-point arithmetic to follow the rules of real-number arithmetic. If you realize that the compiler converts your decimal literals to other values, and computes with those other values, and if you understand the rules that the processor uses when it performs limited-precision arithmetic with those values, you can know exactly what the results will be. If this level of detail is critical for your application, you have to take the time to perform this level of analysis.

The rest of us, however, can continue to pretend that floating-point numbers and arithmetic are nearly real, without worrying overmuch about the differences. Just don’t compare floating-point numbers for exact equality. (How to compare numbers for approximate equality is beyond the scope of this book. Visit the web site for links and references.)

Floating-Point Traits

You can query numeric_limits to reveal the size and limits of a floating-point type. You can also determine whether the type allows infinity or NaN. Listing 26-2 shows some code that displays information about a floating-point type.

Listing 26-2.  Discovering the Attributes of a Floating-Point Type

#include <iostream>
#include <limits>
#include <locale>
 
int main()
{
  std::cout.imbue(std::locale{""});
  std::cout << std::boolalpha;
  // Change float to double or long double to learn about those types.
  typedef float T;
  std::cout << "min=" << std::numeric_limits<T>::min() << ' '
       << "max=" << std::numeric_limits<T>::max() << ' '
       << "IEC 60559? " << std::numeric_limits<T>::is_iec559 << ' '
       << "max exponent=" << std::numeric_limits<T>::max_exponent << ' '
       << "min exponent=" << std::numeric_limits<T>::min_exponent << ' '
       << "mantissa places=" << std::numeric_limits<T>::digits << ' '
       << "radix=" << std::numeric_limits<T>::radix << ' '
       << "has infinity? " << std::numeric_limits<T>::has_infinity << ' '
       << "has quiet NaN? " << std::numeric_limits<T>::has_quiet_NaN << ' '
       << "has signaling NaN? " << std::numeric_limits<T>::has_signaling_NaN << ' ';
 
  if (std::numeric_limits<T>::has_infinity)
  {
    T zero{0};
    T one{1};
    T inf{std::numeric_limits<T>::infinity()};
    if (one/zero == inf)
      std::cout << "1.0/0.0 = infinity ";
    if (inf + inf == inf)
      std::cout << "infinity + infinity = infinity ";
  }
  if (std::numeric_limits<T>::has_quiet_NaN)
  {
    // There's no guarantee that your environment produces quiet NaNs for
    // these illegal arithmetic operations. It's possible that your compiler's
    // default is to produce signaling NaNs, or to terminate the program
    // in some other way.
    T zero{};
    T inf{std::numeric_limits<T>::infinity()};
    std::cout << "zero/zero = " << zero/zero << ' ';
    std::cout << "inf/inf = " << inf/inf << ' ';
  }
}

Modify the program so it prints information about double. Run it. Modify it again for long double, and run it. Do the results match your expectations? ________________

Floating-Point I/O

Reading and writing floating-point values depend on the locale. In the classic locale, the input format is the same as for an integer or floating-point literal. In a native locale, you must write the input according to the rules of the locale. In particular, the decimal separator must be that of the locale. Thousands-separators are optional, but if you use them, you must use the locale-specific character and correct placement.

Output is more complicated.

In addition to the field width and fill character, floating-point output also depends on the precision—the number of places after the decimal point—and the format, which can be fixed-point (without an exponent), scientific (with an exponent), or general (uses an exponent only when necessary). The default is general. Depending on the locale, the number may include separators for groups of thousands.

In the scientific and fixed formats (which you specify with a manipulator of the same name), the precision is the number of digits after the decimal point. In the general format, it is the maximum number of significant digits. Set the stream’s precision with the precision member function or setprecision manipulator. The default precision is six. As usual, the manipulators that do not take arguments are declared in <ios>, so you get them for free with <iostream>, but setprecision requires that you include <iomanip>.

double const pi{3.141592653589792};
std::cout.precision(12);
std::cout << pi << ' ';
std::cout << std::setprecision(4) << pi << ' ';

In scientific format, the exponent is printed with a lowercase 'e' (or 'E', if you use the uppercase manipulator), followed by the base 10 exponent. The exponent always has a sign (+ or -), and at least two digits, even if the exponent is zero. The mantissa is written with one digit before the decimal point. The precision determines the number of places after the decimal point.

In fixed format, no exponent is printed. The number is printed with as many digits before the decimal point as needed. The precision determines the number of places after the decimal point. The decimal point is always printed.

The default format is the general format, which means printing numbers nicely without sacrificing information. If the exponent is less than or equal to –4, or if it is greater than the precision, the number is printed in scientific format. Otherwise, it is printed without an exponent. However, unlike conventional fixed-point output, trailing zeros are removed after the decimal point. If after removal of the trailing zeros the decimal point becomes the last character, it is also removed.

When necessary, values are rounded off to fit within the allotted precision.

A new format in C++ 11 is hexfloat. The value is printed in hexadecimal, which lets you discover the exact value on systems with binary or base 16 representations. Because the letter 'e' is a valid hexadecimal value, the exponent is marked with the letters 'p' or 'P'.

The easiest way to specify a particular output format is with a manipulator: scientific, fixed, or hexfloat. Like the precision, the format persists in the stream’s state until you change it. (Only width resets after an output operation.) Unfortunately, once you set the format, there is no easy way to revert to the default general format. To do that, you must use a member function, and a clumsy one at that, as shown in the following:

std::cout << std::scientific << large_number << '
';
std::cout << std::fixed << small_number << ' ';
std::cout.unsetf(std::ios_base::floatfield);
std::cout << number_in_general_format << ' ';

Complete Table26-1, showing exactly how each value would be printed in each format, in the classic locale. I filled in the first row for your convenience.

Table 26-1. Floating-Point Output

Tab26-1.jpg

After you have filled in the table with your predictions, write a program that will test your predictions, then run it and see how well you did. Compare your program with Listing 26-3.

Listing 26-3.  Demonstrating Floating-Point Output

#include <iostream>
 
/// Print a floating-point number in three different formats.
/// @param precision the precision to use when printing @p value
/// @param value the floating-point number to print
void print(int precision, float value)
{
  std::cout.precision(precision);
  std::cout << std::scientific << value << ' '
            << std::fixed      << value << ' '
            << std::hexfloat   << value << ' ';
 
  // Set the format to general.
  std::cout.unsetf(std::ios_base::floatfield);
  std::cout << value << ' ';
}
 
/// Main program.
int main()
{
  print(6, 123456.789f);
  print(4, 1.23456789f);
  print(2, 123456789.f);
  print(5, -1234.5678e9f);
}

The precise values can differ from one system to another, depending on the floating-point representation. For example, float on most systems cannot support the full precision of nine decimal digits, so you should expect some fuzziness in the least significant digits of the printed result. In other words, unless you want to sit down and do some serious binary computation, you cannot easily predict exactly what the output will be in every case. Table 26-2 shows the output from Listing 26-3, when run on a typical IEC 60559–compliant system.

Table 26-2. Results of Printing Floating-Point Numbers

Tab26-2.jpg

Some applications are never required to use floating-point numbers; others need them a lot. Scientists and engineers, for example, depend on floating-point arithmetic and math functions and must understand the subtleties of working with these numbers. C++ has everything you need for computation-intensive programming. Although the details are beyond the scope of this book, interested readers should consult a reference for the <cmath> header and the transcendental and other functions that it provides. The <cfenv> header contains functions and related declarations to let you adjust the rounding mode and other aspects of the floating-point environment. If you cannot find information about <cfenv> in a C++ reference, consult a C 99 reference for the <fenv.h> header.

The next Exploration takes a side trip to a completely different topic, explaining the strange comments—the extra slashes (///) and stars (/**)—that I’ve used in so many programs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset