Most often the computation of functions such as logarithms and exponential or trigonometric functions are made through software-implemented algorithms applied to floating-point representations. Hardware or microprogrammed systems are mainly justified for special-purpose computing devices such as ASIC or embedded systems. As it is generally not possible to get an exact result, approximation methods have to be used together with error estimation techniques. Newton–Raphson, Taylor–MacLaurin series, or polynomial approximations are the most common approaches to compute these functions. For trigonometric functions, CORDIC (linear convergence) algorithms are well suited. Arguments included in the range [1, 2[ (floating-point IEEE standard) are suitable for most approximation methods that need to limit the range of the argument. Whenever a specific range is imposed on the operands, a prescaling operation may be necessary: so an initial step may be included in the algorithmic procedure. Crucial questions for approximation methods are error estimation and effective rounding techniques; these problems start from table design (first approximation LUT) up to the final result. Numerous methods, algorithms, and implementations are proposed in the literature ([SPE1965], [CHE1972], [ERC1977], [COD1980], [KOS1991], [MAR1990], [TAN1991], [ERC1995], [PAL2000], [CAO2001]). As for the basic operations, the choice will depend on the speed/cost compromises and other constraints imposed on the designer. Approximation methods usually assume the availability of the four basic operations as arithmetic primitives at hand, together with look-up tables for a first “reasonably good” approximation to start from.
The classical Taylor expansion (at point a) formula can be written
When this expansion converges over a certain range of x, that is, when
the expansion is referred to as the Taylor series; if a = 0, it is called the MacLaurin series (Chapter 2).
Consider the convergence behavior of the exponential functions expressed by the following series at x = 0:
The above expressions can be factorized (Hörner scheme) to suggest a simple iterative computational step; for example, (7.13) for n = 8 can be written
Formula (7.16) can be computed in 8 divide-and-add steps with an error inferior to (1 + 0.1).x9/9! > R8, that is, for 0 ≤ x ≤ 1/2, an error ε (maximum for x = 1/2) given by
Actually, (7.16), computed at x = 1/2, yields (rounded down) 1.6487212650; ten-decimal precision computation of e1/2 yields (rounded up) 1.6487212708, that is, an error inferior to the value computed at (7.17). Obviously, the convergence is faster as x becomes smaller, so a prescaling operation could be necessary if the number of algorithmic steps is fixed. Expression (7.14) can be factorized (Hörner scheme) as
Formula (7.18) is computed in the same way as (7.16), but because of the alternating signs, the absolute value of the error εmax is somewhat smaller, actually x9/9!. Expression (7.15) is quite similar to (7.14) excepted for the prescaling: the range of x.ln (a) has to be considered instead of just that of x.
Logarithmic function series are also easy to compute, in particular, ln (x) can be written
or, for n = 10 (Hörner scheme),
A close analysis of (7.19) shows that in the range [1, 2] the convergence rate can be very slow; typically, for x = 2, the error εmax is expressed as
If the range is restricted to [0, 1/2], the maximum error after n steps will be
that is, for x = 1/2, less than 10−4 after 10 steps.
Prescaling the argument of logarithmic functions, together with the corresponding result adjustment, does not involve a significant increase in the overall complexity. Actually, if the argument has been initially divided by k, a correction term +ln (k) has to be added to adjust the final result. For exponential functions the general prescaling problem is generally not straightforward. The particular case of the function 2N, with N expressed in floating point, can be processed as follows. Let m.2k be the floating-point representation of N; m is the mantissa and k the exponent. The scaling process consists of reducing N as
So, assuming E = m.2k, one can write
where 2Ns is the mantissa of 2N and E the exponent. So the scaling process reduces the computation of 2N to that of 2Ns.
Trigonometric functions are also readily expressed as series, in particular,
or, for n = 6 (Hörner scheme)
To get a fast convergence rate, scaling is also required in this case. Since trigonometric functions are periodic, the scaling can be a straight reduction of the argument to, for example, the range using trigonometric equivalence relations; this can thus be processed without calling for any postcorrection to the final result. If the range holds, the convergence rate is fair: εmax can be expressed as
That is, for
Basically, polynomial approximation methods consist of building a polynomial P(x) of degree n, that matches the target function at n points. The principle is general for continuous functions. Taylor–MacLaurin series are particular cases. The choice of the matching points is not fixed, neither is the degree of the polynomial nor the quantity of them (e.g., several polynomials may approximate the same function in separate ranges). A number of methods and algorithms have been proposed in the literature. Moreover, prescaling is often recommended to speed-up the convergence. A classical numerical calculus technique is used to compute the coefficients of the approximation polynomial, once the degree and the desired matching points {xk} are selected; it basically consists of resolving n+1 equations, linear in {ci}, of the form
where are, respectively, the degree-n polynomial and the function to approximate, both computed at xk. Several alternative methods exist to obtain these coefficients without actually solving (7.27). The polynomial coefficients are computed once, then stored; the way the polynomial is then computed for a given value of the argument defines the algorithm computing the approximated function at this argument. A Hörner-type factorization is a first approach to define an iterative computation step.
Comment 7.3 Hörner Scheme Revisited In Sections 7.3.1 and 7.3.2, the Hörner scheme was mentioned as a possible iterative way to implement algorithms. Whenever pipelined or parallel arithmetic operators are available, this approach could be significantly improved; for instance, if one can take advantage of multiple multiply-and-add and squaring units, a useful generalized Hörner scheme may be defined as follows.
Let
be the Hörner expansion of the polynomial P(x).
Let
An extended Hörner expression can be written
Assuming available operators computing (7.29) and squaring units, the number of steps is roughly divided by two. Expressions (7.29) and (7.30) can be generalized further; assuming
one can write
called the generalized Hörner expansion (GHE), which takes advantage of higher level primitive polynomial operators. Expansion (7.32) corresponds to a polynomial of degree k.m − 1. Therefore, thanks to the availability of parallel and/or pipelined operators, multiple level Hörner computation schemes can be built to speed-up the overall process. For instance, a degree-63 polynomial could be computed in 9 multiply-and-add steps if a 3-level scheme is used. First, 16 degree-3 polynomials can be computed (3 steps); four degree-15 polynomials are then worked out using the degree-3 polynomials as primitives (3 steps), and another 3 steps are finally needed to compute the degree-63 polynomial using the degree-16 ones as primitives. A practical implementation of this circuit is shown in Chapter 14.
Convergence methods consist of two parallel processes on two related sequences; typically, one sequence converges to 1 (mutiplicative normalization) or 0 (additive normalization) while the other one converges to the function to approximate. Division using Goldschmidt's algorithm is an example of multiplicative normalization: while the divisor sequence converges to 1, the dividend converges to the desired quotient.
Define
as the multiplicative normalizing function, where ai is selected in such a way that the sequence
converges toward 1. Then the sequence
can be set to converge toward the result ln (x). If y(0) and x(0) are, respectively, set to 0 and to the argument x, and assuming one can write
To make the convergence of (7.34) possible, the argument x needs to be in a range such that
that is,
This means that the argument x could need to be prescaled to fall in the range (7.37). An argument x in the range [1, 2[ (such as, e.g., a floating-point mantissa) fits perfectly; otherwise use a straightforward prescaling operation that replaces x by x′ such that x = x′.2s (x′ in [1, 2[). The algorithm computes ln (x′), then a final additive correction of s.ln (2) is completed. Observe that the lower bound of (7.37) can be lowered to 0.21, as (1 + 20) can be accepted as a first normalizing factor for computing x(1).
In practical implementations of this algorithm, look-up tables are used to read out the successive values of ln (1 ± 2−i), needed to compute y(i + 1) of (7.35). For x in [1/2, 2[, ai can be selected according to the following rules:
The above rules are justified by the following two lemmas, also showing that the convergence rate reaches precision p after p steps (linear convergence).
Lemma 7.1 Let
be the n-bit auxiliary sequence vector at step k; then
Proof The left inequality is trivial, it corresponds to ε = 0. The right inequality is deduced from the computation of x(k).(1 − 2−k) for ε maximum, 2−k − 2−n.
The practical interpretation of (7.42) is the impact of rule (7.39) on x(k + 1) whenever x(k) is greater than one with a fractional part made up of a (k − 1)-zero string and a one at position k. Then x(k + 1) will be either greater than one, exhibiting a similar pattern with at least one zero more, or inferior to one (x0(k + 1) = 0) with at least 2k ones as the header of the fractional part. In both cases, the target value x(p) = 1 is approximated by x(k + 1) with at least one bit more.
Lemma 7.2 Let
be the n-bit auxiliary sequence vector at step k, then
Proof The right inequality is trivial, it corresponds to ε = 0. The left inequality is deduced from the computation of x(k).(1 + 2−k) for ε maximum, that is, 2−k − 2−n.
The practical interpretation of (7.44) is the impact of rule (7.40) on x(k + 1) whenever x(k) is less than one with a fractional part made up of a k-one string and a zero at position k + 1. Then x(k + 1) will be either less than one, exhibiting a similar pattern with at least 2k ones as the header of the fractional part, or greater than one (x0(k + 1) = 1) with at least k + 1 zeros as the header of the fractional part. In both cases, the target value x(p) = 1 is approximated by x(k + 1) with at least one bit more.
Comment 7.4
Algorithm 7.8 Logarithm Computation by Multiplicative Normalization
The argument x is in [1/2, 2[: x = x(0).x(1) x(2) … x(n). Let xx(i, j) be the component j of xx(i) = xx(i, 0).xx(i, 1) xx(i, 2) … xx(i, n). Let lut(i) = ln (1 + a(i).2−i) read from the table.
a(0):=0; c(0):=1; xx(1):=x; yy(1):=0; for i in 1..p-1 loop if xx(i)=1 then exit; end if; if xx(i)>1 then a(i):=-xx(i,i) else a(i):=xx(i,i)*not(xx(i,i+1)); end if; c(i):=1+a(i)*2**(-i); xx(i+1):=xx(i)*c(i); yy(i+1):=yy(i)-lut(i); end loop;
Example 7.8 In this example the auxiliary sequence x(i) is computed in the binary system, while, for readability, the sequence y(i) is computed in decimal; the precision is then readily verified. The functional values ln (1 ± 2−i) are assumed given by look-up tables. x is in [1, 2[.
Let
Compute ln (x) with precision p = 8.
The actual decimal value of ln (1.71875) is 0.541597282 ± 10−9, the difference from the computed result is less than 8.10−4 < 2−10.
As it appears in the preceding example, whenever ai = 0, the only effect of step i on the computation process consists of incrementing the step number; both sequences x(i) and y(i) remain unchanged. So, by detecting strings of 0 or 1 in x(i), one could readily jump to the next nontrivial computation step. The following example illustrates this feature.
Example 7.9 As in the preceding example the auxiliary sequence x(i) is computed in the binary system, while sequence y(i) is computed in decimal The functional values ln (1 ± 2−i) are given by look-up tables. x is now in [1/2, 2[. Strings 00 .. or 11 .. are highlighted by bold type.
Let
Compute ln (x) with precision p = 10 (see Table 7.3).
The actual decimal value of ln (0.59375) is −0.521296923 ± 10−9, the difference from the computed result is less than 4.10−6 < 2−10.
Define
as the additive normalizing function, where ai is selected in such a way that the sequences
converges toward 0. Then the sequence (main sequence)
can be set to converge toward the result ex. If y(0) and x(0) are, respectively, set to 1 and to the argument x, and assuming one can write
To make the convergence of (7.46) possible, the argument x needs to be in a range such that
that is,
Observe that the upper bound of (7.48) can be raised to 1.562, as ln (1 + 20) can be accepted as the first normalizing factor for computing x(1). This extends the range (7.49) to] −1.242, 1.562]. Here again, the argument x could need to be prescaled to fall in the (extended) range (7.49). The range [1, 2[ (floating-point mantissa) doesn't fit any more; the range [−1, 1[ does. A possible prescaling operation can replace x by x′ such that x = x′.2s (x′ in [−1, 1[). The algorithm first computes ex′, then ex is provided after an s-time squaring (square rooting if s < 0) correction on ex′. In particular, a single shift reduces the range [1, 2[to [1/2, 1[: x = x′.2 → x′ in [1/2, 1[; squaring ex′ provides ex.
In the practical implementations of this algorithm, look-up tables are used to read out the successive values of ln (1 ± 2−i), needed to compute x(i + 1) of (7.46). For x in [−1, 1[, ai can be selected according to the following rule:
whenever nonzero, ai holds the same sign as x(i), then
This simple rules ensures that, at each step, |x(i)| will either decrease or stay unchanged; ai = 0 prevents x(i) from increasing. Precision p is reached after p steps.
An interesting feature of the method comes from the following relation:
2−i is the first term of the Taylor–MacLaurin series (7.19). The approximation ln produces an error bounded by (2−2i), that is, less than 2−20 after 10 steps. This very fast convergence behavior (see Table 7.3) can be used to speed-up the algorithm. If the desired precision p is great enough, the algorithmic procedure can be stopped after p/2 iteration steps; at this point one can write
corresponding to the selection
Actually (7.52), the convergence of x(p) to zero, is settled by subtracting for p/2 great enough xi.2−i can be approximated by ln
On the other hand, for p great enough (2−i small enough),
As a consequence, the final p/2 iteration steps can be replaced by
The auxiliary sequence has been built up to x(p/2), then settled to x(p) = 0 by (7.52); the main sequence y(p) can be computed from y(p/2) in just one computation step (7.54). Algorithm 7.9 resumes the general procedure. Chapter 14 describes an implementation including the above-mentioned one-step simplification for the last p/2 steps.
Algorithm 7.9 Exponential Computation by Additive Normalization
The argument x is in [−1, 1[. Let lutn(i) = (1 − 2−i), lutp(i) = (1 + 2−i), lutlnn(i) = ln (1-2−i) and lutlnp(i) = ln (1 + 2−i) be read from look-up tables.
xx(0):=x; yy(0):=1; for i in 0..p−1 loop if xx(i)=0 then exit, end if; if xx(i)<0 then if abs(lutlnn(i))<=abs(2*xx(i)) then xx(i+1):=xx(i)-lutlnn(i); yy(i+1):=yy(i)*lutn(i); else xx(i+1):=xx(i); yy(i+1):=yy(i); end if; else if abs(lutlnp(i))<=abs(2*xx(i)) then xx(i+1):=xx(i)-lutlnp(i); yy(i+1):=yy(i)*lutp(i); else xx(i+1):=xx(i); yy(i+1):=yy(i); end if; end if; end loop;
Example 7.10 For the sake of readability, the following example is treated in decimal. Table 7.4 exhibits the values of ln (1 + 2−i) and ln (1 − 2−i) for i = 0, 1, …, 15. This table is used to compute the successive values of x (i).
The problem at hand is to compute e0.65625 with precision 2−32. The first 16 steps are displayed in Table 7.5. The desired precision is reached through a final single step computing (7.54) for p = 16. The result is 1.92755045011, while the exact result rounded to the 11th decimal is e0.65625 = 1.92755045016. This makes a difference around 5.10−11 < 2−32.
Comment 7.5 The general exponentiation function xy may be computed as
which may be calculated by logarithm computation and multiplication. The particular cases of integer powers (squaring, cubing, etc.) may be calculated through customized multiplication schemes. Convergence methods, such as the one described above, are most often used to implement exponentiation algorithms. An important particular case is the square root (power ) for which a classical digit recurrence algorithm, very similar to division, can easily be adapted. Square rooting is reviewed in Section 7.4.
CORDIC algorithms ([VOL1959], [ERC1987], [DEL1989], [VOL2000]) belong to the class of linear convergence methods. They are particularly well suited to evaluation of trigonometric functions, although they apply to a number of other functions such as the logarithm, exponential, or square root. Historically, the Coordinate Rotation Digital Computer had been designed as a special-purpose digital computer for airborne computation in real-time. The CORDIC techniques were then extended for solving trigonometric relationships involved in plane coordinate rotation and conversion from Cartesian to polar coordinate systems. The CORDIC algorithms converge linearly (one bit per cycle) and are rather simple to implement in practice.
Conceptually, the method consists of computing the new coordinates of a given vector after rotation by a given angle α; after rotation, the unit vector (0, 0) − (1, 0) would get its new end point coordinates as (cos α, sin α).
Let (0, 0) − (x1, y1) be the vector (0, 0) − (x0, y0) rotated by an angle α (Figure 7.1), the basic rotation formulas are given by
The CORDIC algorithm defines, as the auxiliary sequence, the residual rotation remaining to be completed after step i, namely,
If a0 is the desired rotation angle, the auxiliary sequence will be set to converge to zero while the main sequence converges toward the function to evaluate. The method replaces rotations as defined by formulas (7.56) and (7.57) by pseudorotations as illustrated in Figure 7.2. Pseudorotations expand coordinates by a factor of 1/cos αi = (1 + tan2 αi)1/2; (7.56) and (7.57) for step i are replaced by
and
Assuming x0 and y0 to be the initial coordinates x and y, after p rotations by angle αi, the recursive use of (7.59) and (7.60) leads to
and
On the other hand, assuming a to be the overall rotation angle, that is, the argument of the trigonometric function to be evaluated, then the auxiliary sequence is written
The set {αi} is selected in order to make (7.61) and (7.62) easy to calculate and, in particular, to be able to precompute the factor . For this sake, {αi} is pre-set except for the sign; then, convergence will be achieved by a suitable strategy on sign selection. In the following, an algorithm is developed for computing sine and cosine functions.
Let the set {αi} be such that
Equations (7.59) and (7.60) can then be written
and
while (7.58) becomes
After p iterations, ap is assumed close enough to zero, and (7.63) computed iteratively by (7.65) and (7.66), are introduced in (7.61) and (7.62), respectively, leading to
and
where the factor can be precomputed since tan2αi is not dependent on the sign of αi. Then, setting the initial values x0 = x = 1/k and y0 = y = 0, the cosine and sine functions are given by xp and yp, respectively.
Precomputed values of tan−1 (si 2−i) are stored in a look-up table. The size of the look-up table sets the precision p. Table 7.6 shows the first 25 values of tan−1 (si 2−i) together with the fast converging values of k = 1.646760258121 and x0 = x = 1/k = 0.607252935009. The strategy for selecting the sign si is trivial. Since each angle αi (Table 7.6) is smaller than half the preceding angle αi−1, the convergence may be achieved by selecting si = sign (ai−1). The domain of convergence is greater than [−90°, 90°], so any trigonometric value can be computed (with a prospective use of trigonometric relations for angle reduction). More precisely, for 25-step precision, the range is [−99.88°, 99.88°], ± the sum of the αi listed in Table 7.6. Algorithm 7.10 describes the CORDIC procedure for computing sin a and cos a. Without loss of generality, the decimal system has been used for a better readability.
Algorithm 7.10 Sine and Cosine CORDIC Procedure
Let lut(i) be αi = tan−1 (2−i) read from the table.
x(0):=0.607252935009; y(0):=0; a(0):=a; if a(0)=0 then x(0):=1; y(0):=0; exit; end if; for i in 0..p-1 loop if a(i)=0 then exit; end if; if a(i)>0 then s(i):=1 else s(i):=-1; end if; x(i+1):=x(i)-(y(i)*s(i)*2**(-i)); y(i+1):=y(i)+(x(i)*s(i)*2**(-i)); a(i+1):=a(i)-(s(i)*lut(i)); end loop;
In binary, the operations involved in one step of Algorithm 7.10 are: two i-bit shifts, one table look-up, and three signed additions. Computation of x(i + 1), y(i + 1) and a(i + 1) are independent; hardware implementations can thus profit from the potential parallelism. In this case, the elementary step delay is that of a table look-up (or a shift—whatever operation is slower) plus a signed addition. Algorithm 7.10 is illustrated in Example 7.11.
Example 7.11 Let a = 75°; compute the binary expression of cos a and sin a Table 7.6 displays the first steps of the algorithm. Since the binary system is at hand, the initial value of x, namely x(0) = 0.60725293500910, has to be expressed in binary, with the appropriate precision. In the example, 13 bits have been used (x(0) = 0.1001101101110) to set the procedure for a 12-bit precision. The results after 5 steps (Table 7.7) are cos 75° = x(5) = 0.0100001011011 and sin 75° = y(5) = 0.1111011011111. The first seven bits are correct but this precision is irrelevant because too few steps have been completed. The average convergence rate (1 bit per step) is reached after a sufficient number of steps, namely, whenever tan−1 2−i is close enough to 2−i; Table 7.6 shows that for i = 10, tan−1 2−i may be set to 2i with an error inferior to 4.10−7.
The next decimal example runs 28 computational steps (Table 7.8). In this example, the results are cos 30° = x(28) = 0.8660254031 and sin 30° = y(28) = 0.5000000012. In both cases, the error is O(10−9), that is, O(2−28).