Chapter 14. Floating-Point Deux

Since the floating-point values have been discussed, it is now time to discuss some of the operations that can be performed with them, such as bit masking and comparisons.

Why would someone wish to generate a bit mask for a floating-point number? Due to the nature of the mantissa and exponential bits, a floating-point value can be manipulated.

Workbench Files:Benchx86chap14projectplatform

 

project

platform

3D (Special)

vsf3d

vc6

4vec (Special)

qvsf3d

vc.net

SQRT — Square Root

The reciprocal and square root are two mathematical operations that have special functionality with vector processors. The division operation is typically performed by multiplying the reciprocal of the denominator by the numerator. A square root is not always just a square root; sometimes it is a reciprocal square root. So first we examine some simple forms of these.

Equation 14-1. Reciprocal

Reciprocal
Reciprocal

Another way to remember this is:

Equation 14-2. Square root

Square root

The simplified form of this parallel instruction individually calculates the square root of each of the packed floating-point values, and returns the result in the destination. Some processors support the square root instruction directly, but some processors, such as the 3DNow! instruction set, actually support it indirectly through instructional stages. And some processors support it as a reciprocal square root.

So now I pose a little problem. We hopefully all know that a negative number should never be passed into a square root because computers go BOOM, as they have no idea how to deal with an identity (i.)

Square root

With that in mind, what is wrong with a reciprocal square root? Remember your calculus and limits?

Square root

Okay, how about this one?

Hint

Do you see it now? You cannot divide by zero, as it results in infinity and is mathematically problematic. So what has to be done is to trap for the x being too close to zero (as x approaches zero) and then substitute the value of one as the solution for the reciprocal square root.

  y = (x < 0.0000001) ? 1.0 : (1 / sqrt(x)); // Too close to zero

It is not perfect but it is a solution. The number is so close to infinity that the result of its product upon another number is negligible. So in essence the result is that other number; thus the multiplicative identity comes to mind: 13 n = n. But how to deal with this in vectors? Well, you just learned the trick in this chapter! Remember the packed comparison? It is just a matter of using masking and bit blending. So in the case of a reciprocal square root, the square root can be easily achieved by merely multiplying the result by the original x value, thus achieving the desired square root. Recall that the square of a square root is the original value.

Hint
Hint

1×SPFP Scalar Square Root

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

SQRTSS

     

1×SPFP Scalar Square Root

1×SPFP Scalar Square Root

1×SPFP Scalar Square Root

1×SPFP Scalar Square Root

1×SPFP Scalar Square Root

SSE

sqrtss xmmDst, xmmSrc/m32

Single-precision

128

This SIMD instruction is a 128-bit scalar operation that calculates the square root of only the lowest single-precision floating-point element containing the scalar xmmSrc. The result is stored in the lowest single-precision floating-point block at destination xmmDst, and the remaining bit blocks are left intact.

1×SPFP Scalar Square Root

4×SPFP Square Root

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

SQRTPS

     

4×SPFP Square Root

4×SPFP Square Root

4×SPFP Square Root

4×SPFP Square Root

4×SPFP Square Root

SSE

sqrtps xmmDst, xmmSrc/m128

Single-precision

128

This SIMD instruction is a 128-bit parallel operation that calculates the square root of the four single-precision floating-point blocks contained within xmmSrc, and stores the result in the single-precision floating-point blocks at destination xmmDst.

4×SPFP Square Root

1×DPFP Scalar Square Root

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

SQRTSD

      

1×DPFP Scalar Square Root

1×DPFP Scalar Square Root

1×DPFP Scalar Square Root

1×DPFP Scalar Square Root

SSE2

sqrtsd xmmDst, xmmSrc/m64

Double-precision

128

This SIMD instruction is a 128-bit scalar operation that calculates the square root of only the lowest double-precision floating-point block containing the scalar xmmSrc, and stores the result in the lowest double-precision floating-point block at destination xmmDst. The remaining bit blocks are left intact.

1×DPFP Scalar Square Root

2×DPFP Square Root

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

SQRTPD

      

2×DPFP Square Root

2×DPFP Square Root

2×DPFP Square Root

2×DPFP Square Root

SSE2

sqrtpd xmmDst, xmmSrc/m128

Double-precision

128

This SIMD instruction is a 128-bit parallel operation that calculates the square root of the two double-precision floating-point blocks contained within xmmSrc, and stores the result in the double-precision floating-point blocks at destination xmmDst.

2×DPFP Square Root

1×SPFP Scalar Reciprocal Square Root (15-Bit)

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PFRSQRT

   

1×SPFP Scalar Reciprocal Square Root (15-Bit)

1×SPFP Scalar Reciprocal Square Root (15-Bit)

  

1×SPFP Scalar Reciprocal Square Root (15-Bit)

  

RSQRTPS

     

1×SPFP Scalar Reciprocal Square Root (15-Bit)

1×SPFP Scalar Reciprocal Square Root (15-Bit)

1×SPFP Scalar Reciprocal Square Root (15-Bit)

1×SPFP Scalar Reciprocal Square Root (15-Bit)

1×SPFP Scalar Reciprocal Square Root (15-Bit)

RSQRTSS

     

1×SPFP Scalar Reciprocal Square Root (15-Bit)

1×SPFP Scalar Reciprocal Square Root (15-Bit)

1×SPFP Scalar Reciprocal Square Root (15-Bit)

1×SPFP Scalar Reciprocal Square Root (15-Bit)

1×SPFP Scalar Reciprocal Square Root (15-Bit)

3DNow!

pfrsqrt mmxDst, mmxSrc/m32

Single-precision

32/64

SSE

rsqrtss xmmDst, xmmSrc/m32

Single-precision

32

 

rsqrtps xmmDst, xmmSrc/m128

Single-precision

128

This SIMD instruction is a 32-bit scalar operation that calculates the square root of only the lowest single-precision floating-point block containing the scalar mmSrc, and stores the duplicate result in the low and high single-precision floating-point blocks at destination mmDst.

1×SPFP Scalar Reciprocal Square Root (15-Bit)

Pseudo Vec

(Float) Square Root

Example 14-1. ...chap14fsfFsf.cpp

   void vmp_FSqrt(float * const pfD, float fA)
   {
     ASSERT_PTR4(pfD);
     ASSERT_NEG(fA);          // Watch for negative

     *pfD = sqrtf(fA);        // = A
   }

Pseudo Vec (x86)

vmp_FSqrt (3DNow!) Fast Float 15-Bit Precision

A square root is time consuming and should be omitted whenever possible. If it is indeed needed, then the next logical choice would be between an imprecise and quick calculation or a more accurate but slower calculation. The following code is for a simple 15-bit accuracy scalar square root vmp_FSqrt (3DNow!) Fast Float 15-Bit Precision supported by the 3DNow! instruction set.

        movd mm0,fA       ; {0 fA}
        mov  edx,pfD      ; float destination

Example 14-2. ...chap14fsfFsfX86M.asm

...chap14fsfFsfX86M.asm

SPFP Square Root (2 Stage) (24-Bit)

A fast version of the previous instruction would entail taking advantage of the two-stage vector instructions PFRSQIT1 and PFRCPIT2, in conjunction with the result of the square root instruction PFRSQRT, to achieve a higher 24-bit precision. It uses a variation of the Newton-Raphson reciprocal square root approximation.

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PFRCPIT2

   

SPFP Square Root (2 Stage) (24-Bit)

SPFP Square Root (2 Stage) (24-Bit)

  

SPFP Square Root (2 Stage) (24-Bit)

  

PFRSQIT1

   

SPFP Square Root (2 Stage) (24-Bit)

SPFP Square Root (2 Stage) (24-Bit)

  

SPFP Square Root (2 Stage) (24-Bit)

  

First stage for 24-bit reciprocal:

3DNow! pfrsqit1 mmxDst, scalar(mmx/m32) Single-precision 64

Second stage for 24-bit reciprocal and/or square root (see reciprocals in Chapter 13):

3DNow! pfrcpit2 mmxDst, scalar(mmx/m32) Single-precision 64

vmp_FSqrt (3DNow!) Standard Float 24-Bit Precision

The following is the same as the previous scalar square root algorithm but is coded for 24-bit precision. Note the addition of the PFRSQIT1 and PFRCPIT2 instructions.

        mov     edx,pfD ; (float) destination

Example 14-3. ...chap14fsfFsfX86M.asm

...chap14fsfFsfX86M.asm

vmp_FSqrt (SSE) Float Sqrt 24-Bit Precision

For SSE it is merely a scalar square root instruction.

Example 14-4. ...chap14fsfFsfX86M.asm

...chap14fsfFsfX86M.asm

Vector Square Root

Are you nuts? Vector square roots? What are you thinking?

Unless you have a top-of-the-line supercomputer, I would recommend you stay away from vector square roots. Instead, you will typically only need a single square root. If you really need vector-based square roots, remember that your processor can only do one at a time and your code will have to wait for it to complete before issuing a request to begin the next one. That could take almost forever! Well, not quite. But it is still not a great idea. Also, do not forget about preventing negative numbers from being processed by a square root. That causes exception faults!

Pseudo Vec

Vector Square Root

Example 14-5. ...chap14vsf3dVsf3D.cpp

   void vmp_VecSqrt(vmp3DVector * const pvD,
              const vmp3DVector * const pvA)
   {
     pvD->x = sqrtf(pvA->x);
     pvD->y = sqrtf(pvA->y);
     pvD->z = sqrtf(pvA->z);
   }

Quad Vector Square Root

Example 14-6. ...chap14qvsf3dQVsf3D.cpp

   void vmp_QVecSqrt(vmp3DQVector * const pvD,
               const vmp3DQVector * const pvA)
   {
     pvD->x = sqrtf(pvA->x);
     pvD->y = sqrtf(pvA->y);
     pvD->z = sqrtf(pvA->z);
     pvD->w = sqrtf(pvA->w);
   }

Similar to an estimated reciprocal for a division, a square root sometimes is available as an estimate as well. Be warned that the estimated square root is faster but has a lower precision. But if the lower precision is viable for your application, then investigate using the estimated square root instead.

Pseudo Vec (x86)

The 3DNow! instruction set supports 64-bit so two loads must be handled simultaneously as well as two saves, but it is a simple matter of adding the two pairs of floats to each other.

    mov eax,vA  ; Vector A
    mov edx,vD  ; Vector destination

vmp_QVecSqrt (3DNow!) Fast Quad Float SQRT 15-Bit Precision

Example 14-7. ...chap14qvsf3dQVsf3DX86M.asm

...chap14qvsf3dQVsf3DX86M.asm

vmp_QVecSqrt (3DNow!) Quad Float Sqrt 24-Bit Precision

In the previous code there is a comment in bold related to insertion for 24-bit precision. By inserting the following code the higher accuracy will be achieved. It uses the Newton-Raphson reciprocal square approximation.

Example 14-8. ...chap14qvsf3dQVsf3DX86M.asm

...chap14qvsf3dQVsf3DX86M.asm

vmp_QVecSqrt (SSE) Float Sqrt 24-Bit Precision

For SSE there is a 24-bit precision quad square root. For unaligned memory, substitute MOVUPS for the MOVAPS.

Example 14-9. ...chap14qvsf3dQVsf3DX86M.asm

...chap14qvsf3dQVsf3DX86M.asm

vmp_QVecSqrtFast (SSE) Float Sqrt Approximate

The following is a fast reciprocal square root.

Example 14-10. ...chap14qvsf3dQVsf3DX86M.asm

...chap14qvsf3dQVsf3DX86M.asm

Graphics 101 — Vector Magnitude (aka 3D Pythagorean Theorem)

Ever hear that the shortest distance between two points is a straight line? The square of the hypotenuse of a right triangle is equal to the square of each of its two sides whether in 2D or 3D space. The Pythagorean equation is essentially the distance between two points, in essence the magnitude of their differences.

The first rule of a square root operation is to not use it unless you really have to as it is a time intensive mathematical operation. One method typically used for calculating the length of a line between two points whether it exists in 2D or 3D space is to use the Pythagorean equation.

2D Distance

2D right triangle representing a 2D distance

Figure 14-1. 2D right triangle representing a 2D distance

Equation 14.3. 2D distance

2D distance

3D Distance

Right triangle within 3D Cartesian coordinate system representing a 3D distance and thus its magnitude

Figure 14-2. Right triangle within 3D Cartesian coordinate system representing a 3D distance and thus its magnitude

Equation 14.4. 3D distance (magnitude)

3D distance (magnitude)

Mathematical Formula:

3D distance (magnitude)

So if the dot product dp = x2+ y2+ z2approaches zero, the value of 1/x gets closer to infinity. Once x becomes zero, the solution becomes undefined: 1/0 = ∞. When a number is extremely close to infinity and is passed to a square root, the accuracy becomes lost. So instead of being forced to divide by zero (1/0) to represent infinity, it is instead set to a value of one (y x 1 = y); thus the original value is preserved.

The Pythagorean equation is the distance between two points, in essence, the magnitude of their differences. In a terrain-following algorithm for creature AI, the distance between each of the creatures and the main character would be compared to make an idle, run, or flee determination. The coordinates of each object are known but their distances would have to be calculated and then compared to each other as part of a solution. Let's examine a simplistic equation utilizing r to represent the distance between the player and four monsters {mA through mD}:

Monster to player 2D distance calculations

Figure 14-3. Monster to player 2D distance calculations

Monster to player 2D distance calculations

If you remember the algebraic law of multiplicative identity, the square root factors out of the equation, as it can be removed from both sides of the equal sign and the equation will remain in balance.

Monster to player 2D distance calculations

Does this look a little similar to the sum of absolute differences operation discussed in Chapter 7? They are different by the sum of absolutes versus the sum of the squares, but they nevertheless have a similarity. The point is that there is no need to use the square root operation each time in this kind of problem. Neat, huh! It is an old trick but still an effective one.

Now supposing that it has been discovered that Monster C is the closest monster. Take the square root to calculate the distance, not forgetting to use the estimate square root version if accuracy is unnecessary.

Pseudo Vec

Example 14-11. ...chap14vsf3dVsf3D.cpp

  void vmp_VecMagnitude(float * const pfD,
            const vmp3DVector * const pvA)
  {
    *pfD=sqrtf(pvA->x * pvA->x
             + pvA->y * pvA->y
             + pvA->z * pvA->z);
  }

Pseudo Vec (x86)

The 3DNow! instruction set supports 64-bit so two loads and/or stores must be handled simultaneously, but the result is a simple matter of adding the two pairs of floats to each other.

     mov  eax,vA                  ; Vector A
     mov  edx,vD                  ; Vector destination

vmp_VecMagnitude (3DNow!)

Example 14-12. ...chap14vsf3dVsf3DX86M.asm

...chap14vsf3dVsf3DX86M.asm

vmp_VecMagnitude (SSE) Aligned

Replace MOVAPS with MOVUPS for unaligned memory.

Example 14-13. ...chap14vsf3dVsf3DX86M.asm

...chap14vsf3dVsf3DX86M.asm

Vector Normalize

Pseudo Vec

Example 14-14. ...chap14vsf3dVsf3D.cpp

...chap14vsf3dVsf3D.cpp

Pseudo Vec (x86)

The 3DNow! processor supports 64-bit so two loads or two stores must be handled simultaneously, but it is a simple matter of adding the two pairs of floats to each other.

   mov  eax,vA          ; Vector A
   mov  edx,vD          ; Vector destination

vmp_VecNormalize (3DNow!)

Example 14-15. ...chap14vsf3dVsf3DX86M.asm

...chap14vsf3dVsf3DX86M.asm

vmp_VecNormalize (SSE) Aligned

If the data is unaligned, change the MOVAPS instruction to MOVUPS.

Example 14-16. ...chap14vsf3dVsf3DX86M.asm

...chap14vsf3dVsf3DX86M.asm
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset