Chapter 7. Interger Math

This chapter involves math related to whole numbers that you learned in grade school: the processes of addition and subtraction. There you learned about the number line in which numbers just go on forever in positive and negative directions. In computers, as the numbers increase in a positive or negative amount they actually approach a limit (the end of their finite space). Upon reaching the end of the world (Flat Earth Society), they overflow and wrap to the opposite end of the number line (quantum physics, or was that string theory?). Well, anyway, your integer range of numbers is limited by the size of the data used to store them.

Workbench Files:Benchx86chap07projectplatform

 

project

platform

Boolean Logic

pas

vc6

  

vc.net

General Integer Math

We learned negation with the NEG instruction in Chapter 4, "Bit Mangling," and binary multiplication (2N) and division using bit shifting in Chapter 5, "Bit Wrangling."

We can add 8-, 16-, 32-, and 64-bit numbers, and 4-bit if you include BCD (binary-coded decimal) discussed later in Chapter 15.

The EFLAGS Overflow, Sign, Zero, Auxiliary Carry, Parity, and Carry are all altered when using the general-purpose instructions.

Overflow

Whether the numbers overflow the binary limit of the destination operand

Sign

Set to the resulting value of the MSB (most significant bit)

Zero

Set if the result is a value of zero

Aux. Carry

Set as a result of a carry of the low-order nibble

Parity

Set to 1 if the bits add up to an even number; 0 if not

Carry

Any mathematical carry results of the mathematical operation

ADD — Add

ADC — Add with Carry

ADD destination, source

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

ADC

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADD

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

ADC — Add with Carry

add

rmDst(8/16/32/64), #(8/16/32)

Signed

add

rmDst, rSrc(8/16/32/64)

 

add

rDst, rmSrc(8/16/32/64)

 

The ADD general-purpose instruction logically sums the 8-, 16-, 32-, or 64-bit source to the destination, saves the result in destination, and sets the flags accordingly. An 8-bit source immediate value can be sign extended to 16-, 32-, or 64-bit value. A 32-bit source immediate value can be sign extended to a 64-bit destination.

d = a + b

d = b

so

d = a + d = d + a

d += a

ADC — Add with Carry

adc

rmDst(8/16/32/64), #(8/16/32)

Signed

adc

rmDst, rSrc(8/16/32/64)

 

adc

rDst, rmSrc(8/16/32/64)

 

The ADC general-purpose instruction does exactly the same operation but with one difference. The carry value of 0 or 1 is added to the ending result: d = a + b + (carry)

Flags: The flags are set as a result of the addition operation.

Flags

O.flow

Sign

Zero

Aux

Parity

Carry

 

×

×

×

×

×

×

ADC — Add with Carry

When adding a series of numbers one typically uses the non-carry ADD first, followed by ADC for each additional calculation, taking into account the resulting Carry flag.

a   =  a  +  b  +  c  +  d  +  e
a   =  a ADD b ADC c ADC d ADC e

          mov   eax,0000a5a5h
          add   eax,00000ff0h
          adc   eax,0000ff00h
          adc   eax,000ff000h
          adc   eax,00ff0000h

; Add   two 64 #'s     ecx:ebx + edx:eax
          add     ebx,eax       ; Carry results from lower 32 bits.
          adc     ecx,edx       ; Add upper 32 bits + carry

; or when using a 64-bit processor
        add     rbx,rax       ; 64 bits
; and then add with carry for a 128-bit summation.
        adc     rcx,rdx

Sums for really big numbers can be calculated in this way. Need a number 8000 bits long? You can do it in assembly by continuously adding the carry of previous arithmetic operations onto a current operation.

INC — Increment by 1

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

INC

INC — Increment by 1

INC — Increment by 1

INC — Increment by 1

INC — Increment by 1

INC — Increment by 1

INC — Increment by 1

INC — Increment by 1

32

INC — Increment by 1

32

inc

rmDst(8/16/32/64)

(Un)signed

The INC general-purpose instruction is very similar to the ADD instruction, adding a value of 1 to the 8-, 16-, 32-, or 64-bit destination. The result is saved in the destination: d = a + 1.

Flags: The flags are set as a result of the addition operation. Note that the Carry flag is unaffected.

Flags

O.flow

Sign

Zero

Aux

Parity

Carry

 

×

×

×

×

×

-

It is recommended that an ADD instruction be used instead of an INC since flag dependencies can cause stalling due to previous writes to the EFLAG registers.

inc     eax

Use this instead:

add     eax,1

Various forms of indirect memory reference, such as the following, are still allowed:

inc     [eax]
inc     [eax+ebx*4]

This applies to all the general-purpose registers.

No 64-bit

Note

There is no Carry flag set for this instruction. If a carry is needed, use ADD reg/mem, 1.

Tip

The 80×86 is one of the few processors that can perform an operation upon a memory location, not necessarily only in a register. An instruction cannot be preempted by another thread or process in the middle of its execution. It must complete its execution first!

Thread-Safe Increment

 #ifdef _M_IX86
 #define  QUICK_INCREMENT(v) __asm inc v;
 #else
 #define  QUICK_INCREMENT(v) 
     EnterMutex() 
     v++;    
     ExitMutex()
 #endif

XADD — Exchange and Add

XADD destination, source

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

XADD

XADD — Exchange and Add

XADD — Exchange and Add

XADD — Exchange and Add

XADD — Exchange and Add

XADD — Exchange and Add

XADD — Exchange and Add

XADD — Exchange and Add

XADD — Exchange and Add

XADD — Exchange and Add

XADD — Exchange and Add

xadd

rmDst, rSrc(8/16/32/64)

Signed

The XADD general-purpose instruction first exchanges the values in the source general-purpose register and destination just like the XCHG instruction and then logically sums the 8-, 16-, 32-, or 64-bit source to the destination, saves the result in destination, and sets the flags accordingly.

This instruction is rarely used except in the case where a register exchange needs to take place so as to preserve the old value. This is equivalent to four separate instructions:

 

edx = D

eax = A

ebx = tmp

 

OLD

NEW

 

mov ebx,eax

OLD

NEW

NEW

mov eax,edx

OLD

OLD

NEW

mov edx,ebx

NEW

OLD

NEW

add edx,eax

NEW+OLD

OLD

 
 

NEW+OLD

OLD

 

...or two instructions by using the XCHG instruction.

 

edx = D

eax = A

 

OLD

NEW

xchg edx,eax

NEW

OLD

add edx,eax

NEW+OLD

OLD

 

NEW+OLD

OLD

As you can see, the old value is preserved in the source register, and the destination memory gets the sum of the new and old values.

Flags: The flags are set as a result of the addition operation.

Flags

O.flow

Sign

Zero

Aux

Parity

Carry

 

×

×

×

×

×

×

SUB — Subtract

SBB — Subtract with Borrow

SUB destination, source

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

SUB

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

SBB — Subtract with Borrow

sub

rmDst(8/16/32/64), #(8/16/32)

Signed

sub

rmDst, rSrc(8/16/32/64)

 

sub

rDst, rmSrc(8/16/32/64)

 

The SUB general-purpose instruction logically subtracts the 8-, 16-, 32-, or 64-bit source from the destination, saves the result in destination, and sets the flags accordingly. An 8-bit immediate source value can be sign extended to 16-, 32-, or 64-bit value. A 32-bit immediate source value can be sign extended to a 64-bit destination: d = b – a.

sbb

rmDst(8/16/32/64), #(8/16/32)

Signed

sbb

rmDst, rSrc(8/16/32/64)

 

sbb

rDst, rmSrc(8/16/32/64)

 

The SBB general-purpose instruction does exactly the same thing but with one difference: The Carry flag is referred to as the borrow flag with a value of 0 or 1, which is subtracted from the ending result.

6d = b – a – (Carry)

Flags: The flags are set as a result of the subtraction operation. When carry is set (1) it indicates a borrow is needed.

Flags

O.flow

Sign

Zero

Aux

Parity

Carry

 

×

×

×

×

×

×

SBB — Subtract with Borrow

When subtracting a series of numbers one typically uses SUB for the first subtraction, followed by SBB for each additional subtraction, taking into account the Carry flag, which indicates a borrow is required from the next calculation.

a = a – b – c – d – e

a = a SUB b SBB c SBB d SBB e

        mov     eax,0000a5a5h
        sub     eax,00000ff0h
        sbb     eax,0000ff00h
        sbb     eax,000ff000h
        sbb     eax,00ff0000h

DEC — Decrement by 1

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

DEC

DEC — Decrement by 1

DEC — Decrement by 1

DEC — Decrement by 1

DEC — Decrement by 1

DEC — Decrement by 1

DEC — Decrement by 1

DEC — Decrement by 1

32

DEC — Decrement by 1

32

dec

rmDst(8/16/32/64)

(Un)signed

The DEC general-purpose instruction is very similar to the SUB instruction, subtracting a value of 1 from the 8-, 16-, 32-, or 64-bit destination. The result is saved in the destination: d = a – 1.

Flags: The flags are set as a result of the subtraction operation. Note that the carry is unaffected.

Flags

O.flow

Sign

Zero

Aux

Parity

Carry

 

×

×

×

×

×

-

It is recommended that a SUB instruction be used instead of a DEC since flag dependencies can cause stalling due to previous writes to the EFLAG registers.

dec     eax

Use this instead:

sub     eax,1

The various forms of indirect memory reference, such as the following, are still allowed:

dec     [eax]
dec     tbl[ebx+ecx*2]

This applies to all the general-purpose registers.

No 64-bit

Thread-Safe Decrement

#ifdef _M_IX86
#define  QUICK_DECREMENT(v) __asm dec v;
#else
#define  QUICK_DECREMENT(v) 
    EnterMutex()   
    v--;            
    ExitMutex()
#endif

Some programmers will have code such as follows:

     inc al
     jnz $L1        ; Jump if not saturated!
     dec al
$L1:

But that introduces some branching and misprediction. (Branch misprediction will be discussed in a later chapter.) A branchless code methodology can be alternatively utilized where the state of the Carry flag is taken to advantage. Remember, the INC does not affect the Carry flag but the ADD instruction does.

add al,1
sbb al,0

Instead of using an increment, a non-carry addition is used. A normal advancement of one to a value ranging from [0, 255) has a carry of zero, but upon 255+1, a carry is generated. Immediately subtracting a zero with no carry has no effect. But subtracting a zero with a carry takes that one increment back off. Normally d = a + 1 – 0c, then upon rollover d = a + 1 – 1c. In essence, a saturation is created. This same logic can be applied for 216, 232, or 264, whatever your data size is. The code is smaller and much faster.

Contrarily, a down count to the floor value of zero can be implemented utilizing this same rollback method:

sub al,1
adc al,0

The number is decremented until the floor is reached, thus initiating a borrow, and then the borrow is summed back in. Normally, d = a – 1 + 0c, then upon rollover, d = a – 1 + 1c.

Packed Addition and Subtraction

At this point the focus turns to the integer addition and subtraction of numbers in parallel. With the general-purpose instructions of a processor, normal calculations of addition and subtraction take place one at a time as scalar operations. These can instead be pipelined so that multiple integer calculations can occur simultaneously, but when performing large numbers of similar type calculations there is a bottleneck of processor calculation time. By using vector calculations, multiple like calculations can be performed simultaneously. The only trick here is to remember the key phrase "multiple like calculations."

If, for example, four pairs of 32-bit words are being calculated simultaneously such as in the following addition:

Packed Addition and Subtraction

...or subtraction:

Packed Addition and Subtraction

...the point is that the calculations all need to use the same operator. There is an exception, but this is too early to discuss it. There are workarounds, such as if only a couple of expressions need a calculated adjustment while others do not, then adding or subtracting a zero would keep their result neutral.

Packed Addition and Subtraction

Remember your algebraic law:

Packed Addition and Subtraction

It is in essence a wasted calculation, but its use as a placeholder helps make SIMD instructions easier to use.

Packed Addition and Subtraction

The other little item to remember is that subtraction is merely the addition of a value's additive inverse:

a–b = a+(–b)

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDx destination, source

PADDB/PADDW/PADDD/PADDQ Integer Addition

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PADDx

 

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDQ

      

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

PADDB/PADDW/PADDD/PADDQ Integer Addition

MMX

padd(b/w/d/q) mmxDst, mmxSrc/m64

[Un]signed

64

SSE2

padd(b/w/d/q) xmmDst, xmmSrc/m128

[Un]signed

128

The PADDx instruction is a parallel operation that uses an adder on each of the source bit blocks aSrc (xmmSrc) and bSrc (xmmDst) and stores the result in the destination Dst (xmmDst).

The instructions may be labeled as packed, parallel, or vector, but each block of bits is in reality isolated from each other. The following is a 32-bit example consisting of four unsigned 8-bit values:

PADDB/PADDW/PADDD/PADDQ Integer Addition

...and four signed 8-bit values:

PADDB/PADDW/PADDD/PADDQ Integer Addition

Regardless of the decimal representation of unsigned or signed, the hex values of the two examples remains the same, which is the reason for these being [Un]signed, thus sign neutral.

Notice in the following additions of 7-bit signed values that with the limit range of –64...63 the worst case of negative and positive limit values results with no overflow.

PADDB/PADDW/PADDD/PADDQ Integer Addition

Of course, positive and negative signed values could also be intermixed without an overflow. For a 7-bit unsigned value, 0...127, there would be no overflow.

PADDB/PADDW/PADDD/PADDQ Integer Addition

The eighth unused bit is in reality used as a buffer to prevent any overflow to a ninth bit.

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PADDSx

 

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

PADDUSx

 

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

MMX

paddus(b/w) mmxDst, mmxSrc/m64

Unsigned

64

"

padds(b/w) mmxDst, mmxSrc/m64

Signed

 

SSE2

paddus(b/w) xmmDst, xmmSrc/m128

Unsigned

128

"

padds(b/w) xmmDst, xmmSrc/m128

Signed

 

The PADDSx (signed) and PADDUSx (unsigned) instructions are a parallel operation that uses an adder on each of the source bit block registers aSrc (xmmSrc) and bSrc (xmmDst) and stores the result in the destination Dst (xmmDst) using saturation logic to prevent any possible wraparound.

Each calculation limits the value to the extents of the related data type so that if the limit is exceeded, it is clipped inclusively to that limit. This is handled differently whether it is signed or unsigned, as they both use different limit values. Effectively, the result of the summation is compared to the upper limit with a Min expression and compared to the lower limit with a Max expression. Notice in the previous section that when two signed 8-bit values of 0x7F (127) are summed, a value of 0xFE (254) results but is clipped to the maximum value of 0x7f (127). The same applies if, for example, two values of 0x80 (–128) are summed, resulting in –256 but clipped to the minimum value of 0x80 (–128).

The instructions may be labeled as packed, parallel, or vector but each block of bits is in reality isolated from each other.

Vector_{8/16/32/64}-Bit_Int_Addition_with_Saturation

A sample use of this instruction would be for sound mixing where two sound waves are mixed into a single wave for output. The saturation point keeps the amplitude of the wave from wrapping from a positive or high level into a negative or low one, thus creating a pulse encoded harmonic, or distortion.

For saturation, the limits are different for the data size as well as for signed and unsigned.

 

8-bit

16-bit

signed

-128...127

-32768...32767

unsigned

0...255

0...65535

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBx destination, source D –= A

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PSUBx

 

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBQ

      

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

MMX

psub(b/w/d/q) mmxDst, mmxSrc/m64

[Un]signed

64

SSE2

psub(b/w/d/q) xmmDst, xmmSrc/m128

[Un]signed

128

This vector instruction is a parallel operation that subtracts each of the source bit blocks aSrc (xmmSrc) from bSrc (xmmDst) and stores the result in the destination Dst (xmmDst).

Note

Be careful here! The register and operator ordering is as follows:

PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction
PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

PSUBSx destination, source D –= A

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation
Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PSUBSx

 

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

PSUBUSx

 

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

MMX

psubus(b/w) mmxDst, mmxSrc/m64

Unsigned

64

 

psubs(b/w) mmxDst, mmxSrc/m64

Signed

 

SSE2

psubus(b/w) xmmDst, xmmSrc/m128

Unsigned

128

 

psubs(b/w) xmmDst, xmmSrc/m128

Signed

 

This vector instruction is a parallel operation that subtracts each of the source bit blocks aSrc (xmmSrc) from bSrc (xmmDst) and stores the result in the destination Dst (xmmDst).

Vector {8/16/32/64}-Bit Integer Subtraction with Saturation

Vector Addition and Subtraction (Fixed Point)

For most of the number crunching in your games or tools you will most likely use single-precision floating-point. For artificial intelligence (AI) and other high-precision calculations, you may wish to use the higher precision double-precision, but it only exists in scalar form on the FPU, except for the case of the SSE2 or above, so functionality must be emulated in a sequential fashion whenever possible. But even with the higher precision, there is still a bit of an accuracy problem.

An alternative would be to use integer calculations in a fixed-point format of zero or more places. If the data size is large enough to contain the number, then there is no precision loss!

Pseudo Vec

These can get pretty verbose, as for fixed-point (integer) addition there would be support for 8-, 16-, and 32-bit data elements within a 128-bit vector and these would be signed and unsigned, with and without saturation. The interesting thing about adding signed and unsigned numbers, other than the carry or borrow, is that the resulting value will be exactly the same and thus the same equation can be used. This can be viewed in the following 8-bit example:

Pseudo Vec

Notice that the resulting bits from the 8-bit calculation are all the same. Only the carry is different and the resulting bits are only interpreted as being signed or unsigned.

Pseudo Vec (x86)

Now let's examine these functions closer. MMX and SSE2 have the biggest payoff, as 3DNow! and SSE are primarily for floating-point support.

  mov     ebx,pbB         ; Vector B
  mov     eax,pbA         ; Vector A
  mov     edx,pbD         ; Vector Destination

The following is a 16×8-bit addition but substituting a PSUBB for the PADDB will transform it into a subtraction.

vmp_paddB (MMX) 16×8-Bit

Example 7-1. ...chap07pasPAddX86M.asm

movq    mm0,[ebx+0]   ; Read B Data {B7...B0}
movq    mm1,[ebx+8]   ;              {BF...B8}
movq    mm2,[eax+0]   ; Read A Data {A7...A0}
movq    mm3,[eax+8]   ;              {AF...A8}
paddb   mm0,mm2       ; lower 64 bits {A7+B7 ... A0+B0}
paddb   mm1,mm3       ; upper 64 bits {AF+BF ... A8+B8}
movq    [edx+0],mm0
movq    [edx+8],mm1

For SSE, it is essentially the same function wrapper, keeping in mind aligned memory MOVDQA versus non-aligned memory MOVDQU.

vmp_paddB (SSE2) 16×8-Bit

Example 7-2. ...chap07pasPAddX86M.asm

movdqa xmm0,[ebx]     ; Read B Data {BF...B0 }
movdqa xmm1,[eax]     ; Read A Data {AF...A0 }
paddb  xmm0,xmm1      ; {vA+vB} 128 bits {AF+BF   ... A0+B0}
movdqa [edx],xmm0     ; Write D Data

The following is a master substitution table for change of functionality, addition versus subtraction (inclusive/exclusive of saturation).

 

Add

Sub

Add

Sub

Add

Sub

8-bit

paddb

psubb

paddsb

psubsb

paddusb

psubusb

16-bit

paddw

psubw

paddsw

psubsw

paddusw

psubusw

32-bit

paddd

psubd

    

64-bit

paddq

psubq

    

Averages

VD[] = (vA[] + vB[] + 1) ÷ 2;

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PAVGB

    

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

PAVGUSB

   

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

  

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

  

3DNow!

pavgusbmmxDst, mmxSrc/m64

Unsigned

64

MMX+

pavgb

mmxDst, mmxSrc/m64

Unsigned

64

SSE

pavgb

mmxDst, mmxSrc/m64

Unsigned

64

SSE2

pavgb

xmmDst, xmmSrc/m128

Unsigned

128

This SIMD instruction is a 64 (128)-bit parallel operation that sums the eight (16) individual 8-bit source integer bit blocks aSrc (xmmSrc) and bSrc (xmmDst), adds one, then divides by two and returns the lower 8 bits with the result being stored in the destination Dst (xmmDst).

PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average

Tip

These two instructions are remnants from the processor wars. They have two different mnemonics with two different opcodes, but they have the same functionality. PAVGB was added to the AMD instruction set, which matched the SSE instruction PAVGB but had the same functionality as AMD's PAVGUSB. If the target is 3DNow! and MMX extensions or better, use the instruction PAVGB.

PAVGW — N×16-Bit [Un]signed Integer Average

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PAVGW

    

PAVGW — N×16-Bit [Un]signed Integer Average

PAVGW — N×16-Bit [Un]signed Integer Average

PAVGW — N×16-Bit [Un]signed Integer Average

PAVGW — N×16-Bit [Un]signed Integer Average

PAVGW — N×16-Bit [Un]signed Integer Average

PAVGW — N×16-Bit [Un]signed Integer Average

MMX+

pavgw mmxDst, mmxSrc/m64

Unsigned

64

SSE

pavgw mmxDst, mmxSrc/m64

Unsigned

64

SSE2

pavgw xmmDst, xmmSrc/m128

Unsigned

128

This SIMD instruction is a 64 (128)-bit parallel operation that sums the four (eight) individual 16-bit source integer bit blocks aSrc (xmmSrc) and bSrc (xmmDst), adds one, then divides by two and returns the lower 16 bits with the result being stored in the destination Dst (xmmDst).

PAVGW — N×16-Bit [Un]signed Integer Average

Sum of Absolute Differences

The simplified form of this parallel instruction individually calculates the differences of each of the packed bits, then sums the absolute value for all of them, and returns the result in the destination.

Equation 7.1. 

Sum of Absolute Differences

PSADBW — N×8-Bit Sum of Absolute Differences

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PSADBW

    

PSADBW — N×8-Bit Sum of Absolute Differences

PSADBW — N×8-Bit Sum of Absolute Differences

PSADBW — N×8-Bit Sum of Absolute Differences

PSADBW — N×8-Bit Sum of Absolute Differences

PSADBW — N×8-Bit Sum of Absolute Differences

PSADBW — N×8-Bit Sum of Absolute Differences

MMX+

psadbw mmxDst, mmxSrc/m64

Unsigned

64

SSE

psadbw mmxDst, mmxSrc/m64

Unsigned

64

SSE2

psadbw xmmDst, xmmSrc/m128

Unsigned

128

8×8-Bit Sum of Absolute Differences

Dst(15...0) = abs(Src(7...0) - Dst(7...0)) + abs(Src(15...8) -
          Dst(15...8))
     + abs(Src(23...16) - Dst(23...16)) + abs(Src(31...24) - Dst(31...24))
     + abs(Src(39...32) - Dst(39...32)) + abs(Src(47...40) - Dst(47...40))
     + abs(Src(55...48) - Dst(55...48)) + abs(Src(63...56) - Dst(63...56))
Dst(63...16) =  0;

16×8-Bit Sum of Absolute Differences

Dst(31...16) = abs(Src(71...64) - Dst(71...64)) + abs(Src(79...72) -
          Dst(79...72))
     + abs(Src(87...80) - Dst(87...80)) + abs(Src(95...88) - Dst(95...88))
     + abs(Src(103...96) - Dst(103...96)) + abs(Src(111...104) -
          Dst(111...104))
     + abs(Src(119...112) - Dst(119...112)) + abs(Src(127...120) -
          Dst(127...120))
Dst(127...32) = 0

Integer Multiplication

Integer data is expanded upon the calculation of a product between two operands, and thus two operands are needed to store the result. Depending on whether the source values are small enough, the operand receiving the upper bits of the result may be ignored as it always contains a predictable amount.

MUL — Unsigned Muliplication (Scalar)

MUL operand

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

MUL

MUL — Unsigned Muliplication (Scalar)

MUL — Unsigned Muliplication (Scalar)

MUL — Unsigned Muliplication (Scalar)

MUL — Unsigned Muliplication (Scalar)

MUL — Unsigned Muliplication (Scalar)

MUL — Unsigned Muliplication (Scalar)

MUL — Unsigned Muliplication (Scalar)

MUL — Unsigned Muliplication (Scalar)

MUL — Unsigned Muliplication (Scalar)

MUL — Unsigned Muliplication (Scalar)

mul

rmDst(8/16/32/64)

(Un)signed

The MUL general-purpose instruction multiplies the unsigned 8-, 16-, 32-, or 64-bit operand by the unsigned AL/AX/EAX/RAX register, saves the result based upon the bit size of the operand as indicated in the table below, and sets the flags accordingly.

 

8-bit

16-bit

32-bit

64-bit

operation

AL×BL

AX×BX

EAX×EBX

RAX×RBX

result

AX

DX:AX

EDX:EAX

RDX:RAX

Flags: If the upper half of the resulting bits are 0, then the Overflow and Carry bits are set to 0; else they are set to 1. The other flags are undefined.

Flags

O.flow

Sign

Zero

Aux

Parity

Carry

 

×

-

-

-

-

×

A square operation (x2) is actually pretty simple!

        mov     rax,7
        mul     rax             ; rdx:rax = rax x rax

IMUL — Signed Multiplication (Scalar)

IMUL

operand

IMUL

operand1, operand2

IMUL

operand1, operand2, operand3

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

IMUL

IMUL — Signed Multiplication (Scalar)

IMUL — Signed Multiplication (Scalar)

IMUL — Signed Multiplication (Scalar)

IMUL — Signed Multiplication (Scalar)

IMUL — Signed Multiplication (Scalar)

IMUL — Signed Multiplication (Scalar)

IMUL — Signed Multiplication (Scalar)

IMUL — Signed Multiplication (Scalar)

IMUL — Signed Multiplication (Scalar)

IMUL — Signed Multiplication (Scalar)

imul rmDst(8/16/32/64)

Signed

imul rDst(16/32/64), rmSrc(16/32/64)

imul rDst(8/16/32/64), #(8/16/32)

imul rDst(16/32/64), rmSrc(16/32/64), #(8/16/32)

The IMUL general-purpose instruction has three forms based upon the number of operands.

  • One Operand — Similar to that of the MUL instruction but deals with signed numbers. That is, it multiplies the 8-, 16-, 32-, or 64-bit signed operand by the unsigned AL/AX/EAX/RAX register, saves the result based upon the bit size of the operand as indicated in the table below, and sets the flags accordingly.

  • Two Operands — Used for multiplication of 16-, 32-, or 64-bit signed numbers in operand1 with the value in operand2 and stores the result in operand1. Unlike MUL, AX/EAX/RAX is not used unless it is one of the two operands. If operand2 is an immediate value, then it is sign extended to match operand1.

  • Three Operands — Used for multiplication of 16-, 32-, or 64-bit signed numbers in operand2 with the sign extended immediate value in operand3. The result is stored in operand1. Unlike MUL, AX/EAX/RAX is not used unless it is one of the three operands.

 

8-bit

16-bit

32-bit

64-bit

result

AX

DX:AX

EDX:EAX

RDX:RAX

Flags: If the upper half of the resulting bits are 0 then the Overflow and Carry bits are set to 0, else they are set to 1. The other flags are undefined.

Flags

O.flow

Sign

Zero

Aux

Parity

Carry

 

×

-

-

-

-

×

Packed Integer Multiplication

Packed integer multiplication is one of the mathematical equations that you will tend to use in your SIMD application either as fixed-point or just parallel integer processing. This works out nicely when it is necessary to increase the magnitude of a series of integers. The problem here comes up because fixed-point multiplication is not like floating-point multiplication. In floating-point, there is a precision loss with each calculation since a numerical value is stored in an exponential form. With fixed-point, there is no precision loss, which is great but leads to another problem. When two integers are used in a summation, the most significant bits are carried into an additional (n+1) bit. With a multiplication of two integers, the resulting storage required is (n+n=2n) bits. This poses a problem of how to deal with the resulting solution. Since the data size increases, there are multiple solutions to contain the result of the calculation.

HI( int16×int16 )

HI( uint16×uint16 )

LO( int16×int16 )

pmulhw

pmulhuw

pmullw

  • Store upper bits.

  • Store lower bits.

  • Store upper/lower bits into two vectors.

  • Store even n bit elements into 2n bit elements.

  • Store odd n bit elements into 2n bit elements.

PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULLW destination, source

PMULLW — N×16-Bit Parallel Multiplication (Lower)

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PMULLW

 

PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULLW — N×16-Bit Parallel Multiplication (Lower)

MMX

pmullw mmxDst, mmxSrc/m64

[Un]signed

64

MMX+

pmullw mmxDst, mmxSrc/m64

[Un]signed

64

SSE2

pmullw xmmDst, xmmSrc/m128

[Un]signed

128

These vector instructions use a 64 (128)-bit data path and so four (eight) operations occur in parallel. The product is calculated using each of the 16-bit half-words of the multiplicand mmxSrc (xmmSrc) and the 16-bit half-words of the multiplier mmxDst (xmmDst) for each 16-bit block, and stores the lower 16 bits of each of the results in the original 16-bit half-words of the destination mmxDst (xmmDst).

PMULLW — N×16-Bit Parallel Multiplication (Lower)
PMULLW — N×16-Bit Parallel Multiplication (Lower)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW destination, source

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PMULHUW

    

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW

 

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

MMX

pmulhuw

mmxDst, mmxSrc/m64

Unsigned

64

 

pmulhw

mmxDst, mmxSrc/m64

Signed

 

MMX+

pmulhuw

mmxDst, mmxSrc/m64

Unsigned

64

SSE2

pmulhuw

xmmDst, xmmSrc/m128

Unsigned

128

 

pmulhw

xmmDst, xmmSrc/m128

Signed

 

These vector instructions use a 64 (128)-bit data path and so four (eight) operations occur in parallel. The product is calculated using each of the 16-bit half-word of the multiplicand mmxSrc (xmmSrc) and the 16-bit half-word of the multiplier mmxDst (xmmDst) for each 16-bit block, and stores the upper 16 bits of each of the results in the original 16-bit half-words of the destination mmxDst (xmmDst).

PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)
PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)
PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)

PMULHRW — Signed 4×16-Bit Multiplication with Rounding (Upper)

PMULHRW destination, source

PMULHRW — Signed 4×16-Bit Multiplication with Rounding (Upper)

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PMULHRW

   

PMULHRW — Signed 4×16-Bit Multiplication with Rounding (Upper)

PMULHRW — Signed 4×16-Bit Multiplication with Rounding (Upper)

  

PMULHRW — Signed 4×16-Bit Multiplication with Rounding (Upper)

  

3DNow!

pmulhrw mmxDst, mmxSrc/m64

Signed

64

This vector instruction uses a 64-bit data path and so four operations occur in parallel. The product is calculated using each of the unsigned 16-bit half-words of the multiplicand mmxDst and the 16-bit half-word of the multiplier mmxSrc for each 16-bit block, sums 00008000 hex to the 32-bit product, and stores the resulting upper 16 bits in the destination mmxDst (xmmDst).

Dst(15...0) = UPPER16((Dst(15...0)   x Src(15...0)) + 0x8000)
Dst(31...16) = UPPER16((Dst(31...16) x Src(31...16)) + 0x8000)
Dst(47...32) = UPPER16((Dst(47...32) x Src(47...32)) + 0x8000)
Dst(63...48) = UPPER16((Dst(63...48) x Src(63...48)) + 0x8000)
PMULHRW — Signed 4×16-Bit Multiplication with Rounding (Upper)

Pseudo Vec (x86)

An interesting thing about the SSE instruction set is that it is primarily designed for floating-point operations. The MMX instruction set handles most of the packed integer processing except the case of unsigned 16-bit multiplication. This was not resolved until a later release of extensions for the MMX by AMD and the SSE by Intel, but this is only 64 bits. Intel came back with the SSE2 instruction set with a complete set of packed instructions for supporting 128 bits.

mov     ebx,phB         ; Vector B
mov     eax,phA         ; Vector A
mov     edx,phD         ; Vector Destination

(MMX Mul Low) 4×16-Bit

The instruction PMULLW is the [un]signed multiplication storing the lower 16 bits; PMULHW is the signed multiplication storing the upper 16 bits; and PMULHUW is the unsigned multiplication storing the upper 16 bits. The following function is designed to return only 16 bits of the resulting product, and selection of the appropriate instruction will return that value. Using the following code shell but with the PMULLW instruction calculates the lower bits, while PMULHW calculates the upper bits. PMULHW calculates the upper signed bits and PMULHUW the upper unsigned bits. The following code is designed for use with the 64-bit MMX register and is used in pairs for 128-bit support.

Example 7-3. ...chap7pmdPMulX86M.asm

movq     mm0,[ebx+0] ; Read B Data {3...0}
movq     mm1,[ebx+8] ;             {7...4}
movq     mm2,[eax+0] ; Read A Data {3...0}
movq     mm3,[eax+8] ;             {7...4}

pmullw   mm0,mm2
pmullw   mm1,mm3
movq     [edx+0],mm0 ; Write D ??? 16 bits  {3...0}
movq     [edx+8],mm1 ;   (upper or lower)   {7...4}

To return 16 high signed integer bits, use PMULH instead of PMULLW.

pmulhw   mm0,mm2 ; SIGNED HIGH   {3...0}
pmulhw   mm1,mm3 ;               {7...4}

To return 16 unsigned high integer bits, use PMULHUW instead of PMULLW.

pmulhuw  mm0,mm2 ; UNSIGNED HIGH {3...0}
pmulhuw  mm1,mm3 ;               {7...4}

(SSE2 Mul Low) 8×16-Bit

With the release of the SSE2 instruction set, Intel handles 128 bits simultaneously using the XMM registers. These same instructions can be used with 128-bit data.

Example 7-4. ...chap7pmdPMulX86M.asm

movdqa  xmm0,[ebx]      ; Read B Data   {7...0}
movdqa  xmm1,[eax]      ; Read A Data   {7...0}
pmullw  xmm0,xmm1
movdqa  [edx],xmm0      ; Write D lower 16 bits {7...0}

PMULUDQ — Unsigned N×32-Bit Multiply Even

PMULUDQ destination, source

PMULUDQ — Unsigned N×32-Bit Multiply Even

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PMULUDQ

      

PMULUDQ — Unsigned N×32-Bit Multiply Even

PMULUDQ — Unsigned N×32-Bit Multiply Even

PMULUDQ — Unsigned N×32-Bit Multiply Even

PMULUDQ — Unsigned N×32-Bit Multiply Even

SSE2

pmuludq mmxDst, mmxSrc/m64

Unsigned

64

 

pmuludq xmmDst, xmmSrc/m128

Unsigned

128

This vector instruction calculates the product of the (even) lower 32 bits of each set of 64 (128) bits of the multiplicand aSrc (xmmSrc) and the lower 32 bits of the multiplier bSrc (xmmDst) for each 64 (128)-bit block, and stores each full 64 (128)-bit integer result in the destination Dst (xmmDst).

PMULUDQ — Unsigned N×32-Bit Multiply Even

In the following 64-bit table, the upper 32 bits {63...32}of mmxSrc and mmxDst are ignored. Note that this is not a SIMD reference but included for reference.

PMULUDQ — Unsigned N×32-Bit Multiply Even

In the following 128-bit table, the upper odd pairs of 32 bits {127...96, 63...32} of xmmDst and xmmSrc are ignored.

PMULUDQ — Unsigned N×32-Bit Multiply Even

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

PMADDWD destination, source

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PMADDWD

 

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD

MMX

pmaddwd mmxDst, mmxSrc/m64

Signed

64

SSE2

pmaddwd xmmDst, xmmSrc/m128

Signed

128

This vector instruction calculates the 32-bit signed products of two pairs of two 16-bit multiplicands aSrc(xmmSrc) and the multiplier bSrc(xmmDst) for each bit block. The first and second 32-bit products are summed and stored in the lower 32 bits of the Dst(xmmDst). The third and fourth 32-bit products are summed and stored in the next 32 bits of the destination Dst(xmmDst). The same is repeated for the upper 64 bits for the 128-bit data model.

No 64-bit
No 64-bit
No 64-bit

Integer Division

Integer Division

DIV — Unsigned Division

DIV operand

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

DIV

DIV — Unsigned Division

DIV — Unsigned Division

DIV — Unsigned Division

DIV — Unsigned Division

DIV — Unsigned Division

DIV — Unsigned Division

DIV — Unsigned Division

DIV — Unsigned Division

DIV — Unsigned Division

DIV — Unsigned Division

div

rmDst(8/16/32/64)

(Un)signed

The DIV general-purpose instruction divides the unsigned 16-, 32-, 64-, or 128-bit AX, DX:AX, EDX:EAX, RDX:RAX pair into the 8-, 16-, 32-, or 64-bit operand and saves the result based upon the bit size of the operand as indicated in the table below and sets the flags accordingly.

 

8-bit

16-bit

32-bit

64-bit

Dividend

AX

DX:AX

EDX:EAX

RDX:RAX

Divisor

div bl

div bx

div ebx

div rbx

Quotient

AL

AX

EAX

RAX

Remainder

AH

DX

EDX

RDX

Flags: All the flags are undefined.

Flags

O.flow

Sign

Zero

Aux

Parity

Carry

 

-

-

-

-

-

-

div ebx

IDIV — Signed Division

IDIV operand

Mnemonic

P

PII

K6

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

IDIV

IDIV — Signed Division

IDIV — Signed Division

IDIV — Signed Division

IDIV — Signed Division

IDIV — Signed Division

IDIV — Signed Division

IDIV — Signed Division

IDIV — Signed Division

IDIV — Signed Division

IDIV — Signed Division

idiv

rmDst(8/16/32/64)

Signed

The IDIV general-purpose instruction divides the signed 16-, 32-, 64-, or 128-bit AX, DX:AX, EDX:EAX, RDX:RAX pair into the 8-, 16-, 32-, or 64-bit operand and saves the result based upon the bit size of the operand as indicated in the table below and sets the flags accordingly.

 

8-bit

16-bit

32-bit

64-bit

Dividend

AX

DX:AX

EDX:EAX

RDX:RAX

Divisor

idiv bl

idiv bx

idiv ebx

idiv rbx

Quotient

AL

AX

EAX

RAX

Remainder

AH

DX

EDX

RDX

Flags: All the flags are undefined.

Flags

O.flow

Sign

Zero

Aux

Parity

Carry

 

-

-

-

-

-

-

The sign of the remainder always matches the sign of the divisor. An overflow is indicated by a Divide Error Exception instead of the Overflow flag.

Exercises

  1. Write a code snippet to saturate an unsigned 8-bit incrementing count at a value of 99.

  2. The result of the summation of two integers increases the size of the source by one bit (the carry). How many bits are needed from the product of two 32-bit values?

  3. What are different output methods to store the various results of a product of two integers?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset