Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Bit Wrangling

Think of this chapter as a continuation of the previous chapter, "Bit Mangling."

This chapter deals with bit wrangling and thus the movement of bits. They can be shifted into oblivion, rotated, and watched going around and around. As Bugs Bunny would say, "Dizzing, ain't it!"

Bit wrangling actually occurs with the rotation and logical and arithmetic shifting of bits within each parallel bit range of packed bits. Just as in the scalar point of view of a similar general-purpose CPU instruction, the bits can be used for masking, base two multiplication and division, and other functionality.

It must be reiterated that you watch the alignment of your data objects in memory very closely. It takes extra overhead to adjust the memory into an aligned state, and it is a lot more efficient to ensure that they are aligned in the first place. Your code will be smaller and faster! This will be made obvious by the sample code included in this chapter.

Workbench Files:Benchx86chap05projectplatform

	project	platform
Shift/Rotations	prot	vc6
		vc.net

Logical Left Shifting

It is recommended that you use an addition instead of a shift on newer processors for optimization.

      A + A  = 2A = A << 1
     2A + 2A = 4A = A << 2

SHL/SAL - Shift (Logical/Arithmetic) Left

SHL destination, count

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SAL
SHL

shl	rmDst(8/16/32/64), 1	[Un]signed
shl	rmDst(8/16/32/64), #(2ⁿ-1)
shl	rmDst(8/16/32/64), cl

This instruction is a multiplier (2ⁿ) by shifting the destination to the left and shifting the most significant bit (MSB) into the carry. The previous content of the carry is shifted out and lost, and a 0 is shifted into the least significant bit (LSB). The count indicating the number of bits to shift is invalid if greater than or equal to width of data (2ⁿ). If shifted enough times, the data will be emptied, being set to all zeroes. If count = 0, then there is no shift and thus the destination and flags are not changed. Shift Logical Left and Shift Arithmetic Left are the same identical instruction.

Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by 1, then Overflow = 0 if Carry = MSB.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	×	×	×	-	×	×

datd dd      32
datw dw      16
datb db      8

     mov     dx,1010010110100101b  ;0a5a5h  carry=0
     mov     cl,2        ; ×4 = 2²

     shl     datd,4      ; datd = 512 = 32 × 16 = 32 × (2⁴)
     shl     dx,1        ; dx = dx × 2
     ; dx = 0100101101001010b  04b4ah  carry=1
     shl     eax,cl      ; eax = eax × 4

With Pentium processors a multiply is pretty quick, but for a 486 processor or older it was sometimes necessary and faster to do shifting and adding to compensate. For example, a video display with a resolution of 640×480x8-bit would need to calculate an actual screen (memory pointer) related to an X and Y coordinate.

    mov     ecx,X
    mov     edx,Y
    mov     edi,edx
    shl     edi,2           ; 4x
    add     edi,edx         ; 5x = (4x+x)
    shl     edi,7           ; 640x = (5x)×(128)
    add     edi,ecx         ; edi = memory index
    mov     es:[edi],al     ; Write a byte pixel

SHLD — Shift (Logical) Left (Double)

SHLD destination, count

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SHLD

shld	rmDst(16/32/64), rSrc(16/32/64), #(2ⁿ-1)	[Un]signed
shld	rmDst(16/32/64), rSrc(16/32/64), cl

This instruction shifts the upper half of an unsigned 64-bit integer by 2ⁿ, which shifts the source to the left into the destination to the left and shifts the most significant bit (MSB) into the carry. The previous content of the carry is shifted out and lost. (It should be noted that in reality the source is not altered, only the destination.) The count indicating the number of bits to shift is invalid if it is greater than or equal to width of data (2ⁿ). If you exceed the operand size, the result in destination is undefined. If count = 0, then there is no shift and the destination and flags are not changed.

Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by 1, then Overflow = 1 if a sign change occurred; else it is 0.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	×	×	×	-	×	×

datd    dd      32
datw    dw      16
datb    db      8

        mov     ax,0001001000110100b 01234h
        mov     dx,1010010110100101b 0a5a5h
        mov     cl,4

        shld    dx,ax,4
        shl     ax,4
        ; dx = 0101101001010001b 05a51h carry=0

PSLLx — Parallel Shift Left (Logical)

Figure 5-1. Miscellaneous examples of data types being shifted to the left by one bit

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
PSLLx
PSLLDQ

MMX	psll(w/d/q)	mmxDst, count(#/mmx/m64)	[Un]signed	64
SSE2	psll(w/d/q)	xmmDst, count(#/xmm/m128)	[Un]signed	128
"	pslldq	xmmDst, count(#)	[Un]signed	128

These multimedia and SIMD extension instructions are parallel operations that logically left shift each of the data bit blocks, by a count of bits. Depending upon the processor and the instruction, block sizes of 16, 32, 64, or 128 bits can be shifted by the specified count.

This is a multiplier (2ⁿ) by shifting a 0 into the LSB of each packed value of the source aSrc (xmmSrc) and causing all bits to be shifted to the left and the MSB of each packed value being lost. The result is stored in the destination Dst (xmmDst). There is no Carry flag to save the bit. If the count indicating the number of bits to shift is more than packed bits - 1 — 15 for (Words), 31 for (Double words), 63 for (Quad words), or 127 for (Double Quad words) — then the destination will be typically set to a value of zero. This is most effective when used in an integer math function where multiple numbers need to be multiplied by 2ⁿ concurrently.

Pseudo Vec

Although logical left, right, or arithmetic shifting is supported by almost all processors, parallel shifting is not. The same parallel effect can be simulated in those particular cases of non-SIMD supporting processors. The following C code demonstrates it by concurrently shifting four packed 8-bit values to the left. All 32 bits are shifted to the left in unison but as the remaining bits become corrupted by the adjacent byte, a logical AND using a mask in the table lookup forces those vacated bits to zero.

 #define eMASKFE 0xfefefefe       // Mask FE

 uint32 val;

 val = (val << 1) & eMASKFE;

As you saw in the previous example, an absolute shift value is the most simplistic way to go, as it is merely a matter of shifting the n number of bits and masking with the appropriate value.

   11111111  11111111
   11111110  01111111
   11111100  00111111
   11111000  00011111
   11110000  00001111
   11100000  00000111
   11000000  00000011
   10000000  00000001

Table 5-1. An 8-bit mask for stripping bits in conjunction with shifting data (0...7) bits to the left or to the right

8-bit Shift	0	1	2	3	4	5	6	7
Left	FF	FE	FC	F8	F0	E0	C0	80
Right	FF	7F	3F	1F	0F	07	03	01

If the shift factor is unknown, then the algorithm required, such as the following, becomes more complicated as a table lookup is needed for the masking value.

uint32 llMaskBD[] = {  //  Left Shift 32-bit 4×8-bit mask
       0xffffffff, 0xfefefefe, 0xfcfcfcfc, 0xf8f8f8f8,
       0xf0f0f0f0, 0xe0e0e0e0, 0xc0c0c0c0, 0x80808080 };

uint32 rrMaskBD[] = {  //  Right Shift 32-bit 4×8-bit mask
       0xffffffff, 0x7f7f7f7f, 0x3f3f3f3f, 0x1f1f1f1f,
       0x0f0f0f0f, 0x07070707, 0x03030303, 0x01010101 };

Since packed 8-bit logical shift is not supported, let us first examine a block of four 8-bit bytes being shifted simultaneously. Similar to the processors, the shift count needs to be truncated to a value of n-1, thus 0...7 in this particular case. Different processors have different results (especially with the various C compilers due to being a feature of the C language). For example, on some compilers a shift of an 8-bit byte by a value of nine would result in a zero, on some a one, and some, if directly transposing to assembly instructions, would tend to ignore the extraneous bits. For the 8-bit value only the three least significant bits would be used, for a shift value of {0...7}. A shift by nine would be effectively a shift by one since a value of nine in Boolean (1001b) would have its upper bits ignored (1001b) and so only the lower three bits (001b) would be used as a shift count; thus it becomes a value of one.

This also helps to prevent an out of bounds memory reference in regard to the mask lookup. The value is then shifted by the adjusted count, and then logical AND'd with the mask. This effectively processes all four values simultaneously as mentioned!

Packed Shift Left Logical 16×8-bit by n:{0...7}

Example 5-1. ...chap05protPRot.cpp

    void vmp_psllB(uint8 * pbD, uint8 * pbA, uint count)
    {
      uint32 msk, *pD, *pA;

      pD=(uint32*) pvD;
      pA=(uint32*) pvA;

      msk = llMaskBD[count];
      count &= (8-1);        // Clip count to data bit size -1

      *(pD+0) = (*(pA+0) & msk) << count;
      *(pD+1) = (*(pA+1) & msk) << count;
      *(pD+2) = (*(pA+2) & msk) << count;
      *(pD+3) = (*(pA+3) & msk) << count;
    }

Pseudo Vec (x86)

The following is an example needed for the x86 processor because it supports packed 16, 32, and 64 bits, but not a packed 8-bit logical shift. This uses a similar table of masks for the zero bit fill as viewed previously.

  llMaskBQ dq 0ffffffffffffffffh, 0fefefefefefefefeh,
           dq 0fcfcfcfcfcfcfcfch, 0f8f8f8f8f8f8f8f8h,
           dq 0f0f0f0f0f0f0f0f0h, 0e0e0e0e0e0e0e0e0h,
           dq 0c0c0c0c0c0c0c0c0h, 08080808080808080h,

As an MMX register is only 64 bits, the 128-bit value is handled using register pairs. This is actually faster and easier than these 128-bit C pseudocode samples.

vmp_psllB (MMX) 16×8-bit Vector

Example 5-2. ...chap05protpsllX86M.asm

 mov     ecx,dword ptr count     ; # of bits to shift
 mov     edx,pbD
 mov     eax,pbA
 and     ecx,8-1                 ; Clip count to 0...7 bits

 movq    mm0,[eax+0]             ; Read data
 movq    mm1,[eax+8]
 movd    mm2,ecx                 ; mm0=64 bits mm1=64 bits
                                 ; mm2=# of bits to shift
 psllq   mm0,mm2                 ; val << count
 psllq   mm1,mm2
 pand    mm0,llMaskBQ[ecx*8]     ; Strip lower bits
 pand    mm1,llMaskBQ[ecx*8]
 movq    [edx+0],mm0             ; Write data
 movq    [edx+8],mm1

Of course with this instruction, if only a shift by one is needed, the AND and PSLLQ can be replaced with only a PADDB. The MMX and SSE2 instructions support a byte add, and (A<<1) is equivalent to (A+A)!

So replace this:

   psllq  mm0,1                ; val << 1
   psllq  mm1,1
   pand   mm0,llMaskBQ[1*8]    ; Strip lower bit
   pand   mm1,llMaskBQ[1*8]

...with this:

   paddb mm0,mm0
   paddb mm1,mm1

One last thing to note is the actual operation. There is no difference between a pre-mask vs. a post-mask. Only the mask!

If you compare the left and right trace, you should see that they have the same result.

Logical Right Shifting

SHR — Shift (Logical) Right

SHR destination, count

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SHR

shr	rmDst(8/16/32/64), 1	[Un]signed
shr	rmDst(8/16/32/64), #(2ⁿ-1)
shr	rmDst(8/16/32/64), cl

This instruction is an unsigned divide by 2ⁿthat shifts the destination to the right and shifts the least significant bit (LSB) into the carry. The previous content of the carry is shifted out and lost, and a 0 is shifted into the most significant bit (MSB). The count indicating the number of bits to shift is invalid if greater than or equal to the width of data (2ⁿ). If you shift enough times, you will empty the data, setting it to all 0's. If count = 0, then there is no shift and the destination and flags are not changed.

Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by 1, then Overflow is set to 0; else Overflow is set to the starting MSB.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	×	×	×	-	×	×

  datd    dd    32
  datw    dw    16
  datb    db    8

          mov   dx,1010010110100101b ;0a5a5h carry=0
          mov   cl,2        ;  ×4 = 2²

          shr   datd,4      ; datd = 2 = 32 ÷ 16 = 32 ÷ (2⁴)
          shr   dx,1        ; dx = dx × 2
                            ; dx = 0101001011010010b 052d2h carry=1
          shr   eax,cl      ; eax = eax ÷ 4

This is typically used as a divisor for use in masking for table lookups such as:

          mov   eax,nWidth   ; # of bytes to alter
          shr   eax,3        ; divide by (2³)=(n÷8)
                             ; At this point eax is # of 64-bit #'s

SHRD — Shift (Logical) Right (Double)

SHRD destination, source, count

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SHRD

shrd	rmDst(16/32), rSrc(16/32), #(2ⁿ-1) [Un]signed
shrd	rmDst(16/32), rSrc(16/32), cl

This instruction is the lower half of an unsigned 64-bit integer divide by 2ⁿ that shifts the source to the right into the destination to the right and shifts the least significant bit (LSB) into the carry. The previous content of the carry is shifted out and lost. (It should be noted that in reality the source is not altered, only the destination.) The count indicating the number of bits to shift is invalid if it is greater than or equal to the width of data (2ⁿ). If count = 0, then there is no shift, and the destination and flags are not changed.

Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by 1, then Overflow = 1 if a sign change occurred; else it is 0.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	×	×	×	-	×	×

   datd dd      32
   datw dw      16
   datb db      8

        mov     ax,0001001000110100b 01234h
        mov     dx,1010010110100101b 0a5a5h
        mov     cl,4

        shrd    dx,ax,4    ; lower bits
        shr     ax,4       ; upper bits
                           ; dx = 0100101001011010b 04a5ah carry=0

PSRLx — Parallel Shift Right (Logical)

Figure 5-2. Miscellaneous examples of data types being logical shifted to the right by one bit

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
PSRLx
PSRLDQ

MMX	psrl(w/d/q) mmxDst, count(#/mmx/m64)	[Un]signed	64
SSE2	psrl(w/d/q) xmmDst, count(#/xmm/m128)	[Un]signed	128
"	psrldq xmmDst, count(#)	[Un]signed	128

These multimedia and SIMD extension instructions are parallel operations that logically right shift each of the data bit blocks by a count of bits. Depending on the processor and the instruction, block sizes of 16, 32, 64, or 128 bits can be shifted by the specified count.

This is a divisor (2ⁿ) of unsigned values by shifting a 0 into the MSB of each packed value of the source aSrc (xmmSrc), which cause all bits to be shifted to the right and the LSB of each packed value being lost; the result is stored in the destination Dst (xmmDst). There is no Carry flag to save the bit. If the count indicating the number of bits to shift is more than packed bits - 1 — 15 for (Words), 31 for (Double words), 63 for (Quad words), or 127 for (Double Quad words) — then the destination will be typically set to a value of zero. This is most effective when used in an integer math function where multiple unsigned numbers need to be divided by 2ⁿ concurrently.

Pseudo Vec

The C code simulating the functionality is almost identical to the Parallel Shift (Logical) Left instruction, previously discussed in this chapter, except in this case a different mask and a shift to the right are used instead. This should look very similar to you as previously reviewed in SLL; only the bold areas should be different!

Packed Shift Right Logical 16×8-bit by n:{0...7}

Example 5-3. ...chap05protPRot.cpp

   void vmp_psrlB(uint8 * pbD, uint8 * pbA, uint count)
   {
     uint32 msk, *pD, *pA;

     pD=(uint32*) pvD;
     pA=(uint32*) pvA;

     count &= (8-1);        // Clip count to data bit size-1
     msk = llMaskBD[count];

     *(pD+0) = (*(pA+0) & msk) >>count;

     *(pD+1) = (*(pA+1) & msk) >>count;
     *(pD+2) = (*(pA+2) & msk) >>count;
     *(pD+3) = (*(pA+3) & msk) >>count;
   }

There are similarities between left and right logical shifting for the various processor instructions. The function shell code is very similar as lower versus upper bits are masked. The bits are shifted to the left versus to the right. In assembly code, one merely needs to substitute the correct instruction for the same functionality. In simulated C code, the correct mask needs to be chosen. Those instructions, along with their complement, are reflected in the following table.

Table 5-2. Instruction substitution table to convert a previous SLL (Shift Left Logical) instruction into a SRL (Shift Right Logical), as well as masks and their complement

Instructions		Masks
SLL	SRL	SLL	SRL
psllq	psrlq	rrMaskBQ	llMaskBQ
psllw	psrlw
pslld	psrld	llMaskBO	rrMaskBO

Arithmetic Right Shifting

SAR — Shift (Arithmetic) Right

SAR destination, count

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SAR

sar	rmDst(8/16/32/64), 1	[Un]signed
sar	rmDst(8/16/32/64), #(2ⁿ-1)
sar	rmDst(8/16/32/64), cl

This instruction is a signed divide by 2ⁿthat shifts the destination to the right and shifts the least significant bit (LSB) into the carry. The previous content of the carry is shifted out and lost, and a 0 is shifted into the most significant bit (MSB). The count indicating the number of bits to shift is invalid if it is greater than or equal to the width of data (2ⁿ). If you shift enough times you will empty the data, setting it to all 0's. If count = 0, then there is no shift and the destination and flags are not changed.

Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by 1, then Overflow is cleared to 0; else the Overflow is undefined.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	×	×	×	-	×	×

  datd  dd      32
  datw  dw      16
  datb  db      8
        
        mov     dx,1010010110100101b  ;0a5a5h carry=0
        mov     cl,2        ;  x⁴ = 2²
        
        sar     datd,4      ; datd = 2 = 32 ÷ 16 = 32 ÷ (2⁴)
        sar     dx,1        ; dx = dx × 2
                ; dx = 1101001011010010b 0d2d2h carry=1 Z=0 OF=0
        sar     eax,cl      ; eax = eax ÷ 4

This instruction is typically used as a signed divisor. Its use of the "sticky" MSB is great for use in generating masks. As an arithmetic shift retains the MSB while data is being shifted down, this "retained" bit is stuck. Thus, it sticks in position!

By using:

        sar     eax,31

a negative value sets EAX with all bits set to 1 and a positive value sets all 0's. With the use of masking bit logic, some algorithms could run faster due to not needing to branch. Investigate Chapter 11, "Branch-less," for more information.

PSRAx — Packed Shift Right (Arithmetic)

Figure 5-3. Miscellaneous examples of data types being arithmetically shifted to the right by one bit

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
PSRAx

MMX	psra(w/d) mmxDst, count(#/mmx/m64)	Signed	64
SSE2	psra(w/d) xmmDst, count(#/xmm/m128)	Signed	128

These multimedia and SIMD extension instructions are parallel operations that arithmetically right shift each of the data bit blocks by a count of bits. Depending on the processor and the instruction, block sizes of 16 or 32 bits can be shifted by the specified count.

This is a divisor (2ⁿ) by shifting the MSB, a "sticky" bit, of each signed packed value of the source aSrc (xmmSrc), causing all bits to be shifted to the right and the LSB of each packed value being lost; the result is stored in the destination Dst (xmmDst). There is typically no Carry flag to save the bit. If the count indicating the number of bits to shift is more than (packed bits - 1) — 15 for (Words) or 31 for (DWords) — then the destination will be typically set to the same value as the MSB, a value of zero. This is most effective when used in an integer math function where multiple signed numbers need to be divided by 2ⁿconcurrently.

Pseudo Vec

Pseudo vectors do not work well with this instruction due to the nature of the sticky bit. The technique demonstrated in a previous section of this chapter related to left shift discussed emulated vector processing by shifting followed by a logical AND, but it does not work with this instruction. If you really wish to investigate this further, my Vector Game Math Processors book has more information.

Rotate Left (or n-Right)

A multimedia extension instruction for a parallel rotation does not exist at this time. To rotate data is a process of shifting up the lower bits, shifting down the upper bits, masking, and blending by the specified count. It is recommended to only rotate by one instead of an immediate or by another register.

ROL — Rotate Left

ROL destination, count

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
ROL

rol	rmDst(8/16/32/64), 1	[Un]signed
rol	rmDst(8/16/32/64), #(2ⁿ-1)
rol	rmDst(8/16/32/64), cl

This instruction shifts the destination to the left and the most significant bit (MSB) is rotated into the carry and the least significant bit (LSB). The previous content of the carry is lost. This is like a wagon wheel with the bits going counterclockwise around and around shooting off bits like sparks. The count indicating the number of bits to rotate is invalid if it is greater than or equal to the width of data (2ⁿ). If count = 0, then there is no shift and the destination and flags are not changed.

Flags: Carry = last bit rotated out of MSB, Zero = 1 if all destination bits are 0, thus destination = 0. If using a count of 1, then Overflow = the exclusive OR of the two MSBs before the rotate. (This is the same as the Carry XOR MSB after the rotate!) If the rotation is greater than 1, Overflow is undefined.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	×	×	×	-	×	×

   datd      dd      32
   datw      dw      16
   datb      db      8
             
             mov     dx,1010010110100101b 0a5a5h carry=0

             rol     dx,1
                   ; dx = 0100101101001011b 04b4bh carry=1 OF=1

RCL — Rotate Carry Left

RCL destination, count

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
RCL

rcl	rmDst(8/16/32/64), 1	[Un]signed
rcl	rmDst(8/16/32/64), #(2ⁿ-1)
rcl	rmDst(8/16/32/64), cl

This instruction shifts the destination to the left, rotates the most significant bit (MSB) into the carry, and rotates the previous contents of the carry into the least significant bit (LSB). This is like a wagon wheel with an extra spoke. The count indicating the number of bits to rotate is invalid if it is greater than or equal to the width of data (2ⁿ). If count = 0, then there is no shift and the destination and flags are not changed.

Flags: Carry = last bit rotated out, Zero = 1 if all destination bits are 0, thus destination = 0. If using a count of 1, then Overflow = the exclusive OR of the two MSBs before the rotate. (This is the same as the Carry XOR MSB after the rotate!) If the rotation is greater than 1, Overflow is undefined.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	×	×	×	-	×	×

  datd     dd      32
  datw     dw      16
  datb     db      8
           
           mov     dx,1010010110100101b 0a5a5h carry=0

           rcl     dx,1
                 ; dx = 0100101101001010b 04b45h carry=1 Z=0 OF=1

This is effective for the shifting of monochrome data or masking planes, effectively making an image appear to move to the right.

; actual nibble mask    f00000000000ffff00e300080ff00ef7
Msk:    dd      00000000fh,0ffff0000h,080003e00h,07fe00ff0h


$Beg:   mov     esi,offset Msk+(16-4)
        add     edi,16-4
        mov     ecx,3-1               ; 4 arguments
        
        mov     eax,[esi]
        shl     eax,1
        mov     [edi],eax

$L1:    mov     eax,[esi]
        rcl     eax,1
        mov     [edi],eax
        sub     edi,4
        sub     esi,4
        dec     ecx
        jne     $L1

Rotate Right

ROR — Rotate Right

ROR destination, count

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
ROR

ror	rmDst(8/16/32/64), 1	[Un]signed
ror	rmDst(8/16/32/64), #(2ⁿ-1)
ror	rmDst(8/16/32/64), cl

This instruction shifts the destination to the right and rotates the least significant bit (LSB) into the carry and the most significant bit (MSB). The previous content of the carry is lost. This is similar to a wagon wheel with the bits going clockwise around and around shooting off bits like sparks. The count indicating the number of bits to rotate is invalid if it is greater than or equal to the width of the data (2ⁿ). If count = 0, then there is no shift and the destination and flags are not changed.

Flags: Carry = last bit rotated out, Zero = 1 if all destination bits are 0; thus destination = 0. If using a count of 1, then Overflow = the exclusive OR of the two MSBs after the rotate. If the rotation is greater than 1, Overflow is undefined.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	×	×	×	-	×	×

  datd  dd      32
  datw  dw      16
  datb  db      8

        mov     dx,1010010110100101b 0a5a5h carry=0

        ror     dx,1
              ; dx = 1101001011010010b 0d2d2h carry=1 Z=0 OF=0

RCR — Rotate Carry Right

RCR destination, count

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
RCR

rcr	rmDst(8/16/32/64), 1	[Un]signed
rcr	rmDst(8/16/32/64), #(2ⁿ-1)
rcr	rmDst(8/16/32/64), cl

This instruction shifts the destination to the right and rotates the least significant bit (LSB) into the carry. The previous content of the carry is rotated into the most significant bit (MSB). This is like a wagon wheel with an extra spoke. The count indicating the number of bits to rotate is invalid if it is greater than or equal to the width of the data (2ⁿ). If count = 0, then there is no shift and the destination and flags are not changed.

Flags: Carry = last bit rotated out, Zero = 1 if all destination bits are 0; thus destination = 0. If using a count of 1, then Overflow = the exclusive OR of the two MSBs after the rotate. If rotation is greater than 1, Overflow is undefined.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	×	×	×	-	×	×

  datd  dd      32
  datw  dw      16
  datb  db      8

        mov     dx,1010010110100101b 0a5a5h carry=0

        rcr     dx,1
              ; dx = 0100101101001010b 0d2d2h carry=1 Z=0 OF=1

Bit Scanning

BSF — Bit Scan Forward

BSF destination, source

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
BSF

bsf

rDst, rmSrc(16/32/64)

[Un]signed

This instruction scans the source for the least significant bit (LSB); if found, it sets the destination to the index (0...n). If the source is set to 0, the destination is undefined.

Flags: If the source is zero, then the Zero flag is set to 1; else it is set to 0.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	-	-	×	-	-	-

                                  ; ebx = some value
        mov     ecx,0             ; Preset to zero in case ebx is 0.
        mov     ebx,eax
        bsf     ecx,ebx           ; ecx contains the bit setting
        shr     ebx,cl            ; Down shift the data

Note

Preset the BSF destination to zero if the source is not guaranteed to be not zero (unless you plan to branch if zero).

BSR — Bit Scan Reverse

BSR destination, source

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
BSR

bsr

rDst, rmSrc(16/32/64)

[Un]signed

This instruction scans the source for the most significant bit (MSB); if found, it sets the destination to the index (0..n). If the source is set to 0, the destination is undefined.

Flags: If the source is zero, then the Zero flag is set to 1; else it is set to 0.
Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	-	-	×	-	-	-

ToolBox Snippet — Get Bit Count

Here is an obvious use for an 80×86 instruction versus generic C code. In various compression and data packing methods, it is sometimes necessary to find out the maximum number of bits needed to store a value from a list of values, such as how many bits it takes to compress a piece of data. For example, if we take a histogram of all the colors in a bitmap and the highest count is 53 colors (53 - 1 = 52) (034h) (110100b), then it would take 6 bits to store it:

The following calculates the number of bits needed to store an unsigned value.

 uint CountBits(uint val)
 {
       uint nCnt;

 #if 01
              // Use this! VERY MUCH FASTER than a bit scanner!
              // Reverse scan for MSB
       __asm  {
       mov    eax,0                 // Set to zero, in case val is zero.
              bsr eax,val           // N...0 Bit index
              inc eax               // N+1...1 # of bits
              mov nCnt,eax
       };
 #else
   uint nBit = 0x80000000;
   nCnt = 32;

   while ((nCnt > 1) && (0 == (nBit & val)))
   {
        nCnt--;
        nBit >>= 1;
   }
 #endif

 return nCnt;

So if you had numerous numbers to crunch, which of the code snippet algorithms just shown would you rather utilize? A couple instructions of assembly in the upper conditional code or looping up to 31 times maximum for each 32-bit integer value as shown in the lower conditional code?

Graphics 101 — Palette Bits

There is a problem when writing graphics-based applications for Win32. If graphics is 256 color, then the palette is a table lookup. If 24-or 32-bit true color, then the RGB bit orientation is standardized, but if in 16-bit hi-color, then you have a problem. There is no standard, however, although there seem to be two orientations: {5:5:5} and {5:6:5}. But which does that card you are trying to run your application on really use? Direct Draw solved this problem, as you can request the bit and mask information, but what if you are writing a Windows application running under GDI, or a DOS application, or whatever? If your source art is 256 color, you would need to create a conversion table to match your graphics mode. If you have Targa art files that are {5:5:5} and the graphics card is {5:6:5}, you would need to detect this and compensate for it. You will have to paste this code section into your own C code. So, my apologies to those only interested in pure assembly.

Tip

These days there is typically not a good reason for pure assembly code. It is a waste of programming resources.

Sorry if I am offending any of you speed freaks. I personally happen to like to squeeze every last drop of clock cycle out of my CPU, but development projects are about balance: code size vs. code speed. Project budget vs. time constraints. There are also portability issues, but these are usually resolved through abstraction where function pointers are used to vector to generic C or custom assembler based upon processor requirements.

// Color Masking information

typedef struct RGBInfoType
{
    uint RedMask;
    uint GreenMask;
    uint BlueMask;
    uint RedShift;
    uint GreenShift;
    uint BlueShift;
    uint zRedShift;
    uint zGreenShift;
    uint zBlueShift;
} RGBInfo;

// If a fixed palette then get 16/24/32-bit color masking information.

// hDC = (HDC) DC to window
// wgInfo = (BITMAPINFOHEADER)
// nDepth = BitsPerPixel;

HBITMAP hBitmap;
byte *pBMap;
RGBInfo info;

hBitmap = CreateDIBSection(hDC, (BITMAPINFO *) &wgInfo,
             DIB_PAL_COLORS, (VOID **) &pBMap,
             (HANDLE) NULL, (DWORD) NULL);
if (hBitmap != NULL)
{
    HBRUSH hBrush;
    RECT rect;

    rect.left = rect.top = 0;
    rect.right = rect.bottom = 2;

        // Test for Red Mask bits
    hBrush = CreateSolidBrush(0x000000ff); // 0x00bbggrr
    FillRect(hDC, &rect, hBrush);
    info.RedMask = *(DWORD *)(pBMap);

        // Test for Green Mask bits
    hBrush = CreateSolidBrush(0x0000ff00); // 0x00bbggrr
    FillRect(hDC, &rect, hBrush);
    info.GreenMask = *(DWORD *)(pBMap);

        // Test for Blue Mask bits
    hBrush = CreateSolidBrush(0x00ff0000); // 0x00bbggrr
    FillRect(hDC, &rect, hBrush);
    info.BlueMask = *(DWORD *)(pBMap);

        // Now analyze
    gfxGetColorMasks(nDepth, &info);
}

Now we come to the fun part. Normally we would only call these function pairs once during program initialization, so it does not need to be very fast. However, the BSF and BSR instructions are two that do not translate well to C code. They would need a slowpoke bit-shifting loop, so I thought it would be a good example and a function that you would be able to use in your own graphic applications.

  ; Color masking information

  RGBInfo STRUCT 4
      RedMask    dd   0     ; Red color mask
      GreenMask  dd   0     ; Green color mask

      BlueMask   dd    0       ; Blue color mask
  ; These bits are used to shift the bits into position
      RedShft    dd    0       ; Red shift bits up
      GreenShft  dd    0       ; Green shift bits up
      BlueShft   dd    0       ; Blue shift bits up
  ; These bits are used to reduce the bits needed
      zRedShft   dd    0       ; Red shift bits down
      zGreenShft dd    0       ; Green shift bits down
      zBlueShft  dd    0       ; Blue shift bits down
  RGBInfo     ENDS

  ;  Get fixed palette color mask information

  ;bool gfxGetColorMasks(uint nDepth, RGBInfo *pInfo)

      public  gfxGetColorMasks
  gfxGetColorMasks proc near
      push    ebp
      mov     ebp,esp

      mov     eax,[ebp+arg1]     ; nDepth - Get pixel depth
      push    esi
      shr     eax,3              ; n÷8
      push    ebx 
      mov     ebx,RGBMsks[eax*4] ; Get Premask
      mov     esi,[ebp+arg2]     ; pInfo - Get (RGBInfo *)
      test    ebx,ebx 
      jz      $Xit               ; false (failure)

      ; Save masks. Get position of first and last
      ; bits in mask for each color!

      ; Green bits
      mov     eax,(RGBInfo PTR [esi]).GreenMask
      and     eax,ebx            ; AND with mask
      mov     (RGBInfo PTR [esi]).GreenMask,eax
      bsf     ecx,eax            ; get LSB set to 1
      bsr     edx,eax            ; get MSB set to 1
      mov     (RGBInfo PTR [esi]).GreenShft,ecx
      sub     edx,ecx
      add     edx,-7
      neg     edx
      mov     (RGBInfo PTR [esi]).zGreenShft,edx

      ;Blue bits
      mov     eax,(RGBInfo PTR [esi]).BlueMask
      and     eax,ebx
      mov     (RGBInfo PTR [esi]).BlueMask,eax
      bsf     ecx,eax            ; get LSB set to 1
      bsr     edx,eax            ; get MSB set to 1
      mov     (RGBInfo PTR [esi]).BlueShft,ecx
      sub     edx,ecx
      add     edx,-7
      neg     edx

      mov     (RGBInfo PTR [esi]).zBlueShft,edx
      ;Red bits
      mov     eax,(RGBInfo PTR [esi]).RedMask
      and     eax,ebx
      mov     (RGBInfo PTR [esi]).RedMask,eax
      bsf     ecx,eax            ; get LSB set to 1
      bsr     edx,eax            ; get MSB set to 1
      mov     (RGBInfo PTR [esi]).RedShft,ecx
      sub     edx,ecx
      add     edx,-7
      neg     edx
      mov     (RGBInfo PTR [esi]).zRedShft,edx

      mov     eax,1              ; true (success)

  $Xit: pop     ebx
        pop     esi
        pop     ebp
        ret                      ; Return eax
  gfxGetColorMasks endp

Exercises

With a 32-bit data element size, what is the result of a logical right shifting of the following data by 34 bits? With an arithmetic right shift? With a logical left shift?
0xB83DE7820
Write a function to count the number of bits to contain a signed value.
Write a packed 16× 8-bit SAR, using pseudo vector code.
Write a packed 16× 8-bit SAR, using SIMD instructions.
Write a packed ROL using SIMD.
1. 8-bit
2. 16-bit
3. 32-bit
Write a packed ROR using SIMD.
1. 8-bit
2. 16-bit
3. 32-bit

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Bit Wrangling

Create new playlist

Sign In

Sign Up

Chapter 5. Bit Wrangling

Logical Left Shifting

SHL/SAL - Shift (Logical/Arithmetic) Left

SHLD — Shift (Logical) Left (Double)

PSLLx — Parallel Shift Left (Logical)

Pseudo Vec

Packed Shift Left Logical 16×8-bit by n:{0...7}

Pseudo Vec (x86)

vmp_psllB (MMX) 16×8-bit Vector

Logical Right Shifting

SHR — Shift (Logical) Right

SHRD — Shift (Logical) Right (Double)

PSRLx — Parallel Shift Right (Logical)

Pseudo Vec

Packed Shift Right Logical 16×8-bit by n:{0...7}

Arithmetic Right Shifting

SAR — Shift (Arithmetic) Right

PSRAx — Packed Shift Right (Arithmetic)

Pseudo Vec

Rotate Left (or n-Right)

ROL — Rotate Left

RCL — Rotate Carry Left

Rotate Right

ROR — Rotate Right

RCR — Rotate Carry Right

Bit Scanning

BSF — Bit Scan Forward

Note

BSR — Bit Scan Reverse

ToolBox Snippet — Get Bit Count

Graphics 101 — Palette Bits

Tip

Exercises

Table of Contents for
5. Bit Wrangling