Chapter 11. Branchless

Okay, we learned how to branch in the last chapter. We also learned about some branchless coding methods such as the butterfly switch and value setting using bit blending. Now let's learn how to make decisions without branching. You have already learned instructions such as the bit test and set. Here you will be shown how to use masking logic to get the same result as one would with the use of branching. The first item to learn is that the MSB (negative bit #31) and Carry bits are your friends! Two values can be subtracted to obtain a positive or negative threshold, and then the resulting state of the MSB can be arithmetically shifted to create a mask of all ones or zeros. That mask can be used to mask or blend bits.

Branchless

There is only one trick, however. The values need to be one bit less than the data size so as to take advantage of the MSB. Data could be shifted by the bit width of the data, but on some older embedded processors there are penalties for each shifted bit.

  #define UINT_MAX_BITS   (sizeof(uint) <<3)          // 32-bit value = 32
  #define INT_MAX_BITS    ((sizeof(int) <<3)-1)       // 32-bit value = 31
  #define UCHAR_MAX_BITS  (sizeof(uchar) <<3)         // 8-bit value = 8
  #define CHAR_MAX_BITS   ((sizeof(char) <<3)-1)      // 8-bit value = 7

Function y=ABS(x) 'Absolute' D = | A |

Floating-point FABS was discussed in Chapter 8, "Floating-Point Anyone?" Here we will investigate packed integer-based floating-point. There is sometimes found in C code a macro definition similar to:

        #define ABS(x) ((x) < 0 ? -(x) : (x))

But it can be easily replaced in C without the conditional test:

   __inline int ABS(x)
   {
       int y;
       y = x >> INT_MAX_BITS;
       return (x ^ y) - y;
   }

In the previous chapter on branching, you found the following example of the signed integer absolute number function y = | x |, which uses a Jcc instruction.

        test    eax,eax  ; Test if negative
        jns     $Abs     ; Jump if positive

        neg     eax      ; Invert number, two's complement

$Abs:                    ; eax is positive

This can also be done without a Jcc instruction:

   mov  ecx,eax
   sar  ecx,INT_MAX_BITS ; all one's if neg, else 0's
   xor  eax,ecx          ; At this point we're one's complement
   sub  eax,ecx          ; n-(-1)=n+1 = two's complement

Voilà, an ABS() function without any Jcc instructions. Just some sleight of hand using general-purpose instructions.

Let's look a little closer at how this works. Note that it will work on 8-, 16-, or 32-bit numbers in the same way. Below you will find an 8-bit negative number on the left and a positive 8-bit number on the right for a side-by-side comparison.

   mov  cl,al   10001010b 8Ah (-118)   mov dl,bl       01111011b    7Bh     (123)
   sar  cl,7    11111111b FFh          sar dl,7        00000000b    00h

The SAR instruction shifts the MSB arithmetically into all the other bit positions, so a negative number becomes all FFs, and a positive number becomes all 00s. Now XOR those bits with their original XOR value. You learned in a previous chapter that 1Function y=ABS(x) 'Absolute' D = | A |1=0 and 1Function y=ABS(x) 'Absolute' D = | A |0=1.

        ; xor al,cl 01110101b 75h       xor bl,dl 01111011b 7Bh

A negative value would actually flip the original value's bits with a one's complement, and a positive value would keep the value intact. We then subtract the new value with that mask, which effectively adds a +1 or +0, thus a two's complement is performed, since n–(–1)=n+1 = two's complement.

   ;    cl=       11111111b FFh       dl=       00000000b 00h
   ;    sub al,cl 01110110b 76h (118) sub bl,dl 01111011b 7Bh (123)

Pretty cool, huh! And no branching!

Function y=MIN(p, q) 'Minimum'

In C code you may find a macro definition similar to:

   #define MIN(p, q) ((p) < (q) ? (p) : (q))

But it can be easily replaced in C without the condition test:

  __inline MIN(int p, int q)
  {
      int r;
          // r=(p > q) ? p : q
      r = (p-q) >> INT_MAX_BITS;     // (-)=0xFFFFFFFF
                                     // (+)=0x00000000
      return (p & r) | (q & (r^-1)); // keep lower of p or q
  }

Normally there would be a single branch test, but as you will note in the C example above and in the assembly below, there is no branch.

 MIN PROC C PUBLIC p:DWORD, q:DWORD
     mov edx,p               ; edx = p
     mov ecx,q               ; ecx = q

         ; r=(p < q) ? p : q
         ; r = (p-q) >> INT_MAX_BITS;
         ; (-)=0xFFFFFFFF (+)=0x00000000
     mov eax,edx
     sub eax,ecx
     sar eax,INT_MAX_BITS    ; (int >> 31)

         ; return (p Function y=MIN(p, q) 'Minimum'  r) Function y=MIN(p, q) 'Minimum'  (q Function y=MIN(p, q) 'Minimum'  (Function y=MIN(p, q) 'Minimum' r))
     and edx,eax             ; (p Function y=MIN(p, q) 'Minimum'  r)
     xor eax,-1              ; (Function y=MIN(p, q) 'Minimum' )
     and eax,ecx             ; q Function y=MIN(p, q) 'Minimum'(Function y=MIN(p, q) 'Minimum' r)
     or eax,edx
     ret                     ; result is in eax
 MIN endp

This function is similar to the ABS() function above. The argument q is subtracted from p to force a negative value if p is less than q. Upon that negative result, a bit mask of all one's is created. By ANDing that mask with p, ANDing the inverse of that mask with q, and ORing the results, the minimum value is retrieved.

Function y=MAX(p, q) 'Maximum'

The function MAX() would have a similar implementation except that p and q would be internally swapped as shown below.

   MAX  PROC C PUBLIC p:DWORD,  q:DWORD
        mov edx,q                ; edx = q
        mov ecx,p                ; ecx = p

            ; r=(q < p) ? q : p
            ; r = (q-p) >> INT_MAX_BITS;
            ; (-)=0xFFFFFFFF (+)=0x00000000
        mov eax,edx
        sub eax,ecx
        sar eax,INT_MAX_BITS    ; (int >> 31)

            ; return (q Function y=MAX(p, q) 'Maximum'  r) Function y=MAX(p, q) 'Maximum'  (p Function y=MAX(p, q) 'Maximum'  (Function y=MAX(p, q) 'Maximum' r))
        and edx,eax             ; (q Function y=MAX(p, q) 'Maximum'  r)
        xor eax,-1              ; (Function y=MAX(p, q) 'Maximum' r)
        and eax,ecx             ; p Function y=MAX(p, q) 'Maximum'  (Function y=MAX(p, q) 'Maximum' r)
        or eax,edx
        ret                     ; result is in eax
   MAX  endp

There is also something to note in this code. This is a single MIN() function test. As such there is no branch, but there are dependency stalls. The EAX register is constantly being reassigned or operated upon and thus stalling any pipelining. The good news is that it is still faster than the branch. If a particular pair of arrays needed to have a similar operation performed, the code could be intermixed in such a way that pipelining could be used to advantage.

Then again, remember that job interview question in the previous chapter? It was related to writing a STRUPR() function. Let's try that again!

  ;       strupr snippet
  xyzzy   db     "Quick brown fox jumped!",0

        mov    al,[edx]  ; Get a character
 if 0                    ; *** Old Branching Code ***

 $L1:   cmp     al,'a'
        jb      $L2      ; (1) Jump if symbols or uppercase
        cmp     al,'z'
        ja      $L2      ; (2) Jump if extended ASCII

        sub     al,20h   ; Convert to uppercase
        mov     [edx],al ; Save altered character

 $L2:   inc     edx      ; Nothing to do; next!
        mov     al,[edx] ; Get a character
        test    al,al
        jnz     $L1      ; 0 terminator?

 else                    ; *** New Branchless Code ***

 $L1:   mov     ah,al
        sub     ah,'a'
        sub     al,'z'+1
        shr     ah,2     ; xx?xxxxxb <'a'=0x20 else 0x00
        shr     al,2     ; xx?xxxxxb ≤ 'z'=0x20 else 0x00
        xor     ah,0ffh  ; Flip the bits
        and     ah,al    ; Blend the ˜'a' and 'z' masks
        and     ah,020h  ; 00X00000b strip masking bit
        sub     [edx],ah ; -00h or -20h adjustment to character
        inc     edx      ; Nothing to do; next!

        mov     al,[edx] ; Get the next character
        test    al,al
        jnz     $L1      ; 0 terminator?
 endif

In the old branching code section you will notice the two branches, which can be costly in mispredictions. Statistically, there will be more lowercase than uppercase and symbols, and an even higher misprediction rate if Single-Byte Character System (SBCS) international characters are involved. The branchless code is not very optimized, but you will notice some pairing between the AL and AH registers processing the lower and upper limits of the inclusive character range of a to z. There are still some stalls near the bottom of the loop, but there are no prediction errors or CPU cycle hits from having to take a branch; thus, the new code is faster. By spending some optimization time intermixing the looping logic with the actual masking logic, the register dependencies can be reduced even further, allowing the code to run even faster.

   $L1: mov     ah,al
        sub     al,'z'+1
        sub     ah,'a'
        shr     al,2        ; xx?xxxxxb  ≤ 'z'=0x20 else 0x00
        shr     ah,2        ; xx?xxxxxb  <'a'=0x20 else 0x00
        inc     edx         ; Nothing to do; next!
        not     ah          ; Flip the bit
   ;ah stall
        and     ah,al       ; Blend the ˜'a' and 'z' masks
        mov     al,[edx]    ; Get the next character
        and     ah,020h     ; 00X00000b strip masking bit
   ;ah stall
        sub     [edx-1],ah  ; -00h or -20h adjustment to character

        test    al,al
        jnz     $L1         ; 0 terminator?

8-bit values can be mirrored, loaded into a 32-bit register, and then processed simultaneously, but there would be a significant stall in doing so. Experiment and give it a try anyway!

So now you are probably thinking, "That was silly, since it will probably never be needed!" Just remember that it just might be your next job interview test question. When writing code in optimized assembly, one has to initially decide whether it is worthwhile and an efficient use of time. I am sorry to say that string converters are typically not; however, the principles you learned here can be applied to both the C and assembly world of programming, making your code faster and more efficient, and earning the admiration of your fellow programming team, the recognition (or lack thereof) by your manager, and hopefully a better performance review and hefty pay hike! Or not!

Graphics 101 — Quick 2D Distance

Quick 2D distance formulas are used for calculating waypoints in an isometric (flat) terrain-based computer world, artificial intelligence (AI) in path finding, and in (pre-) collision detection and other tasks, but these calculations need not be very accurate, only quick. Normally, to find a 2D distance one merely uses the Pythagorean theorem: dx2+dy2=r2, hence r=sqrt(dx2+dy2). You remember the hypotenuse of the triangle from a geometry lesson in school: the shortest line between two points. A 3D distance would use the equation dx2+dy2+dz2=r2. The problem is that the 2D version would require two multiplications, an addition, and a costly square root calculation. That would take too much time when a series of these calculations need to be performed. Since accuracy is not required, a formula such as the following could be used as an alternative. Note that this has a compiled code length of 127 bytes.

   int Quick2DDist(int x1, int y1, int x2, int y2)
   {
       int dx, dy;

       dx = (x2-x1) < 0 ? (x1-x2) : (x2-x1); // dx= |x2-x1 |
       dy = (y2-y1) < 0 ? (y1-y2) : (y2-y1); // dy= |y2-y1 |
       return((dx > dy) ? dy+(dx>>1) : dx+(dy>>1));
   }

The equation returns a summation of the longest axis and half the shortest axis. It is fairly efficient but contains three conditional branches that can be removed and instead use masking logic to make it even faster. The ABS() function was discussed in detail earlier in this chapter. The new algorithm is a comparison with a substitution. So applying what we have learned about branchless coding we get this new function:

    int Quick2DDist(int x1, int y1, int x2, int y2)
    {
        int x, y, dx, dy;

            // dx=ABS(x2-x1); dy=ABS(y2-y1);
        x = (x2-x1);
        y = (y2-y1);
        dx = x >> INT_MAX_BITS;
        dy = y >> INT_MAX_BITS;
        dx = (x ^ dx) - dx;
        dy = (y ^ dy) - dy;

            // return( (dx < dy) ? dy+(dx>>1) : dx+(dy>>1));
        x = dy + (dx >> 1);
        y = dx + (dy >> 1);
        dx = (dx-dy) >> INT_MAX_BITS;
        return (x & dx) | (y & (dx^-1));
    }

Note that this function has a length of 133 bytes and is only slightly larger than the previous, but runs much faster. There is no branching, and it's been set up for pipelining although the compiler will optimize it in its own way. This would normally be good enough but if there is a need for more speed, then check out the following conversion of this optimized C code to assembly.

    Quick2DDist PROC C PUBLIC x1:DWORD, y1:DWORD, x2:DWORD, y2:DWORD
        push ebx
           ; dx = |x2-x1|    dy = |y2-y1|
        mov  ecx,y2          ; y = (y2-y1)
        mov  ebx,x2          ; x = (x2-x1)
        sub  ecx,y1
        sub  ebx,x1
           ; dx = x >> INT_MAX_BITS; dy = y >> INT_MAX_BITS
        mov edx,ecx
        mov eax,ebx
        sar edx,INT_MAX_BITS
        sar eax,INT_MAX_BITS

           ; dx = (x Graphics 101 — Quick 2D Distance  dx) - dx; dy = (y Graphics 101 — Quick 2D Distance  dy) - dy
        xor ecx,edx          ; (yGraphics 101 — Quick 2D Distancedy)
        xor ebx,eax          ; (xGraphics 101 — Quick 2D Distancedx)
        sub ecx,edx          ; ecx=dy
        sub ebx,eax          ; ebx=dx

           ; return((dx < dy) ? dy+(dx>>1) : dx+(dy>>1));
        mov edx,ecx
        mov eax,ebx
        shr edx,1            ; (dy>>1)
        shr eax,1            ; (dx>>1)
        add edx,ebx          ; y = dx + (dy >> 1);

        sub ebx,ecx          ; r=(dx-dy)

        add eax,ecx          ; x = dy + (dx >> 1);

        sar ebx,INT_MAX_BITS ; r=(+)=00's or (-)=FF's
            ; ebx=r (dx mask) eax=x edx=y

            ; ebx stall
            ; return (x Graphics 101 — Quick 2D Distance  r) Graphics 101 — Quick 2D Distance  (y Graphics 101 — Quick 2D Distance  (Graphics 101 — Quick 2D Distancer))
        mov ecx,ebx          ; (r)
        and eax,ebx          ; (x Graphics 101 — Quick 2D Distance r)
        not ecx              ; Flip mask
            ; ecx stall

        and edx,ecx          ; (y Graphics 101 — Quick 2D Distance  (Graphics 101 — Quick 2D Distancer))
        pop ebx
        or  eax,edx          ; eax = distance
        ret
    Quick2DDist endp

Note that this only requires half the length of either version of the C code as it is only 66 bytes in length. Since it was hand optimized assembly and not C, it is that much faster because of reduced code cache fetches and fewer stalls. Keep in mind that code written in assembly is not always smaller; in fact, it is typically larger, but this just demonstrates some hidden benefits. Again, you will note the blended instructions to support two pipelines, but unfortunately there are two remaining register stalls. By not using the proc macro, the POP EBP and POP EBX instructions can be moved up, removing one of the stalls. If you look carefully, the code is visually paired until you get about halfway down, at which point the logic gets a little foggy. That is because lower code was moved up to remove additional register stalls that would have occurred, and thus making the code less readable.

Again, please remember that you should rewrite your C code in an attempt to break it into simpler components, keeping in mind that it should be readable. Once this is proven to run, then work on the assembly code using this fresher C version as a template to follow. Use the Carry or one of the data bits such as the MSB to smear the bit pattern across the other bits as required. The alternative method would be to use the SET instruction to obtain a 0 or 1, then a decrement to get a –1 or 0 for a mask.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset