Okay, we learned how to branch in the last chapter. We also learned about some branchless coding methods such as the butterfly switch and value setting using bit blending. Now let's learn how to make decisions without branching. You have already learned instructions such as the bit test and set. Here you will be shown how to use masking logic to get the same result as one would with the use of branching. The first item to learn is that the MSB (negative bit #31) and Carry bits are your friends! Two values can be subtracted to obtain a positive or negative threshold, and then the resulting state of the MSB can be arithmetically shifted to create a mask of all ones or zeros. That mask can be used to mask or blend bits.
There is only one trick, however. The values need to be one bit less than the data size so as to take advantage of the MSB. Data could be shifted by the bit width of the data, but on some older embedded processors there are penalties for each shifted bit.
#define UINT_MAX_BITS (sizeof(uint) <<3) // 32-bit value = 32 #define INT_MAX_BITS ((sizeof(int) <<3)-1) // 32-bit value = 31 #define UCHAR_MAX_BITS (sizeof(uchar) <<3) // 8-bit value = 8 #define CHAR_MAX_BITS ((sizeof(char) <<3)-1) // 8-bit value = 7
Floating-point FABS was discussed in Chapter 8, "Floating-Point Anyone?" Here we will investigate packed integer-based floating-point. There is sometimes found in C code a macro definition similar to:
#define ABS(x) ((x) < 0 ? -(x) : (x))
But it can be easily replaced in C without the conditional test:
__inline int ABS(x) { int y; y = x >> INT_MAX_BITS; return (x ^ y) - y; }
In the previous chapter on branching, you found the following example of the signed integer absolute number function y = | x |, which uses a Jcc instruction.
test eax,eax ; Test if negative jns $Abs ; Jump if positive neg eax ; Invert number, two's complement $Abs: ; eax is positive
This can also be done without a Jcc instruction:
mov ecx,eax sar ecx,INT_MAX_BITS ; all one's if neg, else 0's xor eax,ecx ; At this point we're one's complement sub eax,ecx ; n-(-1)=n+1 = two's complement
Voilà, an ABS() function without any Jcc instructions. Just some sleight of hand using general-purpose instructions.
Let's look a little closer at how this works. Note that it will work on 8-, 16-, or 32-bit numbers in the same way. Below you will find an 8-bit negative number on the left and a positive 8-bit number on the right for a side-by-side comparison.
mov cl,al1
0001010b 8Ah (-118) mov dl,bl0
1111011b 7Bh (123) sar cl,711111111
b FFh sar dl,700000000
b 00h
The SAR instruction shifts the MSB arithmetically into all the other bit positions, so a negative number becomes all FFs, and a positive number becomes all 00s. Now XOR those bits with their original XOR value. You learned in a previous chapter that 11=0 and 10=1.
; xor al,cl 01110101b 75h xor bl,dl 01111011b 7Bh
A negative value would actually flip the original value's bits with a one's complement, and a positive value would keep the value intact. We then subtract the new value with that mask, which effectively adds a +1 or +0, thus a two's complement is performed, since n–(–1)=n+1 = two's complement.
; cl=11111111b
FFh dl=00000000
b 00h ; sub al,cl 01110110b 76h (118) sub bl,dl 01111011b 7Bh (123)
Pretty cool, huh! And no branching!
In C code you may find a macro definition similar to:
#define MIN(p, q) ((p) < (q) ? (p) : (q))
But it can be easily replaced in C without the condition test:
__inline MIN(int p, int q) { int r; // r=(p > q) ? p : q r = (p-q) >> INT_MAX_BITS; // (-)=0xFFFFFFFF // (+)=0x00000000 return (p & r) | (q & (r^-1)); // keep lower of p or q }
Normally there would be a single branch test, but as you will note in the C example above and in the assembly below, there is no branch.
MIN PROC C PUBLIC p:DWORD, q:DWORD mov edx,p ; edx = p mov ecx,q ; ecx = q ; r=(p < q) ? p : q ; r = (p-q) >> INT_MAX_BITS; ; (-)=0xFFFFFFFF (+)=0x00000000 mov eax,edx sub eax,ecx sar eax,INT_MAX_BITS ; (int >> 31) ; return (p r) (q ( r)) and edx,eax ; (p r) xor eax,-1 ; ( ) and eax,ecx ; q ( r) or eax,edx ret ; result is in eax MIN endp
This function is similar to the ABS() function above. The argument q is subtracted from p to force a negative value if p is less than q. Upon that negative result, a bit mask of all one's is created. By ANDing that mask with p, ANDing the inverse of that mask with q, and ORing the results, the minimum value is retrieved.
The function MAX() would have a similar implementation except that p and q would be internally swapped as shown below.
MAX PROC C PUBLIC p:DWORD, q:DWORD mov edx,q ; edx = q mov ecx,p ; ecx = p ; r=(q < p) ? q : p ; r = (q-p) >> INT_MAX_BITS; ; (-)=0xFFFFFFFF (+)=0x00000000 mov eax,edx sub eax,ecx sar eax,INT_MAX_BITS ; (int >> 31) ; return (q r) (p ( r)) and edx,eax ; (q r) xor eax,-1 ; ( r) and eax,ecx ; p ( r) or eax,edx ret ; result is in eax MAX endp
There is also something to note in this code. This is a single MIN() function test. As such there is no branch, but there are dependency stalls. The EAX register is constantly being reassigned or operated upon and thus stalling any pipelining. The good news is that it is still faster than the branch. If a particular pair of arrays needed to have a similar operation performed, the code could be intermixed in such a way that pipelining could be used to advantage.
Then again, remember that job interview question in the previous chapter? It was related to writing a STRUPR() function. Let's try that again!
; strupr snippet xyzzy db "Quick brown fox jumped!",0 mov al,[edx] ; Get a character
if 0 ; *** Old Branching Code *** $L1: cmp al,'a' jb $L2 ; (1) Jump if symbols or uppercase cmp al,'z' ja $L2 ; (2) Jump if extended ASCII sub al,20h ; Convert to uppercase mov [edx],al ; Save altered character $L2: inc edx ; Nothing to do; next! mov al,[edx] ; Get a character test al,al jnz $L1 ; 0 terminator? else ; *** New Branchless Code *** $L1: mov ah,al sub ah,'a' sub al,'z'+1 shr ah,2 ; xx?xxxxxb <'a'=0x20 else 0x00 shr al,2 ; xx?xxxxxb ≤ 'z'=0x20 else 0x00 xor ah,0ffh ; Flip the bits and ah,al ; Blend the ˜'a' and 'z' masks and ah,020h ; 00X00000b strip masking bit sub [edx],ah ; -00h or -20h adjustment to character inc edx ; Nothing to do; next! mov al,[edx] ; Get the next character test al,al jnz $L1 ; 0 terminator? endif
In the old branching code section you will notice the two branches, which can be costly in mispredictions. Statistically, there will be more lowercase than uppercase and symbols, and an even higher misprediction rate if Single-Byte Character System (SBCS) international characters are involved. The branchless code is not very optimized, but you will notice some pairing between the AL and AH registers processing the lower and upper limits of the inclusive character range of a to z. There are still some stalls near the bottom of the loop, but there are no prediction errors or CPU cycle hits from having to take a branch; thus, the new code is faster. By spending some optimization time intermixing the looping logic with the actual masking logic, the register dependencies can be reduced even further, allowing the code to run even faster.
$L1: mov ah,al sub al,'z'+1 sub ah,'a'
shr al,2 ; xx?xxxxxb ≤ 'z'=0x20 else 0x00 shr ah,2 ; xx?xxxxxb <'a'=0x20 else 0x00 inc edx ; Nothing to do; next! not ah ; Flip the bit ;ah stall and ah,al ; Blend the ˜'a' and 'z' masks mov al,[edx] ; Get the next character and ah,020h ; 00X00000b strip masking bit ;ah stall sub [edx-1],ah ; -00h or -20h adjustment to character test al,al jnz $L1 ; 0 terminator?
8-bit values can be mirrored, loaded into a 32-bit register, and then processed simultaneously, but there would be a significant stall in doing so. Experiment and give it a try anyway!
So now you are probably thinking, "That was silly, since it will probably never be needed!" Just remember that it just might be your next job interview test question. When writing code in optimized assembly, one has to initially decide whether it is worthwhile and an efficient use of time. I am sorry to say that string converters are typically not; however, the principles you learned here can be applied to both the C and assembly world of programming, making your code faster and more efficient, and earning the admiration of your fellow programming team, the recognition (or lack thereof) by your manager, and hopefully a better performance review and hefty pay hike! Or not!
Quick 2D distance formulas are used for calculating waypoints in an isometric (flat) terrain-based computer world, artificial intelligence (AI) in path finding, and in (pre-) collision detection and other tasks, but these calculations need not be very accurate, only quick. Normally, to find a 2D distance one merely uses the Pythagorean theorem: dx2+dy2=r2, hence r=sqrt(dx2+dy2). You remember the hypotenuse of the triangle from a geometry lesson in school: the shortest line between two points. A 3D distance would use the equation dx2+dy2+dz2=r2. The problem is that the 2D version would require two multiplications, an addition, and a costly square root calculation. That would take too much time when a series of these calculations need to be performed. Since accuracy is not required, a formula such as the following could be used as an alternative. Note that this has a compiled code length of 127 bytes.
int Quick2DDist(int x1, int y1, int x2, int y2) { int dx, dy; dx = (x2-x1) < 0 ? (x1-x2) : (x2-x1); // dx= |x2-x1 | dy = (y2-y1) < 0 ? (y1-y2) : (y2-y1); // dy= |y2-y1 | return((dx > dy) ? dy+(dx>>1) : dx+(dy>>1)); }
The equation returns a summation of the longest axis and half the shortest axis. It is fairly efficient but contains three conditional branches that can be removed and instead use masking logic to make it even faster. The ABS() function was discussed in detail earlier in this chapter. The new algorithm is a comparison with a substitution. So applying what we have learned about branchless coding we get this new function:
int Quick2DDist(int x1, int y1, int x2, int y2) { int x, y, dx, dy; // dx=ABS(x2-x1); dy=ABS(y2-y1); x = (x2-x1); y = (y2-y1); dx = x >> INT_MAX_BITS; dy = y >> INT_MAX_BITS; dx = (x ^ dx) - dx; dy = (y ^ dy) - dy; // return( (dx < dy) ? dy+(dx>>1) : dx+(dy>>1)); x = dy + (dx >> 1); y = dx + (dy >> 1); dx = (dx-dy) >> INT_MAX_BITS; return (x & dx) | (y & (dx^-1)); }
Note that this function has a length of 133 bytes and is only slightly larger than the previous, but runs much faster. There is no branching, and it's been set up for pipelining although the compiler will optimize it in its own way. This would normally be good enough but if there is a need for more speed, then check out the following conversion of this optimized C code to assembly.
Quick2DDist PROC C PUBLIC x1:DWORD, y1:DWORD, x2:DWORD, y2:DWORD push ebx ; dx = |x2-x1| dy = |y2-y1| mov ecx,y2 ; y = (y2-y1) mov ebx,x2 ; x = (x2-x1) sub ecx,y1 sub ebx,x1
; dx = x >> INT_MAX_BITS; dy = y >> INT_MAX_BITS mov edx,ecx mov eax,ebx sar edx,INT_MAX_BITS sar eax,INT_MAX_BITS ; dx = (x dx) - dx; dy = (y dy) - dy xor ecx,edx ; (ydy) xor ebx,eax ; (xdx) sub ecx,edx ; ecx=dy sub ebx,eax ; ebx=dx ; return((dx < dy) ? dy+(dx>>1) : dx+(dy>>1)); mov edx,ecx mov eax,ebx shr edx,1 ; (dy>>1) shr eax,1 ; (dx>>1) add edx,ebx ; y = dx + (dy >> 1); sub ebx,ecx ; r=(dx-dy) add eax,ecx ; x = dy + (dx >> 1); sar ebx,INT_MAX_BITS ; r=(+)=00's or (-)=FF's ; ebx=r (dx mask) eax=x edx=y ; ebx stall ; return (x r) (y (r)) mov ecx,ebx ; (r) and eax,ebx ; (x r) not ecx ; Flip mask ; ecx stall and edx,ecx ; (y (r)) pop ebx or eax,edx ; eax = distance ret Quick2DDist endp
Note that this only requires half the length of either version of the C code as it is only 66 bytes in length. Since it was hand optimized assembly and not C, it is that much faster because of reduced code cache fetches and fewer stalls. Keep in mind that code written in assembly is not always smaller; in fact, it is typically larger, but this just demonstrates some hidden benefits. Again, you will note the blended instructions to support two pipelines, but unfortunately there are two remaining register stalls. By not using the proc macro, the POP EBP and POP EBX instructions can be moved up, removing one of the stalls. If you look carefully, the code is visually paired until you get about halfway down, at which point the logic gets a little foggy. That is because lower code was moved up to remove additional register stalls that would have occurred, and thus making the code less readable.
Again, please remember that you should rewrite your C code in an attempt to break it into simpler components, keeping in mind that it should be readable. Once this is proven to run, then work on the assembly code using this fresher C version as a template to follow. Use the Carry or one of the data bits such as the MSB to smear the bit pattern across the other bits as required. The alternative method would be to use the SET instruction to obtain a 0 or 1, then a decrement to get a –1 or 0 for a mask.