This is one of two chapters that are probably the reason you bought this book. You are probably working with a graphics library and then ran into a little bind in which there did not seem to be a software library function available with that special needed functionality. Your schedules are sliding, nerves are wrecked, and paranoia is beginning to set in as you believe that your project lead or manager is beginning to doubt your abilities and that pink slip is only about a week away. Out of desperation you have escaped to your refuge, that favorite technical bookstore that has rescued you so many times in the past. You find lots of graphics books but not what you are looking for. There are a few assembly books that you have seen over the years, but they have always been targeted for beginners, but you have bought them anyway for your personal library as another resource book. (I have always been impressed with someone who has multiple large bookcases with multitudes of dog-eared books in their office. There is a fine art to making brand new books look well used.) And then you see this book. You flip through it and this chapter catches your gleaming eye and then you whisper in euphoria to yourself, "This book will save my butt!" At that point you look around and see everyone in the bookstore staring at you as you skulk toward the sales clerk.
Of course, there are other code samples in my Vector Game Math Processors book so do not forget to buy that book as well.
For those of you C programmers out there, this chapter is very similar to those heavily used functions memset() and memcpy(). They are used in almost every application for a large variety of purposes, but their behavior typically is not that useful or fast enough in the clearing or blitting of graphic images. Some of you are probably thinking "Why isn't this guy using the hardware blitter on a graphics card?" Well, in some cases, the blitter is hidden from you in the bowels of drivers such as Direct Draw but that's all it is, a blitter — a hardware device designed to move video card memory to video card memory and you only have 64 to 256 MB to play with. Okay, okay, that is much better than a couple years ago when you only had 2 to 8 MB. What we are trying to do here is learn the optimal method of moving memory around the computer system and from system memory to video memory. Also, just where did those images come from? Whose file format, and what compression type? How did they get loaded? Where do you get the driver, etc? There is also more to life than displaying video games! What about streaming media such as MPEG-4 and DivX, video analysis, scientific research, speech recognition, stereoscopic vision, etc.? The list is endless and new reasons are being invented all the time. Now that I am off my soap-box we can continue.
If you happen to be constructing 8-bit images, then the memset() function can work pretty well for you as the transparency value can be anywhere from 0 to 255. If working in 16-, 24-, or 32-bit colors this function is only useful if the transparency color that you are trying to set just happens to be 0; if it is any other value, you have a serious problem. You have put together a C function to do the task but even though this function is not called that often, its speed is not up to your needs.
Your older 32-bit C libraries typically have the following library function for clearing a block of memory:
; void *memset(void *pMem, int val, uint nCnt) ; ; Note: if nCnt is set to 0 then no bytes will be set! public memset memset proc near push ebp mov ebp,esp push edi mov ecx,[ebp+arg3] ; nCnt mov edi,[ebp+arg1] ; pMem mov eax,[ebp+arg2] ; val // Insert one of the following example code here! $xit: mov eax,[ebp+arg1] ; Return pointer pop edi pop ebp ret memset endp
The following loop is really inefficient code except when used on the old 8086 processors.
test ecx,ecx jz $xit $L0: stosb loop $L0
That code is relatively small but pretty inefficient as it is using the repeating string function to write a series of 8-bit bytes. The payoff on Pentium processors only comes with a repeat of 64 or more.
rep stosb
With a repeat factor of less than 64, use the following. Note that in using the ES:[EDI], the ES: IS the default and so we do not really need to put it in the code.
test ecx,ecx jz $xit ; jump if len of 0 $L1: mov es:[edi],al ; set a byte inc edi dec ecx jne $L1
An alternate method that is a lot more efficient than those listed above is to divide our total number of bytes into the number of 4-byte blocks, then loop on that, not forgetting to handle the remainders.
test ecx,ecx jz $xit ; jump if len of 0 ; The speed of writing 1 byte is the same as writing 4 bytes ; properly aligned so we build a 32-bit value to write (al = val) mov ah,al mov edx,eax shl eax,16 mov ax,dx ; eax=replicated byte ×4 mov edx,ecx ; Get # of bytes to set shr edx,2 ; n = n ÷ 4 jz $L2 ; Jump if 1..3 bytes ; edx = # of 32-bit writes $L1: mov [edi],eax ; set 4 bytes add edi,4 ; advance pointer by 4 bytes dec edx jne $L1 ; Loop for DWORDS
; Remainders (1..3) bytes $L2: and ecx,00000011b ; Mask remainder bits (0..3) jz $L4 ; Jump if no remainders ; 1 to 3 bytes to set $L3: mov [edi],al ; set 1 byte inc edi ; advance pointer by 1 byte dec ecx jne $L3 ; Loop for 1's $L4:
There are more sophisticated methods that you can employ but this is a good start.
For optimal performance all data reads and writes must be on a 32-bit boundary. In a copy situation, if the source and destination are misaligned, there is not much that can be done about it. But in the case of setting memory that is misaligned, it is a snap to fix it.
Figure 19-1. Imagine these four differently aligned memory strands as eels. We pull out our sushi knife and finely chop off their heads into little 8-bit (1-byte) chunks, chop off the tails into 8-bit (1-byte) chunks, and then coarsely chop the bodies into larger 32-bit (4-byte) chunks, and serve raw.
The first three memory strands had their heads misaligned, but the fourth was aligned properly. On the other hand, the tails of the last three were misaligned and the first one was aligned properly. Now that they're sliced and diced, their midsections are all properly aligned for best data handling.
The latest C runtime libraries use something a lot more elaborate such as the following function:
; void *memset(void *pMem, int val, uint nCnt) ; ; Note: if nCnt is set to 0 then no bytes will be set! public memset memset proc near $BSHFT = 2 ; Shift count $BCNT = 4 ; Byte count push ebp mov ebp,esp ; Unlike the code above, flow does not have to fall through ; if a size of 0 was passed, and so we need to test for it. ; The lines are adjusted to help prevent a stall. mov ecx,[ebp+arg3] ; nCnt push edi ; Older programmers will say hey, why didn't you 'OR ecx,ecx' ; but this is a read/write function that will cost you ; time for the write. The 'TEST ecx,ecx' is a read only! test ecx,ecx push ebx jz $Xit ; jump if size is 0 mov edi,[ebp+arg1] ; pMem ; If the size is (1...3) bytes long, then handle as tail bytes test ecx,NOT ($BCNT-1) mov eax,[ebp+arg2] ; val jz $Tail ; If already aligned on a (n mod 3)==0 boundary mov edx,edi and edx,($BCNT-1) jz $SetD ; The memory attempting to be set may not be properly aligned on ; a 4-byte boundary and thus if the block is 4 bytes in size or ; greater, then the 32-bit writes will have clock penalties on ; each write and so first adjust to be properly aligned.
sub edx,$BCNT add ecx,edx ; Reduce # of bytes to set $Lead: mov [edi],al ; Set a byte inc edi inc edx jne $Lead ; Loop for those {1..3} bytes ; The speed of writing 1 byte is the same as writing 4 bytes ; properly aligned so build a 32-bit value to write (al = val) $SetD : mov ah,al mov edx,eax shl eax,16 mov ax,dx ; eax=replicated byte ×4 ; Now we set the bytes four at a time mov edx,ecx shr edx,$BSHFT ; (n÷4) = # of 32-bit writes $SetD1: mov [edi],eax add edi,$BCNT dec edx jne $SetD1 and ecx,($BCNT-1) jz $Xit ; jump if size is 0 ; Write any trailing bytes $Tail: mov [edi],al ; set a byte inc edi dec ecx jne $Tail ; loop for trailing bytes $Xit: pop ebx pop edi mov eax,[ebp+arg1] ; Return destination pointer pop ebp ret memset endp
As you can see, that simple memory set function became a lot bigger, but its execution speed became a lot quicker. With very short lengths of bytes to set, such as sizes of fewer than four bytes, this code is actually slower but it quickly gains in speed as the memory lengths increase in size, especially if aligned on 4-byte boundaries. For an extra efficiency on a size of 256 bytes or more, using the STOSD instruction would be best.
You should use the string functions such as STOSD only if the repeat factor is 64 or more.
These numbers aren't exactly right as this function has not been tuned for its optimal timing yet, but I leave that to you. Besides, what would be the fun in it if I gave you all the answers? As versatile as the MMX instruction set is, the linear setting or copying of memory is no more efficient than the integer instructions. In fact, a STOSD/MOVSD string set/copy with a repeat of 64 or more is actually faster than the equivalent MMX instructions on legacy processors. This would also leave the XMM register for math related solutions. It turns out that we are actually pumping data very close to or at the bus speed. For experimental purposes and to have some MMX practice, one alternative would be the use of the MMX instruction MOVQ in the $SetD section of the code so eight bytes would be written at one time.
Alter the $BSHFT and $BCNT to the new values:
$BSHFT = 3 ; Shift count ×8=(1<<3) $BCNT = 8 ; Byte count ; Run lookup table replicates an 8-bit byte into a 64-bit qword. ; It saves a lot of shifting and ORing and only costs ; 256x8 = 2048 bytes and 1 time cycle. ; ; 00000000h,00000000h,01010101h,01010101h,02020202h,02020202h, ; etc. Replicate64 label DWORD .XLIST foo = 0 REPEAT 256 DD foo,foo foo = foo + 01010101h ENDM .LIST $SetD : lea eax,Replicate64[eax*8] movq mm7,[eax] mov edx,ecx shr edx,$BSHFT ; (n÷8) = # of 64-bit writes $SetD1: movq [edi],mm7 add edi,$BCNT dec edx jne $SetD1
And call at the appropriate time only if your CPU thread has floating-point operations to handle:
Emms
I recommend the use of the ZeroMemory() function instead. It saves passing an extra argument value of 0, or the time to replicate the single byte to four bytes.
A few years ago I was working on a project that was required to run on a 386 processor but typically ran on a 486 and had this little squirrely problem. One of the in-house computer systems that I tested the application on ran the code extremely slowly. I spent quite a while on it and when doing some benchmark testing to isolate the problem I found that the memory copy algorithm, which was used to blit graphical sprites onto the screen, was the culprit. Sprites could appear on screen with any kind of data alignment as they moved horizontally across the screen. Upon deeper investigation I found that this computer system was running DOS like all the others but in this particular case, it was running on an AMD 386SX processor. AMD usually has pretty good processors but I was intrigued and so I ordered and received their AM386 data book unique to that model processor. Upon reading the book I found out to my horror that this processor had a little zinger. As it is a 32-bit processor with a 16-bit bus, if your source and destination pointers are not properly aligned, then a single 32-bit memory access has an additional eight clock penalty for that misaligned access. And so we come to my next rule.
That little problem required the need to detect not only the exact manufacturer but also the model of processor and must route function calls to special code to handle each. In most cases the code could be shared, but some isolated instances required the special code. The following is an older style of the C function memcpy().
; void *memcpy(void *pDst, const void *pSrc, uint nSize) ; ; Note: if nSize is set to 0 then no bytes will be copied!
public memcpy memcpy proc near push ebp mov ebp,esp push esi push edi mov esi,[ebp+arg1] ; pSrc mov edi,[ebp+arg2] ; pDst mov ecx,[ebp+arg3] ; nSize // Insert one of the following example code here! mov eax,[ebp+arg1] ; Return pointer pop edi pop esi pop ebp ret memcpy endp
This loop is really inefficient code except when used on the old 8086 processors.
$L0: movsb loop $L0
The following code is relatively small but pretty inefficient as it is using the repeating string function to write a series of 8-bit bytes. The payoff on a Pentium only comes with a repeat of 64 or more.
rep movsb
With a repeat factor of less than 64 use the following. Note that we do not need to put the DS: or the ES: AS the default for the ESI source register is DS, and the default for the EDI destination register is ES.
$L1: mov al,[esi] ; al,ds:[esi] mov [edi],al ; es:[edi],al inc esi inc edi dec ecx jne $L1
In the above example we actually get a dependency penalty as we set the AL register but have to wait before we can actually execute the next instruction. If we adjust the function as follows, we no longer have that problem. You will note that the "inc esi" line was moved up to separate the AL, and the AL register.
$L1: mov al,ds:[esi] inc esi ; removes dependency penalty mov es:[edi],al
inc edi dec ecx jne $L1
Another method that is a lot more efficient than those listed above uses the same techniques we learned for setting memory. We divide our total number of bytes into the number of 4-byte blocks, then loop on that, not forgetting to handle the remainders. We handle the dependency penalty at $L1 in the same way.
mov edx,ecx ; Get # of bytes to set shr edx,2 ; n = n ÷ 4 jz $L2 ; Jump if 1..3 bytes ; DWORDS (uint32) $L1: mov eax,[esi] ; 1μOP read 32 bits add esi,4 mov [edi],eax ; 2μOP write 32 bits add edi,4 dec edx jne $L1 ; Loop for DWORDS ; Remainders $L2: and ecx,00000011b ; Mask remainder bits (0..3) jz $L4 ; Jump if no remainders ; 1 to 3 bytes to set $L3: mov al,[esi] inc esi mov [edi],al inc edi dec ecx jne $L3 ; Loop for 1's $L4:
This following method is significantly faster as it moves eight bytes at a time instead of four. There is no dependency penalty since the register being set is not being used immediately.
mov ecx,[ebp+arg3] ; nSize shr ecx,3 ; n = n ÷ 8 jz $L2 ; Jump if 1..7 bytes ; QWORDS (uint64) $L1: mov eax,[esi] ; 1μOP read 32 bits mov edx,[esi+4] ; read next 32 bits mov [edi],eax ; 2μOP write 32 bits
mov [edi+4],edx ; write next 32 bits add esi,8 add edi,8 dec ecx jne $L1 ; Loop for QWORDS ; Remainders $L2: mov ecx,[ebp+arg3] ; nSize and ecx,00000111b ; Mask remainder bits (0..7) jz $L4 ; Jump if no remainders ; 1 to 7 bytes to set $L3: mov al,[esi] ; read a byte inc esi mov [edi],al ; write byte inc edi dec ecx jne $L3 ; Loop for 1's $L4:
This code is just about as fast as a copy using MMX. An example would be to replace $L1 with the following code:
$L1: movq mm7,[esi] ; read 64 bits add esi,8 movq [edi],mm7 ; write 64 bits add edi,8 dec ecx jne $L1 ; Loop for QWORDS
There are more sophisticated methods that you can employ, but this is a good start.
It is important for memory to be aligned, as a problem occurs when the source and/or destination are misaligned. Memory movement (copy) functions should try to reorient source and destination pointers. Unfortunately, if one is not lucky enough that the source and destination are either both properly aligned or they are misaligned exactly the same:
If ((pSrc AND 00000000111b) == (pDst AND 00000000111b))
...then adjust them. If their logically AND'ed values are 0, no adjustment is needed. If the alignment is the same, adjust by 1's to get into the alignment position. If both are out of alignment, obtain a speed increase by putting at least one of them into alignment (preferably the destination):
mov edx,edi ; At least align destination! and edx,0000111b jz $Mid ; Jump if properly aligned ; Remove misaligned bytes add edx,0fffffffch ; -3 $lead: mov al,[esi] ; read byte inc esi mov [edi],al ; write byte inc edi dec ecx ; reduce total to move inc edx ; increment to 0 jne $lead ; loop for lead bytes $Mid:
For the actual memory movement operation there are various techniques that can be used, each with its own benefit or drawback.
The best method is a preventative one. If the memory you're dealing with is for video images, then not only should the (width mod 8) equal a remainder of zero but the source and destination pointers should also be properly aligned. In this way, there is no problem of clock penalties for each memory access and no extra and possibly futile effort trying to align them.
In 8-bit images, moving (blitting) sprite memory can be difficult as sprites will always be misaligned. In 32-bit images where one pixel is 32 bits, alignment is a snap, as every pixel is properly aligned.
#ifdef __cplusplus extern "C" void gfxCopyBlit8x8Asm(byte *pDst, byte *pSrc, uint nStride, uint nWidth, uint nHeight); #endif // Comment this line out for 'C' code #define USE_ASM_COPYBLIT_8X8 // 8-bit to 8-bit Copy Blit // // This function is pre-clipped to copy an 8-bit color // pixel from the buffer pointed to by the source // pointer to an identical sized destination buffer. #ifdef USE_ASM_COPYBLIT_8X8 #define CopyBlit8x8 CopyBlit8x8Asm #else void CopyBlit8x8(byte *pDst, byte *pSrc, uint nStride, uint nWidth, uint nHeight)
{ // If width is the stride then copy entire image if (nWidth == nStride) { memcpy(pDst, pSrc, nStride * nHeight); } else { // Copy image 1 scanline at a time. do { memcpy(pDst, pSrc, nWidth); pSrc += nStride; // Source stride adjustment pDst += nStride; // Destination Stride adj. } while(--nHeight); // Loop for height } } #endif
As you probably noted, there is extra logic checking if width and stride are the same. If so, then unroll the loop to make the code even more efficient.
Goal:Try to write the listed function in assembly optimized for your processor. Or multiple processors.
The code size would increase but using a vector table such as follows would allow you to unroll your (remainder) loops. With normal code, four states would be required but for MMX all eight would be best.
mov eax,ecx ; Get Width and eax,0000111b jmp $SetTbl[eax*4] ; At bottom of assembly source file insert the vector table so ; it doesn't interfere with your memory caches. Align 16 $SetTbl: dd $SetQ ; (n mod 8) = 0 dd $Set1 ; (n mod 4) = 1 dd $Set2 ; (n mod 4) = 2 dd $Set3 ; (n mod 4) = 3 dd $SetD ; (n mod 4) = 0 dd $Set1 ; (n mod 4) = 1 dd $Set2 ; (n mod 4) = 2 dd $Set3 ; (n mod 4) = 3
When dealing with graphic images there are various parameters defining its geometry.
memptr— The base pointer to a coordinate within the image related to its physical memory address.
bits per pixel— The number of bits per pixel used to represent the image. Typically 1/4/8/16/24/32-bit but pretty much only 8- to 32-bit are used these days.
width— The width of the image in pixels.
height— The height of the image in pixels.
stride— The number of bytes used to represent the start of one row of pixels to the start of another. It should be noted that there may be extra bytes beyond the last visible pixel and the start of the row of pixels. For example, in Figure 19-2 the 640-pixel scanline has an overage of 384 bytes. That means when you write that 640thpixel you need to add 384 to get to the start of the next scanline (640+384=1024).
So now let's use this information in some real code.
#ifdef __cplusplus extern "C" void gfxClrAsm(byte *pDst, uint nStride, uint nWidth, uint nHeight); #endif // Comment this line out for C code
#define USE_ASM_GFXCLR // Graphics Clear // // This is a pre-clipped function used to clear a bitmap // pointed to by the destination pointer. // Note: This can be used to clear 8/16/24/32-bit pixels. #ifdef USE_ASM_GFXCLR #define gfxClr gfxClrAsm #else void gfxClr(byte *pDst, uint nStride, uint nWidth, uint nHeight) { do { memset(pDst, 0, nWidth); pDst += nStride; } while (--nHeight); } #endif
Project:
Using what you've learned, try to write the C function above in assembly optimized for your processor.
void gfxClrAsm(byte *pDst, uint nStride, uint nWidth, uint nHeight);
There are different methods one can choose to blit or bit field copy a graphics image, including a pure blit where the image is merely copied pixel by pixel or a transparent copy such as detailed here.
A transparent pixel is referred to by a variety of names, including transparent, color key, skip color, invisible color, and non-displayed pixel. This is a pixel containing no image color data that allows the color of the pixel directly underneath it to be displayed. It is typically set to an unusual color that helps the artists and programmers easily identify it in relation to the rest of the colors.
If you watch the news you see this process every day compliments of the weatherman. He is shot on a green screen, being careful not to wear a color similar to the color key, so the electronics will make him appear in front of an image such as a map and that composite image is transmitted to your television. If he wore the same shade of color as the color key, in the middle of his chest he would appear to have a big hole where you would be able to see through his body.
When using film, moviemakers shoot models or actors on a blue screen, as the color of blue is actually clear on the film negative. Oversimplifying this explanation, the non-clear areas would be converted into a mask and the images would be cut into a composite typically using a matte backdrop.
When using digitized graphics in a computer, movie/game makers shoot actors on a green screen and digitally map the images into a single image using some sort of backdrop.
Your transparency color can be any color. I typically pick a dark shade of blue. For instance, in an RGB range of (0 to 255) {red:0, green:0, blue:108}. This allows me to differentiate between the color of black and transparency and still have the transparent color dark enough so as not to detract from the art. When I am nearly done with the image and almost ready to test it for any stray transparent pixels, I set them to a bright purple {red:255, green:0, blue:255} as that particular color of bright purple is not usually found in my art images and it really stands out. It does not matter what color you use as long as the image does not contain that particular color.
In a 2D graphics application, there is typically a need to composite images and so this leads to how to handle a transparent blit.
A few years ago, I taught a College for Kids program during the summer titled "The Art of Computer/Video Game Design." For that class, I had put together a small program that reinforced the need for computer games to have foreign language support. This particular game was called "Monster Punch." A language would be selected and then various living fruit with their eyes moving around would drop down from the top of the screen and pass through the opening out of view at the bottom of the screen. After all the fruit had fallen, the display would snap to a view of a blender, at which point all the fruit would be blended, while screaming, into monster punch where the blender comes alive, à la "Monster Punch!" (Okay, maybe I am a little warped, but you should have been able to figure that out by now!)
The following sections use Monster Punch to demonstrate blitting.
The following sprite imagery is that of a copy blit, where a rectangular image is copied to the destination and overwrites any overlapped pixel.
Figure 19-3. Monster Punch — Copy blit of strawberry image on the right into the blender on the left.
Using efficiently optimized code, up to eight bytes at a time can be copied with 64-bit access, which corresponds to simultaneously writing eight 8-bit pixels, or four 16-bit pixels, or almost three 24-bit pixels, or only two 32-bit pixels. With 128-bit, up to 16 bytes can be accessed, thus 16 8-bit pixels, or eight 16-bit pixels, or slightly over five 24-bit pixels, or only four 32-bit pixels.
As the following sprite image portrays, all pixels from the source that match the transparent color are not copied, thus causing the sprite to be seamlessly pasted into the background.
Figure 19-4. Monster Punch — Transparent blit of strawberry image on the right into the blender on the left.
Normally when dealing with transparencies, only one pixel at a time can be tested to detect if it is transparent or not and so wind up introducing inefficiencies such as branch mispredictions, but that is where the sample in the following section comes in handy.
The following code is a sample of a transparent blit, where a scanline of a count of ECX 8-bit bytes is copied from one graphic source row [ESI] to a destination graphic row [EDI] one pixel at a time.
This eight 8-bit transparent pixel copy uses MMX code. Note that there is only one branch loop every eighth pixel.
tcolor qword 03f3f3f3f3f3f3f3fh ; 03fh = transparent pixel ; esi=source edi=destination ecx=# of qwords movq mm7,tcolor ; Get replicated transparency $T0: movq mm5,[esi] ; Get 8 source pixels movq mm4,[edi] ; Get background movq mm6,mm5 ; Copy 8 source pixels ; Compare each pixel's color to transparency color and if ; a match, set each pixel in the mask to FF else 00! pcmpeqb mm5,mm7 ; Create masks for transparency add esi,8 ; Adjust source pointer ; Only keep the pixels in the destination that correspond ; to the transparent pixels of the source! pand mm4,mm5 ; Using the same mask, flip it, then AND it with the ; source pixels, keeping the non-transparent pixels. pandn mm5,mm6 ; erase transparent pixels ; Or the destination pixels with the source pixels. por mm4,mm5 ; blend 8 pixels into art movq [edi],mm4 ; Save new background add edi,8 ; Adjust destination pointer dec ecx ; any pixels left? jne $T0 ; Loop for eight 8-bit pixels
There is no transparency testing or branching, only the masking and blending of data, which makes the process of a transparent blit much faster. These two different blits (copy, transparent) are typically designed for a graphic environment such as in Figure 19-5 where the background seen on the right is kept in a separate buffer like wallpaper.
Figure 19-5. Transparent copy blit of strawberry sprite and blender image background to achieve composite result of both.
The background is CopyBlit to the working surface as seen on the left, and the sprite image is Transparent Blit in front of it. When the sprite image is animated, the area being changed is "erased" from the working surface by a rectangular CopyBlit of that area from the background to the working surface and then the update sprite image has a rectangular area Transparent Blit in front. This is a layered approach typically used in a video game that has a number of animated objects moving around the display.
Another graphic sprite environment method is where the area under the sprite is remembered in a buffer attached to the sprite before the sprite image is Transparent Blit. This operation typically occurs simultaneously to reduce the amount of calculation work.
This is typically called an "overlay" method used by Windows and some sprite engines. The drawback to this method is that overlapping of sprites needs to be minimized because erasing one requires all the other intersecting sprites visible above that sprite to be erased. The list of sprites needs to be traversed to find out which sprites intersect the area and need to be erased and repainted by replacing the image under each intersecting sprite in the image buffer with the corresponding original background image. The list of sprites then needs to be traversed again, this time drawing the sprites back into the scene.
Figure 19-6. The blit of a rectangular blender image to a storage buffer, then the transparent blit of a strawberry into blender. A blit of the saved blender image back into blender effectively erases the strawberry.
tcolor qword 03f3f3f3f3f3f3f3fh ; 03fh = transparent pixel ; esi=source edi=destination ebx=buffer ecx=# of qwords movq mm7,tcolor ; Get replicated transparency $T0: movq mm5,[esi] ; Get 8 source pixels movq mm4,[edi] ; Get 8 background pixels movq mm6,mm5 ; Copy 8 source pixels ; Compare each pixel's color to transparency color and if ; a match, set each pixel in the mask to FF, else 00! pcmpeqb mm5,mm7 ; Create masks for transparency movq [ebx],mm4 ; Save BGnd in buffer ; Only keep the pixels in the destination that correspond ; to the transparent pixels of the source! pand mm4,mm5 ; Using the same mask, flip it then AND it with the ; source pixels, keeping the non-transparent pixels. pandn mm5,mm6 ; erase transparent pixels ; Or the destination pixels with the source pixels. add ebx,8 ; Adjust buffer pointer por mm4,mm5 ; Blend 8 pixels into art add esi,8 ; Adjust source pointer movq [edi],mm4 ; Save new background add edi,8 ; Adjust destination pointer dec ecx ; Any pixels left? jne $T0 ; Loop for eight 8-bit pixels
The same trick of using inverse logic can be used for expanding image clipping planes.
In the image on the left, no matter how it's encoded (8/16/24/32 bits), only a single bit in the clipping plane image on the right would be needed to represent a single pixel. If black=0 and white=1, then a sprite object could appear to pass in front of the fence as well as behind it but in front of the distant background. This could be done in a variety of ways. One would be to use masks where both the sprite pixel and the background pixel are masked so only one has a non-zero value. The resulting color is written to the destination buffer.
; esi = sprite image pointer ; ebx = clipping plane ; ebp = background image pointer ; edi = destination image buffer pointer mov edx,[ebx] ; Get clipping plane mov ch,32 ; 32 pixels at a time $L1: mov al,[esi] ; Source sprite image inc esi ; Next sprite pixel pointer mov cl,[ebp] ; Source background image inc ebp ; Next src background pixel test al,tcolor ; transparent color je $T1 ; Jump if transparent pixel shr edx,1 ; Get a masking bit into carry setnc ah ; 1=background 0=foreground dec ah ; 00=background ff=foreground and cl,ah ; ff=keep bgnd 00=kill it not ah ; Flip masking bits and al,ah ; ff=keep sprite 00=kill it or cl,al ; (XOR type) Blend pixels
$T1: mov [edi],cl ; Save new pixel to destination inc edi ; Next dst working pixel dec ch ; 1 less pixel in run jnz $L1 ; Loop