Chapter 19. Gfx 'R' Asm

This is one of two chapters that are probably the reason you bought this book. You are probably working with a graphics library and then ran into a little bind in which there did not seem to be a software library function available with that special needed functionality. Your schedules are sliding, nerves are wrecked, and paranoia is beginning to set in as you believe that your project lead or manager is beginning to doubt your abilities and that pink slip is only about a week away. Out of desperation you have escaped to your refuge, that favorite technical bookstore that has rescued you so many times in the past. You find lots of graphics books but not what you are looking for. There are a few assembly books that you have seen over the years, but they have always been targeted for beginners, but you have bought them anyway for your personal library as another resource book. (I have always been impressed with someone who has multiple large bookcases with multitudes of dog-eared books in their office. There is a fine art to making brand new books look well used.) And then you see this book. You flip through it and this chapter catches your gleaming eye and then you whisper in euphoria to yourself, "This book will save my butt!" At that point you look around and see everyone in the bookstore staring at you as you skulk toward the sales clerk.

Of course, there are other code samples in my Vector Game Math Processors book so do not forget to buy that book as well.

For those of you C programmers out there, this chapter is very similar to those heavily used functions memset() and memcpy(). They are used in almost every application for a large variety of purposes, but their behavior typically is not that useful or fast enough in the clearing or blitting of graphic images. Some of you are probably thinking "Why isn't this guy using the hardware blitter on a graphics card?" Well, in some cases, the blitter is hidden from you in the bowels of drivers such as Direct Draw but that's all it is, a blitter — a hardware device designed to move video card memory to video card memory and you only have 64 to 256 MB to play with. Okay, okay, that is much better than a couple years ago when you only had 2 to 8 MB. What we are trying to do here is learn the optimal method of moving memory around the computer system and from system memory to video memory. Also, just where did those images come from? Whose file format, and what compression type? How did they get loaded? Where do you get the driver, etc? There is also more to life than displaying video games! What about streaming media such as MPEG-4 and DivX, video analysis, scientific research, speech recognition, stereoscopic vision, etc.? The list is endless and new reasons are being invented all the time. Now that I am off my soap-box we can continue.

Setting Memory

If you happen to be constructing 8-bit images, then the memset() function can work pretty well for you as the transparency value can be anywhere from 0 to 255. If working in 16-, 24-, or 32-bit colors this function is only useful if the transparency color that you are trying to set just happens to be 0; if it is any other value, you have a serious problem. You have put together a C function to do the task but even though this function is not called that often, its speed is not up to your needs.

Your older 32-bit C libraries typically have the following library function for clearing a block of memory:

; void *memset(void *pMem, int val, uint nCnt)
;
; Note: if nCnt is set to 0 then no bytes will be set!

        public  memset
memset  proc    near
        push    ebp
        mov     ebp,esp
        push    edi

        mov     ecx,[ebp+arg3]          ; nCnt
        mov     edi,[ebp+arg1]          ; pMem
        mov     eax,[ebp+arg2]          ; val

           // Insert one of the following example code here!

$xit:   mov     eax,[ebp+arg1]          ; Return pointer
        pop     edi
        pop     ebp
        ret
memset  endp

Warning

The following loop is really inefficient code except when used on the old 8086 processors.

        test       ecx,ecx
        jz         $xit

$L0:    stosb
        loop       $L0

That code is relatively small but pretty inefficient as it is using the repeating string function to write a series of 8-bit bytes. The payoff on Pentium processors only comes with a repeat of 64 or more.

rep  stosb

With a repeat factor of less than 64, use the following. Note that in using the ES:[EDI], the ES: IS the default and so we do not really need to put it in the code.

          test       ecx,ecx
          jz         $xit         ; jump if len of 0

$L1:      mov        es:[edi],al  ; set a byte
          inc        edi
          dec        ecx
          jne        $L1

An alternate method that is a lot more efficient than those listed above is to divide our total number of bytes into the number of 4-byte blocks, then loop on that, not forgetting to handle the remainders.

        test    ecx,ecx
        jz      $xit              ; jump if len of 0

; The speed of writing 1 byte is the same as writing 4 bytes
; properly aligned so we build a 32-bit value to write (al = val)

        mov     ah,al
        mov     edx,eax
        shl     eax,16
        mov     ax,dx             ; eax=replicated byte ×4

        mov     edx,ecx           ; Get # of bytes to set
        shr     edx,2             ; n = n ÷ 4
        jz      $L2               ; Jump if 1..3 bytes

; edx = # of 32-bit writes

$L1:    mov     [edi],eax         ; set 4 bytes
        add     edi,4             ; advance pointer by 4 bytes
        dec     edx
        jne     $L1               ; Loop for DWORDS
; Remainders (1..3) bytes

$L2:    and     ecx,00000011b     ; Mask remainder bits (0..3)
        jz      $L4               ; Jump if no remainders

;       1 to 3 bytes to set

$L3:    mov     [edi],al          ; set 1 byte
        inc     edi               ; advance pointer by 1 byte
        dec     ecx
        jne     $L3               ; Loop for 1's

$L4:

There are more sophisticated methods that you can employ but this is a good start.

For optimal performance all data reads and writes must be on a 32-bit boundary. In a copy situation, if the source and destination are misaligned, there is not much that can be done about it. But in the case of setting memory that is misaligned, it is a snap to fix it.

Imagine these four differently aligned memory strands as eels. We pull out our sushi knife and finely chop off their heads into little 8-bit (1-byte) chunks, chop off the tails into 8-bit (1-byte) chunks, and then coarsely chop the bodies into larger 32-bit (4-byte) chunks, and serve raw.

Figure 19-1. Imagine these four differently aligned memory strands as eels. We pull out our sushi knife and finely chop off their heads into little 8-bit (1-byte) chunks, chop off the tails into 8-bit (1-byte) chunks, and then coarsely chop the bodies into larger 32-bit (4-byte) chunks, and serve raw.

The first three memory strands had their heads misaligned, but the fourth was aligned properly. On the other hand, the tails of the last three were misaligned and the first one was aligned properly. Now that they're sliced and diced, their midsections are all properly aligned for best data handling.

The latest C runtime libraries use something a lot more elaborate such as the following function:

; void *memset(void *pMem, int val, uint nCnt)
;
; Note: if nCnt is set to 0 then no bytes will be set!

        public  memset
memset  proc    near

$BSHFT  =       2                 ; Shift count
$BCNT   =       4                 ; Byte count

        push    ebp
        mov     ebp,esp

; Unlike the code above, flow does not have to fall through
; if a size of 0 was passed, and so we need to test for it.
; The lines are adjusted to help prevent a stall.

        mov     ecx,[ebp+arg3]    ; nCnt
        push    edi

; Older programmers will say hey, why didn't you 'OR ecx,ecx'
; but this is a read/write function that will cost you
; time for the write. The 'TEST ecx,ecx' is a read only!

        test    ecx,ecx
        push    ebx
        jz      $Xit              ; jump if size is 0


        mov     edi,[ebp+arg1]    ; pMem

; If the size is (1...3) bytes long, then handle as tail bytes

        test    ecx,NOT ($BCNT-1)
        mov     eax,[ebp+arg2]    ; val
        jz      $Tail

; If already aligned on a (n mod 3)==0 boundary

        mov    edx,edi
        and    edx,($BCNT-1)
        jz     $SetD

; The memory attempting to be set may not be properly aligned on
; a 4-byte boundary and thus if the block is 4 bytes in size or
; greater, then the 32-bit writes will have clock penalties on
; each write and so first adjust to be properly aligned.
        sub    edx,$BCNT
        add    ecx,edx            ; Reduce # of bytes to set

$Lead:  mov    [edi],al           ; Set a byte
        inc    edi
        inc    edx
        jne    $Lead              ; Loop for those {1..3} bytes

; The speed of writing 1 byte is the same as writing 4 bytes
; properly aligned so build a 32-bit value to write   (al = val)

$SetD : mov     ah,al
        mov     edx,eax
        shl     eax,16
        mov     ax,dx             ; eax=replicated byte ×4

; Now we set the bytes four at a time

        mov     edx,ecx
        shr     edx,$BSHFT        ; (n÷4) = # of  32-bit writes

$SetD1: mov     [edi],eax
        add     edi,$BCNT
        dec     edx
        jne     $SetD1

        and     ecx,($BCNT-1)
        jz      $Xit              ; jump if size is 0

; Write any trailing bytes

$Tail:  mov    [edi],al           ; set a byte
        inc    edi
        dec    ecx
        jne    $Tail              ; loop for trailing bytes

$Xit:   pop     ebx
        pop     edi
        mov     eax,[ebp+arg1]    ; Return destination pointer
        pop     ebp
        ret
memset  endp

As you can see, that simple memory set function became a lot bigger, but its execution speed became a lot quicker. With very short lengths of bytes to set, such as sizes of fewer than four bytes, this code is actually slower but it quickly gains in speed as the memory lengths increase in size, especially if aligned on 4-byte boundaries. For an extra efficiency on a size of 256 bytes or more, using the STOSD instruction would be best.

Note

You should use the string functions such as STOSD only if the repeat factor is 64 or more.

These numbers aren't exactly right as this function has not been tuned for its optimal timing yet, but I leave that to you. Besides, what would be the fun in it if I gave you all the answers? As versatile as the MMX instruction set is, the linear setting or copying of memory is no more efficient than the integer instructions. In fact, a STOSD/MOVSD string set/copy with a repeat of 64 or more is actually faster than the equivalent MMX instructions on legacy processors. This would also leave the XMM register for math related solutions. It turns out that we are actually pumping data very close to or at the bus speed. For experimental purposes and to have some MMX practice, one alternative would be the use of the MMX instruction MOVQ in the $SetD section of the code so eight bytes would be written at one time.

Alter the $BSHFT and $BCNT to the new values:

$BSHFT  =       3                 ; Shift count ×8=(1<<3)
$BCNT   =       8                 ; Byte count

; Run lookup table replicates an 8-bit byte into a 64-bit qword.
; It saves a lot of shifting and ORing and only costs
; 256x8 = 2048 bytes and 1 time cycle.
;
; 00000000h,00000000h,01010101h,01010101h,02020202h,02020202h,
;   etc.

Replicate64 label   DWORD
       .XLIST
  foo   =       0
  REPEAT        256
        DD      foo,foo
        foo     =   foo + 01010101h
  ENDM
        .LIST


$SetD : lea     eax,Replicate64[eax*8]
        movq    mm7,[eax]

        mov     edx,ecx
        shr     edx,$BSHFT         ; (n÷8) = # of  64-bit writes

$SetD1: movq    [edi],mm7
        add     edi,$BCNT
        dec     edx
        jne     $SetD1

And call at the appropriate time only if your CPU thread has floating-point operations to handle:

Emms

Note

I recommend the use of the ZeroMemory() function instead. It saves passing an extra argument value of 0, or the time to replicate the single byte to four bytes.

Copying Memory

A few years ago I was working on a project that was required to run on a 386 processor but typically ran on a 486 and had this little squirrely problem. One of the in-house computer systems that I tested the application on ran the code extremely slowly. I spent quite a while on it and when doing some benchmark testing to isolate the problem I found that the memory copy algorithm, which was used to blit graphical sprites onto the screen, was the culprit. Sprites could appear on screen with any kind of data alignment as they moved horizontally across the screen. Upon deeper investigation I found that this computer system was running DOS like all the others but in this particular case, it was running on an AMD 386SX processor. AMD usually has pretty good processors but I was intrigued and so I ordered and received their AM386 data book unique to that model processor. Upon reading the book I found out to my horror that this processor had a little zinger. As it is a 32-bit processor with a 16-bit bus, if your source and destination pointers are not properly aligned, then a single 32-bit memory access has an additional eight clock penalty for that misaligned access. And so we come to my next rule.

That little problem required the need to detect not only the exact manufacturer but also the model of processor and must route function calls to special code to handle each. In most cases the code could be shared, but some isolated instances required the special code. The following is an older style of the C function memcpy().

; void *memcpy(void *pDst, const void *pSrc, uint nSize)
;
; Note: if nSize is set to 0 then no bytes will be copied!
          public  memcpy
memcpy    proc    near
          push    ebp
          mov     ebp,esp
          push    esi
          push    edi

          mov     esi,[ebp+arg1]          ; pSrc
          mov     edi,[ebp+arg2]          ; pDst
          mov     ecx,[ebp+arg3]          ; nSize

          // Insert one of the following example code here!

          mov     eax,[ebp+arg1]          ; Return pointer
          pop     edi
          pop     esi
          pop     ebp
          ret
memcpy    endp

Warning

This loop is really inefficient code except when used on the old 8086 processors.

$L0:    movsb
        loop      $L0

The following code is relatively small but pretty inefficient as it is using the repeating string function to write a series of 8-bit bytes. The payoff on a Pentium only comes with a repeat of 64 or more.

rep movsb

With a repeat factor of less than 64 use the following. Note that we do not need to put the DS: or the ES: AS the default for the ESI source register is DS, and the default for the EDI destination register is ES.

$L1:   mov   al,[esi]         ; al,ds:[esi]
       mov   [edi],al         ; es:[edi],al
       inc   esi
       inc   edi
       dec   ecx
       jne   $L1

In the above example we actually get a dependency penalty as we set the AL register but have to wait before we can actually execute the next instruction. If we adjust the function as follows, we no longer have that problem. You will note that the "inc esi" line was moved up to separate the AL, and the AL register.

$L1:    mov    al,ds:[esi]
        inc    esi            ; removes dependency penalty
        mov    es:[edi],al
inc    edi
dec    ecx
jne    $L1

Another method that is a lot more efficient than those listed above uses the same techniques we learned for setting memory. We divide our total number of bytes into the number of 4-byte blocks, then loop on that, not forgetting to handle the remainders. We handle the dependency penalty at $L1 in the same way.

       mov     edx,ecx         ; Get # of bytes to set
       shr     edx,2           ; n = n ÷ 4
       jz      $L2             ; Jump if 1..3 bytes

;      DWORDS (uint32)

$L1:   mov     eax,[esi]       ; 1μOP read 32 bits
       add     esi,4
       mov     [edi],eax       ; 2μOP write 32 bits
       add     edi,4
       dec     edx
       jne     $L1             ; Loop for DWORDS

;      Remainders

$L2:   and     ecx,00000011b   ; Mask remainder bits (0..3)
       jz      $L4             ; Jump if no remainders

;      1 to 3 bytes to set

$L3:   mov     al,[esi]
       inc     esi
       mov     [edi],al
       inc     edi
       dec     ecx
       jne     $L3             ; Loop for 1's

$L4:

This following method is significantly faster as it moves eight bytes at a time instead of four. There is no dependency penalty since the register being set is not being used immediately.

       mov     ecx,[ebp+arg3]  ; nSize
       shr     ecx,3           ; n = n ÷ 8
       jz      $L2             ; Jump if 1..7 bytes

;      QWORDS (uint64)

$L1:   mov     eax,[esi]       ; 1μOP read 32 bits
       mov     edx,[esi+4]     ; read next 32 bits
       mov     [edi],eax       ; 2μOP write 32 bits
       mov     [edi+4],edx     ; write next 32 bits
       add     esi,8
       add     edi,8
       dec     ecx
       jne     $L1             ; Loop for QWORDS

;      Remainders

$L2:   mov     ecx,[ebp+arg3]  ; nSize
       and     ecx,00000111b   ; Mask remainder bits (0..7)
       jz      $L4             ; Jump if no remainders

;      1 to 7 bytes to set

$L3:   mov     al,[esi]        ; read a byte
       inc     esi
       mov     [edi],al        ; write byte
       inc     edi
       dec     ecx
       jne     $L3             ; Loop for 1's

$L4:

This code is just about as fast as a copy using MMX. An example would be to replace $L1 with the following code:

$L1:   movq    mm7,[esi]       ; read 64 bits
       add     esi,8
       movq    [edi],mm7       ; write 64 bits
       add     edi,8
       dec     ecx
       jne     $L1             ; Loop for QWORDS

There are more sophisticated methods that you can employ, but this is a good start.

It is important for memory to be aligned, as a problem occurs when the source and/or destination are misaligned. Memory movement (copy) functions should try to reorient source and destination pointers. Unfortunately, if one is not lucky enough that the source and destination are either both properly aligned or they are misaligned exactly the same:

If ((pSrc AND 00000000111b) == (pDst AND 00000000111b))

...then adjust them. If their logically AND'ed values are 0, no adjustment is needed. If the alignment is the same, adjust by 1's to get into the alignment position. If both are out of alignment, obtain a speed increase by putting at least one of them into alignment (preferably the destination):

       mov     edx,edi         ; At least align destination!
       and     edx,0000111b
       jz      $Mid            ; Jump if properly aligned

       ; Remove misaligned bytes

       add     edx,0fffffffch  ; -3

$lead: mov     al,[esi]        ; read byte
       inc     esi
       mov     [edi],al        ; write byte
       inc     edi
       dec     ecx             ; reduce total to move
       inc     edx             ; increment to 0
       jne     $lead           ; loop for lead bytes

$Mid:

For the actual memory movement operation there are various techniques that can be used, each with its own benefit or drawback.

The best method is a preventative one. If the memory you're dealing with is for video images, then not only should the (width mod 8) equal a remainder of zero but the source and destination pointers should also be properly aligned. In this way, there is no problem of clock penalties for each memory access and no extra and possibly futile effort trying to align them.

In 8-bit images, moving (blitting) sprite memory can be difficult as sprites will always be misaligned. In 32-bit images where one pixel is 32 bits, alignment is a snap, as every pixel is properly aligned.

#ifdef __cplusplus
extern "C" void gfxCopyBlit8x8Asm(byte *pDst, byte *pSrc,
        uint nStride, uint nWidth, uint nHeight);
#endif


    // Comment this line out for 'C' code

#define USE_ASM_COPYBLIT_8X8

    // 8-bit to 8-bit Copy Blit
    //
    // This function is pre-clipped to copy an 8-bit color
    // pixel from the buffer pointed to by the source
    // pointer to an identical sized destination buffer.

#ifdef USE_ASM_COPYBLIT_8X8
#define CopyBlit8x8 CopyBlit8x8Asm
#else
void CopyBlit8x8(byte *pDst, byte *pSrc, uint nStride,
        uint nWidth, uint nHeight)
{
     // If width is the stride then copy entire image
   if (nWidth == nStride)
     {
       memcpy(pDst, pSrc, nStride * nHeight);
     }
   else
     {   // Copy image 1 scanline at a time.
       do {
           memcpy(pDst, pSrc, nWidth);

           pSrc += nStride;      // Source stride adjustment
           pDst += nStride;      // Destination Stride adj.
         } while(--nHeight);     // Loop for height
     }
}
#endif

As you probably noted, there is extra logic checking if width and stride are the same. If so, then unroll the loop to make the code even more efficient.

Goal:Try to write the listed function in assembly optimized for your processor. Or multiple processors.

Speed Freak

The code size would increase but using a vector table such as follows would allow you to unroll your (remainder) loops. With normal code, four states would be required but for MMX all eight would be best.

         mov     eax,ecx          ; Get Width
         and     eax,0000111b
         jmp     $SetTbl[eax*4]

; At bottom of assembly source file insert the vector table so
; it doesn't interfere with your memory caches.

         Align 16
$SetTbl: dd       $SetQ           ; (n mod 8) = 0
         dd       $Set1           ; (n mod 4) = 1
         dd       $Set2           ; (n mod 4) = 2
         dd       $Set3           ; (n mod 4) = 3
         dd       $SetD           ; (n mod 4) = 0
         dd       $Set1           ; (n mod 4) = 1
         dd       $Set2           ; (n mod 4) = 2
         dd       $Set3           ; (n mod 4) = 3

Graphics 101 — Frame Buffer

When dealing with graphic images there are various parameters defining its geometry.

  • memptr— The base pointer to a coordinate within the image related to its physical memory address.

  • bits per pixel— The number of bits per pixel used to represent the image. Typically 1/4/8/16/24/32-bit but pretty much only 8- to 32-bit are used these days.

  • width— The width of the image in pixels.

  • height— The height of the image in pixels.

  • stride— The number of bytes used to represent the start of one row of pixels to the start of another. It should be noted that there may be extra bytes beyond the last visible pixel and the start of the row of pixels. For example, in Figure 19-2 the 640-pixel scanline has an overage of 384 bytes. That means when you write that 640thpixel you need to add 384 to get to the start of the next scanline (640+384=1024).

Bitmap dimension information

Figure 19-2. Bitmap dimension information

So now let's use this information in some real code.

#ifdef __cplusplus
extern "C" void gfxClrAsm(byte *pDst, uint nStride,
 uint nWidth, uint nHeight);
#endif


    // Comment this line out for C code
#define USE_ASM_GFXCLR

    // Graphics Clear
    //
    // This is a pre-clipped function used to clear a bitmap
    // pointed to by the destination pointer.
    // Note: This can be used to clear 8/16/24/32-bit pixels.

#ifdef USE_ASM_GFXCLR
#define gfxClr gfxClrAsm
#else

void gfxClr(byte *pDst, uint nStride, uint nWidth, uint nHeight)
{
    do {
        memset(pDst, 0, nWidth);
        pDst += nStride;
      } while (--nHeight);
}
#endif

Project:

Using what you've learned, try to write the C function above in assembly optimized for your processor.

void gfxClrAsm(byte *pDst, uint nStride, uint nWidth, uint nHeight);

Graphics 101 — Blit

There are different methods one can choose to blit or bit field copy a graphics image, including a pure blit where the image is merely copied pixel by pixel or a transparent copy such as detailed here.

A transparent pixel is referred to by a variety of names, including transparent, color key, skip color, invisible color, and non-displayed pixel. This is a pixel containing no image color data that allows the color of the pixel directly underneath it to be displayed. It is typically set to an unusual color that helps the artists and programmers easily identify it in relation to the rest of the colors.

If you watch the news you see this process every day compliments of the weatherman. He is shot on a green screen, being careful not to wear a color similar to the color key, so the electronics will make him appear in front of an image such as a map and that composite image is transmitted to your television. If he wore the same shade of color as the color key, in the middle of his chest he would appear to have a big hole where you would be able to see through his body.

When using film, moviemakers shoot models or actors on a blue screen, as the color of blue is actually clear on the film negative. Oversimplifying this explanation, the non-clear areas would be converted into a mask and the images would be cut into a composite typically using a matte backdrop.

When using digitized graphics in a computer, movie/game makers shoot actors on a green screen and digitally map the images into a single image using some sort of backdrop.

Your transparency color can be any color. I typically pick a dark shade of blue. For instance, in an RGB range of (0 to 255) {red:0, green:0, blue:108}. This allows me to differentiate between the color of black and transparency and still have the transparent color dark enough so as not to detract from the art. When I am nearly done with the image and almost ready to test it for any stray transparent pixels, I set them to a bright purple {red:255, green:0, blue:255} as that particular color of bright purple is not usually found in my art images and it really stands out. It does not matter what color you use as long as the image does not contain that particular color.

In a 2D graphics application, there is typically a need to composite images and so this leads to how to handle a transparent blit.

A few years ago, I taught a College for Kids program during the summer titled "The Art of Computer/Video Game Design." For that class, I had put together a small program that reinforced the need for computer games to have foreign language support. This particular game was called "Monster Punch." A language would be selected and then various living fruit with their eyes moving around would drop down from the top of the screen and pass through the opening out of view at the bottom of the screen. After all the fruit had fallen, the display would snap to a view of a blender, at which point all the fruit would be blended, while screaming, into monster punch where the blender comes alive, à la "Monster Punch!" (Okay, maybe I am a little warped, but you should have been able to figure that out by now!)

The following sections use Monster Punch to demonstrate blitting.

Copy Blit

The following sprite imagery is that of a copy blit, where a rectangular image is copied to the destination and overwrites any overlapped pixel.

Monster Punch — Copy blit of strawberry image on the right into the blender on the left.

Figure 19-3. Monster Punch — Copy blit of strawberry image on the right into the blender on the left.

Using efficiently optimized code, up to eight bytes at a time can be copied with 64-bit access, which corresponds to simultaneously writing eight 8-bit pixels, or four 16-bit pixels, or almost three 24-bit pixels, or only two 32-bit pixels. With 128-bit, up to 16 bytes can be accessed, thus 16 8-bit pixels, or eight 16-bit pixels, or slightly over five 24-bit pixels, or only four 32-bit pixels.

Transparent Blit

As the following sprite image portrays, all pixels from the source that match the transparent color are not copied, thus causing the sprite to be seamlessly pasted into the background.

Monster Punch — Transparent blit of strawberry image on the right into the blender on the left.

Figure 19-4. Monster Punch — Transparent blit of strawberry image on the right into the blender on the left.

Normally when dealing with transparencies, only one pixel at a time can be tested to detect if it is transparent or not and so wind up introducing inefficiencies such as branch mispredictions, but that is where the sample in the following section comes in handy.

Graphics 101 — Blit (MMX)

The following code is a sample of a transparent blit, where a scanline of a count of ECX 8-bit bytes is copied from one graphic source row [ESI] to a destination graphic row [EDI] one pixel at a time.

Graphics Engine — Sprite Layered

This eight 8-bit transparent pixel copy uses MMX code. Note that there is only one branch loop every eighth pixel.

tcolor qword 03f3f3f3f3f3f3f3fh ; 03fh = transparent pixel

; esi=source edi=destination ecx=# of qwords

     movq   mm7,tcolor     ; Get replicated transparency

$T0: movq   mm5,[esi]      ; Get 8 source pixels
     movq   mm4,[edi]      ; Get background
     movq   mm6,mm5        ; Copy 8 source pixels

; Compare each pixel's color to transparency color and if
; a match, set each pixel in the mask to FF else 00!

     pcmpeqb mm5,mm7       ; Create masks for transparency

     add     esi,8         ; Adjust source pointer

;   Only keep the pixels in the destination that correspond
;   to the transparent pixels of the source!

     pand    mm4,mm5

; Using the same mask, flip it, then AND it with the
; source pixels, keeping the non-transparent pixels.

     pandn   mm5,mm6       ; erase transparent pixels

; Or the destination pixels with the source pixels.

     por     mm4,mm5       ; blend 8 pixels into art
     movq    [edi],mm4     ; Save new background

     add     edi,8         ; Adjust destination pointer
     dec     ecx           ; any pixels left?
     jne     $T0           ; Loop for eight 8-bit pixels

There is no transparency testing or branching, only the masking and blending of data, which makes the process of a transparent blit much faster. These two different blits (copy, transparent) are typically designed for a graphic environment such as in Figure 19-5 where the background seen on the right is kept in a separate buffer like wallpaper.

Transparent copy blit of strawberry sprite and blender image background to achieve composite result of both.

Figure 19-5. Transparent copy blit of strawberry sprite and blender image background to achieve composite result of both.

The background is CopyBlit to the working surface as seen on the left, and the sprite image is Transparent Blit in front of it. When the sprite image is animated, the area being changed is "erased" from the working surface by a rectangular CopyBlit of that area from the background to the working surface and then the update sprite image has a rectangular area Transparent Blit in front. This is a layered approach typically used in a video game that has a number of animated objects moving around the display.

Graphics Engine — Sprite Overlay

Another graphic sprite environment method is where the area under the sprite is remembered in a buffer attached to the sprite before the sprite image is Transparent Blit. This operation typically occurs simultaneously to reduce the amount of calculation work.

This is typically called an "overlay" method used by Windows and some sprite engines. The drawback to this method is that overlapping of sprites needs to be minimized because erasing one requires all the other intersecting sprites visible above that sprite to be erased. The list of sprites needs to be traversed to find out which sprites intersect the area and need to be erased and repainted by replacing the image under each intersecting sprite in the image buffer with the corresponding original background image. The list of sprites then needs to be traversed again, this time drawing the sprites back into the scene.

The blit of a rectangular blender image to a storage buffer, then the transparent blit of a strawberry into blender. A blit of the saved blender image back into blender effectively erases the strawberry.

Figure 19-6. The blit of a rectangular blender image to a storage buffer, then the transparent blit of a strawberry into blender. A blit of the saved blender image back into blender effectively erases the strawberry.

tcolor qword 03f3f3f3f3f3f3f3fh ; 03fh = transparent pixel

; esi=source  edi=destination  ebx=buffer  ecx=# of qwords

      movq   mm7,tcolor    ; Get replicated transparency

$T0:  movq   mm5,[esi]     ; Get 8 source pixels
      movq   mm4,[edi]     ; Get 8 background pixels
      movq   mm6,mm5       ; Copy 8 source pixels

; Compare each pixel's color to transparency color and if
; a match, set each pixel in the mask to FF, else 00!

      pcmpeqb mm5,mm7      ; Create masks for transparency
      movq    [ebx],mm4    ; Save BGnd in buffer

; Only keep the pixels in the destination that correspond
; to the transparent pixels of the source!

      pand   mm4,mm5

; Using the same mask, flip it then AND it with the
; source pixels, keeping the non-transparent pixels.

      pandn  mm5,mm6       ; erase transparent pixels

; Or the destination pixels with the source pixels.

     add     ebx,8         ; Adjust buffer pointer
     por     mm4,mm5       ; Blend 8 pixels into art
     add     esi,8         ; Adjust source pointer
     movq    [edi],mm4        ; Save new background
     add     edi,8         ; Adjust destination pointer

     dec     ecx           ; Any pixels left?
     jne     $T0           ; Loop for eight 8-bit pixels

Graphics 101 — Clipping Blit

2D bitmap on left with 2-bit clipping plane on right

Figure 19-7. 2D bitmap on left with 2-bit clipping plane on right

The same trick of using inverse logic can be used for expanding image clipping planes.

In the image on the left, no matter how it's encoded (8/16/24/32 bits), only a single bit in the clipping plane image on the right would be needed to represent a single pixel. If black=0 and white=1, then a sprite object could appear to pass in front of the fence as well as behind it but in front of the distant background. This could be done in a variety of ways. One would be to use masks where both the sprite pixel and the background pixel are masked so only one has a non-zero value. The resulting color is written to the destination buffer.

; esi = sprite image pointer
; ebx = clipping plane
; ebp = background image pointer
; edi = destination image buffer pointer

        mov    edx,[ebx]          ; Get clipping plane
        mov    ch,32              ; 32 pixels at a time

$L1:    mov    al,[esi]           ; Source sprite image
        inc    esi                ; Next sprite pixel pointer
        mov    cl,[ebp]           ; Source background image
        inc    ebp                ; Next src background pixel

        test   al,tcolor          ; transparent color
        je     $T1                ; Jump if transparent pixel

        shr    edx,1              ; Get a masking bit into carry
        setnc  ah                 ; 1=background 0=foreground
        dec    ah                 ; 00=background ff=foreground
        and    cl,ah              ; ff=keep bgnd 00=kill it
        not    ah                 ; Flip masking bits
        and    al,ah              ; ff=keep sprite 00=kill it
        or     cl,al              ; (XOR type) Blend pixels
$T1:    mov    [edi],cl           ; Save new pixel to destination
        inc    edi                ; Next dst working pixel
        dec    ch                 ; 1 less pixel in run
        jnz    $L1                ; Loop
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset