Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 18. System

System "Lite"

There are other instructions available in your processor, but they have very little to no relationship to your application code. As mentioned at the beginning of this book, there are basically three types of instructions. (Note that I am oversimplifying here!) They are general-purpose, floating-point, and system instructions. The existence of these later instructions has to do with writing system level, thus operating system, code. They are not typically accessible or needed by those programmers writing non-operating system code. As this book is not targeted for that market, there is no need to make the standard application programmer wade through it. But as some of you may just cry foul, I have included a very light overview of these instructions. Besides, there are some tidbits in here for all of you!

Chapter 3, "Processor Differential Insight," as well as Chapter 16, "What CPUID?" gave some background on the processor. We shall now continue with that information. Some of what is included here is not necessarily just for system programmers as some features of the 80×86 are system related but are accessible from the application level. Note the System "Lite" part? Keep in mind that this is a superficial overview. If you need an in-depth explanation, please refer to documentation direct from the manufacturer.

System Timing Instructions

RDPMC — Read Performance — Monitoring Counters

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
RDPMC

rdpmc

This instruction loads the 40-bit performance monitoring counter indexed by ECX into the EDX:EAX register pair. For 64-bit mode, RDX[0...31]:RAX[0...31]=[RCX]. This instruction is accessible from any layer inclusive of the application layer only if the PCE flag in CR4 is set. When the flag is clear, this instruction can only be run from privilege level 0.

RDTSC — Read Time-Stamp Counter

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
RDTSC

rdtsc

This system instruction reads the 64-bit time-stamp counter and loads the value into the EDX:EAX registers. The counter is incremented every clock cycle and is cleared (reset) to zero upon the processor being reset. This instruction is accessible from any layer inclusive of the application layer unless the TSD flag in CR4 is set. So far while running under Win32 the flag has been clear as a default, thus allowing an application to access this instruction.

  ;       void CpuDelaySet(void)

         public  CpuDelaySet
  CpuDelaySet    proc    near
          rdtsc                      ; Read time-stamp counter

          mov     tclkl,eax          ; Save low 32 bits
          mov     tclkh,edx          ; Save high 32 bits
          ret
  CpuDelaySet    endp

  ; long int CpuDelayCalc(void)
  ;
  ; This function is called after IClkSet() to get the
  ; elapsed interval in clock cycles.
  ;
  ; Note: On a 400MHz computer, only reading the lower 32 bits
  ; gives a maximum 10 second sampling before rollover.

           public   CpuDelayCalc
  CpuDelayCalc proc     near
           rdtsc                     ; Read time-stamp counter

           sub      eax,tclkl
           sbb      edx,tclkh        ; edx:eax = total elapsed interval

           ret                       ; return edx:eax = 64 bits of info.
  CpuDelayCalc endp

These two functions can be used for time trials while optimizing code. Due to multithreaded environments, another thread or interrupt can steal your time slice while you are trying to do time analysis on a bit of code. You could divide the number of loops into the total delay to get an average loop delay count. What I like to do is run a benchmark of executing the same code a few thousand times, ignoring the effects the prefetch has on these times or the fact the Nth time around the data is already sitting in memory. One time I took the governor off an MPEG decoder so it would run full speed, allowing code to be optimized so that it would run faster and faster.

Calculating Processor Speed

The following code snippet can be included within your own code for determining computer speed. The computer quite often is not running at the speed you may think. I had a weird problem in an application running on my laptop and it did not make any sense until I wrote this code Even then I thought it had a bug until I realized the laptop had a thermal problem and dropped its computer speed by 50% or more so as to run cooler. Clients running your application may have some weird problems or be misinformed of their machines' capabilities, and this code can give you or customer support representatives more debugging insight.

  typedef struct SpeedDataType
  {
    uint tSpeed;
    uint tSpeedState;
    uint nCnt;
    uint wTimerID;

  } SpeedData;


  // Win32 Timer - Calculate CPU Speed

  void CALLBACK SpeedCalcTimer(UINT wTimerID, UINT msg, DWORD dwUser,
                               DWORD dw1, DWORD dw2)
  {
     SpeedData *sp = (SpeedData *)dwUser;

     if (sp->wTimerID != wTimerID) // Is this our timer ID?
     {
        return;
     }

     switch(sp->tSpeedState)
     {
     case 2:                      // 2^nd tick (avg of the two intervals)
       sp->tSpeed = (CpuDelayCalc() + sp->tSpeed) >> 1;
       sp->nCnt++;
       CpuDelaySet();
       break;

     case 1:                      // 1^st tick
       sp->tSpeed = CpuDelayCalc();
       sp->nCnt++;

       // Allow flow through!

     case 0:                      // Starting tick
      CpuDelaySet();
      sp->tSpeedState++;
      break;

     default:
        break;
   }
 }


 // Be VERY careful when this is called, as your OS may not like it!

 uint SpeedCalc(void)
 {
    TIMECAPS tc;
    uint wTimerRes, nCnt;
    SpeedData sd;

    wTimerRes = 1;

 // Set the timer resolution for the multimedia timer
    if (TIMERR_NOERROR == timeGetDevCaps(&tc, sizeof(TIMECAPS)))
    {

     wTimerRes = min(max(tc.wPeriodMin, 1), tc.wPeriodMax);
     timeBeginPeriod(wTimerRes); // 1ms resolution
 }

 sd.nCnt = sd.tSpeed = sd.tSpeedState = 0;

 sd.wTimerID = timeSetEvent(1, wTimerRes, SpeedCalcTimer,
     (DWORD)&sd, TIME_PERIODIC | TIME_KILL_SYNCHRONOUS);

 if (sd.wTimerID)               // If we were given a TimerId
 {                              // (Should not fail!)
    do {
       nCnt = sd.nCnt;
       Sleep(10);               // Sleep 10ms

       if (sd.nCnt > 100)       // Cycle 100 times
       {
          timeKillEvent(sd.wTimerID);
          return sd.tSpeed/1000;
       }
    } while (nCnt != sd.nCnt);  // If the same, the timer failed!

    timeKillEvent(sd.wTimerID);
 }
// Didn't work? Try it the really not-so-accurate way!

   CpuDelaySet();
   Sleep(10);
   return CpuDelayCalc()/10000;
}

80×86 Architecture

The Intel and AMD processors have similar functional architecture. Different processors have different numbers of caches, on chip cache, off chip cache, different speeds, different instruction sets, different methods of pipelining instructions. All this book is interested in is helping you, the application programmer, make your code go fast by writing it in assembly. You have no control over what flavor of processor the user of your application chooses to run their applications on. (Of course you could program your application to check these parameters and refuse to run on a system you do not like! But that would be evil!)

test  ebx,ebx
mov   ecx,ebx
mov   esi,ebx
mov   edi,ebx
test  ebx,ebx

The use of full registers (such as in the above 32-bit code snippet in Protected Mode) allows instructions to be able to be executed on the same clock.

Partial stalls occur if a short version of a register is written to and then immediately followed by a larger version. For example:

mov al,9
add bx,ax     ; clock stall

mov al,9
add ebx,eax   ; clock stall

mov ax,9
add ebx,ax    ; clock stall

The AL register will cause the next instruction to have a partial stall if it contains a large form such as AX, EAX, or RAX and if it is being written. This is like being at a red signal light in your car and when the light turns green you slam down on the accelerator; your car will sputter, spit a little, hesitate (stall), and then finally accelerate.

CPU Status Registers (32-Bit EFLAGS/64-Bit RFLAGS)

Figure 18-1. CPU status register

EFLAG	Code	Bit	Flag Descriptions
EFLAGS_CF	000000001h	0	Carry
	000000002h	1	1
EFLAGS_PF	000000004h	2	Parity
	000000008h	3	0
EFLAGS_AF	000000010h	4	Auxiliary Carry
	000000020h	5	0
EFLAGS_ZF	000000040h	6	Zero
EFLAGS_SF	000000080h	7	Sign
EFLAGS_TF	000000100h	8	Trap
EFLAGS_IF	000000200h	9	Interrupt Enable
EFLAGS_DF	000000400h	10	Direction
EFLAGS_OF	000000800h	11	Overflow
EFLAGS_IOPL	000003000h	12, 13	I/O Privilege Level
EFLAGS_NT	000004000h	14	Nested Task
	000010000h	15	0
EFLAGS_RF	000010000h	16	Resume
EFLAGS_VM	000020000h	17	Virtual-8086 Mode
EFLAGS_AC	000040000h	18	Alignment Check
EFLAGS_VIF	000080000h	19	Virtual Interrupt
EFLAGS_VIP	000100000h	20	Virtual Interrupt Pending
EFLAGS_ID	000200000h	21	CPUID
		23...31	0

And in 64-bit mode the upper 32 bits of the RFLAGS register (0:EFLAGS):

32...63

RFLAG (extra) bits

Protection Rings

The 386 and above have layers of protection referred to as protection rings.

Figure 18-2. Protection rings

The inner ring #0 contains the operating system kernel. The two middle rings (#1 and #2) contain the operating system services (device drivers), and the outer ring #3 is where the application (user code) resides. The ring numbers are also referred to as privilege levels with 0 being the highest and 3 being the lowest.

An application can access functions in the other rings by means of a gate. The SYSCALL and SYSENTER functions are two methods. This is a protection system to protect the inner rings from the outer. You know, to keep the riffraff out! Any attempt to access an inner ring without going through a gate will cause a general protection fault.

Control Registers

There are four control registers {CR0, CR2, CR3, CR4} that control system level operations. Note that CR1 is reserved.

Table 18-1. Control register 0 (CR0) extensions

CR0	Code	Bit	Flag Descriptions
CR0_PE	000000001h	0	Protection Enable
CR0_MP	000000002h	1	Monitor Coprocessor
CR0_EM	000000004h	2	Emulation
CR0_TS	000000008h	3	Task Switched
CR0_ET	000000010h	4	Extension Type
CR0_NE	000000020h	5	Numeric Error
		6...15
CR0_WP	000010000h	16	Write Protected
		17
CR0_AM	000040000h	18	Alignment Mask
		19...28
CR0_NW	020000000h	29	Not Write-Through
CR0_CD	040000000h	30	Cache Disable
CR0_PG	080000000h	31	Paging

And in 64-bit mode the upper 32 bits of the CR0 register (0:CR0):

32...63

Control register 2 (CR2) is a 32/64-bit page fault linear address.

Table 18-2. Control register 3 (CR3) extensions

CR3	Code	Bit	Flag Descriptions
		0...2
CR3_PWT	000000008h	3	Page Writes Transparent
CR3_PCD	000000010h	4	Page Cache Disable
Page Dir.Base		12...31

And in 64-bit mode the upper 32 bits of the CR3 register (0:CR3):

32...63

CR3 (extra) bits

Table 18-3. Control register 4 (CR4) extensions

CR4	Code	Bit	Flag Descriptions
CR4_VME	000000001h	0	Virtual-8086 Mode Ext.
CR4_PVI	000000002h	1	Protected Virtual Int.
CR4_TSD	000000004h	2	Time Stamp Disable
CR4_DE	000000008h	3	Debugging Extensions
CR4_PSE	000000010h	4	Page Size Extension
CR4_PAE	000000020h	5	Physical Address Ext.
CR4_MCE	000000040h	6	Machine Check Enable
CR4_PGE	000000080h	7	Global Page Enable
CR4_PCE	000000100h	8	RDPMC Enabled
CR4_OSFXSR	000000200h	9	FXSAVE, FXRSTOR
CF4_OSXMMEXCPT	000000400h	10	Unmasked SIMD FP Exception

And in 64-bit mode the upper 32 bits of the CR4 register (0:CR4):

32...63

CR4 (extra) bits

(TPR) Task Priority Registers — (CR8)

Table 18-4. Control register 8 (CR8) extensions. This is new for EM64T.

CR8	Bit	Flag Descriptions
CR8_APSC	0...3	Arbitration Priority Sub-class
CR8_AP	4...7	Arbitration Priority
	8...63	CR4 (extra bits)

Debug Registers

There are eight debug registers: {DR0, DR1, DR2, DR3, DR4, DR5, DR6, DR7}. Knowing them is unimportant as you are most likely using a debugger to develop your application, not building a debugger. These are privileged resources and only accessible at the system level to set up and monitor the breakpoints {0...3}.

Cache Manipulation

Several mechanisms have been put into place to squeeze optimal throughput from the processors. One method of cache manipulation discussed in Chapter 10, "Branching," is Intel's hint as to the prediction of logic flow through branches counter to the static prediction logic. Another mechanism is a hint to the processor about cache behavior so as to give the processor insight into how a particular piece of code is utilizing memory access. Here is a brief review of some terms that have already been discussed:

Temporal data — Memory that requires multiple accesses and therefore needs to be loaded into a cache for better throughput.
Non-temporal hint — A hint (an indicator) to the processor that memory only requires a single access (one shot). This would be similar to copying a block of memory or performing a calculation, but the result is not going to be needed for a while so there is no need to write it into the cache. Thus, the memory access has no need to read and load cache, and therefore the code can be faster.

For speed and efficiency, when memory is accessed for read or write a cache line containing that data (whose length is dependent upon manufacturer and version) is copied from system memory to high-speed cache memory. The processor performs read/write operations on the cache memory. When a cache line is invalidated, the write back of that cache line to system memory occurs. In a multiprocessor system, this occurs frequently due to non-sharing of internal caches. The second stage of writing the cache line back to system memory is called a "write back."

Cache Sizes

Different processors have different cache sizes for data and for code. These are dependent upon processor model, manufacturer, etc., as shown below:

CPU	L1 Cache (Data /Code)	L2 Cache
Celeron	16Kb / 16Kb	256Kb
Pentium 4	8Kb / 12Kμops	512Kb
Athlon XP	64Kb / 64Kb	256Kb
Duron	64Kb / 64Kb	64Kb
Pentium M	32Kb / 32Kb	1024Kb
Xeon		512Kb

Depending on your code and level of optimization, the size of the cache may be of importance. For the purposes of this book, however, it is being ignored, as that topic is more suitable for a book very specifically targeting heavy-duty optimization. This book, however, is interested in the cache line size as that is more along the lightweight optimization that has been touched on from time to time. It should be noted that AMD uses a minimum size of 32 bytes.

Cache Line Sizes

The (code/data) cache line size determines how many instruction/data bytes can be preloaded.

Intel	Cache Line Size
PIII	32
Pentium M	64
P4	64
Xeon	64

AMD	Cache Line Size
Athlon	64
Opteron	64

The cache line size can be obtained by using the CPUID instruction with EAX set to 1. The following calculation will give you the actual cache line size.

mov    eax,1
cpuid

and    ebx,00000FF00h
shr    ebx,8-3                   ; ebx = size of cache line

PREFETCH_x — Prefetch Data into Caches

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
PREFETCH
PREFETCHNTA
PREFETCHT0
PREFETCHT1
PREFETCHT2
PREFETCHW

3DNow!	prefetch	mSrc8
	prefetchw	mSrc8
SSE	prefetcht0	mSrc8
	prefetcht1	mSrc8
	prefetcht2	mSrc8
	prefetchnta	mSrc8

The PREFETCHNTA instruction performs a non-temporal hint to the processor with respect to all the caches, to load from system memory mSrc8 into the first-level cache for a PIII or a second-level cache for a P4 or Xeon processor.

The PREFETCHT0 instruction performs a temporal hint to the processor to load from system memory mSrc8 into the first- or second-level cache for a PIII, or a second-level cache for a P4 or Xeon processor.

The PREFETCHT1 instruction performs a temporal hint to the processor with respect to the first-level cache to load from system memory mSrc8 into the second-level cache for PIII, P4, or Xeon processor.

The PREFETCHT2 instruction performs a temporal hint to the processor with respect to the second-level cache to load from system memory mSrc8 into the first-level cache for PIII or the second-level cache for P4 or Xeon processor.

If data is already loaded at the same or higher cache, then no operation is performed.

AMD processors alias PREFETCHT1 and PREFETCHT2 instructions to the PREFETCHT0 instructions, so they all have the PREFETCHT0 functionality.

The 3DNow! PREFETCH instruction loads a cache line into the L1 data cache from the mSrc8.

The 3DNow! PREFETCHW instruction loads a cache line into the L1 data cache from the mSrc8 but sets a hint indicating that it is for write operations.

LFENCE — Load Fence

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LFENCE

lfence

This instruction is similar to the MFENCE instruction, but it acts as a barrier between memory load instructions issued before and after the LFENCE and MFENCE instructions.

SFENCE — Store Fence

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SFENCE

sfence

This instruction is similar to the instruction MFENCE but it acts as a barrier between memory save instructions issued before and after the SFENCE or MFENCE instructions.

MFENCE — Memory Fence

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
MFENCE

mfence

This instruction is a barrier (fence) to isolate system memory to and from cache memory operations that occur before and after this instruction.

CLFLUSH — Flush Cache Line

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
CLFLUSH

clflush mSrc8

This instruction invalidates the cache line (code or data) containing the linear address specified by mSrc8. If the line is dirty — that is, different from the system memory in the process of being written to — it is written back to system memory. This instruction is ordered by the MFENCE instruction. Check CPUID bit #19 (CLFSH) to see if this instruction is available.

INVD — Invalidate Cache (WO/Writeback)

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
INVD

invd

This instruction invalidates the internal caches without waiting for write back of modified cache lines and initiates bus cycles for external caches to flush. This is similar to WBINVD but without a write back.

WBINVD — Write Back and Invalidate Cache

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
WBINVD

wbinvd

This instruction writes back all the modified cache lines, then invalidates the internal caches, and initiate bus cycles for external caches to flush. This is similar to INVD but with a write back.

System Instructions

The scope of system instructions are not covered in this book. Refer to the Intel and AMD specific documentation for full specifications. They are considered OS/System instructions and as such will not be discussed in this book. Some are accessible by the application layer at the low privilege level but are not part of the general application development process. They are only referenced here for informational purposes and to ensure this book lists all instructions available at the time of its publication.

ARPL — Adjust Requested Privilege Level

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
ARPL								32		32

arpl rmDst16, rSrc16

This system instruction adjusts the RPL (Request Privilege Level) by comparing the segment selector of rSrc with rmDst. If rSrc > rmDst, then set the Zero flag; otherwise clear (reset) it. This instruction can be accessed by an application.

BOUND — Check Array Index For Bounding Error

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
BOUND								32		32

bound	rSrcA16, mSrcB16 ^ 16
bound	rSrcA32, mSrcB32 ^ 32

This system instruction checks if the array index rSrcA is within the bounds of the array specified by mSrcB. A #BR (Bounds Range) exception is triggered if it is not inclusive.

CLTS — Clear Task Switch Flag

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
CLTS

clts

This system instruction clears the task switch flag TS Bit #3 of CR0 (CR0_TS). The operating system sets this flag every time a task switch occurs and this flag is used to clear it. It is used in conjunction with the synchronization of the task switch with the FPU.

HLT — Halt Processor

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
HLT

hlt

This is a system instruction that stops the processor and puts it into a halt state.

UD2 — Undefined Instruction

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
UD2

ud2

UD2 is an undefined instruction and guaranteed to throw an opcode exception in all modes.

INVLPG — Invalidate TLB

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
INVLPG

invlpg mSrc

This instruction invalidates the TLB (Translation Lookaside Buffer) page referenced by mSrc.

LAR — Load Access Rights

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LAR

lar	rDst16, rmSrc16
lar	rDst32, rmSrc32
lar	rDst64, rmSrc64

This system instruction copies the access rights from the segment descriptor referenced by the source rmSrc, stores them in the destination rDst, and sets the zero flag. This instruction can only be called from Protected Mode.

LOCK — Assert Lock # Signal Prefix

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LOCK

lock

This system instruction is a code prefix to turn the trailing instruction into an atomic instruction. In a multiprocessor environment it ensures that the processor using the lock has exclusive access to memory shared with the other processor.

This instruction can only be used with the following instructions and only when they are performing a write operation to memory: ADD, ADC, AND, BTC, BTR, BTS, CMPSCHG, CMPXCHG8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XADD, XCHG.

This instruction works best with a read-modify-write operation such as the BTS instruction.

LSL — Load Segment Limit

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LSL

lsl	rDst16, rmSrc16
lsl	rDst32, rmSrc32
lsl	rDst64, rmSrc64

This system instruction copies the segment descriptor referenced by the source rmSrc to the destination rDst.

MOV — Move To/From Control Registers

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
MOV CR

mov	cr{0...4}, r32	32
mov	r32, cr{0...4}

This system instruction copies memory from the control register to a general-purpose register or from a general-purpose register to a control register.

MOV — Move To/From Debug Registers

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
MOV DR

mov	r32, dr{0...7}	32
mov	dr{0...7}, r32

This system instruction copies memory from the debug register to a general-purpose register or from a general-purpose register to a debug register.

STMXCSR — Save MXCSR Register State

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
STMXCSR

stmxcsr mDst32

This system instruction saves the MXCSR control and status register to the destination mDst32. The complement to this instruction is LDMXCSR.

LDMXCSR — Load MXCSR Register State

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LDMXCSR

ldmxcsr mSrc32

This system instruction loads the MXCSR control and status register from the source mSrc32. The complement of this instruction is STMXCSR.

The default value is 00001F80h.

SGDT/SIDT — Save Global/Interrupt Descriptor Table

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SGDT
SIDT

sgdt	m
sidt	m

The SGDT system instruction copies the Global Descriptor Table Register (GDTR) to the destination. The complement of this instruction is LGDT.

The SIDT system instruction copies the Interrupt Descriptor Table Register (IDTR) to the destination. The complement of this instruction is LIDT.

LGDT/LIDT — Load Global/Interrupt Descriptor Table

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LGDT
LIDT

lgdt	mSrc16 ^ (32/64)
lidt	mSrc16 ^ (32/64)

The LGDT system instruction loads the source mSrc16 into the Global Descriptor Table Register (GDTR).

The LIDT system instruction loads the source mSrc16 into the Interrupt Descriptor Table Register (IDTR).

SLDT — Save Local Descriptor Table

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SLDT

sldt

rmDst16

This system instruction copies the segment selector from the Local Descriptor Table Register (LDTR) to the destination rmDst16.

LLDT — Load Local Descriptor Table

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LLDT

lldt

rmSrc16

This system instruction loads the source rmSrc16 into the segment selector element of the Local Descriptor Table Register (LDTR). This instruction is only available in Protected Mode.

SMSW — Save Machine Status Word

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SMSW

smsw rmDst16

This system instruction copies the lower 16 bits of control register CR0 into the destination rmDst16.

LMSW — Load Machine Status Word

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LMSW

lmsw rmSrc16

This system instruction loads the lower four bits of the source rmSrc16 and overwrites the lower four bits of the control register CR0.

STR — Save Task Register

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
STR

str rmDst16

This system instruction reads the task register and saves the segment selector value into the 16-bit destination rmDst16. The register gets the upper 16 bits cleared to zero in the upper bits of the 32-bit form.

str ax             ; actually stores 0000:AX into EAX

LTR — Load Task Register

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LTR

ltr rmSrc16

This system instruction sets the task register with the segment selector stored in the 16-bit source rmSrc16.

RDMSR — Read from Model Specific Register

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
RDMSR

rdmsr

This is a system instruction that may only be run in Privilege Level 0. The Model Specific Register (MSR) indexed by ECX is loaded into the EDX:EAX register pair.

WRMSR — Write to Model Specific Register

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
WRMSR

wrmsr

This system instruction writes the 64-bit value in EDX:EAX to the Model Specific Register specified by the ECX register. In 64-bit mode the lower 32 bits of each 64-bit register RDX[0..31]:[RAX[0...31] form the 64-bit value that is written to the MSR specified by the RCX register.

MSR[ecx] = edx:eax

SWAPGS — Swap GS Base Register

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SWAPGS								64		64

swapgs

This system instruction swaps the GS register value with the value in the MSR address C0000102H.

SYSCALL — 64-Bit Fast System Call

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SYSCALL										64

syscall

This instruction is a fast 64-bit system call to privilege level 0. It allows code at the lower privilege levels to call code within Privilege Level 0.

SYSRET — Fast Return from 64-Bit Fast System Call

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SYSRET										64

sysret

This instruction is a return from a fast 64-bit system call. It is a complement to SYSCALL.

SYSENTER — Fast System Call

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SYSENTER

sysenter

This instruction is a fast system call to Privilege Level 0. It allows code at the lower privilege levels to call code within Privilege Level 0.

SYSEXIT — Fast Return from Fast System Call

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
SYSEXIT

sysexit

This instruction is a return from a fast system call. It is a complement to SYSENTER.

RSM — Resume from System Management Mode

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
RSM

rsm

This system instruction returns control from the System Management Mode (SMM) back to the operating system or the application that was interrupted by the SMM interrupt.

VERR/VERW — Verify Segment for Reading

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
VERR
VERW

verr	rm16
verw	rm16

mov  ax,cs
verr ax
verw ax

These instructions verify whether the specified segment/selector CS, DS, ES, FS, or GS is VERR (readable) or VERW (writeable) and sets the zero flag to 1 if yes or resets (clears) the zero flag if no. Code segments are never verified as writeable. The stack segment-selector (SS) is not an allowed register. These instructions are not available in Real Mode.

LDS/LES/LFS/LGS/LSS — Load Far Pointer

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
LDS
LES
LFS
LGS
LSS

lds	r32Dst, mSrc(16:32)	Protected Mode	48
lds	r16Dst, mSrc(16:16)	Real Mode	32
les	r32Dst, mSrc(16:32)	Protected Mode	48
les	r16Dst, mSrc(16:16)	Real Mode	32
lfs	r64Dst, mSrc(16:64)	64-bit Mode	80
lfs	r32Dst, mSrc(16:32)	64-bit, Protected Mode	48
lfs	r16Dst, mSrc(16:16)	64-bit, Real Mode	32
lgs	r64Dst, mSrc(16:64)	64-bit Mode	80
lgs	r32Dst, mSrc(16:32)	64-bit, Protected Mode	48
lgs	r16Dst, mSrc(16:16)	64-bit, Real Mode	32
lss	r64Dst, mSrc(16:64)	64-bit Mode	80
lss	r32Dst, mSrc(16:32)	64-bit, Protected Mode	48
lss	r16Dst, mSrc(16:16)	64-bit, Real Mode	32

This is a special memory pointer instruction that moves a memory address into a register pair with a specified pointer value. The form you use is determined by the (64-bit/Protected/Real) mode your code is for.

Flags	O.flow	Sign	Zero	Aux	Parity	Carry
	-	-	-	-	-	-

Flags: None are altered by this opcode.

Protected Mode Win95 programmers do not need to get at the VGA, but if you have an old monochrome adapter plugged into your system this will be handy using Microsoft's secret (unpublished) selector {013fh}, which gets you access to every linear address on your machine {013fh:00000000...0ffffffffh}. This became the data selector for Win 95B and is a Bounds Error for Win32 and Win64 developers.

monoadr dd      0b0000h
monosel dw      013fh


        mov     edi,monoadr
        mov     es,monosel

        les     FWORD PTR monoadr

Of course, that pointer is used in a function such as:

        mov     es:[edi],eax
        add     edi,4

Saving that pointer back to the address:

        mov     monoadr,edi
        mov     monosel,es

That was fine and dandy, but the following is a quicker method even though it takes a little organization and is very easy to make a mistake due to its length.

monobase FWORD 013f000b0000h

        les   edi,monobase

The declaration has too many zeros and is a lil' too darn long, don't you think! It almost looks like binary. Loading that address into the pointer is very quick, but trying to save the pointer back to the address isn't so slick and it seems a little murky to me.

        mov     DWORD PTR monobase,edi
        mov     WORD PTR monobase+4,es

An alternate method would be using a data structure such as follows:

;       Protected Mode address (Far)
PMADR   STRUC
        adr     dd       ?         ; PM Address
        sel     dw       ?         ; PM Segment (Selector)
PMADR   ends

monobase PMADR   {000b0000h,013fh} ; Monochrome Base Address

And to actually get the pointer:

        les     edi,FWORD PTR monobase

Save the pointer back:

        mov     monobase.adr,edi
        mov     monobase.sel,es

Now, doesn't that look much cleaner? Assembly coding can get convoluted enough without creating one's own confusion. Now for those Real Mode programmers, a touch of VGA nostalgia:

vgaseg  dw    0a000h

mov     di,0
mov     es,vgaseg

The following code snippet is similar to the previous 32-bit version but scaled down for 16-bit. Using the same techniques:

;       Real Mode address (Far)
RMADR   STRUC
        off     dw       ?        ; Real Mode Offset
        rseg    dw       ?        ; Real Mode Segment
RMADR   ends

vgabase RMADR     {0,0a000h}      ; VGA Base Address

les     di,vgabase

And using that pointer:

        mov     es:[di],ax
        add     di,2

Hyperthreading Instructions

The scope of hyperthreading instructions is not covered in this book. Refer to the Intel-specific documentation for full specifications. They are considered OS/System instructions and as such will not be discussed in this book. They are accessible by the application layer at the low privilege level but are not part of the general application development process. They are only referenced here for informational purposes.

MONITOR — Monitor

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
MONITOR

monitor

This system instruction sets up a hardware monitor using an address stored in the EAX register and arms the monitor. Registers ECX and EDX contain information to be sent to the monitor. This is accessible at any privilege level unless the MONITOR flag in the CPUID is not set, indicating the processor does not support this instruction. This instruction is used in conjunction with the instruction MWAIT.

MWAIT — Wait

Mnemonic	P	PII	K6	3D!	3Mx+	SSE	SSE2	A64	SSE3	E64T
MWAIT

mwait

This system instruction is similar to a NOP but works in conjunction with the MONITOR instruction for signaling to a hardware monitor. It is a hint to the processor that it is okay to stop instruction execution until a monitor related event. MONITOR can be used by itself, but if MWAIT is used, only one MWAIT instruction follows a MONITOR instruction (especially in a loop).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18. System

Create new playlist

Sign In

Sign Up

Chapter 18. System

System "Lite"

System Timing Instructions

RDPMC — Read Performance — Monitoring Counters

RDTSC — Read Time-Stamp Counter

Calculating Processor Speed

80×86 Architecture

CPU Status Registers (32-Bit EFLAGS/64-Bit RFLAGS)

Protection Rings

Control Registers

(TPR) Task Priority Registers — (CR8)

Debug Registers

Cache Manipulation

Cache Sizes

Cache Line Sizes

PREFETCHx — Prefetch Data into Caches

LFENCE — Load Fence

SFENCE — Store Fence

MFENCE — Memory Fence

CLFLUSH — Flush Cache Line

INVD — Invalidate Cache (WO/Writeback)

WBINVD — Write Back and Invalidate Cache

System Instructions

ARPL — Adjust Requested Privilege Level

BOUND — Check Array Index For Bounding Error

CLTS — Clear Task Switch Flag

HLT — Halt Processor

UD2 — Undefined Instruction

INVLPG — Invalidate TLB

LAR — Load Access Rights

LOCK — Assert Lock # Signal Prefix

LSL — Load Segment Limit

MOV — Move To/From Control Registers

MOV — Move To/From Debug Registers

STMXCSR — Save MXCSR Register State

LDMXCSR — Load MXCSR Register State

SGDT/SIDT — Save Global/Interrupt Descriptor Table

LGDT/LIDT — Load Global/Interrupt Descriptor Table

SLDT — Save Local Descriptor Table

LLDT — Load Local Descriptor Table

SMSW — Save Machine Status Word

LMSW — Load Machine Status Word

STR — Save Task Register

LTR — Load Task Register

RDMSR — Read from Model Specific Register

WRMSR — Write to Model Specific Register

SWAPGS — Swap GS Base Register

SYSCALL — 64-Bit Fast System Call

SYSRET — Fast Return from 64-Bit Fast System Call

SYSENTER — Fast System Call

SYSEXIT — Fast Return from Fast System Call

RSM — Resume from System Management Mode

VERR/VERW — Verify Segment for Reading

LDS/LES/LFS/LGS/LSS — Load Far Pointer

Hyperthreading Instructions

MONITOR — Monitor

MWAIT — Wait

Table of Contents for
18. System

PREFETCH_x — Prefetch Data into Caches