There are other instructions available in your processor, but they have very little to no relationship to your application code. As mentioned at the beginning of this book, there are basically three types of instructions. (Note that I am oversimplifying here!) They are general-purpose, floating-point, and system instructions. The existence of these later instructions has to do with writing system level, thus operating system, code. They are not typically accessible or needed by those programmers writing non-operating system code. As this book is not targeted for that market, there is no need to make the standard application programmer wade through it. But as some of you may just cry foul, I have included a very light overview of these instructions. Besides, there are some tidbits in here for all of you!
Chapter 3, "Processor Differential Insight," as well as Chapter 16, "What CPUID?" gave some background on the processor. We shall now continue with that information. Some of what is included here is not necessarily just for system programmers as some features of the 80×86 are system related but are accessible from the application level. Note the System "Lite" part? Keep in mind that this is a superficial overview. If you need an in-depth explanation, please refer to documentation direct from the manufacturer.
rdpmc
This instruction loads the 40-bit performance monitoring counter indexed by ECX into the EDX:EAX register pair. For 64-bit mode, RDX[0...31]:RAX[0...31]=[RCX]. This instruction is accessible from any layer inclusive of the application layer only if the PCE flag in CR4 is set. When the flag is clear, this instruction can only be run from privilege level 0.
rdtsc
This system instruction reads the 64-bit time-stamp counter and loads the value into the EDX:EAX registers. The counter is incremented every clock cycle and is cleared (reset) to zero upon the processor being reset. This instruction is accessible from any layer inclusive of the application layer unless the TSD flag in CR4 is set. So far while running under Win32 the flag has been clear as a default, thus allowing an application to access this instruction.
; void CpuDelaySet(void) public CpuDelaySet CpuDelaySet proc near rdtsc ; Read time-stamp counter mov tclkl,eax ; Save low 32 bits mov tclkh,edx ; Save high 32 bits ret CpuDelaySet endp
; long int CpuDelayCalc(void) ; ; This function is called after IClkSet() to get the ; elapsed interval in clock cycles. ; ; Note: On a 400MHz computer, only reading the lower 32 bits ; gives a maximum 10 second sampling before rollover. public CpuDelayCalc CpuDelayCalc proc near rdtsc ; Read time-stamp counter sub eax,tclkl sbb edx,tclkh ; edx:eax = total elapsed interval ret ; return edx:eax = 64 bits of info. CpuDelayCalc endp
These two functions can be used for time trials while optimizing code. Due to multithreaded environments, another thread or interrupt can steal your time slice while you are trying to do time analysis on a bit of code. You could divide the number of loops into the total delay to get an average loop delay count. What I like to do is run a benchmark of executing the same code a few thousand times, ignoring the effects the prefetch has on these times or the fact the Nth time around the data is already sitting in memory. One time I took the governor off an MPEG decoder so it would run full speed, allowing code to be optimized so that it would run faster and faster.
The following code snippet can be included within your own code for determining computer speed. The computer quite often is not running at the speed you may think. I had a weird problem in an application running on my laptop and it did not make any sense until I wrote this code Even then I thought it had a bug until I realized the laptop had a thermal problem and dropped its computer speed by 50% or more so as to run cooler. Clients running your application may have some weird problems or be misinformed of their machines' capabilities, and this code can give you or customer support representatives more debugging insight.
typedef struct SpeedDataType { uint tSpeed; uint tSpeedState; uint nCnt; uint wTimerID;
} SpeedData; // Win32 Timer - Calculate CPU Speed void CALLBACK SpeedCalcTimer(UINT wTimerID, UINT msg, DWORD dwUser, DWORD dw1, DWORD dw2) { SpeedData *sp = (SpeedData *)dwUser; if (sp->wTimerID != wTimerID) // Is this our timer ID? { return; } switch(sp->tSpeedState) { case 2: // 2nd tick (avg of the two intervals) sp->tSpeed = (CpuDelayCalc()
+ sp->tSpeed) >> 1; sp->nCnt++;CpuDelaySet()
; break; case 1: // 1st tick sp->tSpeed =CpuDelayCalc()
; sp->nCnt++; // Allow flow through! case 0: // Starting tickCpuDelaySet()
; sp->tSpeedState++; break; default: break; } } // Be VERY careful when this is called, as your OS may not like it! uint SpeedCalc(void) { TIMECAPS tc; uint wTimerRes, nCnt; SpeedData sd; wTimerRes = 1; // Set the timer resolution for the multimedia timer if (TIMERR_NOERROR == timeGetDevCaps(&tc, sizeof(TIMECAPS))) {
wTimerRes = min(max(tc.wPeriodMin, 1), tc.wPeriodMax); timeBeginPeriod(wTimerRes); // 1ms resolution } sd.nCnt = sd.tSpeed = sd.tSpeedState = 0; sd.wTimerID = timeSetEvent(1, wTimerRes, SpeedCalcTimer, (DWORD)&sd, TIME_PERIODIC | TIME_KILL_SYNCHRONOUS); if (sd.wTimerID) // If we were given a TimerId { // (Should not fail!) do { nCnt = sd.nCnt; Sleep(10); // Sleep 10ms if (sd.nCnt > 100) // Cycle 100 times { timeKillEvent(sd.wTimerID); return sd.tSpeed/1000; } } while (nCnt != sd.nCnt); // If the same, the timer failed! timeKillEvent(sd.wTimerID); } // Didn't work? Try it the really not-so-accurate way!CpuDelaySet
(); Sleep(10); returnCpuDelayCalc
()/10000; }
The Intel and AMD processors have similar functional architecture. Different processors have different numbers of caches, on chip cache, off chip cache, different speeds, different instruction sets, different methods of pipelining instructions. All this book is interested in is helping you, the application programmer, make your code go fast by writing it in assembly. You have no control over what flavor of processor the user of your application chooses to run their applications on. (Of course you could program your application to check these parameters and refuse to run on a system you do not like! But that would be evil!)
test ebx,ebx mov ecx,ebx mov esi,ebx mov edi,ebx test ebx,ebx
The use of full registers (such as in the above 32-bit code snippet in Protected Mode) allows instructions to be able to be executed on the same clock.
Partial stalls occur if a short version of a register is written to and then immediately followed by a larger version. For example:
mov al,9 add bx,ax ; clock stall mov al,9 add ebx,eax ; clock stall mov ax,9 add ebx,ax ; clock stall
The AL register will cause the next instruction to have a partial stall if it contains a large form such as AX, EAX, or RAX and if it is being written. This is like being at a red signal light in your car and when the light turns green you slam down on the accelerator; your car will sputter, spit a little, hesitate (stall), and then finally accelerate.
EFLAG | Code | Bit | Flag Descriptions |
---|---|---|---|
000000001h | 0 | Carry | |
000000002h | 1 | 1 | |
EFLAGS_PF | 000000004h | 2 | Parity |
000000008h | 3 | 0 | |
EFLAGS_AF | 000000010h | 4 | Auxiliary Carry |
000000020h | 5 | 0 | |
EFLAGS_ZF | 000000040h | 6 | Zero |
EFLAGS_SF | 000000080h | 7 | Sign |
EFLAGS_TF | 000000100h | 8 | Trap |
EFLAGS_IF | 000000200h | 9 | |
EFLAGS_DF | 000000400h | 10 | Direction |
EFLAGS_OF | 000000800h | 11 | Overflow |
EFLAGS_IOPL | 000003000h | 12, 13 | I/O Privilege Level |
EFLAGS_NT | 000004000h | 14 | Nested Task |
000010000h | 15 | 0 | |
EFLAGS_RF | 000010000h | 16 | Resume |
EFLAGS_VM | 000020000h | 17 | Virtual-8086 Mode |
EFLAGS_AC | 000040000h | 18 | Alignment Check |
EFLAGS_VIF | 000080000h | 19 | Virtual Interrupt |
EFLAGS_VIP | 000100000h | 20 | Virtual Interrupt Pending |
EFLAGS_ID | 000200000h | 21 | CPUID |
23...31 | 0 |
And in 64-bit mode the upper 32 bits of the RFLAGS register (0:EFLAGS):
32...63 | RFLAG (extra) bits |
The 386 and above have layers of protection referred to as protection rings.
The inner ring #0 contains the operating system kernel. The two middle rings (#1 and #2) contain the operating system services (device drivers), and the outer ring #3 is where the application (user code) resides. The ring numbers are also referred to as privilege levels with 0 being the highest and 3 being the lowest.
An application can access functions in the other rings by means of a gate. The SYSCALL and SYSENTER functions are two methods. This is a protection system to protect the inner rings from the outer. You know, to keep the riffraff out! Any attempt to access an inner ring without going through a gate will cause a general protection fault.
There are four control registers {CR0, CR2, CR3, CR4} that control system level operations. Note that CR1 is reserved.
Table 18-1. Control register 0 (CR0) extensions
CR0 | Code | Bit | Flag Descriptions |
---|---|---|---|
CR0_PE | 000000001h | 0 | Protection Enable |
CR0_MP | 000000002h | 1 | Monitor Coprocessor |
CR0_EM | 000000004h | 2 | Emulation |
CR0_TS | 000000008h | 3 | Task Switched |
CR0_ET | 000000010h | 4 | Extension Type |
CR0_NE | 000000020h | 5 | Numeric Error |
6...15 | |||
CR0_WP | 000010000h | 16 | Write Protected |
17 | |||
CR0_AM | 000040000h | 18 | Alignment Mask |
19...28 | |||
CR0_NW | 020000000h | 29 | Not Write-Through |
CR0_CD | 040000000h | 30 | |
CR0_PG | 080000000h | 31 | Paging |
And in 64-bit mode the upper 32 bits of the CR0 register (0:CR0):
32...63 |
Control register 2 (CR2) is a 32/64-bit page fault linear address.
Table 18-2. Control register 3 (CR3) extensions
CR3 | Code | Bit | Flag Descriptions |
---|---|---|---|
0...2 | |||
CR3_PWT | 000000008h | 3 | Page Writes Transparent |
CR3_PCD | 000000010h | 4 | Page Cache Disable |
Page Dir.Base | 12...31 |
And in 64-bit mode the upper 32 bits of the CR3 register (0:CR3):
32...63 | CR3 (extra) bits |
Table 18-3. Control register 4 (CR4) extensions
Code | Bit | Flag Descriptions | |
---|---|---|---|
CR4_VME | 000000001h | 0 | Virtual-8086 Mode Ext. |
CR4_PVI | 000000002h | 1 | Protected Virtual Int. |
CR4_TSD | 000000004h | 2 | Time Stamp Disable |
CR4_DE | 000000008h | 3 | |
CR4_PSE | 000000010h | 4 | Page Size Extension |
CR4_PAE | 000000020h | 5 | Physical Address Ext. |
CR4_MCE | 000000040h | 6 | Machine Check Enable |
CR4_PGE | 000000080h | 7 | Global Page Enable |
CR4_PCE | 000000100h | 8 | RDPMC Enabled |
CR4_OSFXSR | 000000200h | 9 | FXSAVE, FXRSTOR |
CF4_OSXMMEXCPT | 000000400h | 10 | Unmasked SIMD FP Exception |
And in 64-bit mode the upper 32 bits of the CR4 register (0:CR4):
32...63 | CR4 (extra) bits |
There are eight debug registers: {DR0, DR1, DR2, DR3, DR4, DR5, DR6, DR7}. Knowing them is unimportant as you are most likely using a debugger to develop your application, not building a debugger. These are privileged resources and only accessible at the system level to set up and monitor the breakpoints {0...3}.
Several mechanisms have been put into place to squeeze optimal throughput from the processors. One method of cache manipulation discussed in Chapter 10, "Branching," is Intel's hint as to the prediction of logic flow through branches counter to the static prediction logic. Another mechanism is a hint to the processor about cache behavior so as to give the processor insight into how a particular piece of code is utilizing memory access. Here is a brief review of some terms that have already been discussed:
Temporal data — Memory that requires multiple accesses and therefore needs to be loaded into a cache for better throughput.
Non-temporal hint — A hint (an indicator) to the processor that memory only requires a single access (one shot). This would be similar to copying a block of memory or performing a calculation, but the result is not going to be needed for a while so there is no need to write it into the cache. Thus, the memory access has no need to read and load cache, and therefore the code can be faster.
For speed and efficiency, when memory is accessed for read or write a cache line containing that data (whose length is dependent upon manufacturer and version) is copied from system memory to high-speed cache memory. The processor performs read/write operations on the cache memory. When a cache line is invalidated, the write back of that cache line to system memory occurs. In a multiprocessor system, this occurs frequently due to non-sharing of internal caches. The second stage of writing the cache line back to system memory is called a "write back."
Different processors have different cache sizes for data and for code. These are dependent upon processor model, manufacturer, etc., as shown below:
CPU | L1 Cache (Data /Code) | L2 Cache |
---|---|---|
Celeron | 16Kb / 16Kb | 256Kb |
Pentium 4 | 8Kb / 12Kμops | 512Kb |
Athlon XP | 64Kb / 64Kb | 256Kb |
Duron | 64Kb / 64Kb | 64Kb |
Pentium M | 32Kb / 32Kb | 1024Kb |
Xeon | 512Kb |
Depending on your code and level of optimization, the size of the cache may be of importance. For the purposes of this book, however, it is being ignored, as that topic is more suitable for a book very specifically targeting heavy-duty optimization. This book, however, is interested in the cache line size as that is more along the lightweight optimization that has been touched on from time to time. It should be noted that AMD uses a minimum size of 32 bytes.
The (code/data) cache line size determines how many instruction/data bytes can be preloaded.
Intel | Cache Line Size |
---|---|
PIII | 32 |
Pentium M | 64 |
P4 | 64 |
Xeon | 64 |
AMD | Cache Line Size |
---|---|
Athlon | 64 |
Opteron | 64 |
The cache line size can be obtained by using the CPUID instruction with EAX set to 1. The following calculation will give you the actual cache line size.
mov eax,1 cpuid and ebx,00000FF00h shr ebx,8-3 ; ebx = size of cache line
Mnemonic | P | PII | K6 | 3D! | 3Mx+ | SSE | SSE2 | A64 | SSE3 | E64T |
---|---|---|---|---|---|---|---|---|---|---|
PREFETCH | ||||||||||
PREFETCHNTA | ||||||||||
PREFETCHT0 | ||||||||||
PREFETCHT1 | ||||||||||
PREFETCHT2 | ||||||||||
PREFETCHW |
3DNow! | prefetch | mSrc8 |
prefetchw | mSrc8 | |
SSE | prefetcht0 | mSrc8 |
prefetcht1 | mSrc8 | |
prefetcht2 | mSrc8 | |
prefetchnta | mSrc8 |
The PREFETCHNTA instruction performs a non-temporal hint to the processor with respect to all the caches, to load from system memory mSrc8 into the first-level cache for a PIII or a second-level cache for a P4 or Xeon processor.
The PREFETCHT0 instruction performs a temporal hint to the processor to load from system memory mSrc8 into the first- or second-level cache for a PIII, or a second-level cache for a P4 or Xeon processor.
The PREFETCHT1 instruction performs a temporal hint to the processor with respect to the first-level cache to load from system memory mSrc8 into the second-level cache for PIII, P4, or Xeon processor.
The PREFETCHT2 instruction performs a temporal hint to the processor with respect to the second-level cache to load from system memory mSrc8 into the first-level cache for PIII or the second-level cache for P4 or Xeon processor.
If data is already loaded at the same or higher cache, then no operation is performed.
AMD processors alias PREFETCHT1 and PREFETCHT2 instructions to the PREFETCHT0 instructions, so they all have the PREFETCHT0 functionality.
The 3DNow! PREFETCH instruction loads a cache line into the L1 data cache from the mSrc8.
The 3DNow! PREFETCHW instruction loads a cache line into the L1 data cache from the mSrc8 but sets a hint indicating that it is for write operations.
lfence
This instruction is similar to the MFENCE instruction, but it acts as a barrier between memory load instructions issued before and after the LFENCE and MFENCE instructions.
sfence
This instruction is similar to the instruction MFENCE but it acts as a barrier between memory save instructions issued before and after the SFENCE or MFENCE instructions.
Mnemonic | P | PII | K6 | 3D! | 3Mx+ | SSE | SSE2 | A64 | SSE3 | E64T |
---|---|---|---|---|---|---|---|---|---|---|
MFENCE |
mfence
This instruction is a barrier (fence) to isolate system memory to and from cache memory operations that occur before and after this instruction.
clflush mSrc8
This instruction invalidates the cache line (code or data) containing the linear address specified by mSrc8. If the line is dirty — that is, different from the system memory in the process of being written to — it is written back to system memory. This instruction is ordered by the MFENCE instruction. Check CPUID bit #19 (CLFSH) to see if this instruction is available.
The scope of system instructions are not covered in this book. Refer to the Intel and AMD specific documentation for full specifications. They are considered OS/System instructions and as such will not be discussed in this book. Some are accessible by the application layer at the low privilege level but are not part of the general application development process. They are only referenced here for informational purposes and to ensure this book lists all instructions available at the time of its publication.
arpl rmDst16, rSrc16
This system instruction adjusts the RPL (Request Privilege Level) by comparing the segment selector of rSrc with rmDst. If rSrc > rmDst, then set the Zero flag; otherwise clear (reset) it. This instruction can be accessed by an application.
bound | rSrcA16, mSrcB16 ^ 16 |
bound | rSrcA32, mSrcB32 ^ 32 |
This system instruction checks if the array index rSrcA is within the bounds of the array specified by mSrcB. A #BR (Bounds Range) exception is triggered if it is not inclusive.
clts
This system instruction clears the task switch flag TS Bit #3 of CR0 (CR0_TS). The operating system sets this flag every time a task switch occurs and this flag is used to clear it. It is used in conjunction with the synchronization of the task switch with the FPU.
hlt
This is a system instruction that stops the processor and puts it into a halt state.
ud2
UD2 is an undefined instruction and guaranteed to throw an opcode exception in all modes.
invlpg mSrc
This instruction invalidates the TLB (Translation Lookaside Buffer) page referenced by mSrc.
lar | rDst16, rmSrc16 |
lar | rDst32, rmSrc32 |
lar | rDst64, rmSrc64 |
This system instruction copies the access rights from the segment descriptor referenced by the source rmSrc, stores them in the destination rDst, and sets the zero flag. This instruction can only be called from Protected Mode.
lock
This system instruction is a code prefix to turn the trailing instruction into an atomic instruction. In a multiprocessor environment it ensures that the processor using the lock has exclusive access to memory shared with the other processor.
This instruction can only be used with the following instructions and only when they are performing a write operation to memory: ADD, ADC, AND, BTC, BTR, BTS, CMPSCHG, CMPXCHG8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XADD, XCHG.
This instruction works best with a read-modify-write operation such as the BTS instruction.
lsl | rDst16, rmSrc16 |
lsl | rDst32, rmSrc32 |
lsl | rDst64, rmSrc64 |
This system instruction copies the segment descriptor referenced by the source rmSrc to the destination rDst.
Mnemonic | P | PII | K6 | 3D! | 3Mx+ | SSE | SSE2 | A64 | SSE3 | E64T |
---|---|---|---|---|---|---|---|---|---|---|
MOV CR |
mov | cr{0...4}, r32 | 32 |
mov | r32, cr{0...4} |
This system instruction copies memory from the control register to a general-purpose register or from a general-purpose register to a control register.
mov | r32, dr{0...7} | 32 |
mov | dr{0...7}, r32 |
This system instruction copies memory from the debug register to a general-purpose register or from a general-purpose register to a debug register.
stmxcsr mDst32
This system instruction saves the MXCSR control and status register to the destination mDst32. The complement to this instruction is LDMXCSR.
Mnemonic | P | PII | K6 | 3D! | 3Mx+ | SSE | SSE2 | A64 | SSE3 | E64T |
---|---|---|---|---|---|---|---|---|---|---|
LDMXCSR |
ldmxcsr mSrc32
This system instruction loads the MXCSR control and status register from the source mSrc32. The complement of this instruction is STMXCSR.
The default value is 00001F80h.
sgdt | m |
sidt | m |
The SGDT system instruction copies the Global Descriptor Table Register (GDTR) to the destination. The complement of this instruction is LGDT.
The SIDT system instruction copies the Interrupt Descriptor Table Register (IDTR) to the destination. The complement of this instruction is LIDT.
Mnemonic | P | PII | K6 | 3D! | 3Mx+ | SSE | SSE2 | A64 | SSE3 | E64T |
---|---|---|---|---|---|---|---|---|---|---|
LGDT | ||||||||||
LIDT |
lgdt | mSrc16 ^ (32/64) |
lidt | mSrc16 ^ (32/64) |
The LGDT system instruction loads the source mSrc16 into the Global Descriptor Table Register (GDTR).
The LIDT system instruction loads the source mSrc16 into the Interrupt Descriptor Table Register (IDTR).
sldt | rmDst16 |
This system instruction copies the segment selector from the Local Descriptor Table Register (LDTR) to the destination rmDst16.
lldt | rmSrc16 |
This system instruction loads the source rmSrc16 into the segment selector element of the Local Descriptor Table Register (LDTR). This instruction is only available in Protected Mode.
smsw rmDst16
This system instruction copies the lower 16 bits of control register CR0 into the destination rmDst16.
lmsw rmSrc16
This system instruction loads the lower four bits of the source rmSrc16 and overwrites the lower four bits of the control register CR0.
Mnemonic | P | PII | K6 | 3D! | 3Mx+ | SSE | SSE2 | A64 | SSE3 | E64T |
---|---|---|---|---|---|---|---|---|---|---|
STR |
str rmDst16
This system instruction reads the task register and saves the segment selector value into the 16-bit destination rmDst16. The register gets the upper 16 bits cleared to zero in the upper bits of the 32-bit form.
str ax ; actually stores 0000:AX into EAX
ltr rmSrc16
This system instruction sets the task register with the segment selector stored in the 16-bit source rmSrc16.
rdmsr
This is a system instruction that may only be run in Privilege Level 0. The Model Specific Register (MSR) indexed by ECX is loaded into the EDX:EAX register pair.
wrmsr
This system instruction writes the 64-bit value in EDX:EAX to the Model Specific Register specified by the ECX register. In 64-bit mode the lower 32 bits of each 64-bit register RDX[0..31]:[RAX[0...31] form the 64-bit value that is written to the MSR specified by the RCX register.
MSR[ecx] = edx:eax
swapgs
This system instruction swaps the GS register value with the value in the MSR address C0000102H.
syscall
This instruction is a fast 64-bit system call to privilege level 0. It allows code at the lower privilege levels to call code within Privilege Level 0.
sysret
This instruction is a return from a fast 64-bit system call. It is a complement to SYSCALL.
sysenter
This instruction is a fast system call to Privilege Level 0. It allows code at the lower privilege levels to call code within Privilege Level 0.
sysexit
This instruction is a return from a fast system call. It is a complement to SYSENTER.
rsm
This system instruction returns control from the System Management Mode (SMM) back to the operating system or the application that was interrupted by the SMM interrupt.
verr | rm16 |
verw | rm16 |
mov ax,cs verr ax verw ax
These instructions verify whether the specified segment/selector CS, DS, ES, FS, or GS is VERR (readable) or VERW (writeable) and sets the zero flag to 1 if yes or resets (clears) the zero flag if no. Code segments are never verified as writeable. The stack segment-selector (SS) is not an allowed register. These instructions are not available in Real Mode.
lds | r32Dst, mSrc(16:32) | 48 | |
lds | r16Dst, mSrc(16:16) | 32 | |
les | r32Dst, mSrc(16:32) | Protected Mode | 48 |
les | r16Dst, mSrc(16:16) | Real Mode | 32 |
lfs | r64Dst, mSrc(16:64) | 64-bit Mode | 80 |
lfs | r32Dst, mSrc(16:32) | 64-bit, Protected Mode | 48 |
lfs | r16Dst, mSrc(16:16) | 64-bit, Real Mode | 32 |
lgs | r64Dst, mSrc(16:64) | 64-bit Mode | 80 |
lgs | r32Dst, mSrc(16:32) | 64-bit, Protected Mode | 48 |
lgs | r16Dst, mSrc(16:16) | 64-bit, Real Mode | 32 |
lss | r64Dst, mSrc(16:64) | 64-bit Mode | 80 |
lss | r32Dst, mSrc(16:32) | 64-bit, Protected Mode | 48 |
lss | r16Dst, mSrc(16:16) | 64-bit, Real Mode | 32 |
This is a special memory pointer instruction that moves a memory address into a register pair with a specified pointer value. The form you use is determined by the (64-bit/Protected/Real) mode your code is for.
Flags | O.flow | Sign | Zero | Aux | Parity | Carry |
---|---|---|---|---|---|---|
- | - | - | - | - | - |
Flags: None are altered by this opcode.
Protected Mode Win95 programmers do not need to get at the VGA, but if you have an old monochrome adapter plugged into your system this will be handy using Microsoft's secret (unpublished) selector {013fh}, which gets you access to every linear address on your machine {013fh:00000000...0ffffffffh}. This became the data selector for Win 95B and is a Bounds Error for Win32 and Win64 developers.
monoadr dd 0b0000h monosel dw 013fh mov edi,monoadr mov es,monosel
or
les FWORD PTR monoadr
Of course, that pointer is used in a function such as:
mov es:[edi],eax add edi,4
Saving that pointer back to the address:
mov monoadr,edi mov monosel,es
That was fine and dandy, but the following is a quicker method even though it takes a little organization and is very easy to make a mistake due to its length.
monobase FWORD 013f000b0000h les edi,monobase
The declaration has too many zeros and is a lil' too darn long, don't you think! It almost looks like binary. Loading that address into the pointer is very quick, but trying to save the pointer back to the address isn't so slick and it seems a little murky to me.
mov DWORD PTR monobase,edi mov WORD PTR monobase+4,es
An alternate method would be using a data structure such as follows:
; Protected Mode address (Far) PMADR STRUC adr dd ? ; PM Address sel dw ? ; PM Segment (Selector) PMADR ends monobase PMADR {000b0000h,013fh} ; Monochrome Base Address
And to actually get the pointer:
les edi,FWORD PTR monobase
Save the pointer back:
mov monobase.adr,edi mov monobase.sel,es
Now, doesn't that look much cleaner? Assembly coding can get convoluted enough without creating one's own confusion. Now for those Real Mode programmers, a touch of VGA nostalgia:
vgaseg dw 0a000h mov di,0 mov es,vgaseg
The following code snippet is similar to the previous 32-bit version but scaled down for 16-bit. Using the same techniques:
; Real Mode address (Far) RMADR STRUC off dw ? ; Real Mode Offset rseg dw ? ; Real Mode Segment RMADR ends vgabase RMADR {0,0a000h} ; VGA Base Address les di,vgabase
And using that pointer:
mov es:[di],ax add di,2
The scope of hyperthreading instructions is not covered in this book. Refer to the Intel-specific documentation for full specifications. They are considered OS/System instructions and as such will not be discussed in this book. They are accessible by the application layer at the low privilege level but are not part of the general application development process. They are only referenced here for informational purposes.
monitor
This system instruction sets up a hardware monitor using an address stored in the EAX register and arms the monitor. Registers ECX and EDX contain information to be sent to the monitor. This is accessible at any privilege level unless the MONITOR flag in the CPUID is not set, indicating the processor does not support this instruction. This instruction is used in conjunction with the instruction MWAIT.
Mnemonic | P | PII | K6 | 3D! | 3Mx+ | SSE | SSE2 | A64 | SSE3 | E64T |
---|---|---|---|---|---|---|---|---|---|---|
MWAIT |
mwait
This system instruction is similar to a NOP but works in conjunction with the MONITOR instruction for signaling to a hardware monitor. It is a hint to the processor that it is okay to stop instruction execution until a monitor related event. MONITOR can be used by itself, but if MWAIT is used, only one MWAIT instruction follows a MONITOR instruction (especially in a loop).