Most modern personal computers contain a processor supporting either the Intel or AMD version of the x86 32-bit and x64 64-bit architectures. In contrast, almost all smartphones, smartwatches, tablets, and many embedded systems contain ARM 32-bit or 64-bit processors. This chapter takes a detailed look at the registers and instruction sets of these processor families.
After completing this chapter, you will understand the high-level architectures and unique attributes of the x86, x64, 32-bit ARM, and 64-bit ARM registers, instruction sets, assembly languages, and key aspects of legacy features supported in these architectures.
This chapter covers the following topics:
The files for this chapter, including the answers to the exercises, are available at https://github.com/PacktPublishing/Modern-Computer-Architecture-and-Organization-Second-Edition.
For the purposes of this discussion, the term x86 refers to the 16-bit and 32-bit instruction set architecture of the series of processors that began with the Intel 8086, introduced in 1978. The 8088, released in 1979, is functionally very similar to the 8086, except it has an 8-bit data bus instead of the 16-bit bus of the 8086. The 8088 was the central processor in the original IBM PC.
Subsequent generations of this processor series were named 80186, 80286, 80386, and 80486, leading to the term “x86” as shorthand for members of the family. Subsequent generations dropped the numeric naming convention and received the names Pentium, Core, i Series, Celeron, and Xeon.
Advanced Micro Devices (AMD), a semiconductor manufacturing company that competes with Intel, has been producing x86-compatible processors since 1982. Some recent AMD x86 processor generations have been named Ryzen, Opteron, Athlon, Turion, Phenom, and Sempron.
Code execution compatibility between Intel and AMD processors is good in many aspects. There are some key differences between processors from the two vendors, including the chip pin configuration and chipset compatibility.
In general, Intel processors only work in motherboards and with chipsets designed for Intel chips, and AMD processors only work in motherboards and with chipsets designed for AMD chips. We will highlight some other differences between Intel and AMD processors later in this section.
The 8086 and 8088 are 16-bit processors, despite the 8-bit data bus of the 8088. Internal registers in these processors are 16 bits wide and the instruction set operates on 16-bit data values. The 8088 transparently executes two bus cycles to transfer each 16-bit value between the processor and memory.
The 8086 and 8088 do not support the more sophisticated features of modern processors such as paged virtual memory and protection rings. These early processors also have only 20 address lines, limiting the addressable memory to 1 MB. A 20-bit address cannot fit in a 16-bit register, so it is necessary to use a somewhat complicated system of segment registers and offsets to access the full 1 MB address space.
In 1985, Intel released the 80386 with enhancements that mitigate many of these limitations. The 80386 introduced these features:
Modern x86 processors boot into the 16-bit operating mode of the original 8086, which is now called real mode. This mode retains compatibility with software written for the 8086/8088 environment, such as the MS-DOS operating system.
In most modern systems running on x86 processors, a transition to protected mode occurs during system startup. Once in protected mode, the operating system remains in protected mode until the computer shuts down.
MS-DOS ON A MODERN PC
Although the x86 processor in a modern PC is compatible at the instruction level with the original 8088, running an old copy of MS-DOS on a modern computer system is unlikely to be a straightforward process. The peripheral devices and their interfaces in modern PCs are not compatible with the corresponding interfaces in PCs from the 1980s. MS-DOS would need a driver that understands how to interact with the USB-connected keyboard of a modern motherboard, for example.
These days, the primary use for 16-bit mode in x86 processors is to serve as a bootloader for a protected mode operating system. Because most developers of computerized devices and the software that runs on them are unlikely to be involved in implementing such a capability, the remainder of our x86 discussion in this chapter will address protected mode and the associated 32-bit flat memory model.
The x86 architecture supports unsigned and signed two’s complement integer data types with widths of 8, 16, 32, 64, and 128 bits. The names assigned to these data types are as follows:
In most cases, the x86 architecture does not mandate the storage of these data types on natural boundaries. The natural boundary of a data type is any address evenly divisible by the size of the data type in bytes.
Storing any of the multi-byte types at unaligned boundaries is allowed but is discouraged because it causes a negative performance impact: instructions operating on unaligned data consume additional clock cycles. A few instructions that operate on double quadwords require naturally aligned storage and will generate a general protection fault if unaligned access is attempted.
x86 natively supports floating-point data types in widths of 16, 32, 64, and 80 bits. The 32-bit, 64-bit, and 80-bit formats are those presented in Chapter 9, Specialized Processor Extensions. The 16-bit format is called half-precision floating-point and has an 11-bit mantissa, an implied leading 1 bit, and a 5-bit exponent. The half-precision floating-point format is used extensively in GPU processing.
In the next section, we will look at the x86 register set in detail.
The x86 architecture protected mode has eight 32-bit wide general-purpose registers, a flags register, and an instruction pointer. There are also six segment registers and additional processor model-specific configuration registers. The segment registers and model-specific registers are configured by system software during startup and are, in general, not relevant to the developers of applications and device drivers. For these reasons, we will not discuss the segment registers and model-specific registers further.
The 16-bit general-purpose registers in the original 8086 architecture are named AX
, CX
, DX
, BX
, SP
, BP
, SI
, and DI
. The reason for listing the first four registers in this non-alphabetic order is because this is the sequence in which these eight registers are pushed onto the stack by a pushad
(push all registers) instruction.
In the transition to the 32-bit architecture of the 80386, each register grew to 32 bits. The 32-bit version of a register’s name is prefixed with the letter “E” to indicate this extension.
It is possible to access portions of 32-bit registers in smaller bit widths. For example, the lower 16 bits of the 32-bit EAX
register are referenced as AX
. The AX
register can be further accessed as individual bytes using the names AH
(high-order byte) and AL
(low-order byte). The following diagram shows the register names and the subsets of each:
Figure 10.1: Register names and subsets
Writing to a portion of a 32-bit register, for example, the AL
register, affects only the bits in that portion. In the case of AL
, loading an 8-bit value modifies the lowest 8 bits of EAX
, leaving the other 24 bits unaffected.
In keeping with the x86’s CISC architecture, several functions associated with various instructions are tied to specific registers. Table 10.1 provides a description of the functions associated with each of the x86 general-purpose registers:
Register |
Name |
Function |
|
Accumulator |
Arithmetic operations |
|
Counter |
Loop counter and shift/rotate counter |
|
Data |
Arithmetic and I/O operations |
|
Base |
Pointer to data |
|
Stack pointer |
Pointer to the top of the stack |
|
Base pointer |
Pointer to the stack base within a function |
|
Source index |
Pointer to the source location in array operations |
|
Destination index |
Pointer to the destination location in array operations |
Table 10.1: x86 general-purpose registers and associated functions
These register-specific functions contrast with the architectures of many RISC processors, which tend to provide a greater number of general-purpose registers. Registers within a RISC processor are, for the most part, functionally equivalent to one another.
The x86 flags register, EFLAGS
, contains the processor status bits described in Table 10.2:
Bit |
Name |
Function |
0 |
|
Carry flag: Indicates if addition produced a carry or subtraction produced a borrow. Used as input by addition and subtraction instructions. |
2 |
|
Parity flag: Set if the low 8 bits of the result contain an even number of 1 bits. |
4 |
|
Adjust flag: Indicates if addition produced a carry or subtraction produced a borrow from the lower 4 bits. Used in BCD arithmetic. |
6 |
|
Zero flag: Set if the result of an operation is zero. |
7 |
|
Sign flag: Set if the result of an operation is negative. |
8 |
|
Trap flag: Used in single-step debugging. |
9 |
|
Interrupt enable flag: Setting this bit enables hardware interrupts. |
10 |
|
Direction flag: Controls the direction of string processing. When clear, the order is lowest to highest addresses. When set, the order is highest to lowest addresses. |
11 |
|
Overflow flag: Set if an operation resulted in a signed overflow. |
12-13 |
|
I/O privilege level: The privilege level of the currently executing thread. IOPL 0 is kernel mode, and 3 is user mode. |
14 |
|
Nested task flag: Controls the chaining of interrupts. |
16 |
|
Resume flag: Used for processing exceptions during debugging. |
17 |
|
Virtual 8086 mode flag: If set, 8086 compatibility mode is active. This mode allows some MS-DOS applications to be run in the context of a protected mode operating system. |
18 |
|
Alignment check flag: If set, memory alignment checking is active. For example, if the AC flag is set, storing a 16-bit value to an odd address triggers an Alignment Check exception. x86 processors can perform unaligned memory accesses when this flag is not set, but the number of instruction cycles required may increase. |
19 |
|
Virtual interrupt flag: Virtual version of the |
20 |
|
Virtual interrupt pending flag: Set when an interrupt is pending in virtual 8086 mode. |
21 |
|
ID flag: If this bit can be set, the |
Table 10.2: x86 flags’ register bits
All bits in the EFLAGS
register that are not listed in Table 10.2 are reserved and are unused.
The 32-bit instruction pointer, EIP
, contains the address of the next instruction to execute, unless a branch is taken. When a branch is taken, the address of the branch destination is loaded into EIP
and execution continues from there.
The x86 architecture is little-endian, meaning multi-byte values are stored in memory with the least significant byte at the lowest address and the most significant byte at the highest address.
As one would expect for a CISC architecture, x86 supports a variety of addressing modes. There are several rules associated with addressing source and destination operands that must be followed to create valid instructions. For instance, the sizes of the source and destination operands of a mov
instruction must be equal. The assembler will attempt to select a suitable size for an operand that has an ambiguous size (for example, an immediate value of 7) to match the width of a destination location (such as the 32-bit register EAX
). In cases where the size of an operand cannot be inferred, size keywords such as byte
ptr
must be provided.
The assembly language in these examples uses Intel syntax, which places the operands in destination-source order. Intel syntax is used primarily in the Windows and MS-DOS contexts. An alternative notation, known as AT&T syntax, places operands in source-destination order. AT&T syntax is used in Unix-based operating systems. All examples in this book will use the Intel syntax.
The x86 architecture supports a variety of addressing modes, which we will look at next. Comments in assembly code begin with a semicolon and continue to the end of the line.
In this addressing mode, the register is implied by the instruction opcode. For example:
clc ; Clear the carry flag (CF in the EFLAGS register)
One or both source and destination registers are encoded in the instruction:
mov eax, ecx ; Copy the contents of register ECX to EAX
Registers may be used as the first operand, the second operand, or both operands.
An immediate value is provided as an instruction operand:
mov eax, 7 ; Move the 32-bit value 7 into EAX
mov ax, 7 ; Move the 16-bit value 7 into AX (the lower 16 bits of EAX)
When using Intel syntax, it is not necessary to prefix immediate values with the #
character.
The address of the value is provided as an instruction operand:
mov eax, [078bch] ; Copy the 32-bit value at hex address 78BC to EAX
In x86 assembly code, square brackets surrounding an expression indicate the expression is an address. When performing moves or other operations are performed on square-bracketed operands, the value being operated upon is the data at the specified address. The exception to this rule is the LEA
(load effective address) instruction, which we’ll examine later.
The operand is a register containing the address of the data value:
mov eax, [esi] ; Copy the 32-bit value at the address contained in ESI to
; EAX
This mode is equivalent to using a pointer to reference a variable in C or C++.
The operand indicates a register plus offset that combine to provide the address of the data value:
mov eax, [esi + 0bh] ; Copy the 32-bit value at the address (ESI + 0bh) to
; EAX
This mode is useful for accessing the elements of a data structure. In this scenario, the ESI
register contains the address of the structure, and the added constant is the byte offset of the element from the beginning of the structure.
The operand indicates a base register, an index register, and an offset that sum together to calculate the address of the data value:
mov eax, [ebx + esi + 10] ; Copy the 32-bit value starting at the address
; (EBX + ESI + 10) to EAX
This mode is useful for accessing individual data elements within an array of data structures. In this example, the EBX
register contains the address of the beginning of the structure array, ESI
contains the offset of the referenced structure within the array, and the constant value (10
) is the offset of the desired element from the beginning of the selected structure.
The operand is composed of a base register, an index register multiplied by a scale factor, and an offset that sum together to calculate the address of the data value:
mov eax, [ebx + esi*4 + 10] ; Copy the 32-bit value starting at the
; address (EBX + ESI*4 + 10) to EAX
In this addressing mode, the value in the index register can be multiplied by 1 (the default), 2, 4, or 8 before being summed with the other components of the operand address. There is no performance penalty associated with using the scaling multiplier. This feature is helpful when iterating over arrays containing elements with sizes of 2, 4, or 8 bytes.
Most of the general-purpose registers can be used as the base or index register in the based addressing modes.
The following diagram shows the possible combinations of register usage and scaling in the based addressing modes:
Figure 10.2: Based addressing mode
All eight general-purpose registers are available for use as the base register. Of those eight, only ESP
is unavailable for use as the index register.
The x86 instruction set was introduced with the Intel 8086 and has been extended several times over the years. Some of the most significant changes relate to the extension of the architecture from 16 to 32 bits, which added protected mode and paged virtual memory. In almost all cases, the new capabilities have been added while retaining full backward compatibility.
The full x86 instruction set contains several hundred instructions. We will not discuss all of them in this chapter. This section will provide brief summaries of the more important and commonly encountered instructions applicable to user-mode applications and device drivers.
This subset of x86 instructions can be divided into a few general categories: data movement; stack manipulation; arithmetic and logic; conversions; control flow; string and flag manipulation; input/output; and protected mode. We will also cover some miscellaneous instructions that do not fall into any specific category.
Data movement instructions do not affect the processor flags. The following instructions perform data movement:
mov
: Copies the data value referenced by the second operand to the location provided as the first operand.cmov
cc: Conditionally moves the second operand’s data to the register provided as the first operand if the cc condition is true. The condition is determined from one or more of the following processor flags: CF
, ZF
, SF
, OF
, and PF
. The condition codes are e
(equal), ne
(not equal), g
(greater), ge
(greater or equal), a
(above), ae
(above or equal), l
(less), le
(less or equal), b
(below), be
(below or equal), o
(overflow), no
(no overflow), z
(zero), nz
(not zero), s
(SF
=1), ns
(SF
=0), cxz
(register CX
is zero), and ecxz
(the ECX
register is zero).movsx
, movzx
: These are variants of the mov
instruction performing sign extension and zero extension, respectively. The source operand must be a smaller size than the destination.lea
: Computes the address provided by the second operand and stores it at the location given in the first operand. The second operand is surrounded by square brackets. Unlike the other data movement instructions, the computed address is stored in the destination rather than the data value located at that address.Stack manipulation instructions do not affect the processor flags. These instructions are:
push
: Decrements ESP
by 4, and then places the 32-bit operand into the stack location pointed to by ESP
.pop
: Copies the 32-bit data value pointed to by ESP
to the operand location (a register or memory address), and then increments ESP
by 4.pushfd
, popfd
: Pushes or pops the EFLAGS
register.pushad
, popad
: Pushes or pops the EAX
, ECX
, EDX
, EBX
, ESP
, EBP
, ESI
, and EDI
registers, in that order.The arithmetic and logic instructions modify the processor flags. The following instructions perform arithmetic and logic operations:
add
, sub
: Perform integer addition or subtraction. When subtracting, the second operand is subtracted from the first. Both operands can be registers, or one operand can be a memory location and the other a register. One operand can be a constant.adc
, sbb
: Performs integer addition or subtraction using the CF
flag as a carry input (for addition) or as a borrow input (for subtraction).cmp
: Subtracts the two operands and discards the result while updating the OF
, SF
, ZF
, AF
, PF
, and CF
flags based on the result.neg
: Negates the operand.inc
, dec
: Increments or decrements the operand by one.mul
: Performs unsigned integer multiplication. The size of the product depends on the size of the operand. A byte operand is multiplied by AL
and the result is placed in AX
. A word operand is multiplied by AX
and the result is placed in DX
:AX
, with the upper 16 bits in DX
. A doubleword is multiplied by EAX
and the result is placed in EDX
:EAX
.imul
: Performs signed integer multiplication. The first operand must be a register and receives the result of the operation. There may be a total of two or three operands. In the two-operand form, the first operand multiplies the second operand, and the result is stored in the first operand (a register). In the three-operand form, the second operand multiplies the third operand, and the result is stored in the first operand register. In the three-operand form, the third operand must be an immediate value.div
, idiv
: Performs unsigned (div
) or signed (idiv
) division. The size of the result depends on the size of the operand. A byte operand is divided into AX
, the quotient is placed in AL
, and the remainder is placed in AH
. A word operand is divided into DX
:AX
, the quotient is placed in AX
, and the remainder is placed in DX
. A doubleword is divided into EDX
:EAX
, the quotient is placed in EAX
, and the remainder is placed in EDX
.and
, or
, xor
: Performs the corresponding logical operation on the two operands and stores the result in the destination operand location.not
: Performs a logical NOT
(bit inversion) operation on a single operand.sal
, shl
, sar
, shr
: Performs a logical (shl
and shr
) or arithmetic (sal
and sar
) shift of the byte, word, or doubleword argument left or right by 1 to 31 bit positions. sal
and shl
place the last bit shifted out into the carry flag and insert zeros into the vacated least significant bits. shr
places the last bit shifted out into the carry flag and inserts zeros into the vacated most significant bits. sar
differs from shr
by propagating the sign bit into the vacated most significant bits.rol
, rcl
, ror
, rcr
: Performs a left or right rotation by 0 to 31 bits, optionally through the carry flag. rcl
and rcr
rotate through the carry flag, while rol
and ror
do not.bts
, btr
, btc
: Reads a specified bit number (provided as the second operand) within the bits of the first operand into the carry flag, then either sets (bts
), resets (btr
), or complements (btc
) that bit. These instructions may be preceded by the lock
keyword to make the operation atomic.test
: Performs a logical AND
operation of two operands and updates the SF
, ZF
, and PF
flags based on the result.Conversion instructions extend a smaller data size to a larger size. These instructions are:
cbw
: Converts a byte (AL
register) into a word (AX
).cwd
: Converts a word (AX
register) into a doubleword (DX
:AX
).cwde
: Converts a word (AX
register) into a doubleword (EAX
).cdq
: Converts a doubleword (AX
register) into a quadword (EDX
:EAX
).Control flow instructions conditionally or unconditionally transfer execution to an address. The control flow instructions are:
jmp
: Transfers control to the instruction at the address provided as the operand.j
cc: Transfers control to the instruction at the address provided as the operand if the condition cc is true. The condition codes were described previously in the cmov
cc instruction description. The condition is determined from one or more of the following processor flags: CF
, ZF
, SF
, OF
, and PF
.call
: Pushes the current value of EIP
onto the stack and transfers control to the instruction at the address provided as the operand.ret
: Pops the top-of-stack value and stores it in EIP
. If an operand is provided, it pops the given number of bytes from the stack to clear parameters.loop
: Decrements the loop
counter in ECX
and, if not zero, transfers control to the instruction at the address provided as the operand.String manipulation instructions may be prefixed by the rep
keyword to repeat the operation the number of times given by the ECX
register, incrementing or decrementing the source and destination location on each iteration, depending on the state of the DF
flag. The operand size processed on each iteration can be a byte, word, or doubleword. The source address of each string element is given by the ESI
register and the destination by the EDI
register. These instructions are:
mov
: Moves a string elementcmps
: Compares elements at corresponding locations in two stringsscas
: Compares a string element to the value in EAX
, AX
, or AL
, depending on the operand sizelods
: Loads the string into EAX
, AX
, or AL
, depending on the operand sizestos
: Stores EAX
, AX
, or AL
, depending on the operand size, to the address in EDI
Flag manipulation instructions modify bits in the EFLAGS
register. The flag manipulation instructions are:
stc
, clc
, cmc
: Sets, clears, or complements the carry flag, CF
std
, cld
: Sets or clears the direction flag, DF
sti
, cli
: Sets or clears the interrupt flag, IF
Input/output instructions read data from or write data to peripheral devices. The input/output instructions are:
in
, out
: Moves 1, 2, or 4 bytes between EAX
, AX
, or AL
and an I/O port, depending on the operand sizeins
, outs
: Moves a data element between memory and an I/O port in the same manner as the string instructionsrep ins
, rep outs
: Moves blocks of data between memory and an I/O port in the same manner as the string instructionsThe following instructions access the features of protected mode:
sysenter
, sysexit
: Transfers control from ring 3 to ring 0 (sysenter
) or from ring 0 to ring 3 (sysexit
) in Intel processors.syscall
, sysret
: Transfers control from ring 3 to ring 0 (syscall
) or from ring 0 to ring 3 (sysret
) in AMD processors. In x86 (32-bit) mode, AMD processors also support sysenter
and sysexit
.These instructions do not fit into the categories previously listed:
int
: Initiates a software interrupt. The operand is the interrupt vector number.nop
: No operation.cpuid
: Provides information about the processor model and its capabilities.The instructions listed in this section are some of the more common instructions you will come across in x86 applications and device drivers beyond those listed in the preceding sections. The x86 architecture contains a wide variety of instruction categories, including the following:
Additional processor registers are provided for use by the floating-point and SIMD instructions.
There are even more categories of x86 instructions beyond those listed here, a few of which have been retired in later generations of the architecture.
Listed below are some examples of instruction usage patterns you will come across frequently in compiled code. The techniques used in these examples produce the desired result while minimizing code size and the number of clock cycles required:
xor reg, reg ; Set reg to zero
test reg, reg ; Test if reg contains zero
add reg, reg ; Shift reg left by one bit
Individual x86 instructions are of variable length and can range in size from 1 to 15 bytes. The components of a single instruction, including any optional bytes, are laid out in memory in the following sequence:
lock
prefix performs bus locking in a multiprocessor system to enable atomic test-and-set type operations. rep
and related prefixes enable string instructions to perform repeated operations on string elements in a single instruction. Other prefixes are available to provide hints for conditional branch instructions or to override the default size of an address or operand.The variable-length nature of x86 instructions makes the process of instruction decoding quite complex. It is also challenging for debugging tools to disassemble a sequence of instructions in reverse order, perhaps to display the code leading up to a breakpoint.
This difficulty arises because it is possible for a trailing subset of bytes within a lengthy instruction to form a complete, valid instruction. This complexity is a notable difference from the more regular instruction formats used in RISC architectures.
It is possible to develop programs of any level of complexity in assembly language.
Most modern applications, however, are largely or entirely developed in high-level languages. Assembly language tends to be used in cases where the use of specialized instructions is required, or a level of extreme optimization is necessary that is unachievable with an optimizing compiler.
Regardless of the language used in application development, all code must ultimately execute as processor instructions. To fully understand how code executes on a computer system, there is no substitute for examining the state of the system following the execution of each individual instruction. A good way to learn to understand and operate in this environment is to write some assembly code.
The x86 assembly language example in the following listing is a complete x86 application that runs in a Windows command console, printing a text string and then exiting:
.386
.model FLAT,C
.stack 400h
.code
includelib libcmt.lib
includelib legacy_stdio_definitions.lib
extern printf:near
extern exit:near
public main
main proc
; Print the message
push offset message
call printf
; Exit the program with status 0
push 0
call exit
main endp
.data
message db "Hello, Computer Architect!",0
end
A description of the contents of this assembly language file follows:
.386
directive indicates the instructions in this file should be interpreted as applying to 80386 and later-generation processors..model FLAT,C
directive specifies a 32-bit flat memory model and the use of C language function calling conventions..stack 400h
directive specifies a stack size of 400 h (1,024) bytes..code
directive indicates the start of executable code.includelib
and extern
directives reference system-provided libraries and the functions within them to be used by the program.public
directive indicates that the function name, main
, is an externally visible symbol.main proc
and main endp
are the assembly language instructions making up the main
function..data
directive indicates the start of data memory. The message db
statement defines the message string as a sequence of bytes, followed by a zero byte.end
directive marks the end of the program.This file, named hello_x86.asm
, can be assembled and linked to form the executable hello_x86.exe
program with the following command, which runs the Microsoft Macro Assembler:
ml /Fl /Zi /Zd hello_x86.asm
The components of this command are:
ml
runs the assembler (ml.exe
)/Fl
creates a listing file/Zi
includes symbolic debugging information in the executable file/Zd
includes line number debugging information in the executable filehello_x86.asm
is the name of the assembly language source fileThis is a portion of the hello_x86.lst
listing file generated by the assembler:
.386
.model FLAT,C
.stack 400h
00000000 .code
includelib libcmt.lib
includelib legacy_stdio_definitions.lib
extern printf:near
extern exit:near
public main
00000000 main proc
; Print the message
00000000 68 00000000 R push offset message
00000005 E8 00000000 E call printf
; Exit the program with status 0
0000000A 6A 00 push 0
0000000C E8 00000000 E call exit
00000011 main endp
00000000 .data
00000000 48 65 6C 6C 6F message db "Hello, Computer Architect!",0
2C 20 43 6F 6D
70 75 74 65 72
20 41 72 63 68
69 74 65 63 74
21 00
This listing displays the address offsets from the beginning of the main
function in the left column. On lines containing instructions, the opcode follows the address offset. Address references in the code (for example, offset message
) are displayed as 00000000
in the listing because these values are determined during linking, and not during assembly, which is when this listing is generated.
This is the output displayed when running this program:
C:>hello_x86.exe
Hello, Computer Architect!
Next, we will look at the extension of the 32-bit x86 architecture to the 64-bit x64 architecture.
The original specification for a processor architecture extending the x86 processor and instruction set to 64 bits, named AMD64, was introduced by AMD in 2000. The first AMD64 processor, the Opteron, was released in 2003. Intel found itself following AMD’s lead and developed an AMD64-compatible architecture, eventually given the name Intel 64. The first Intel processor that implemented the 64-bit architecture was the Xeon, introduced in 2004. The name of the architecture shared by AMD and Intel came to be called x86-64, reflecting the evolution of x86 to 64 bits, and, in popular usage, this term has been shortened to x64.
The first Linux version supporting the x64 architecture was released in 2001, well before the first x64 processors were even available. Windows began supporting the x64 architecture in 2005.
Processors implementing the AMD64 and Intel 64 architectures are largely compatible at the instruction set level of user-mode programs. There are a few differences between the architectures, the most significant of which is the difference in support of the sysenter
/sysexit
Intel instructions and the syscall
/sysret
AMD instructions we saw earlier.
In general, operating systems and programming language compilers manage these differences, making them rarely an issue of concern to software and system developers. Developers of kernel software, drivers, and assembly code must take these differences into account.
The principal features of the x64 architecture are:
R
indicates 64-bit registers. For example, in x64, the extended x86 EAX
register is called RAX
. The x86 register subcomponents EAX
, AX
, AH
, and AL
continue to be available in x64.RIP
, is now 64 bits. The flags register, RFLAGS
, also extends to 64 bits, though the upper 32 bits are reserved. The lower 32 bits of RFLAGS
are the same as EFLAGS
in the x86 architecture.R8
through R15
.Virtual addresses in the x64 architecture are 64 bits wide, supporting an address space of 16 exabytes (EB), equivalent to 264 bytes. Current processors from AMD and Intel, however, support only 48 bits of virtual address space. This restriction reduces processor hardware complexity while still supporting up to 256 terabytes (TB) of virtual address space. Current-generation processors also support a maximum of 48 bits of physical address space. This permits a processor to address 256 TB of physical RAM, though modern motherboards do not support the number of DRAM devices such a system would require.
In the x64 architecture, the extension of x86 register lengths to 64 bits and the addition of registers R8
through R15
results in the register map shown in Figure 10.3:
Figure 10.3: x64 registers
In Figure 10.3, the x86 registers described in the preceding section (and present in x64) are shaded. The x86 registers have the same names and are the same sizes when operating in 64-bit mode.
The 64-bit extended versions of the x86 registers have names starting with the letter R. The new 64-bit registers (R8
through R15
) can be accessed in smaller widths using the appropriate suffix letter:
R11D
R11W
R11B
Unlike the x86 registers, the new registers in the x64 architecture are truly general purpose and do not perform any special functions at the processor instruction level.
The x64 architecture implements essentially the same instruction set as x86, with 64-bit extensions. When operating in 64-bit mode, the x64 architecture uses a default address size of 64 bits and a default operand size of 32 bits. A new opcode prefix byte, rex
, specifies the use of 64-bit operands.
The format of x64 instructions in memory matches that of the x86 architecture, with some exceptions that, for our purposes, are minor. The addition of support for the rex
prefix byte is the most significant variation from the x86 instruction format. Address displacements and immediate values within some instructions can be 64 bits wide, in addition to all the bit widths supported in x86.
Although it is possible to define instructions that are longer than 15 bytes, the processor instruction decoder will raise a general protection fault if an attempt is made to decode an instruction longer than 15 bytes.
The x64 assembly language source file for the hello
program is like the x86 version of this code, with some notable differences:
RCX
, RDX
, R8
, and R9
registers, in that order. This differs from the default x86 calling convention, which pushes parameters onto the stack. Both library functions called by this program (printf
and exit
) take a single argument, passed in RCX
.sub rsp, 40
instruction performs this stack allocation. Normally, after the called function returns, it would be necessary to adjust the stack pointer to remove this allocation. Our program calls the exit
function, terminating program execution, which makes this step unnecessary.The code for the 64-bit version of the hello
program is as follows:
.code
includelib libcmt.lib
includelib legacy_stdio_definitions.lib
extern printf:near
extern exit:near
public main
main proc
; Reserve stack space
sub rsp, 40
; Print the message
lea rcx, message
call printf
; Exit the program with status 0
xor rcx, rcx
call exit
main endp
.data
message db "Hello, Computer Architect!",0
end
This file, named hello_x64.asm
, is assembled and linked to form the executable hello_x64.exe
program with the following call to the Microsoft Macro Assembler (x64 version):
ml64 /Fl /Zi /Zd hello_x64.asm
The components of this command are:
ml64
runs the 64-bit assembler/Fl
creates a listing file/Zi
includes symbolic debugging information in the executable file/Zd
includes line number debugging information in the executable filehello_x64.asm
is the name of the assembly language source fileThis is a portion of the hello_x64.lst
listing file generated by the assembler command:
00000000 .code
includelib libcmt.lib
includelib legacy_stdio_definitions.lib
extern printf:near
extern exit:near
public main
00000000 main proc
; Reserve stack space
00000000 48/ 83 EC 28 sub rsp, 40
; Print the message
00000004 48/ 8D 0D lea rcx, message
00000000 R
0000000B E8 00000000 E call printf
; Exit the program with status 0
00000010 48/ 33 C9 xor rcx, rcx
00000013 E8 00000000 E call exit
00000018 main endp
00000000 .data
00000000 48 65 6C 6C 6F message db "Hello, Computer Architect!",0
2C 20 43 6F 6D
70 75 74 65 72
20 41 72 63 68
69 74 65 63 74
21 00
The output of running this program is as follows:
C:>hello_x64.exe
Hello, Computer Architect!
This completes our brief introduction to the x86 and x64 architectures. There is a great deal more to be learned, and indeed the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volumes 1 through 4, contains nearly 5,000 pages of detailed documentation on these architectures. We have just scratched the surface in this chapter.
Next, we will take a similar top-level tour of the ARM 32-bit and 64-bit architectures.
The ARM architectures define a family of RISC processors suitable for use in a wide variety of applications. Processors based on ARM architectures are preferred in designs where a combination of high performance, low power consumption, and small physical size is needed.
ARM Holdings, a British semiconductor and software company, developed the ARM architectures and licenses them to other companies who implement processors in silicon. Many applications of the ARM architectures are system-on-chip (SoC) designs combining a processor with specialized hardware to support functions such as cellular radio communications in smartphones.
ARM processors are employed in a broad spectrum of applications, from tiny battery-powered devices to supercomputers. ARM processors serve as embedded processors in safety-critical systems such as automotive anti-lock brakes and as general-purpose processors in smartwatches, portable phones, tablets, laptop computers, desktop computers, and servers. As of 2021, over 180 billion ARM processors have been manufactured.
ARM processors are true RISC systems with a large set of general-purpose registers and single-cycle execution of most instructions. Standard ARM instructions have a fixed width of 32 bits, though a separate variable-length instruction set named T32 (formerly called Thumb) is available for applications where memory is at a premium. The T32 instruction set uses a mixture of 16- and 32-bit instructions.
Current-generation ARM processors support both the ARM and T32 instruction sets and can switch between the two sets on the fly. Most operating systems and applications prefer to use the T32 instruction set over the ARM set because code density is improved.
ARM is a load/store architecture, requiring data to be loaded from memory to a register before any processing such as an ALU operation can take place with it. A subsequent instruction stores the result back to memory. While this might seem like a step back from the x86 and x64 architectures, which operate directly on operands in memory in a single instruction, in practice, the load/store approach permits several sequential operations to be performed at high speed on an operand once it has been loaded into one of the many processor registers.
ARM processors are bi-endian. A configuration setting is available to select the little-endian or big-endian byte order for multi-byte values. The default setting is little-endian, which is the configuration commonly used by operating systems.
The ARM architecture natively supports these data types:
WHAT’S IN A WORD?
There is a potentially confusing difference between the data type names of the ARM architecture and those of the x86 and x64 architectures: in x86 and x64, a word is 16 bits and a doubleword is 32 bits. In ARM, a word is 32 bits and a doubleword is 64 bits.
ARM processors support eight distinct execution privilege levels. These levels, and their abbreviations, are as follows:
For the purposes of operating systems and user applications, the most important privilege levels are USR and SVC. The two interrupt request modes, FIQ and IRQ, are used by device drivers for processing interrupts.
In most operating systems running on ARM, including Windows and Linux, the kernel mode runs in ARM SVC mode, equivalent to ring 0 on x86/64. ARM USR mode is equivalent to ring 3 on x86/x64. Applications running under Linux on ARM processors use software interrupts to request kernel services, which involves a transition from USR mode to SVC mode.
The ARM architecture provides system capabilities beyond those of the main processor via the concept of coprocessors. Each coprocessor implements a specialized category of functionality in support of the main processor. Up to 16 coprocessors can be implemented in a system, with predefined functions assigned to four of them.
Coprocessor 15 implements the MMU and other system functions. If present, coprocessor 15 must support the instruction opcodes, register set, and behaviors specified for the MMU. Coprocessors 10 and 11 combine to provide floating-point functionality in processors equipped with that feature. Coprocessor 14 provides debugging functions.
The ARM architectures have evolved through several versions over the years. The architectural variant currently in wide use is ARMv8-A. ARMv8-A supports 32-bit and 64-bit operating systems and applications. 32-bit applications can run under a 64-bit ARMv8-A operating system.
Virtually all high-end smartphones and portable electronic devices produced since 2016 are designed around processors or SoCs based on the ARMv8-A architecture. The description that follows will focus on ARMv8-A 32-bit mode. We will look at the differences in ARMv8-A 64-bit mode in a later section in this chapter.
In USR mode, the ARM architecture has 16 general-purpose 32-bit registers named R0
through R15
. The first 13 registers are truly general-purpose, while the last three have the following defined functions:
R13
is the stack pointer, also named SP
in assembly code. This register points to the top of the stack.R14
is the link register, also named LR
. This register holds the return address while in a called function. The use of a link register differs from x86/x64, which pushes the return address onto the stack. The reason for using a register to hold the return address is because it is significantly faster to resume execution at the address in LR
at the end of a function than it is to pop the return address from the stack and resume execution at that address.
R15
is the program counter, also named PC
. Due to pipelining, the value contained in PC
is usually two instructions ahead of the currently executing instruction. Unlike x86/x64, it is possible for user code to directly read and write the PC
register. Writing an address to PC
causes execution to immediately jump to the newly written address.The current program status register (CPSR) contains status and mode control bits, similar to EFLAGS
/RFLAGS
in the x86/x64 architectures.
Bit |
Name |
Function |
0-3 |
|
Mode: The current execution privilege level ( |
4 |
|
Thumb: Set if the |
9 |
|
Endianness: Setting this bit enables big-endian mode. If clear, little-endian mode is active. Most code uses little-endian mode. |
27 |
|
Cumulative saturation flag: Set if, at some point in a series of operations, an overflow or saturation occurred. |
28 |
|
Overflow flag: Set if the operation resulted in a signed overflow. |
29 |
|
Carry flag: Indicates whether addition produced a carry, or subtraction produced a borrow. |
30 |
|
Zero flag: Set if the result of an operation is zero. |
31 |
|
Negative flag: Set if the result of an operation is negative. |
Table 10.3: Selected CPSR bits
CPSR
bits not listed in Table 10.3 are either reserved or represent functions not discussed in this chapter.
By default, most instructions do not affect the flags. The S
suffix must be used with, for example, an addition instruction (adds
) to cause the result to affect the flags. Comparison instructions are the exception to this rule; they update the flags automatically.
In true RISC fashion, the only ARM instructions that can access system memory are those that perform register loads and stores.
The ldr
instruction loads a register from memory, while str
stores a register to memory. A separate instruction, mov
, transfers the contents of one register to another or moves an immediate value into a register.
When computing the target address for a load or store operation, ARM starts with a base address provided in a register and adds an increment to arrive at the target memory address. There are three methods for determining the increment that will be added to the base register in register load and store instructions:
ldr r0, [r1, #10]
loads r0
with the word at the address r1+10
. As shown in the following addressing mode examples, pre- or post-indexing can optionally update the base register to the target address before or after the memory location is accessed.ldr r0, [r1, r2]
loads r0
with the word at the address r1+r2
. Either of the registers can be thought of as the base register.ldr r0, [r1, r2, lsl #3]
loads r0
with the word at the address r1+(r2×8)
. The shift can be a logical left or right shift, lsl
or lsr
, inserting zero bits in the vacated bit positions, or an arithmetic right shift, asr
, that replicates the sign bit in the vacated positions.The addressing modes available for specifying source and destination operands in ARM instructions are presented in the following sections.
An immediate value is provided as part of the instruction. The possible immediate values consist of an 8-bit value, coded in the instruction, rotated through an even number of bit positions. A full 32-bit value cannot be specified because the instruction itself is, at most, 32 bits wide. To load an arbitrary 32-bit value into a register, the ldr
instruction must be used instead to load the value from memory:
mov r0, #10 // Load the 32-bit value 10 decimal into r0
mov r0, #0xFF000000 // Load the 32-bit value FF000000h into r0
The second example contains the 8-bit value FFh
in the instruction opcode. During execution, it is rotated left by 24-bit positions into the most significant 8 bits of the word.
This mode copies one register to another:
mov r0, r1 // Copy r1 to r0
mvn r0, r1 // Copy NOT(r1) to r0
The address of the operand is provided in a register. The register containing the address is surrounded by square brackets:
ldr r0, [r1] // Load the 32-bit value at the address given in r1 to r0
str r0, [r3] // Store r0 to the address in r3
Unlike most instructions, str
uses the first operand as the source and the second as the destination.
The address of the operand is computed by adding an offset to the base register:
ldr r0, [r1, #32] // Load r0 with the value at the address [r1+32]
str r0, [r1, #4] // Store r0 to the address [r1+4]
The address of the value is determined by adding an offset to the base register. The base register is updated to the computed address and this address is used to load the destination register:
ldr r0, [r1, #32]! // Load r0 with [r1+32] and update r1 to (r1+32)
str r0, [r1, #4]! // Store r0 to [r1+4] and update r1 to (r1+4)
The base address is first used to access the memory location. The base register is then updated to the computed address:
ldr r0, [r1], #32 // Load [r1] to r0, then update r1 to (r1+32)
str r0, [r1], #4 // Store r0 to [r1], then update r1 to (r1+4)
The address of the operand is the sum of a base register and an increment register. The register names are surrounded by square brackets:
ldr r0, [r1, r2] // Load r0 with [r1+r2]
str r0, [r1, r2] // Store r0 to [r1+r2]
The address of the operand is the sum of a base register and an increment register shifted left or right by the given number of bits. The register names and the shift information are surrounded by square brackets:
ldr r0, [r1, r2, lsl #5] // Load r0 with [r1+(r2*32)]
str r0, [r1, r2, lsr #2] // Store r0 to [r1+(r2/4)]
The next section introduces the general categories of ARM instructions.
The instructions described in this section are from the T32
instruction set.
These instructions move data between registers and memory:
ldr
, str
: Copies an 8-bit (suffix b
for byte), 16-bit (suffix h
for halfword), or 32-bit value between a register and a memory location. ldr
copies the value from memory to a register, while str
copies a register to memory. ldrb
copies 1 byte into the lower 8 bits of a register.ldm
, stm
: Loads or stores multiple registers. Copies 1 to 16 registers to or from memory. For example, the instruction ldm r1, {r0, r2, r4-r11}
loads registers r0
, r2
, and r4
through r11
from contiguous memory beginning at the address provided in r1.
Any subset of registers can be loaded from, or stored to, memory using these instructions.These instructions store data to, and retrieve data from, the stack:
push
, pop
: Pushes or pops any subset of the registers to or from the stack, for example, push {r0, r2, r4-r11}
. These instructions are variants of the ldm
and stm
instructions.These instructions transfer data between registers:
mov
, mvn
: Moves a register (mov
), or its bit-inversion (mvn
), to the destination register.These instructions mostly have one destination register and two source operands. The first source operand is a register, while the second can be a register, a shifted register, or an immediate value.
Including the s
suffix causes these instructions to set the condition flags. For example, adds
performs addition and sets the condition flags:
add
, sub
: Adds or subtracts two numbers. For example, add r0, r1, r2, lsl #3
is equivalent to the expression r0 = r1 + (r2 × 23). The lsl
operator performs a logical shift left of the second operand, r2
.adc
, sbc
: Adds or subtracts two numbers with carry or borrow.neg
: Negates a number.and
, orr
, eor
: Performs logical AND
, OR
, or XOR
operations.orn
, eon
: Performs logical OR
or XOR
operations between the first operand and the bitwise-inverted second operand.bic
: Clears selected bits in a register.mul
: Multiplies two numbers.mla
: Multiplies two numbers and accumulates the result. This instruction has an additional operand to specify the accumulator register.sdiv
, udiv
: Signed and unsigned division, respectively.These instructions compare two values and set the condition flags based on the result of the comparison. The s
suffix is not needed with these instructions to set the condition codes:
cmp
: Subtracts two numbers, discards the result, and sets the condition flags. This is equivalent to a subs
instruction, except the result is discarded.cmn
: Adds two numbers, discards the result, and sets the condition flags. This is equivalent to an adds
instruction, except the result is discarded.tst
: Performs a bitwise AND
, discards the result, and sets the condition flags. This is equivalent to an ands
instruction, except the result is discarded.These instructions transfer control conditionally or unconditionally to a target address:
b
: Performs an unconditional branch to the target address.b
cc: Branches based on one of these condition codes as cc: eq
(equal), ne
(not equal), gt
(greater than), lt
(less than), ge
(greater or equal), le
(less or equal), cs
(carry set), cc
(carry clear), mi
(minus: N flag = 1), pl
(plus: N flag = 0), vs
(V flag set), vc
(V flag clear), hi
(higher: C flag set and Z flag clear), or ls
(lower or same: C flag clear and Z flag clear).bl
: Branches to the specified address and stores the address of the next instruction in the link register (r14
, also called lr
). The called function returns to the calling code with the mov pc, lr
instruction.bx
: Branches and selects the instruction set. If bit 0 of the target address is 1, T32
mode is entered. If bit 0 is clear, ARM mode is entered. Bit 0 of instruction addresses must always be zero due to ARM’s address alignment requirements. This frees bit 0 to select the instruction set.blx
: Branches with a link and selects the instruction set. This instruction combines the functions of the bl
and bx
instructions.This instruction allows user-mode code to initiate a call to supervisor mode:
svc
(supervisor call): Initiates a software interrupt that causes the supervisor mode exception handler to process a system service request.This instruction is used by debuggers during software development:
bkpt
(trigger a breakpoint): This instruction takes a 16-bit operand for use by debugging software to identify the breakpoint.Many ARM instructions support conditional execution, which uses the same condition codes as the branch instructions to determine whether individual instructions are executed. If an instruction’s condition evaluates false, the instruction is processed as a no-op. The condition code is appended to the instruction mnemonic. This conditional execution mechanism is formally known as predication.
For example, this function converts a nibble (the lower 4 bits of a byte) into an ASCII character version of the nibble:
// Convert the low 4 bits of r0 to an ascii character in r0
nibble2ascii:
and r0, #0xF
cmp r0, #10
addpl r0, r0, #('A' - 10)
addmi r0, r0, #'0'
mov pc, lr
The cmp
instruction subtracts 10
from the nibble in r0
and sets the N
flag if r0
is less than 10. If r0
is greater than or equal to 10, the N
flag is clear.
If N
is clear, the addpl
instruction executes (pl
means “plus,” as in “not negative”), and the addmi
instruction does not execute. If N
is set, the addpl
instruction does not execute and the addmi
instruction executes. After this sequence completes, r0
contains a character in the range 0-9 or A-F.
The use of conditional instruction execution helps keep the instruction pipeline flowing efficiently by avoiding branches.
ARM processors optionally support a range of SIMD and floating-point instructions. Additional instructions are provided that are generally only used during system configuration.
The ARM assembly example in this section uses the syntax of the GNU Assembler, provided with the Android Studio integrated development environment (IDE). Other assemblers may use a different syntax. As with the Intel syntax for the x86 and x64 assembly languages, the operand order for most instructions is the destination followed by the source.
The ARM assembly language source file for the hello
program is as follows:
.text
.global _start
_start:
mov r0, #1 // int fd 1 (stdout)
ldr r1, =message // const void *buf
mov r2, #count // size_t count
mov r7, #4 // syscall 4 (sys_write)
svc 0
mov r0, #0 // int status (0=OK)
mov r7, #1 // syscall 1 (sys_exit)
svc 0
.data
message:
.ascii "Hello, Computer Architect!"
count = . - message
This file, named hello_arm.s
, is assembled and linked to form the executable program hello_arm
with the following commands. These commands use the development tools provided with the Android Studio Native Development Kit (NDK). The commands assume the Windows PATH
environment variable has been set to include the NDK tools directory:
arm-linux-androideabi-as -al=hello_arm.lst -o hello_arm.o hello_arm.s
arm-linux-androideabi-ld -o hello_arm hello_arm.o
The components of these commands are:
arm-linux-androideabi-as
runs the assembler-al=hello_arm.lst
creates a listing file named hello_arm.lst
-o hello_arm.o
creates an object file named hello_arm.o
hello_arm.s
is the name of the assembly language source filearm-linux-androideabi-ld
runs the linker-o hello_arm
creates an executable file named hello_arm
hello_arm.o
is the name of the object file provided as input to the linkerThis is a portion of the hello_arm.lst
listing file generated by the assembler command:
1 .text
2 .global _start
3
4 _start:
5 0000 0100A0E3 mov r0, #1 // int fd 1 (stdout)
6 0004 14109FE5 ldr r1, =message // const void *buf
7 0008 1A20A0E3 mov r2, #count // size_t count
8 000c 0470A0E3 mov r7, #4 // syscall 4 (sys_write)
9 0010 000000EF svc 0
10
11 0014 0000A0E3 mov r0, #0 // int status (0=OK)
12 0018 0170A0E3 mov r7, #1 // syscall 1 (sys_exit)
13 001c 000000EF svc 0
14
15 .data
16 message:
17 0000 48656C6C .ascii "Hello, Computer Architect!"
17 6F2C2043
17 6F6D7075
17 74657220
17 41726368
18 count = . - message
You can run this program on an Android device with Developer options enabled. We won’t go into the procedure for enabling those options here, but you can learn more about that topic with an internet search.
This is the output displayed when running this program on an Android ARM device connected to the host PC with a USB cable:
C:>adb push hello_arm /data/local/tmp/hello_arm
C:>adb shell chmod +x /data/local/tmp/hello_arm
C:>adb shell /data/local/tmp/hello_arm
Hello, Computer Architect!
These commands use the Android Debug Bridge (adb) tool included with Android Studio. Although the hello_arm
program runs on the Android device, output from the program is sent back to the PC and appears in the command window.
The next section introduces the 64-bit ARM architecture, an extension of the 32-bit ARM architecture.
The 64-bit version of the ARM architecture, named AArch64, was announced in 2011. This architecture has 31 general-purpose 64-bit registers, 64-bit addressing, a 48-bit virtual address space, and a new instruction set named A64.
The 64-bit instruction set is a superset of the 32-bit instruction set, allowing existing 32-bit code to run unmodified on 64-bit processors.
Instructions are 32 bits wide, and most operands are 32 or 64 bits. The A64 register functions differ in some respects from 32-bit mode: the program counter is no longer directly accessible as a register and an additional register is provided that always returns an operand value of zero.
At the user privilege level, most A64 instructions have the same mnemonics as the corresponding 32-bit instructions. The assembler determines whether an instruction operates on 64-bit or 32-bit data based on the operands provided. The following rules determine the operand length and register size used by an instruction:
x0
w1
When working with 32-bit registers, the following rules apply:
The A64 is a load/store architecture with the same instruction mnemonics for memory operations (ldr
and str
) as 32-bit mode. There are some differences and limitations in comparison to the 32-bit load and store instructions:
ldm
or stm
instructions for loading or storing multiple registers in a single instruction. Instead, A64 adds the ldp
and stp
instructions for loading or storing a pair of registers in a single instruction.Stack operations are significantly different in A64. Perhaps the biggest difference in this area is that the stack pointer must maintain 16-byte alignment when accessing data.
This is the 64-bit ARM assembly language source file for the hello
program:
.text
.global _start
_start:
// Print the message to file 1 (stdout) with syscall 64
mov x0, #1
ldr x1, =msg
mov x2, #msg_len
mov x8, #64
svc 0
// Exit the program with syscall 93, returning status 0
mov x0, #0
mov x8, #93
svc 0
.data
msg:
.ascii "Hello, Computer Architect!"
msg_len = . - msg
This file, named hello_arm64.s
, is assembled and linked to form the executable hello_arm64
program with the following commands. These commands use the 64-bit development tools provided with the Android Studio NDK. The use of these commands assumes the Windows PATH
environment variable has been set to include the tools directory:
aarch64-linux-android-as -al=hello_arm64.lst -o hello_arm64.o ^hello_arm64.s
aarch64-linux-android-ld -o hello_arm64 hello_arm64.o
The components of these commands are:
aarch64-linux-android-as
runs the assembler-al=hello_arm64.lst
creates a listing file named hello_arm64.lst
-o hello_arm64.o
creates an object file named hello_arm64.o
hello_arm64.s
is the name of the assembly language source fileaarch64-linux-android-ld
runs the linker-o hello_arm64
creates an executable file named hello_arm64
hello_arm64.o
is the name of the object file provided as input to the linkerThis is a portion of the hello_arm64.lst
listing file generated by the assembler:
1 .text
2 .global _start
3
4 _start:
5 // Print the message to file 1 (stdout) with syscall 64
6 0000 200080D2 mov x0, #1
7 0004 E1000058 ldr x1, =msg
8 0008 420380D2 mov x2, #msg_len
9 000c 080880D2 mov x8, #64
10 0010 010000D4 svc 0
11
12 // Exit the program with syscall 93, returning status 0
13 0014 000080D2 mov x0, #0
14 0018 A80B80D2 mov x8, #93
15 001c 010000D4 svc 0
16
17 .data
18 msg:
19 0000 48656C6C .ascii "Hello, Computer Architect!"
19 6F2C2043
19 6F6D7075
19 74657220
19 41726368
20 msg_len = . - msg
You can run this program on an Android device with Developer options enabled, as described earlier. This is the output displayed when running this program on an Android ARM device connected to the host PC with a USB cable:
C:>adb push hello_arm64 /data/local/tmp/hello_arm64
C:>adb shell chmod +x /data/local/tmp/hello_arm64
C:>adb shell /data/local/tmp/hello_arm64
Hello, Computer Architect!
This completes our introduction to the 32-bit and 64-bit ARM architectures.
Having completed this chapter, you should have a good understanding of the high-level architectures and features of the x86, x64, 32-bit ARM, and 64-bit ARM registers, instruction sets, and assembly languages.
The x86 and x64 architectures represent a mostly CISC approach to processor design, using variable-length instructions that can take many cycles to execute, a lengthy pipeline, and (in x86) a limited number of processor registers.
The ARM architectures, on the other hand, implement RISC processors with mostly single-cycle instruction execution, a large register set, and (somewhat) fixed-length instructions. Early versions of ARM had pipelines as short as three stages, though later generations have considerably more stages.
Is one of these architectures better than the other, in a general sense? It may be that each is better in some ways, and system designers must make their selection of processor architecture based on the specific needs of the system under development. Of course, there is a great deal of inertia behind the use of x86/x64 processors in personal computing, business computing, and server applications. Similarly, there is much history behind the dominance of ARM processors in smart personal devices and embedded systems. Many factors beyond raw performance must be considered in the processor selection process when designing a new computer or smart device.
In the next chapter, we’ll look at the RISC-V architecture. RISC-V was developed from a clean sheet, incorporating lessons learned from the history of processor development and without any of the baggage required to maintain support for decades-old legacy designs.
In the Windows search box in the Task bar, begin typing Developer Command Prompt for VS 2022
. When the app appears in the search menu, select it to open Command Prompt.
Create a file named hello_x86.asm
with the content shown in the source listing in the x86 assembly language section of this chapter.
Build the program using the command shown in the x86 assembly language section of this chapter and run it. Verify that the output Hello, Computer Architect! appears on the screen.
x64 Native Tools Command Prompt for VS 2022
. When the app appears in the search menu, select it to open Command Prompt.Create a file named hello_x64.asm
with the content shown in the source listing in the x64 assembly language section of this chapter.
Build the program using the command shown in the x64 assembly language section of this chapter and run it. Verify that the output Hello, Computer Architect! appears on the screen.
Locate the following files under the SDK installation directory (the default location is under %LOCALAPPDATA%Android
) and add their directories to your PATH
environment variable: arm-linux-androideabi-as.exe
and adb.exe
. Hint: The following command works for one version of Android Studio (your path may vary):
set PATH=%PATH%;%LOCALAPPDATA%AndroidSdk
dk23.0.7599858 oolchainsllvmprebuiltwindows-x86_64in
Create a file named hello_arm.s
with the content shown in the source listing in the 32-bit ARM assembly language section of this chapter.
Build the program using the commands shown in the 32-bit ARM assembly language section of this chapter.
Enable Developer Options on an Android phone or tablet. Search the internet for instructions on how to do this.
Connect your Android device to the computer with a USB cable.
Copy the program executable image to the phone using the commands shown in the 32-bit ARM assembly language section of this chapter and run the program. Verify that the output Hello, Computer Architect! appears on the host computer screen.
Disable Developer Options on your Android phone or tablet.
%LOCALAPPDATA%Android
) and add their directories to your PATH
environment variable: aarch64-linux-android-as.exe
and adb.exe
. Hint: The following command works for one version of Android Studio (your path may vary):
set PATH=%PATH%;%LOCALAPPDATA%AndroidSdk
dk23.0.7599858 oolchainsllvmprebuiltwindows-x86_64in;%LOCALAPPDATA%AndroidSdkplatform-tools
Create a file named hello_arm64.s
with the content shown in the source listing in the 64-bit ARM assembly language section of this chapter.
Build the program using the commands shown in the 64-bit ARM assembly language section of this chapter.
Enable Developer Options on an Android phone or tablet.
Connect your Android device to the computer with a USB cable.
Copy the program executable image to the phone using the commands shown in the 64-bit ARM assembly language section of this chapter and run the program. Verify that the output Hello, Computer Architect! appears on the host computer screen.
Disable Developer Options on your Android phone or tablet.
Join the book’s Discord workspace for a monthly Ask me Anything session with the author: https://discord.gg/7h8aNRhRuY