M
Machine language programmer, L-17 to L-18
Machine memory, Virtual Machines,
110
Macro-op fusion, Intel Core i7,
237–238
Magnetic storage
cost
vs. access time,
D-3
historical background, L-77 to L-79
Mail servers, benchmarking, D-20
Main Memory
address translation,
B-46
GPUs and coprocessors,
330
memory hierarchy basics,
76
memory hierarchy design,
72
Multimedia SIMD
vs. GPUs,
312
multiprocessor cache coherence,
352
paging
vs. segmentation,
B-43
processor performance calculations,
218–219
server energy efficiency,
462
symmetric shared-memory multiprocessors,
363
Manufacturing cost
chip fabrication case study,
61–62
MapReduce
WSC cost-performance,
474
Mask Registers
NVIDIA GPU computational structures,
291
Massively parallel processors (MPPs)
cluster history, L-62, L-72 to L-73
system area network history, F-100 to F-101
Matrix multiplication
multidimensional arrays in vector architectures,
278
Mauchly, John, L-2 to L-3, L-5, L-19
Maximum transfer unit, network interfaces, F-7 to F-8
Maximum vector length (MVL)
Multimedia SIMD extensions,
282
MCF
compiler optimizations,
A-29
MCP operating system, L-16
Mean time between failures (MTBF)
Mean time to failure (MTTF)
computer system power consumption case study,
63–64
dependability benchmarks, D-21
example calculations,
34–35
I/O subsystem design, D-59 to D-61
RAID reconstruction, D-55 to D-57
TB-80 cluster, D-40 to D-41
Mean time to repair (MTTR)
dependability benchmarks, D-21
RAID reconstruction, D-56
Mean time until data loss (MTDL), RAID reconstruction, D-55 to D-57
Media, interconnection networks, F-9 to F-12
Media extensions, DSPs, E-10 to E-11
Memory access
ARM Cortex-A8 example,
117
basic MIPS pipeline,
C-36
data hazard stall minimization,
C-17,
C-19
distributed-memory multiprocessor,
I-32
exception stopping/restarting,
C-46
instruction set complications,
C-49
integrated instruction fetch units,
208
MIPS data transfers,
A-34
multimedia instruction compiler support,
A-31
shared-memory workloads,
372
simple RISC implementation,
C-6
vector architectures,
G-10
Memory addressing
ALU immediate operands,
A-12
compiler-based speculation, H-32
displacement values,
A-12
immediate value distribution,
A-13
vector architectures,
G-10
Memory banks
See also Banked memory
multiprocessor architecture,
347
shared-memory multiprocessors,
363
vector load/store unit bandwidth,
276–277
vector systems, G-9 to G-11
Memory bus (M-bus)
interconnection networks, F-88
Memory consistency
compiler optimization,
396
development of models, L-64
directory-based cache coherence protocol basics,
382
multiprocessor cache coherency,
353
relaxed consistency models,
394–395
single-chip multicore processor case study,
412–418
speculation to hide latency,
396–397
Memory-constrained scaling, scientific applications on parallel processors, I-33
Memory hierarchy
block placement issues,
B-7
cache optimization
basic optimizations,
B-40
pipelined cache access,
82
interconnection network protection, F-87 to F-88
Pentium
vs. Opteron protection,
B-57
virtual memory
fast address translation,
B-46
Memory hierarchy design
Alpha 21264 floorplan,
143
cache optimization
compiler-controlled prefetching,
92–95
compiler optimizations,
87–90
critical word first,
86–87
hardware instruction prefetching,
91–92,
92
pipelined cache access,
82
write buffer merging,
87,
88
cache performance prediction,
125–126
cache size and misses per instruction,
126
DDR2 SDRAM timing diagram,
139
highly parallel memory systems,
133–136
high memory bandwidth,
126
instruction miss benchmarks,
127
instruction simulation,
126
Intel Core i7 three-level cache hierarchy,
118
Intel Core i7 TLB structure,
118
Intel 80x86 virtualization issues,
128
system call virtualization/paravirtualization performance,
141
Virtual Machines ISA support,
109–110
Virtual Machines protection,
107–108
Virtual Machines and virtual memory and I/O,
110–111
VMM on nonvirtualizable ISA,
128–129
Memory Interface Unit
vector processor example,
310
Memoryless, definition, D-28
Memory mapping
segmented virtual memory,
B-52
virtual memory definition,
B-42
Memory-memory instruction set architecture, ISA classification,
A-3,
A-5
Memory protection
Pentium
vs. Opteron,
B-57
Memory stall cycles
average memory access time,
B-17
miss rate calculation,
B-6
performance equations,
B-22
Memory system
computer architecture,
15
distributed shared-memory,
379,
418
ILP,
245
hardware
vs. software speculation,
221–222
multiprocessor architecture,
347
multiprocessor cache coherence,
352
price/performance/power considerations,
53
shared-memory multiprocessors,
363
T1 multithreading unicore performance,
227
vector architectures, G-9 to G-11
Memory technology basics
DRAM and DIMM characteristics,
101
SDRAM power consumption,
102,
103
Mesh interface unit (MIU), Intel SCCC, F-70
Mesh network
dimension-order routing, F-47 to F-48
Mesh topology
NEWS communication, F-42 to F-43
Message ID, packet header, F-8, F-16
Message-passing communication
historical background, L-60 to L-61
large-scale multiprocessors, I-5 to I-6
Message Passing Interface (MPI)
lack in shared-memory multiprocessors, I-5
Messages
adaptive routing, F-93 to F-94
coherence maintenance,
381
interconnection networks, F-6 to F-9
zero-copy protocols, F-91
Microarchitecture
as architecture component,
15–16
out-of-order example,
253
Microbenchmarks
disk array deconstruction, D-51 to D-55
disk deconstruction, D-48 to D-51
Microfusion, Intel Core i7 micro-op buffer,
238
Micro-ops
processor clock rates,
244
Microprocessor overview
integrated circuit improvements,
power and energy system trends,
21–23
recent advances, L-33 to L-34
Microsoft DirectX, L-51 to L-52
Microsoft Windows
RAID benchmarks,
D-22, D-22 to D-23
time/volume/commoditization impact,
28
Microsoft Windows 2008 Server
real-world considerations,
52–55
Migration, cache coherent multiprocessors,
354
Millions of floating-point operations per second (MFLOPS)
early performance measures, L-7
parallel processing debates, L-57 to L-58
SIMD computer history, L-55
SIMD supercomputer development, L-43
vector performance measures, G-15 to G-16
MIMD (Multiple Instruction Streams, Multiple Data Streams)
first vector computers, L-46, L-48
GPUs
vs. vector architectures,
310
with Multimedia SIMD,
vs. GPU,
324–330
multiprocessor architecture,
346–348
speedup via parallelism,
263
Minicomputers, replacement by microprocessors,
3–4
MIPS (Microprocessor without Interlocked Pipeline Stages)
branch predictor correlation,
163
conditional branches, K-11
conditional instructions, H-27
control flow instructions,
14
dynamic scheduling with Tomasulo’s algorithm,
171,
173
early pipelined CPUs, L-26
FP unit with Tomasulo’s algorithm,
173
instruction execution issues, K-81
instruction formats, core instructions, K-6
ISA example
addressing modes for data transfer,
A-34
arithmetic/logical instructions,
A-37
load-store instructions,
A-36
Livermore Fortran kernel performance,
331
multiple-issue processor history, L-29
performance measurement history, L-6 to L-7
processor performance calculations,
218–219
registers and usage conventions,
12
RISC instruction set lineage,
K-43
scoreboard components,
C-76
Sony PlayStation 2 Emotion Engine, E-17
unaligned word read instructions,
K-26
unpipelined functional units,
C-52
vs. VAX, K-65 to K-66,
K-75,
K-82
MIPS16
arithmetic/logical instructions,
K-24
data transfer instructions,
K-23
embedded instruction format,
K-8
instructions, K-14 to K-16
multiply-accumulate,
K-20
unique instructions, K-40 to K-42
MIPS32,
vs. VAX sort,
K-80
MIPS64
arithmetic/logical instructions,
K-11
MIPS64
conditional branches,
K-17
data transfer instructions,
K-10
instruction list, K-26 to K-27
instruction set architecture formats,
14
nonaligned data transfers, K-24 to K-26
RISC instruction set,
C-4
MIPS2000, instruction benchmarks,
K-82
MIPS 3010, chip layout,
J-59
MIPS core
compare and conditional branch, K-9 to K-16
equivalent RISC instructions
arithmetic/logical instructions,
K-15
common extensions, K-19 to K-24
control instructions,
K-12,
K-16
embedded RISC data transfers,
K-14
MIPS R3010
arithmetic functions, J-58 to J-61
floating-point exceptions, J-35
MIPS R4000
early pipelined CPUs, L-27
MIPS R8000, precise exceptions,
C-59
Misprediction rate
branch-prediction buffers,
C-29
predictors on SPEC89,
166
profile-based predictor,
C-27
SPECCPU2006 benchmarks,
167
Mispredictions
branch-target buffers,
205
hardware-based speculation,
190
hardware
vs. software speculation,
221
integer
vs. FP programs,
212
Misses per instruction
application/OS statistics,
B-59
memory hierarchy basics,
75
performance impact calculations,
B-18
shared-memory workloads,
372
strided access-TLB interactions,
323
Miss penalty
compiler-controlled prefetching,
92–95
critical word first,
86–87
hardware prefetching,
91–92
ILP speculative execution,
223
memory hierarchy basics,
75–76
processor performance calculations,
218–219
Miss rate
AMD Opteron data cache,
B-15
compiler-controlled prefetching,
92–95
compiler optimizations,
87–90
early IBM computers, L-10 to L-11
hardware prefetching,
91–92
memory hierarchy basics,
75–76
processor performance calculations,
218–219
scientific workloads
distributed-memory multiprocessors,
I-28 to I-30
symmetric shared-memory multiprocessors, I-22,
I-23 to I-25
shared-memory multiprogramming workload,
376,
376–377
single
vs. multiple thread executions,
228
Sun T1 multithreading unicore performance,
228
vs. virtual addressed cache size,
B-37
MIT Raw, characteristics,
F-73
Mitsubishi M32R
arithmetic/logical instructions,
K-24
data transfer instructions,
K-23
embedded instruction format,
K-8
multiply-accumulate,
K-20
unique instructions, K-39 to K-40
Mixed cache
AMD Opteron example,
B-15
Mixer, radio receiver,
E-23
M/M/1 model
example,
D-32, D-32 to D-33
RAID performance prediction, D-57
sample calculations, D-33
M/M/2 model, RAID performance prediction, D-57
Modified-Exclusive-Shared-Invalid (MESI) protocol, characteristics,
362
Modified-Owned-Exclusive-Shared-Invalid (MOESI) protocol, characteristics,
362
Modified state
directory-based cache coherence protocol basics,
380
large-scale multiprocessor cache coherence, I-35
snooping coherence protocol,
358–359
Modula-3, integer division/remainder,
J-12
Module availability, definition,
34
Module reliability, definition,
34
Moore’s law
flawed architectures,
A-45
interconnection networks, F-70
and microprocessor dominance,
3–4
point-to-point links and switches, D-34
Mortar shot graphs, multiprocessor performance measurement,
405–406
Motion JPEG encoder, Sanyo VPC-SX500 digital camera, E-19
Motorola 68882, floating-point precisions, J-33
MPEG
Multimedia SIMD Extensions history, L-49
Sanyo VPC-SX500 digital camera, E-19
Sony PlayStation 2 Emotion Engine, E-17
Multibanked caches
cache optimization,
85–86
Multichip modules, OCNs, F-3
Multicomputers
historical background, L-64 to L-65
Multicore processors
architecture goals/requirements,
15
centralized shared-memory multiprocessor structure,
347
directory-based cache coherence,
380
directory-based coherence,
381,
419
multichip
cache and memory states,
419
performance milestones,
20
snooping cache coherence implementation,
365
uniform memory access,
364
write invalidate protocol implementation,
356–357
Multics protection software, L-9
Multicycle operations, MIPS pipeline
Multiflow processor, L-30, L-32
Multigrid methods, Ocean application, I-9 to I-10
Multilevel caches
cache optimizations,
B-22
centralized shared-memory architectures,
351
memory hierarchy basics,
76
memory hierarchy history, L-11
miss rate
vs. cache size,
B-33
Multimedia SIMD
vs. GPU,
312
performance equations,
B-22
Multilevel exclusion, definition,
B-35
Multilevel inclusion
memory hierarchy history, L-11
Multimedia applications
desktop processor support,
E-11
vector architectures,
267
Multimedia Extensions (MMX)
desktop/server RISCs, K-16 to K-19
Multimedia SIMD Extensions
historical background, L-49 to L-50
256-bit-wide operations,
282
Multimedia user interfaces, PMDs,
Multimode fiber, interconnection networks, F-9
Multipass array multiplier, example,
J-51
Multiple Instruction Streams, Single Data Stream (MISD), definition,
10
Multiple-issue processors
with dynamic scheduling and speculation,
197–202
early development, L-28 to L-30
instruction fetch bandwidth,
202–203
integrated instruction fetch units,
207
microarchitectural techniques case study,
247–254
Tomasulo’s algorithm,
183
Multiple lanes technique
vector performance, G-7 to G-9
vector performance calculations, G-8
Multiple paths, ILP limitation studies,
220
Multiple-precision addition, J-13
Multiply-accumulate (MAC)
Multiply operations
floating point
multiplication, J-17 to J-20
integer arithmetic
with many adders, J-50 to J-54
multipass array multiplier,
J-51
n-bit unsigned integers,
J-4
signed-digit addition table,
J-54
with single adder, J-47 to J-49,
J-48
integer shifting over zeros, J-45 to J-47
PA-RISC instructions, K-34 to K-35
unfinished instructions,
179
Multiprocessor basics
architectural issues and approaches,
346–348
architecture goals/requirements,
15
architecture and software development,
407–409
communication calculations,
350
embedded systems, E-14 to E-15
parallel processing challenges,
349–351
point-to-point example,
413
Multiprocessor history
bus-based coherent multiprocessors, L-59 to L-60
large-scale multiprocessors, L-60 to L-61
parallel processing debates, L-56 to L-58
recent advances and developments, L-58 to L-60
SIMD computers, L-55 to L-56
synchronization and consistency models, L-64
Multiprogramming
software optimization,
408
workload execution time,
375
Multistage interconnection networks (MINs)
bidirectional, F-33 to F-34
crossbar switch calculations, F-31 to F-32
vs. direct network costs, F-92
system area network history, F-100 to F-101
topology, F-30 to F-31, F-38 to F-39
Multistage switch fabrics, topology, F-30
Multi-Streaming Processor (MSP)
Cray X1, G-21 to G-23,
G-22, G-23 to G-24
first vector computers, L-46
Multithreaded SIMD Processor
Fermi GPU architectural innovations,
305–308
Fermi GPU block diagram,
307
NVIDIA GPU computational structures,
291
Multithreaded vector processor
Fermi GPU comparison,
305
Multithreading
historical background, L-34 to L-35
memory hierarchy basics,
75–76
MXP processor, components, E-14
Myrinet SAN, F-67
cluster history, L-62 to L-63, L-73
system area network history, F-100