M

Machine language programmer, L-17 to L-18
Machine memory, Virtual Machines, 110
Macro-op fusion, Intel Core i7, 237–238
Magnetic storage
access time, D-3
cost vs. access time, D-3
historical background, L-77 to L-79
Mail servers, benchmarking, D-20
Main Memory
addressing modes, A-10
address translation, B-46
arithmetic intensity example, 286, 286–288
block placement, B-44
cache function, B-2
cache optimization, B-30, B-36
coherence protocol, 362
definition, 292, 309
DRAM, 17
gather-scatter, 329
GPU vs. MIMD, 327
GPUs and coprocessors, 330
GPU threads, 332
ILP considerations, 245
interlane wiring, 273
linear speedups, 407
memory hierarchy basics, 76
memory hierarchy design, 72
memory mapping, B-42
MIPS operations, A-36
Multimedia SIMD vs. GPUs, 312
multiprocessor cache coherence, 352
paging vs. segmentation, B-43
partitioning, B-50
processor performance calculations, 218–219
RISC code size, A-23
server energy efficiency, 462
symmetric shared-memory multiprocessors, 363
vector processor, G-25
vs. virtual memory, B-3, B-41
virtual memory block identification, B-44 to B-45
virtual memory writes, B-45 to B-46
VLIW, 196
write-back, B-11
write process, B-45
Manufacturing cost
chip fabrication case study, 61–62
cost trends, 27
modern processors, 62
vs. operation cost, 33
MapReduce
cloud computing, 455
cost calculations, 458–460, 459
Google usage, 437
reductions, 321
WSC batch processing, 437–438
WSC cost-performance, 474
Mark-I, L-3 to L-4, L-6
Mark-II, L-4
Mark-III, L-4
Mark-IV, L-4
Mask Registers
basic operation, 275–276
definition, 309
Multimedia SIMD, 283
NVIDIA GPU computational structures, 291
vector compilers, 303
vector vs. GPU, 311
VMIPS, 267
MasPar, L-44
Massively parallel processors (MPPs)
characteristics, I-45
cluster history, L-62, L-72 to L-73
system area network history, F-100 to F-101
Matrix300 kernel
definition, 56
prediction buffer, C-29
Matrix multiplication
benchmarks, 56
LU kernel, I-8
multidimensional arrays in vector architectures, 278
Mauchly, John, L-2 to L-3, L-5, L-19
Maximum transfer unit, network interfaces, F-7 to F-8
Maximum vector length (MVL)
Multimedia SIMD extensions, 282
vector vs. GPU, 311
VLRs, 274–275
McCreight, Ed, F-99
MCF
compiler optimizations, A-29
data cache misses, B-10
Intel Core i7, 240–241
MCP operating system, L-16
Mean time between failures (MTBF)
fallacies, 56–57
RAID, L-79
SLA states, 34
Mean time to failure (MTTF)
computer system power consumption case study, 63–64
dependability benchmarks, D-21
disk arrays, D-6
example calculations, 34–35
I/O subsystem design, D-59 to D-61
RAID reconstruction, D-55 to D-57
SLA states, 34
TB-80 cluster, D-40 to D-41
WSCs vs. servers, 434
Mean time to repair (MTTR)
dependability benchmarks, D-21
disk arrays, D-6
RAID 6, D-8 to D-9
RAID reconstruction, D-56
Mean time until data loss (MTDL), RAID reconstruction, D-55 to D-57
Media, interconnection networks, F-9 to F-12
Media extensions, DSPs, E-10 to E-11
Mellanox MHEA28-XT, F-76
Memory access
ARM Cortex-A8 example, 117
basic MIPS pipeline, C-36
vs. block size, B-28
cache hit calculation, B-5 to B-6
Cray Research T3D, F-87
data hazards requiring stalls, C-19 to C-21
data hazard stall minimization, C-17, C-19
distributed-memory multiprocessor, I-32
exception stopping/restarting, C-46
hazards and forwarding, C-56 to C-57
instruction set complications, C-49
integrated instruction fetch units, 208
MIPS data transfers, A-34
MIPS exceptions, C-48 to C-49
MIPS pipeline control, C-37 to C-39
MIPS R4000, C-65
multimedia instruction compiler support, A-31
pipeline branch issues, C-40, C-42
RISC classic pipeline, C-7, C-10
shared-memory workloads, 372
simple MIPS implementation, C-32 to C-33
simple RISC implementation, C-6
structural hazards, C-13 to C-14
vector architectures, G-10
Memory addressing
ALU immediate operands, A-12
basic considerations, A-11 to A-13
compiler-based speculation, H-32
displacement values, A-12
immediate value distribution, A-13
interpretation, A-7 to A-8
ISA, 11
vector architectures, G-10
Memory banks See also Banked memory
gather-scatter, 280
multiprocessor architecture, 347
parallelism, 45
shared-memory multiprocessors, 363
strides, 279
vector load/store unit bandwidth, 276–277
vector systems, G-9 to G-11
Memory bus (M-bus)
definition, 351
Google WSC servers, 469
interconnection networks, F-88
Memory consistency
basic considerations, 392–393
cache coherence, 352
compiler optimization, 396
development of models, L-64
directory-based cache coherence protocol basics, 382
multiprocessor cache coherency, 353
relaxed consistency models, 394–395
single-chip multicore processor case study, 412–418
speculation to hide latency, 396–397
Memory-constrained scaling, scientific applications on parallel processors, I-33
Memory hierarchy
address space, B-57 to B-58
basic questions, B-6 to B-12
block identification, B-7 to B-9
block placement issues, B-7
block replacement, B-9 to B-10
cache optimization
basic categories, B-22
basic optimizations, B-40
hit time reduction, B-36 to B-40
miss categories, B-23 to B-26
miss penalty reduction
via multilevel caches, B-30 to B-35
read misses vs. writes, B-35 to B-36
miss rate reduction
via associativity, B-28 to B-30
via block size, B-26 to B-28
via cache size, B-28
pipelined cache access, 82
cache performance, B-3 to B-6
average memory access time, B-17 to B-20
basic considerations, B-16
basic equations, B-22
example calculation, B-16
out-of-order processors, B-20 to B-22
case studies, B-60 to B-67
development, L-9 to L-12
inclusion, 397–398
interconnection network protection, F-87 to F-88
levels in slow down, B-3
Opteron data cache example, B-12 to B-15, B-13
Opteron L1/L2, B-57
OS and page size, B-58
overview, B-39
Pentium vs. Opteron protection, B-57
processor examples, B-3
process protection, B-50
terminology, B-2 to B-3
virtual memory
basic considerations, B-40 to B-44, B-48 to B-49
basic questions, B-44 to B-46
fast address translation, B-46
overview, B-48
paged example, B-54 to B-57
page size selection, B-46 to B-47
segmented example, B-51 to B-54
write strategy, B-10 to B-12
WSCs, 443, 443–446, 444
Memory hierarchy design
access times, 77
Alpha 21264 floorplan, 143
ARM Cortex-A8 example, 114–117, 115–117
cache coherency, 112–113
cache optimization
case study, 131–133
compiler-controlled prefetching, 92–95
compiler optimizations, 87–90
critical word first, 86–87
energy consumption, 81
hardware instruction prefetching, 91–92, 92
multibanked caches, 85–86, 86
nonblocking caches, 83–85, 84
overview, 78–79
pipelined cache access, 82
techniques overview, 96
way prediction, 81–82
write buffer merging, 87, 88
cache performance prediction, 125–126
cache size and misses per instruction, 126
DDR2 SDRAM timing diagram, 139
highly parallel memory systems, 133–136
high memory bandwidth, 126
instruction miss benchmarks, 127
instruction simulation, 126
Intel Core i7, 117–124, 119, 123–125
Intel Core i7 three-level cache hierarchy, 118
Intel Core i7 TLB structure, 118
Intel 80x86 virtualization issues, 128
memory basics, 74–78
overview, 72–74
protection and ISA, 112
server vs. PMD, 72
system call virtualization/paravirtualization performance, 141
virtual machine monitor, 108–109
Virtual Machines ISA support, 109–110
Virtual Machines protection, 107–108
Virtual Machines and virtual memory and I/O, 110–111
virtual memory protection, 105–107
VMM on nonvirtualizable ISA, 128–129
Xen VM example, 111
Memory Interface Unit
NVIDIA GPU ISA, 300
vector processor example, 310
Memoryless, definition, D-28
Memory mapping
memory hierarchy, B-48 to B-49
segmented virtual memory, B-52
TLBs, 323
virtual memory definition, B-42
Memory-memory instruction set architecture, ISA classification, A-3, A-5
Memory protection
control dependence, 155
Pentium vs. Opteron, B-57
processes, B-50
safe calls, B-54
segmented virtual memory example, B-51 to B-54
virtual memory, B-41
Memory stall cycles
average memory access time, B-17
definition, B-4 to B-5
miss rate calculation, B-6
out-of-order processors, B-20 to B-21
performance equations, B-22
Memory system
cache optimization, B-36
coherency, 352–353
commercial workloads, 367, 369–371
computer architecture, 15
C program evaluation, 134–135
dependability enhancement, 104–105
distributed shared-memory, 379, 418
gather-scatter, 280
GDRAMs, 323
GPUs, 332
ILP, 245
hardware vs. software speculation, 221–222
speculative execution, 222–223
Intel Core i7, 237, 242
latency, B-21
MIPS, C-33
multiprocessor architecture, 347
multiprocessor cache coherence, 352
multiprogramming workload, 377–378
page size changes, B-58
price/performance/power considerations, 53
RISC, C-7
Roofline model, 286
shared-memory multiprocessors, 363
SMT, 399–400
stride handling, 279
T1 multithreading unicore performance, 227
vector architectures, G-9 to G-11
vector chaining, G-11
vector processors, 271, 277
virtual, B-43, B-46
Memory technology basics
DRAM, 98, 98–100, 99
DRAM and DIMM characteristics, 101
DRAM performance, 100–102
Flash memory, 102–104
overview, 96–97
performance trends, 20
SDRAM power consumption, 102, 103
SRAM, 97–98
Mesh interface unit (MIU), Intel SCCC, F-70
Mesh network
characteristics, F-73
deadlock, F-47
dimension-order routing, F-47 to F-48
OCN history, F-104
routing example, F-46
Mesh topology
characteristics, F-36
direct networks, F-37
NEWS communication, F-42 to F-43
Message ID, packet header, F-8, F-16
Message-passing communication
historical background, L-60 to L-61
large-scale multiprocessors, I-5 to I-6
Message Passing Interface (MPI)
function, F-8
InfiniBand, F-77
lack in shared-memory multiprocessors, I-5
Messages
adaptive routing, F-93 to F-94
coherence maintenance, 381
InfiniBand, F-76
interconnection networks, F-6 to F-9
zero-copy protocols, F-91
Microarchitecture
as architecture component, 15–16
ARM Cortex-A8, 241
Cray X1, G-21 to G-22
data hazards, 168
ILP exploitation, 197
Intel Core i7, 236–237
Nehalem, 411
OCNs, F-3
out-of-order example, 253
PTX vs. x86, 298
techniques case study, 247–254
Microbenchmarks
disk array deconstruction, D-51 to D-55
disk deconstruction, D-48 to D-51
Microfusion, Intel Core i7 micro-op buffer, 238
Microinstructions
complications, C-50 to C-51
x86, 298
Micro-ops
Intel Core i7, 237, 238–240, 239
processor clock rates, 244
Microprocessor overview
clock rate trends, 24
cost trends, 27–28
desktop computers, 6
embedded computers, 8–9
energy and power, 23–26
inside disks, D-4
integrated circuit improvements, 2
and Moore’s law, 3–4
performance trends, 19–20, 20
power and energy system trends, 21–23
recent advances, L-33 to L-34
technology trends, 18
Microprocessor without Interlocked Pipeline Stages See MIPS (Microprocessor without Interlocked Pipeline Stages)
Microsoft
cloud computing, 455
containers, L-74
Intel support, 245
WSCs, 464–465
Microsoft Azure, 456, L-74
Microsoft DirectX, L-51 to L-52
Microsoft Windows
benchmarks, 38
multithreading, 223
RAID benchmarks, D-22, D-22 to D-23
time/volume/commoditization impact, 28
WSC workloads, 441
Microsoft Windows 2008 Server
real-world considerations, 52–55
SPECpower benchmark, 463
Microsoft XBox, L-51
Migration, cache coherent multiprocessors, 354
Millions of floating-point operations per second (MFLOPS)
early performance measures, L-7
parallel processing debates, L-57 to L-58
SIMD computer history, L-55
SIMD supercomputer development, L-43
vector performance measures, G-15 to G-16
MIMD (Multiple Instruction Streams, Multiple Data Streams)
and Amdahl’s law, 406–407
definition, 10
early computers, L-56
first vector computers, L-46, L-48
GPU programming, 289
GPUs vs. vector architectures, 310
with Multimedia SIMD, vs. GPU, 324–330
multiprocessor architecture, 346–348
speedup via parallelism, 263
TLP, basic considerations, 344–345
Minicomputers, replacement by microprocessors, 3–4
Minniespec benchmarks
ARM Cortex-A8, 116, 235
ARM Cortex-A8 memory, 115–116
MIPS (Microprocessor without Interlocked Pipeline Stages)
addressing modes, 11–12
basic pipeline, C-34 to C-36
branch predictor correlation, 163
cache performance, B-6
conditional branches, K-11
conditional instructions, H-27
control flow instructions, 14
data dependences, 151
data hazards, 169
dynamic scheduling with Tomasulo’s algorithm, 171, 173
early pipelined CPUs, L-26
embedded systems, E-15
encoding, 14
exceptions, C-48, C-48 to C-49
exception stopping/restarting, C-46 to C-47
features, K-44
FP pipeline performance, C-60 to C-61, C-62
FP unit with Tomasulo’s algorithm, 173
hazard checks, C-71
ILP, 149
ILP exposure, 157–158
ILP hardware model, 215
instruction execution issues, K-81
instruction formats, core instructions, K-6
instruction set complications, C-49 to C-51
ISA class, 11
ISA example
addressing modes for data transfer, A-34
arithmetic/logical instructions, A-37
basic considerations, A-32 to A-33
control flow instructions, A-37 to A-38, A-38
data types, A-34
dynamic instruction mix, A-41, A-41 to A-42, A-42
FP operations, A-38 to A-39
instruction format, A-35
load-store instructions, A-36
MIPS operations, A-35 to A-37
registers, A-34
usage, A-39
Livermore Fortran kernel performance, 331
memory addressing, 11
multicycle operations
basic considerations, C-51 to C-54
hazards and forwarding, C-54 to C-58
precise exceptions, C-58 to C-60
multimedia support, K-19
multiple-issue processor history, L-29
operands, 12
performance measurement history, L-6 to L-7
pipeline branch issues, C-39 to C-42
pipeline control, C-36 to C-39
pipe stage, C-37
processor performance calculations, 218–219
registers and usage conventions, 12
RISC code size, A-23
RISC history, L-19
RISC instruction set lineage, K-43
as RISC systems, K-4
scoreboard components, C-76
scoreboarding, C-72
scoreboarding steps, C-73, C-73 to C-74
simple implementation, C-31 to C-34, C-34
Sony PlayStation 2 Emotion Engine, E-17
unaligned word read instructions, K-26
unpipelined functional units, C-52
vs. VAX, K-65 to K-66, K-75, K-82
write strategy, B-10
MIPS16
addressing modes, K-6
arithmetic/logical instructions, K-24
characteristics, K-4
constant extension, K-9
data transfer instructions, K-23
embedded instruction format, K-8
instructions, K-14 to K-16
multiply-accumulate, K-20
RISC code size, A-23
unique instructions, K-40 to K-42
MIPS32, vs. VAX sort, K-80
MIPS64
addressing modes, K-5
arithmetic/logical instructions, K-11
MIPS64
conditional branches, K-17
constant extension, K-9
conventions, K-13
data transfer instructions, K-10
FP instructions, K-23
instruction list, K-26 to K-27
instruction set architecture formats, 14
instruction subset, 13, A-40
in MIPS R4000, C-61
nonaligned data transfers, K-24 to K-26
RISC instruction set, C-4
MIPS2000, instruction benchmarks, K-82
MIPS 3010, chip layout, J-59
MIPS core
compare and conditional branch, K-9 to K-16
equivalent RISC instructions
arithmetic/logical, K-11
arithmetic/logical instructions, K-15
common extensions, K-19 to K-24
control instructions, K-12, K-16
conventions, K-16
data transfers, K-10
embedded RISC data transfers, K-14
FP instructions, K-13
instruction formats, K-9
MIPS M2000, L-21, L-21
MIPS MDMX
characteristics, K-18
multimedia support, K-18
MIPS R2000, L-20
MIPS R3000
integer arithmetic, J-12
integer overflow, J-11
MIPS R3010
arithmetic functions, J-58 to J-61
chip comparison, J-58
floating-point exceptions, J-35
MIPS R4000
early pipelined CPUs, L-27
FP pipeline, C-65 to C-67, C-66
integer pipeline, C-63
pipeline overview, C-61 to C-65
pipeline performance, C-67 to C-70
pipeline structure, C-62 to C-63
MIPS R8000, precise exceptions, C-59
MIPS R10000, 81
latency hiding, 397
precise exceptions, C-59
Misalignment, memory address interpretation, A-7 to A-8, A-8
Misprediction rate
branch-prediction buffers, C-29
predictors on SPEC89, 166
profile-based predictor, C-27
SPECCPU2006 benchmarks, 167
Mispredictions
ARM Cortex-A8, 232, 235
branch predictors, 164–167, 240, C-28
branch-target buffers, 205
hardware-based speculation, 190
hardware vs. software speculation, 221
integer vs. FP programs, 212
Intel Core i7, 237
prediction buffers, C-29
static branch prediction, C-26 to C-27
Misses per instruction
application/OS statistics, B-59
cache performance, B-5 to B-6
cache protocols, 359
cache size effect, 126
L3 cache block size, 371
memory hierarchy basics, 75
performance impact calculations, B-18
shared-memory workloads, 372
SPEC benchmarks, 127
strided access-TLB interactions, 323
Miss penalty
average memory access time, B-16 to B-17
cache optimization, 79, B-35 to B-36
cache performance, B-4, B-21
compiler-controlled prefetching, 92–95
critical word first, 86–87
hardware prefetching, 91–92
ILP speculative execution, 223
memory hierarchy basics, 75–76
nonblocking cache, 83
out-of-order processors, B-20 to B-22
processor performance calculations, 218–219
reduction via multilevel caches, B-30 to B-35
write buffer merging, 87
Miss rate
AMD Opteron data cache, B-15
ARM Cortex-A8, 116
average memory access time, B-16 to B-17, B-29
basic categories, B-23
vs. block size, B-27
cache optimization, 79
and associativity, B-28 to B-30
and block size, B-26 to B-28
and cache size, B-28
cache performance, B-4
and cache size, B-24 to B-25
compiler-controlled prefetching, 92–95
compiler optimizations, 87–90
early IBM computers, L-10 to L-11
example calculations, B-6, B-31 to B-32
hardware prefetching, 91–92
Intel Core i7, 123, 125, 241
memory hierarchy basics, 75–76
multilevel caches, B-33
processor performance calculations, 218–219
scientific workloads
distributed-memory multiprocessors, I-28 to I-30
symmetric shared-memory multiprocessors, I-22, I-23 to I-25
shared-memory multiprogramming workload, 376, 376–377
shared-memory workload, 370–373
single vs. multiple thread executions, 228
Sun T1 multithreading unicore performance, 228
vs. virtual addressed cache size, B-37
MIT Raw, characteristics, F-73
Mitsubishi M32R
addressing modes, K-6
arithmetic/logical instructions, K-24
characteristics, K-4
condition codes, K-14
constant extension, K-9
data transfer instructions, K-23
embedded instruction format, K-8
multiply-accumulate, K-20
unique instructions, K-39 to K-40
Mixed cache
AMD Opteron example, B-15
commercial workload, 373
Mixer, radio receiver, E-23
Miya, Eugene, L-65
M/M/1 model
example, D-32, D-32 to D-33
overview, D-30
RAID performance prediction, D-57
sample calculations, D-33
M/M/2 model, RAID performance prediction, D-57
Mobile clients
data usage, 3
GPU features, 324
vs. server GPUs, 323–330
Modified-Exclusive-Shared-Invalid (MESI) protocol, characteristics, 362
Modified-Owned-Exclusive-Shared-Invalid (MOESI) protocol, characteristics, 362
Modified state
coherence protocol, 362
directory-based cache coherence protocol basics, 380
large-scale multiprocessor cache coherence, I-35
snooping coherence protocol, 358–359
Modula-3, integer division/remainder, J-12
Module availability, definition, 34
Module reliability, definition, 34
Moore’s law
DRAM, 100
flawed architectures, A-45
interconnection networks, F-70
and microprocessor dominance, 3–4
point-to-point links and switches, D-34
RISC, A-3
RISC history, L-22
software importance, 55
switch size, F-29
technology trends, 17
Mortar shot graphs, multiprocessor performance measurement, 405–406
Motion JPEG encoder, Sanyo VPC-SX500 digital camera, E-19
Motorola 68000
characteristics, K-42
memory protection, L-10
Motorola 68882, floating-point precisions, J-33
Move address, VAX, K-70
MPEG
Multimedia SIMD Extensions history, L-49
multimedia support, K-17
Sanyo VPC-SX500 digital camera, E-19
Sony PlayStation 2 Emotion Engine, E-17
Multibanked caches
cache optimization, 85–86
example, 86
Multichip modules, OCNs, F-3
Multicomputers
cluster history, L-63
definition, 345, L-59
historical background, L-64 to L-65
Multicore processors
architecture goals/requirements, 15
cache coherence, 361–362
centralized shared-memory multiprocessor structure, 347
Cray X1E, G-24
directory-based cache coherence, 380
directory-based coherence, 381, 419
DSM architecture, 348, 379
multichip
cache and memory states, 419
with DSM, 419
multiprocessors, 345
OCN history, F-104
performance, 400–401, 401
performance gains, 398–400
performance milestones, 20
single-chip case study, 412–418
and SMT, 404–405
snooping cache coherence implementation, 365
SPEC benchmarks, 402
uniform memory access, 364
write invalidate protocol implementation, 356–357
Multics protection software, L-9
Multicycle operations, MIPS pipeline
basic considerations, C-51 to C-54
hazards and forwarding, C-54 to C-58
precise exceptions, C-58 to C-60
Multidimensional arrays
dependences, 318
in vector architectures, 278–279
Multiflow processor, L-30, L-32
Multigrid methods, Ocean application, I-9 to I-10
Multilevel caches
cache optimizations, B-22
centralized shared-memory architectures, 351
memory hierarchy basics, 76
memory hierarchy history, L-11
miss penalty reduction, B-30 to B-35
miss rate vs. cache size, B-33
Multimedia SIMD vs. GPU, 312
performance equations, B-22
purpose, 397
write process, B-11
Multilevel exclusion, definition, B-35
Multilevel inclusion
definition, 397, B-34
implementation, 397
memory hierarchy history, L-11
Multimedia applications
desktop processor support, E-11
GPUs, 288
ISA support, A-46
MIPS FP operations, A-39
vector architectures, 267
Multimedia Extensions (MMX)
compiler support, A-31
desktop RISCs, K-18
desktop/server RISCs, K-16 to K-19
SIMD history, 262, L-50
vs. vector architectures, 282–283
Multimedia instructions
ARM Cortex-A8, 236
compiler support, A-31 to A-32
Multimedia SIMD Extensions
basic considerations, 262, 282–284
compiler support, A-31
DLP, 322
DSPs, E-11
vs. GPUs, 312
historical background, L-49 to L-50
MIMD, vs. GPU, 324–330
parallelism classes, 10
programming, 285
Roofline visual performance model, 285–288, 287
256-bit-wide operations, 282
vs. vector, 263–264
Multimedia user interfaces, PMDs, 6
Multimode fiber, interconnection networks, F-9
Multipass array multiplier, example, J-51
Multiple Instruction Streams, Multiple Data Streams See MIMD (Multiple Instruction Streams, Multiple Data Streams)
Multiple Instruction Streams, Single Data Stream (MISD), definition, 10
Multiple-issue processors
basic VLIW approach, 193–196
with dynamic scheduling and speculation, 197–202
early development, L-28 to L-30
instruction fetch bandwidth, 202–203
integrated instruction fetch units, 207
loop unrolling, 162
microarchitectural techniques case study, 247–254
primary approaches, 194
SMT, 224, 226
with speculation, 198
Tomasulo’s algorithm, 183
Multiple lanes technique
vector instruction set, 271–273
vector performance, G-7 to G-9
vector performance calculations, G-8
Multiple paths, ILP limitation studies, 220
Multiple-precision addition, J-13
Multiply-accumulate (MAC)
DSP, E-5
embedded RISCs, K-20
TI TMS320C55 DSP, E-8
Multiply operations
chip comparison, J-61
floating point
denormals, J-20 to J-21
examples, J-19
multiplication, J-17 to J-20
precision, J-21
rounding, J-18, J-19
integer arithmetic
array multiplier, J-50
Booth recoding, J-49
even/odd array, J-52
issues, J-11
with many adders, J-50 to J-54
multipass array multiplier, J-51
n-bit unsigned integers, J-4
Radix-2, J-4 to J-7
signed-digit addition table, J-54
with single adder, J-47 to J-49, J-48
Wallace tree, J-53
integer shifting over zeros, J-45 to J-47
PA-RISC instructions, K-34 to K-35
unfinished instructions, 179
Multiprocessor basics
architectural issues and approaches, 346–348
architecture goals/requirements, 15
architecture and software development, 407–409
basic hardware primitives, 387–389
cache coherence, 352–353
coining of term, L-59
communication calculations, 350
computer categories, 10
consistency models, 395
definition, 345
early machines, L-56
embedded systems, E-14 to E-15
fallacies, 55
locks via coherence, 389–391
low-to-high-end roles, 344–345
parallel processing challenges, 349–351
for performance gains, 398–400
performance trends, 21
point-to-point example, 413
streaming Multiprocessor, 292, 307, 313–314
Multiprocessor history
bus-based coherent multiprocessors, L-59 to L-60
clusters, L-62 to L-64
early computers, L-56
large-scale multiprocessors, L-60 to L-61
parallel processing debates, L-56 to L-58
recent advances and developments, L-58 to L-60
SIMD computers, L-55 to L-56
synchronization and consistency models, L-64
virtual memory, L-64
Multiprogramming
definition, 345
multithreading, 224
performance, 36
shared-memory workload performance, 375–378, 377
shared-memory workloads, 374–375
software optimization, 408
virtual memory-based protection, 105–106, B-49
workload execution time, 375
Multistage interconnection networks (MINs)
bidirectional, F-33 to F-34
crossbar switch calculations, F-31 to F-32
vs. direct network costs, F-92
example, F-31
self-routing, F-48
system area network history, F-100 to F-101
topology, F-30 to F-31, F-38 to F-39
Multistage switch fabrics, topology, F-30
Multi-Streaming Processor (MSP)
Cray X1, G-21 to G-23, G-22, G-23 to G-24
Cray X1E, G-24
first vector computers, L-46
Multithreaded SIMD Processor
block diagram, 294
definition, 292, 309, 313–314
Fermi GPU architectural innovations, 305–308
Fermi GPU block diagram, 307
Fermi GTX 480 GPU floorplan, 295, 295–296
GPU programming, 289–290
GPUs vs. vector architectures, 310, 310–311
Grid mapping, 293
NVIDIA GPU computational structures, 291
NVIDIA GPU Memory structures, 304, 304–305
Roofline model, 326
Multithreaded vector processor
definition, 292
Fermi GPU comparison, 305
Multithreading
coarse-grained, 224–226
definition and types, 223–225
fine-grained, 224–226
GPU programming, 289
historical background, L-34 to L-35
ILP, 223–232
memory hierarchy basics, 75–76
parallel benchmarks, 231, 231–232
for performance gains, 398–400
Sun T1 effectiveness, 226–229
MVAPICH, F-77
MXP processor, components, E-14
Myrinet SAN, F-67
characteristics, F-76
cluster history, L-62 to L-63, L-73
routing algorithms, F-48
switch vs. NIC, F-86
system area network history, F-100
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset