D
DASH multiprocessor, L-61
Database program speculation, via multiple branches,
211
Data cache
page level write-through,
B-56
Data cache miss
applications
vs. OS,
B-59
sizes and associativities,
B-10
Data cache size, multiprogramming,
376–377
Datacenters
layer 3 network example,
445
tier classifications,
491
WSC efficiency measurement,
450–452
Data dependences
conditional instructions, H-24
dynamically scheduling with scoreboard,
C-71
example calculations, H-3 to H-4
ILP limitation studies,
220
vector execution time,
269
Data fetching
directory-based cache coherence protocol example,
382–383
ILP, instruction bandwidth
snooping coherence protocols,
355–356
Data flow
global code scheduling, H-17
ILP limitation studies,
220
Data flow execution, hardware-based speculation,
184
Data hazards
basic considerations,
C-16
dynamic scheduling,
167–176
Tomasulo’s algorithm loop-based example,
179–181
ILP limitation studies,
220
microarchitectural techniques case study,
247–254
Data-level parallelism (DLP)
GPUs
basic considerations,
288
basic PTX thread instructions,
299
Fermi GPU architecture innovations,
305–308
Fermi GTX 480 floorplan,
295
Multimedia SIMD comparison,
312
multithreaded SIMD Processor block diagram,
294
NVIDIA computational structures,
291–297
NVIDIA/CUDA and AMD terminology,
313–315
SIMD thread scheduling,
297
Multimedia SIMD Extensions
vector architecture
basic considerations,
264
peak performance
vs. start-up overhead,
331
vector load-store unit bandwidth,
276–277
vector kernel implementation,
334–336
vector performance and memory bandwidth,
332
vector
vs. scalar performance,
331–332
Data link layer
interconnection networks, F-10
Data parallelism, SIMD computer history, L-55
Data-race-free, synchronized programs,
394
Data races, synchronized programs,
394
Data transfers
cache miss rate calculations,
B-16
computer architecture,
15
desktop RISC instructions,
K-10,
K-21
embedded RISCs,
K-14,
K-23
instruction operators,
A-15
Intel 80x86,
K-49,
K-53 to K-54
MIPS, addressing modes,
A-34
MIPS64 instruction subset,
A-40
MIPS core extensions, K-20
multimedia instruction compiler support,
A-31
Data trunks, MIPS scoreboarding,
C-75
Data types
architect-compiler writer relationship,
A-30
dependence analysis, H-10
MIPS64 architecture,
A-34
multimedia compiler support,
A-31
SIMD Multimedia Extensions,
282–283
DAXPY loop
on enhanced VMIPS, G-19 to G-21
peak performance
vs. start-up overhead,
331
vector performance measures, G-16
VMIPS peak performance, G-17
D-caches
case study examples,
B-63
Deadlock
dimension-order routing, F-47 to F-48
large-scale multiprocessor cache coherence, I-34 to I-35, I-38 to I-40
mesh network routing,
F-46
system area network history, F-101
Deadlock avoidance
meshes and hypercubes, F-47
Deadlock recovery, routing, F-45
Decimal operands, formats,
A-14
Decimal operations, PA-RISC instructions, K-35
Decoder, radio receiver,
E-23
Decode stage, TI 320C55 DSP, E-7
DECstation 5000, reboot measurements,
F-69
DEC VAX
architect-compiler writer relationship,
A-30
branches,
A-18
jumps, procedure calls, K-71 to K-72
cluster history, L-62, L-72
compiler writing-architecture relationship,
A-30
control flow instruction branches,
A-18
early computer arithmetic, J-63 to J-64
early pipelined CPUs, L-26
extensive pipelining,
C-81
high-level language computer architecture, L-18 to L-19
immediate value distribution,
A-13
instruction classes,
B-73
instruction encoding, K-68 to K-70,
K-69
instruction execution issues, K-81
instruction operator categories,
A-15
miss rate
vs. virtual addressing,
B-37
operand types/sizes,
A-14
operation count, K-70 to K-71
RISC history, L-20 to L-21
RISC instruction set lineage,
K-43
sort register allocation, K-76
swap full procedure, K-75 to K-76
unique instructions, K-28
DEC VAX-11/780, L-6 to L-7, L-11, L-18
DEC VAX 8700
vs. MIPS M2000,
K-82,
L-21
Dedicated link network
black box network, F-5 to F-6
effective bandwidth, F-17
Defect tolerance, chip fabrication cost case study,
61–62
Deferred addressing, VAX, K-67
Dell Poweredge servers, prices,
53
Dell Poweredge Thunderbird, SAN characteristics,
F-76
Dell servers
real-world considerations,
52–55
Demodulator, radio receiver,
E-23
Denormals, J-14 to J-16, J-20 to J-21
floating-point additions, J-26 to J-27
floating-point underflow, J-36
Dense matrix multiplication, LU kernel, I-8
Density-optimized processors,
vs. SPEC-optimized, F-85
Dependability
benchmark examples, D-21 to D-23,
D-22
disk operators, D-13 to D-15
integrated circuits,
33–36
Internet Archive Cluster, D-38 to D-40
WSC goals/requirements,
433
Dependence analysis
example calculations, H-7
Dependence distance, loop-carried dependences, H-6
Dependences
dynamically scheduling with scoreboard,
C-71
dynamic scheduling with Tomasulo’s algorithm,
172
hardware-based speculation,
183
ILP limitation studies,
220
loop-level parallelism,
318–322, H-3
dependence analysis, H-6 to H-10
as program properties,
152
and Tomasulo’s algorithm,
170
vector execution time,
269
Dependent computations, elimination, H-10 to H-12
Descriptor privilege level (DPL), segmented virtual memory,
B-53
Descriptor table, IA-32,
B-52
Design faults, storage systems, D-11
Desktop computers
interconnection networks, F-85
memory hierarchy basics,
78
multiprocessor importance,
344
performance benchmarks,
38–40
processor comparison,
242
RISC systems
addressing modes and instruction formats, K-5 to K-6
arithmetic/logical instructions,
K-22
conditional branches,
K-17
control instructions,
K-12
data transfer instructions,
K-10,
K-21
FP instructions,
K-13,
K-23
multimedia extensions, K-16 to K-19,
K-18
system characteristics,
E-4
Destination offset, IA-32 segment,
B-53
Deterministic routing algorithm
vs. adaptive routing, F-52 to F-55,
F-54
Die yield, basic equation,
30–31
Digital Alpha
conditional instructions, H-27
early pipelined CPUs, L-27
RISC instruction set lineage,
K-43
synchronization history, L-64
Digital Alpha 21064, L-48
Digital Alpha processors
arithmetic/logical instructions,
K-11
conditional branches, K-12,
K-17
control flow instruction branches,
A-18
data transfer instructions,
K-10
displacement addressing mode,
A-12
exception stopping/restarting,
C-47
immediate value distribution,
A-13
MAX, multimedia support,
E-11
MIPS precise exceptions,
C-59
unique instructions, K-27 to K-29
Digital Linear Tape, L-77
Digital signal processor (DSP)
cell phones, E-23,
E-23, E-23 to E-24
desktop multimedia support,
E-11
embedded RISC extensions, K-19
examples and characteristics,
E-6
media extensions, E-10 to E-11
saturating operations, K-18 to K-19
TI TMS320C6x, E-8 to E-10
TI TMS320C6x instruction packet,
E-10
TI TMS320C55, E-6 to E-7,
E-7 to E-8
Dimension-order routing (DOR), definition, F-46
Direct attached disks, definition, D-35
Direct-mapped cache
address translation,
B-38
memory hierarchy basics,
74
Direct memory access (DMA)
historical background, L-81
network interface functions, F-7
Sanyo VPC-SX500 digital camera, E-19
Sony PlayStation 2 Emotion Engine, E-18
zero-copy protocols, F-91
Direct networks
commercial system topologies,
F-37
vs. high-dimensional networks, F-92
Directory-based cache coherence
advanced directory protocol case study,
420–426
distributed-memory multiprocessor,
380
large-scale multiprocessor history, L-61
state transition diagram,
383
Directory-based multiprocessor
scientific workloads, I-29
synchronization, I-16, I-19 to I-20
Directory controller, cache coherence, I-40 to I-41
Dirty bit
virtual memory fast address translation,
B-46
Discrete cosine transform, DSP, E-5
Disk arrays
deconstruction case study, D-51 to D-55,
D-52 to D-55
RAID levels, D-6 to D-8,
D-7
Disk layout, RAID performance prediction, D-57 to D-59
Disk power, basic considerations, D-5
Disk storage
areal density, D-2 to D-5
deconstruction case study, D-48 to D-51,
D-50
DRAM/magnetic disk cost
vs. access time,
D-3
intelligent interfaces, D-4
internal microprocessors, D-4
real faults and failures, D-10 to D-11
throughput
vs. command queue depth,
D-4
Disk technology
failure rate calculation,
48
Dispatch stage
microarchitectural techniques case study,
247–254
Displacement addressing mode
basic considerations,
A-10
MIPS data transfers,
A-34
MIPS instruction format,
A-35
value distributions,
A-12
Display lists, Sony PlayStation 2 Emotion Engine, E-17
Distributed routing, basic concept, F-48
Distributed shared memory (DSM)
multichip multicore multiprocessor,
419
snooping coherence protocols,
355
Distributed shared-memory multiprocessors
cache coherence implementation, I-36 to I-37
scientific application performance, I-26 to I-32,
I-28 to I-32
Distributed switched networks, topology, F-34 to F-40
Divide operations
chip comparison, J-60 to J-61
floating-point, stall,
C-68
floating-point iterative, J-27 to J-31
integers, speedup
radix-4 SRT division,
J-57
with single adder, J-54 to J-58
integer shifting over zeros, J-45 to J-47
language comparison,
J-12
n-bit unsigned integers,
J-4
PA-RISC instructions, K-34 to K-35
restoring/nonrestoring,
J-6
SRT division, J-45 to J-47,
J-46
unfinished instructions,
179
DLX
vs. Intel 80x86 operations, K-62,
K-63 to K-64
Double data rate 2 (DDR2), SDRAM timing diagram,
139
Double data rate 3 (DDR3)
DRAM internal organization,
98
SDRAM power consumption,
102,
103
Double data rate 4 (DDR4), DRAM,
99
Double data rate 5 (DDR5), GDRAM,
102
Double-extended floating-point arithmetic, J-33 to J-34
Double failures, RAID reconstruction, D-55 to D-57
Double-precision floating point
data access benchmarks,
A-15
DSP media extensions, E-10 to E-11
Fermi GPU architecture,
306
floating-point pipeline,
C-65
MIPS data transfers,
A-34
Multimedia SIMD
vs. GPUs,
312
Double words
aligned/misaligned addresses,
A-8
data access benchmarks,
A-15
DRDRAM, Sony PlayStation 2, E-16 to E-17
Driver domains, Xen VM,
111
Dual inline memory modules (DIMMs)
clock rates, bandwidth, names,
101
Dual SIMD Thread Scheduler, example,
305–306
Dynamically allocatable multi-queues (DAMQs), switch microarchitecture, F-56 to F-57
Dynamically scheduled pipelines
Dynamically shared libraries, control flow instruction addressing modes,
A-18
Dynamic energy, definition,
23
Dynamic network reconfiguration, fault tolerance, F-67 to F-68
Dynamic random-access memory (DRAM)
clock rates, bandwidth, names,
101
cost
vs. access time,
D-3
embedded benchmarks, E-13
first vector computers, L-45, L-47
GPU SIMD instructions,
296
IBM Blue Gene/L, I-43 to I-44
improvement over time,
17
integrated circuit costs,
28
internal organization,
98
magnetic storage history, L-78
memory hierarchy design,
73,
73
NVIDIA GPU Memory structures,
305
performance milestones,
20
real-world server considerations,
52–55
server energy savings,
25
Sony PlayStation 2,
E-16, E-17
vector memory systems, G-9
WSC efficiency measurement,
450
Dynamic scheduling
ILP
with multiple issue and speculation,
197–202
SMT on superscalar processors,
230
and unoptimized code,
C-81
Dynamic voltage-frequency scaling (DVFS)
processor performance equation,
52