D

DaCapo benchmarks
ISA, 242
SMT, 230–231, 231
DASH multiprocessor, L-61
Database program speculation, via multiple branches, 211
Data cache
ARM Cortex-A8, 236
cache optimization, B-33, B-38
cache performance, B-16
GPU Memory, 306
ISA, 241
locality principle, B-60
MIPS R4000 pipeline, C-62 to C-63
multiprogramming, 374
page level write-through, B-56
RISC processor, C-7
structural hazards, C-15
TLB, B-46
Data cache miss
applications vs. OS, B-59
cache optimization, B-25
Intel Core i7, 240
Opteron, B-12 to B-15
sizes and associativities, B-10
writes, B-10
Data cache size, multiprogramming, 376–377
Datacenters
CDF, 487
containers, L-74
cooling systems, 449
layer 3 network example, 445
PUE statistics, 451
tier classifications, 491
vs. WSC costs, 455–456
WSC efficiency measurement, 450–452
vs. WSCs, 436
Data dependences
conditional instructions, H-24
data hazards, 167–168
dynamically scheduling with scoreboard, C-71
example calculations, H-3 to H-4
hazards, 153–154
ILP, 150–152
ILP hardware model, 214–215
ILP limitation studies, 220
vector execution time, 269
Data fetching
ARM Cortex-A8, 234
directory-based cache coherence protocol example, 382–383
dynamically scheduled pipelines, C-70 to C-71
ILP, instruction bandwidth
basic considerations, 202–203
branch-target buffers, 203–206
return address predictors, 206–207
MIPS R4000, C-63
snooping coherence protocols, 355–356
Data flow
control dependence, 154–156
dynamic scheduling, 168
global code scheduling, H-17
ILP limitation studies, 220
limit, L-33
Data flow execution, hardware-based speculation, 184
Datagrams See Packets
Data hazards
ARM Cortex-A8, 235
basic considerations, C-16
definition, C-11
dependences, 152–154
dynamic scheduling, 167–176
basic concept, 168–170
examples, 176–178
Tomasulo’s algorithm, 170–176, 178–179
Tomasulo’s algorithm loop-based example, 179–181
ILP limitation studies, 220
instruction set complications, C-50 to C-51
microarchitectural techniques case study, 247–254
MIPS pipeline, C-71
Data hazards
stall minimization by forwarding, C-16 to C-19, C-18
stall requirements, C-19 to C-21
VMIPS, 264
Data-level parallelism (DLP)
definition, 9
GPUs
basic considerations, 288
basic PTX thread instructions, 299
conditional branching, 300–303
coprocessor relationship, 330–331
Fermi GPU architecture innovations, 305–308
Fermi GTX 480 floorplan, 295
mapping examples, 293
Multimedia SIMD comparison, 312
multithreaded SIMD Processor block diagram, 294
NVIDIA computational structures, 291–297
NVIDIA/CUDA and AMD terminology, 313–315
NVIDIA GPU ISA, 298–300
NVIDIA GPU Memory structures, 304, 304–305
programming, 288–291
SIMD thread scheduling, 297
terminology, 292
vs. vector architectures, 308–312, 310
from ILP, 4–5
Multimedia SIMD Extensions
basic considerations, 282–285
programming, 285
roofline visual performance model, 285–288, 287
and power, 322
vector architecture
basic considerations, 264
gather/scatter operations, 279–280
multidimensional arrays, 278–279
multiple lanes, 271–273
peak performance vs. start-up overhead, 331
programming, 280–282
vector execution time, 268–271
vector-length registers, 274–275
vector load-store unit bandwidth, 276–277
vector-mask registers, 275–276
vector processor example, 267–268
VMIPS, 264–267
vector kernel implementation, 334–336
vector performance and memory bandwidth, 332
vector vs. scalar performance, 331–332
WSCs vs. servers, 433–434
Data link layer
definition, F-82
interconnection networks, F-10
Data parallelism, SIMD computer history, L-55
Data-race-free, synchronized programs, 394
Data races, synchronized programs, 394
Data transfers
cache miss rate calculations, B-16
computer architecture, 15
desktop RISC instructions, K-10, K-21
embedded RISCs, K-14, K-23
gather-scatter, 281, 291
instruction operators, A-15
Intel 80x86, K-49, K-53 to K-54
ISA, 12–13
MIPS, addressing modes, A-34
MIPS64, K-24 to K-26
MIPS64 instruction subset, A-40
MIPS64 ISA formats, 14
MIPS core extensions, K-20
MIPS operations, A-36 to A-37
MMX, 283
multimedia instruction compiler support, A-31
operands, A-12
PTX, 305
SIMD extensions, 284
“typical” programs, A-43
VAX, B-73
vector vs. GPU, 300
Data trunks, MIPS scoreboarding, C-75
Data types
architect-compiler writer relationship, A-30
dependence analysis, H-10
desktop computing, A-2
Intel 80x86, K-50
MIPS, A-34, A-36
MIPS64 architecture, A-34
multimedia compiler support, A-31
operand types/sizes, A-14 to A-15
SIMD Multimedia Extensions, 282–283
SPARC, K-31
VAX, K-66, K-70
Dauber, Phil, L-28
DAXPY loop
chained convoys, G-16
on enhanced VMIPS, G-19 to G-21
memory bandwidth, 332
MIPS/VMIPS calculations, 267–268
peak performance vs. start-up overhead, 331
vector performance measures, G-16
VLRs, 274–275
on VMIPS, G-19 to G-20
VMIPS calculations, G-18
VMIPS on Linpack, G-18
VMIPS peak performance, G-17
D-caches
case study examples, B-63
way prediction, 81–82
Deadlock
cache coherence, 361
dimension-order routing, F-47 to F-48
directory protocols, 386
Intel SCCC, F-70
large-scale multiprocessor cache coherence, I-34 to I-35, I-38 to I-40
mesh network routing, F-46
network routing, F-44
routing comparison, F-54
synchronization, 388
system area network history, F-101
Deadlock avoidance
meshes and hypercubes, F-47
routing, F-44 to F-45
Deadlock recovery, routing, F-45
Dead time
vector pipeline, G-8
vector processor, G-8
Decimal operands, formats, A-14
Decimal operations, PA-RISC instructions, K-35
Decision support system (DSS), shared-memory workloads, 368–369, 369, 369–370
Decoder, radio receiver, E-23
Decode stage, TI 320C55 DSP, E-7
DEC PDP-11, address space, B-57 to B-58
DECstation 5000, reboot measurements, F-69
DEC VAX
addressing modes, A-10 to A-11, A-11, K-66 to K-68
address space, B-58
architect-compiler writer relationship, A-30
branch conditions, A-19
branches, A-18
jumps, procedure calls, K-71 to K-72
bubble sort, K-76
characteristics, K-42
cluster history, L-62, L-72
compiler writing-architecture relationship, A-30
control flow instruction branches, A-18
data types, K-66
early computer arithmetic, J-63 to J-64
early pipelined CPUs, L-26
exceptions, C-44
extensive pipelining, C-81
failures, D-15
flawless architecture design, A-45, K-81
high-level instruction set, A-41 to A-43
high-level language computer architecture, L-18 to L-19
history, 2–3
immediate value distribution, A-13
instruction classes, B-73
instruction encoding, K-68 to K-70, K-69
instruction execution issues, K-81
instruction operator categories, A-15
instruction set complications, C-49 to C-50
integer overflow, J-11
vs. MIPS, K-82
vs. MIPS32 sort, K-80
vs. MIPS code, K-75
miss rate vs. virtual addressing, B-37
operands, K-66 to K-68
operand specifiers, K-68
operands per ALU, A-6, A-8
operand types/sizes, A-14
operation count, K-70 to K-71
operations, K-70 to K-72
operators, A-15
overview, K-65 to K-66
precise exceptions, C-59
replacement by RISC, 2
RISC history, L-20 to L-21
RISC instruction set lineage, K-43
sort, K-76 to K-79
sort code, K-77 to K-79
sort register allocation, K-76
swap, K-72 to K-76
swap code, B-74, K-72, K-74
swap full procedure, K-75 to K-76
swap and register preservation, B-74 to B-75
unique instructions, K-28
DEC VAX-11/780, L-6 to L-7, L-11, L-18
DEC VAX 8700
vs. MIPS M2000, K-82, L-21
RISC history, L-21
Dedicated link network
black box network, F-5 to F-6
effective bandwidth, F-17
example, F-6
Defect tolerance, chip fabrication cost case study, 61–62
Deferred addressing, VAX, K-67
Delayed branch
basic scheme, C-23
compiler history, L-31
instructions, K-25
stalls, C-65
Dell Poweredge servers, prices, 53
Dell Poweredge Thunderbird, SAN characteristics, F-76
Dell servers
economies of scale, 456
real-world considerations, 52–55
WSC services, 441
Demodulator, radio receiver, E-23
Denormals, J-14 to J-16, J-20 to J-21
floating-point additions, J-26 to J-27
floating-point underflow, J-36
Dense matrix multiplication, LU kernel, I-8
Density-optimized processors, vs. SPEC-optimized, F-85
Dependability
benchmark examples, D-21 to D-23, D-22
definition, D-10 to D-11
disk operators, D-13 to D-15
integrated circuits, 33–36
Internet Archive Cluster, D-38 to D-40
memory systems, 104–105
WSC goals/requirements, 433
WSC memory, 473–474
WSC storage, 442–443
Dependence analysis
basic approach, H-5
example calculations, H-7
limitations, H-8 to H-9
Dependence distance, loop-carried dependences, H-6
Dependences
antidependences, 152, 320, C-72, C-79
CUDA, 290
as data dependence, 150
data hazards, 167–168
definition, 152–153, 315–316
dynamically scheduled pipelines, C-70 to C-71
dynamically scheduling with scoreboard, C-71
dynamic scheduling with Tomasulo’s algorithm, 172
hardware-based speculation, 183
hazards, 153–154
ILP, 150–156
ILP hardware model, 214–215
ILP limitation studies, 220
loop-level parallelism, 318–322, H-3
dependence analysis, H-6 to H-10
MIPS scoreboarding, C-79
as program properties, 152
sparse matrices, G-13
and Tomasulo’s algorithm, 170
types, 150
vector execution time, 269
vector mask registers, 275–276
VMIPS, 268
Dependent computations, elimination, H-10 to H-12
Descriptor privilege level (DPL), segmented virtual memory, B-53
Descriptor table, IA-32, B-52
Design faults, storage systems, D-11
Desktop computers
characteristics, 6
compiler structure, A-24
as computer class, 5
interconnection networks, F-85
memory hierarchy basics, 78
multimedia support, E-11
multiprocessor importance, 344
performance benchmarks, 38–40
processor comparison, 242
RAID history, L-80
RISC systems
addressing modes, K-5
addressing modes and instruction formats, K-5 to K-6
arithmetic/logical instructions, K-22
conditional branches, K-17
constant extension, K-9
control instructions, K-12
conventions, K-13
data transfer instructions, K-10, K-21
examples, K-3, K-4
features, K-44
FP instructions, K-13, K-23
instruction formats, K-7
multimedia extensions, K-16 to K-19, K-18
system characteristics, E-4
Destination offset, IA-32 segment, B-53
Deterministic routing algorithm
vs. adaptive routing, F-52 to F-55, F-54
DOR, F-46
Dies
embedded systems, E-15
integrated circuits, 28–30, 29
Nehalem floorplan, 30
wafer example, 31, 31–32
Die yield, basic equation, 30–31
Digital Alpha
branches, A-18
conditional instructions, H-27
early pipelined CPUs, L-27
RISC history, L-21
RISC instruction set lineage, K-43
synchronization history, L-64
Digital Alpha 21064, L-48
Digital Alpha 21264
cache hierarchy, 368
floorplan, 143
Digital Alpha MAX
characteristics, K-18
multimedia support, K-18
Digital Alpha processors
addressing modes, K-5
arithmetic/logical instructions, K-11
branches, K-21
conditional branches, K-12, K-17
constant extension, K-9
control flow instruction branches, A-18
conventions, K-13
data transfer instructions, K-10
displacement addressing mode, A-12
exception stopping/restarting, C-47
FP instructions, K-23
immediate value distribution, A-13
MAX, multimedia support, E-11
MIPS precise exceptions, C-59
multimedia support, K-19
recent advances, L-33
as RISC systems, K-4
shared-memory workload, 367–369
unique instructions, K-27 to K-29
Digital Linear Tape, L-77
Digital signal processor (DSP)
cell phones, E-23, E-23, E-23 to E-24
definition, E-3
desktop multimedia support, E-11
embedded RISC extensions, K-19
examples and characteristics, E-6
media extensions, E-10 to E-11
overview, E-5 to E-7
saturating operations, K-18 to K-19
TI TMS320C6x, E-8 to E-10
TI TMS320C6x instruction packet, E-10
TI TMS320C55, E-6 to E-7, E-7 to E-8
TI TMS320C64x, E-9
Dimension-order routing (DOR), definition, F-46
Direct attached disks, definition, D-35
Direct-mapped cache
address parts, B-9
address translation, B-38
block placement, B-7
early work, L-10
memory hierarchy basics, 74
memory hierarchy, B-48
optimization, 79–80
Direct memory access (DMA)
historical background, L-81
InfiniBand, F-76
network interface functions, F-7
Sanyo VPC-SX500 digital camera, E-19
Sony PlayStation 2 Emotion Engine, E-18
TI TMS320C55 DSP, E-8
zero-copy protocols, F-91
Direct networks
commercial system topologies, F-37
vs. high-dimensional networks, F-92
vs. MIN costs, F-92
topology, F-34 to F-40
Directory-based cache coherence
advanced directory protocol case study, 420–426
basic considerations, 378–380
case study, 418–420
definition, 354
distributed-memory multiprocessor, 380
large-scale multiprocessor history, L-61
latencies, 425
protocol basics, 380–382
protocol example, 382–386
state transition diagram, 383
Directory-based multiprocessor
characteristics, I-31
performance, I-26
scientific workloads, I-29
synchronization, I-16, I-19 to I-20
Directory controller, cache coherence, I-40 to I-41
Dirty bit
case study, D-61 to D-64
definition, B-11
virtual memory fast address translation, B-46
Dirty block
definition, B-11
read misses, B-36
Discrete cosine transform, DSP, E-5
Disk arrays
deconstruction case study, D-51 to D-55, D-52 to D-55
RAID 6, D-8 to D-9
RAID 10, D-8
RAID levels, D-6 to D-8, D-7
Disk layout, RAID performance prediction, D-57 to D-59
Disk power, basic considerations, D-5
Disk storage
access time gap, D-3
areal density, D-2 to D-5
cylinders, D-5
deconstruction case study, D-48 to D-51, D-50
DRAM/magnetic disk cost vs. access time, D-3
intelligent interfaces, D-4
internal microprocessors, D-4
real faults and failures, D-10 to D-11
throughput vs. command queue depth, D-4
Disk technology
failure rate calculation, 48
Google WSC servers, 469
performance trends, 19–20, 20
WSC Flash memory, 474–475
Dispatch stage
instruction steps, 174
microarchitectural techniques case study, 247–254
Displacement addressing mode
basic considerations, A-10
MIPS, 12
MIPS data transfers, A-34
MIPS instruction format, A-35
value distributions, A-12
VAX, K-67
Display lists, Sony PlayStation 2 Emotion Engine, E-17
Distributed routing, basic concept, F-48
Distributed shared memory (DSM)
basic considerations, 378–380
basic structure, 347–348, 348
characteristics, I-45
directory-based cache coherence, 354, 380, 418–420
multichip multicore multiprocessor, 419
snooping coherence protocols, 355
Distributed shared-memory multiprocessors
cache coherence implementation, I-36 to I-37
scientific application performance, I-26 to I-32, I-28 to I-32
Distributed switched networks, topology, F-34 to F-40
Divide operations
chip comparison, J-60 to J-61
floating-point, stall, C-68
floating-point iterative, J-27 to J-31
integers, speedup
radix-2 division, J-55
radix-4 division, J-56
radix-4 SRT division, J-57
with single adder, J-54 to J-58
integer shifting over zeros, J-45 to J-47
language comparison, J-12
n-bit unsigned integers, J-4
PA-RISC instructions, K-34 to K-35
Radix-2, J-4 to J-7
restoring/nonrestoring, J-6
SRT division, J-45 to J-47, J-46
unfinished instructions, 179
DLX
integer arithmetic, J-12
vs. Intel 80x86 operations, K-62, K-63 to K-64
Double data rate (DDR)
ARM Cortex-A8, 117
DRAM performance, 100
DRAMs and DIMMS, 101
Google WSC servers, 468–469
IBM Blue Gene/L, I-43
InfiniBand, F-77
Intel Core i7, 121
SDRAMs, 101
Double data rate 2 (DDR2), SDRAM timing diagram, 139
Double data rate 3 (DDR3)
DRAM internal organization, 98
GDRAM, 102
Intel Core i7, 118
SDRAM power consumption, 102, 103
Double data rate 4 (DDR4), DRAM, 99
Double data rate 5 (DDR5), GDRAM, 102
Double-extended floating-point arithmetic, J-33 to J-34
Double failures, RAID reconstruction, D-55 to D-57
Double-precision floating point
add-divide, C-68
AVX for x86, 284
chip comparison, J-58
data access benchmarks, A-15
DSP media extensions, E-10 to E-11
Fermi GPU architecture, 306
floating-point pipeline, C-65
GTX 280, 325, 328–330
IBM 360, 171
MIPS data transfers, A-34
MIPS registers, 12, A-34
Multimedia SIMD vs. GPUs, 312
operand sizes/types, 12
as operand type, A-13 to A-14
operand usage, 297
pipeline timing, C-54
Roofline model, 287, 326
SIMD Extensions, 283
VMIPS, 266, 266–267
Double rounding
FP precisions, J-34
FP underflow, J-37
Double words
aligned/misaligned addresses, A-8
data access benchmarks, A-15
Intel 80x86, K-50
memory address interpretation, A-7 to A-8
MIPS data types, A-34
operand types/sizes, 12, A-14
stride, 278
DRDRAM, Sony PlayStation 2, E-16 to E-17
Driver domains, Xen VM, 111
Dual inline memory modules (DIMMs)
clock rates, bandwidth, names, 101
DRAM basics, 99
Google WSC server, 467
Google WSC servers, 468–469
graphics memory, 322–323
Intel Core i7, 118, 121
Intel SCCC, F-70
SDRAMs, 101
WSC memory, 473–474
Dual SIMD Thread Scheduler, example, 305–306
Dynamically allocatable multi-queues (DAMQs), switch microarchitecture, F-56 to F-57
Dynamically scheduled pipelines
basic considerations, C-70 to C-71
with scoreboard, C-71 to C-80
Dynamically shared libraries, control flow instruction addressing modes, A-18
Dynamic energy, definition, 23
Dynamic network reconfiguration, fault tolerance, F-67 to F-68
Dynamic power
energy efficiency, 211
microprocessors, 23
vs. static power, 26
Dynamic random-access memory (DRAM)
bandwidth issues, 322–323
characteristics, 98–100
clock rates, bandwidth, names, 101
cost vs. access time, D-3
cost trends, 27
Cray X1, G-22
CUDA, 290
dependability, 104
disk storage, D-3 to D-4
embedded benchmarks, E-13
errors and faults, D-11
first vector computers, L-45, L-47
Flash memory, 103–104
Google WSC servers, 468–469
GPU SIMD instructions, 296
IBM Blue Gene/L, I-43 to I-44
improvement over time, 17
integrated circuit costs, 28
Intel Core i7, 121
internal organization, 98
magnetic storage history, L-78
memory hierarchy design, 73, 73
memory performance, 100–102
multibanked caches, 86
NVIDIA GPU Memory structures, 305
performance milestones, 20
power consumption, 63
real-world server considerations, 52–55
Roofline model, 286
server energy savings, 25
Sony PlayStation 2, E-16, E-17
speed trends, 99
technology trends, 17
vector memory systems, G-9
vector processor, G-25
WSC efficiency measurement, 450
WSC memory costs, 473–474
WSC memory hierarchy, 444–445
WSC power modes, 472
yield, 32
Dynamic scheduling
first use, L-27
ILP
basic concept, 168–169
definition, 168
example and algorithms, 176–178
with multiple issue and speculation, 197–202
overcoming data hazards, 167–176
Tomasulo’s algorithm, 170–176, 178–179, 181–183
MIPS scoreboarding, C-79
SMT on superscalar processors, 230
and unoptimized code, C-81
Dynamic voltage-frequency scaling (DVFS)
energy efficiency, 25
Google WSC, 467
processor performance equation, 52
Dynamo (Amazon), 438, 452
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset