Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

D

DaCapo benchmarks

ISA, 242

SMT, 230–231, 231

DAMQs See Dynamically allocatable multi-queues (DAMQs)

DASH multiprocessor, L-61

Database program speculation, via multiple branches, 211

Data cache

ARM Cortex-A8, 236

cache optimization, B-33, B-38

cache performance, B-16

GPU Memory, 306

ISA, 241

locality principle, B-60

MIPS R4000 pipeline, C-62 to C-63

multiprogramming, 374

page level write-through, B-56

RISC processor, C-7

structural hazards, C-15

TLB, B-46

Data cache miss

applications vs. OS, B-59

cache optimization, B-25

Intel Core i7, 240

Opteron, B-12 to B-15

sizes and associativities, B-10

writes, B-10

Data cache size, multiprogramming, 376–377

Datacenters

CDF, 487

containers, L-74

cooling systems, 449

layer 3 network example, 445

PUE statistics, 451

tier classifications, 491

vs. WSC costs, 455–456

WSC efficiency measurement, 450–452

vs. WSCs, 436

Data dependences

conditional instructions, H-24

data hazards, 167–168

dynamically scheduling with scoreboard, C-71

example calculations, H-3 to H-4

hazards, 153–154

ILP, 150–152

ILP hardware model, 214–215

ILP limitation studies, 220

vector execution time, 269

Data fetching

ARM Cortex-A8, 234

directory-based cache coherence protocol example, 382–383

dynamically scheduled pipelines, C-70 to C-71

ILP, instruction bandwidth

basic considerations, 202–203

branch-target buffers, 203–206

return address predictors, 206–207

MIPS R4000, C-63

snooping coherence protocols, 355–356

Data flow

control dependence, 154–156

dynamic scheduling, 168

global code scheduling, H-17

ILP limitation studies, 220

limit, L-33

Data flow execution, hardware-based speculation, 184

Datagrams See Packets

Data hazards

ARM Cortex-A8, 235

basic considerations, C-16

definition, C-11

dependences, 152–154

dynamic scheduling, 167–176

basic concept, 168–170

examples, 176–178

Tomasulo’s algorithm, 170–176, 178–179

Tomasulo’s algorithm loop-based example, 179–181

ILP limitation studies, 220

instruction set complications, C-50 to C-51

microarchitectural techniques case study, 247–254

MIPS pipeline, C-71

RAW, C-57 to C-58

Data hazards

stall minimization by forwarding, C-16 to C-19, C-18

stall requirements, C-19 to C-21

VMIPS, 264

Data-level parallelism (DLP)

definition, 9

GPUs

basic considerations, 288

basic PTX thread instructions, 299

conditional branching, 300–303

coprocessor relationship, 330–331

Fermi GPU architecture innovations, 305–308

Fermi GTX 480 floorplan, 295

mapping examples, 293

Multimedia SIMD comparison, 312

multithreaded SIMD Processor block diagram, 294

NVIDIA computational structures, 291–297

NVIDIA/CUDA and AMD terminology, 313–315

NVIDIA GPU ISA, 298–300

NVIDIA GPU Memory structures, 304, 304–305

programming, 288–291

SIMD thread scheduling, 297

terminology, 292

vs. vector architectures, 308–312, 310

from ILP, 4–5

Multimedia SIMD Extensions

basic considerations, 282–285

programming, 285

roofline visual performance model, 285–288, 287

and power, 322

vector architecture

basic considerations, 264

gather/scatter operations, 279–280

multidimensional arrays, 278–279

multiple lanes, 271–273

peak performance vs. start-up overhead, 331

programming, 280–282

vector execution time, 268–271

vector-length registers, 274–275

vector load-store unit bandwidth, 276–277

vector-mask registers, 275–276

vector processor example, 267–268

VMIPS, 264–267

vector kernel implementation, 334–336

vector performance and memory bandwidth, 332

vector vs. scalar performance, 331–332

WSCs vs. servers, 433–434

Data link layer

definition, F-82

interconnection networks, F-10

Data parallelism, SIMD computer history, L-55

Data-race-free, synchronized programs, 394

Data races, synchronized programs, 394

Data transfers

cache miss rate calculations, B-16

computer architecture, 15

desktop RISC instructions, K-10, K-21

embedded RISCs, K-14, K-23

gather-scatter, 281, 291

instruction operators, A-15

Intel 80x86, K-49, K-53 to K-54

ISA, 12–13

MIPS, addressing modes, A-34

MIPS64, K-24 to K-26

MIPS64 instruction subset, A-40

MIPS64 ISA formats, 14

MIPS core extensions, K-20

MIPS operations, A-36 to A-37

MMX, 283

multimedia instruction compiler support, A-31

operands, A-12

PTX, 305

SIMD extensions, 284

“typical” programs, A-43

VAX, B-73

vector vs. GPU, 300

Data trunks, MIPS scoreboarding, C-75

Data types

architect-compiler writer relationship, A-30

dependence analysis, H-10

desktop computing, A-2

Intel 80x86, K-50

MIPS, A-34, A-36

MIPS64 architecture, A-34

multimedia compiler support, A-31

operand types/sizes, A-14 to A-15

SIMD Multimedia Extensions, 282–283

SPARC, K-31

VAX, K-66, K-70

Dauber, Phil, L-28

DAXPY loop

chained convoys, G-16

on enhanced VMIPS, G-19 to G-21

memory bandwidth, 332

MIPS/VMIPS calculations, 267–268

peak performance vs. start-up overhead, 331

vector performance measures, G-16

VLRs, 274–275

on VMIPS, G-19 to G-20

VMIPS calculations, G-18

VMIPS on Linpack, G-18

VMIPS peak performance, G-17

D-caches

case study examples, B-63

way prediction, 81–82

DDR See Double data rate (DDR)

Deadlock

cache coherence, 361

dimension-order routing, F-47 to F-48

directory protocols, 386

Intel SCCC, F-70

large-scale multiprocessor cache coherence, I-34 to I-35, I-38 to I-40

mesh network routing, F-46

network routing, F-44

routing comparison, F-54

synchronization, 388

system area network history, F-101

Deadlock avoidance

meshes and hypercubes, F-47

routing, F-44 to F-45

Deadlock recovery, routing, F-45

Dead time

vector pipeline, G-8

vector processor, G-8

Decimal operands, formats, A-14

Decimal operations, PA-RISC instructions, K-35

Decision support system (DSS), shared-memory workloads, 368–369, 369, 369–370

Decoder, radio receiver, E-23

Decode stage, TI 320C55 DSP, E-7

DEC PDP-11, address space, B-57 to B-58

DECstation 5000, reboot measurements, F-69

DEC VAX

addressing modes, A-10 to A-11, A-11, K-66 to K-68

address space, B-58

architect-compiler writer relationship, A-30

branch conditions, A-19

branches, A-18

jumps, procedure calls, K-71 to K-72

bubble sort, K-76

characteristics, K-42

cluster history, L-62, L-72

compiler writing-architecture relationship, A-30

control flow instruction branches, A-18

data types, K-66

early computer arithmetic, J-63 to J-64

early pipelined CPUs, L-26

exceptions, C-44

extensive pipelining, C-81

failures, D-15

flawless architecture design, A-45, K-81

high-level instruction set, A-41 to A-43

high-level language computer architecture, L-18 to L-19

history, 2–3

immediate value distribution, A-13

instruction classes, B-73

instruction encoding, K-68 to K-70, K-69

instruction execution issues, K-81

instruction operator categories, A-15

instruction set complications, C-49 to C-50

integer overflow, J-11

vs. MIPS, K-82

vs. MIPS32 sort, K-80

vs. MIPS code, K-75

miss rate vs. virtual addressing, B-37

operands, K-66 to K-68

operand specifiers, K-68

operands per ALU, A-6, A-8

operand types/sizes, A-14

operation count, K-70 to K-71

operations, K-70 to K-72

operators, A-15

overview, K-65 to K-66

precise exceptions, C-59

replacement by RISC, 2

RISC history, L-20 to L-21

RISC instruction set lineage, K-43

sort, K-76 to K-79

sort code, K-77 to K-79

sort register allocation, K-76

swap, K-72 to K-76

swap code, B-74, K-72, K-74

swap full procedure, K-75 to K-76

swap and register preservation, B-74 to B-75

unique instructions, K-28

DEC VAX-11/780, L-6 to L-7, L-11, L-18

DEC VAX 8700

vs. MIPS M2000, K-82, L-21

RISC history, L-21

Dedicated link network

black box network, F-5 to F-6

effective bandwidth, F-17

example, F-6

Defect tolerance, chip fabrication cost case study, 61–62

Deferred addressing, VAX, K-67

Delayed branch

basic scheme, C-23

compiler history, L-31

instructions, K-25

stalls, C-65

Dell Poweredge servers, prices, 53

Dell Poweredge Thunderbird, SAN characteristics, F-76

Dell servers

economies of scale, 456

real-world considerations, 52–55

WSC services, 441

Demodulator, radio receiver, E-23

Denormals, J-14 to J-16, J-20 to J-21

floating-point additions, J-26 to J-27

floating-point underflow, J-36

Dense matrix multiplication, LU kernel, I-8

Density-optimized processors, vs. SPEC-optimized, F-85

Dependability

benchmark examples, D-21 to D-23, D-22

definition, D-10 to D-11

disk operators, D-13 to D-15

integrated circuits, 33–36

Internet Archive Cluster, D-38 to D-40

memory systems, 104–105

WSC goals/requirements, 433

WSC memory, 473–474

WSC storage, 442–443

Dependence analysis

basic approach, H-5

example calculations, H-7

limitations, H-8 to H-9

Dependence distance, loop-carried dependences, H-6

Dependences

antidependences, 152, 320, C-72, C-79

CUDA, 290

as data dependence, 150

data hazards, 167–168

definition, 152–153, 315–316

dynamically scheduled pipelines, C-70 to C-71

dynamically scheduling with scoreboard, C-71

dynamic scheduling with Tomasulo’s algorithm, 172

hardware-based speculation, 183

hazards, 153–154

ILP, 150–156

ILP hardware model, 214–215

ILP limitation studies, 220

loop-level parallelism, 318–322, H-3

dependence analysis, H-6 to H-10

MIPS scoreboarding, C-79

as program properties, 152

sparse matrices, G-13

and Tomasulo’s algorithm, 170

types, 150

vector execution time, 269

vector mask registers, 275–276

VMIPS, 268

Dependent computations, elimination, H-10 to H-12

Descriptor privilege level (DPL), segmented virtual memory, B-53

Descriptor table, IA-32, B-52

Design faults, storage systems, D-11

Desktop computers

characteristics, 6

compiler structure, A-24

as computer class, 5

interconnection networks, F-85

memory hierarchy basics, 78

multimedia support, E-11

multiprocessor importance, 344

performance benchmarks, 38–40

processor comparison, 242

RAID history, L-80

RISC systems

addressing modes, K-5

addressing modes and instruction formats, K-5 to K-6

arithmetic/logical instructions, K-22

conditional branches, K-17

constant extension, K-9

control instructions, K-12

conventions, K-13

data transfer instructions, K-10, K-21

examples, K-3, K-4

features, K-44

FP instructions, K-13, K-23

instruction formats, K-7

multimedia extensions, K-16 to K-19, K-18

system characteristics, E-4

Destination offset, IA-32 segment, B-53

Deterministic routing algorithm

vs. adaptive routing, F-52 to F-55, F-54

DOR, F-46

Dies

embedded systems, E-15

integrated circuits, 28–30, 29

Nehalem floorplan, 30

wafer example, 31, 31–32

Die yield, basic equation, 30–31

Digital Alpha

branches, A-18

conditional instructions, H-27

early pipelined CPUs, L-27

RISC history, L-21

RISC instruction set lineage, K-43

synchronization history, L-64

Digital Alpha 21064, L-48

Digital Alpha 21264

cache hierarchy, 368

floorplan, 143

Digital Alpha MAX

characteristics, K-18

multimedia support, K-18

Digital Alpha processors

addressing modes, K-5

arithmetic/logical instructions, K-11

branches, K-21

conditional branches, K-12, K-17

constant extension, K-9

control flow instruction branches, A-18

conventions, K-13

data transfer instructions, K-10

displacement addressing mode, A-12

exception stopping/restarting, C-47

FP instructions, K-23

immediate value distribution, A-13

MAX, multimedia support, E-11

MIPS precise exceptions, C-59

multimedia support, K-19

recent advances, L-33

as RISC systems, K-4

shared-memory workload, 367–369

unique instructions, K-27 to K-29

Digital Linear Tape, L-77

Digital signal processor (DSP)

cell phones, E-23, E-23, E-23 to E-24

definition, E-3

desktop multimedia support, E-11

embedded RISC extensions, K-19

examples and characteristics, E-6

media extensions, E-10 to E-11

overview, E-5 to E-7

saturating operations, K-18 to K-19

TI TMS320C6x, E-8 to E-10

TI TMS320C6x instruction packet, E-10

TI TMS320C55, E-6 to E-7, E-7 to E-8

TI TMS320C64x, E-9

Dimension-order routing (DOR), definition, F-46

DIMMs See Dual inline memory modules (DIMMs)

Direct attached disks, definition, D-35

Direct-mapped cache

address parts, B-9

address translation, B-38

block placement, B-7

early work, L-10

memory hierarchy basics, 74

memory hierarchy, B-48

optimization, 79–80

Direct memory access (DMA)

historical background, L-81

InfiniBand, F-76

network interface functions, F-7

Sanyo VPC-SX500 digital camera, E-19

Sony PlayStation 2 Emotion Engine, E-18

TI TMS320C55 DSP, E-8

zero-copy protocols, F-91

Direct networks

commercial system topologies, F-37

vs. high-dimensional networks, F-92

vs. MIN costs, F-92

topology, F-34 to F-40

Directory-based cache coherence

advanced directory protocol case study, 420–426

basic considerations, 378–380

case study, 418–420

definition, 354

distributed-memory multiprocessor, 380

large-scale multiprocessor history, L-61

latencies, 425

protocol basics, 380–382

protocol example, 382–386

state transition diagram, 383

Directory-based multiprocessor

characteristics, I-31

performance, I-26

scientific workloads, I-29

synchronization, I-16, I-19 to I-20

Directory controller, cache coherence, I-40 to I-41

Dirty bit

case study, D-61 to D-64

definition, B-11

virtual memory fast address translation, B-46

Dirty block

definition, B-11

read misses, B-36

Discrete cosine transform, DSP, E-5

Disk arrays

deconstruction case study, D-51 to D-55, D-52 to D-55

RAID 6, D-8 to D-9

RAID 10, D-8

RAID levels, D-6 to D-8, D-7

Disk layout, RAID performance prediction, D-57 to D-59

Disk power, basic considerations, D-5

Disk storage

access time gap, D-3

areal density, D-2 to D-5

cylinders, D-5

deconstruction case study, D-48 to D-51, D-50

DRAM/magnetic disk cost vs. access time, D-3

intelligent interfaces, D-4

internal microprocessors, D-4

real faults and failures, D-10 to D-11

throughput vs. command queue depth, D-4

Disk technology

failure rate calculation, 48

Google WSC servers, 469

performance trends, 19–20, 20

WSC Flash memory, 474–475

Dispatch stage

instruction steps, 174

microarchitectural techniques case study, 247–254

Displacement addressing mode

basic considerations, A-10

MIPS, 12

MIPS data transfers, A-34

MIPS instruction format, A-35

value distributions, A-12

VAX, K-67

Display lists, Sony PlayStation 2 Emotion Engine, E-17

Distributed routing, basic concept, F-48

Distributed shared memory (DSM)

basic considerations, 378–380

basic structure, 347–348, 348

characteristics, I-45

directory-based cache coherence, 354, 380, 418–420

multichip multicore multiprocessor, 419

snooping coherence protocols, 355

Distributed shared-memory multiprocessors

cache coherence implementation, I-36 to I-37

scientific application performance, I-26 to I-32, I-28 to I-32

Distributed switched networks, topology, F-34 to F-40

Divide operations

chip comparison, J-60 to J-61

floating-point, stall, C-68

floating-point iterative, J-27 to J-31

integers, speedup

radix-2 division, J-55

radix-4 division, J-56

radix-4 SRT division, J-57

with single adder, J-54 to J-58

integer shifting over zeros, J-45 to J-47

language comparison, J-12

n-bit unsigned integers, J-4

PA-RISC instructions, K-34 to K-35

Radix-2, J-4 to J-7

restoring/nonrestoring, J-6

SRT division, J-45 to J-47, J-46

unfinished instructions, 179

DLP See Data-level parallelism (DLP)

DLX

integer arithmetic, J-12

vs. Intel 80x86 operations, K-62, K-63 to K-64

DMA See Direct memory access (DMA)

DOR See Dimension-order routing (DOR)

Double data rate (DDR)

ARM Cortex-A8, 117

DRAM performance, 100

DRAMs and DIMMS, 101

Google WSC servers, 468–469

IBM Blue Gene/L, I-43

InfiniBand, F-77

Intel Core i7, 121

SDRAMs, 101

Double data rate 2 (DDR2), SDRAM timing diagram, 139

Double data rate 3 (DDR3)

DRAM internal organization, 98

GDRAM, 102

Intel Core i7, 118

SDRAM power consumption, 102, 103

Double data rate 4 (DDR4), DRAM, 99

Double data rate 5 (DDR5), GDRAM, 102

Double-extended floating-point arithmetic, J-33 to J-34

Double failures, RAID reconstruction, D-55 to D-57

Double-precision floating point

add-divide, C-68

AVX for x86, 284

chip comparison, J-58

data access benchmarks, A-15

DSP media extensions, E-10 to E-11

Fermi GPU architecture, 306

floating-point pipeline, C-65

GTX 280, 325, 328–330

IBM 360, 171

MIPS, 285, A-38 to A-39

MIPS data transfers, A-34

MIPS registers, 12, A-34

Multimedia SIMD vs. GPUs, 312

operand sizes/types, 12

as operand type, A-13 to A-14

operand usage, 297

pipeline timing, C-54

Roofline model, 287, 326

SIMD Extensions, 283

VMIPS, 266, 266–267

Double rounding

FP precisions, J-34

FP underflow, J-37

Double words

aligned/misaligned addresses, A-8

data access benchmarks, A-15

Intel 80x86, K-50

memory address interpretation, A-7 to A-8

MIPS data types, A-34

operand types/sizes, 12, A-14

stride, 278

DPL See Descriptor privilege level (DPL)

DRAM See Dynamic random-access memory (DRAM)

DRDRAM, Sony PlayStation 2, E-16 to E-17

Driver domains, Xen VM, 111

DSM See Distributed shared memory (DSM)

DSP See Digital signal processor (DSP)

DSS See Decision support system (DSS)

Dual inline memory modules (DIMMs)

clock rates, bandwidth, names, 101

DRAM basics, 99

Google WSC server, 467

Google WSC servers, 468–469

graphics memory, 322–323

Intel Core i7, 118, 121

Intel SCCC, F-70

SDRAMs, 101

WSC memory, 473–474

Dual SIMD Thread Scheduler, example, 305–306

DVFS See Dynamic voltage-frequency scaling (DVFS)

Dynamically allocatable multi-queues (DAMQs), switch microarchitecture, F-56 to F-57

Dynamically scheduled pipelines

basic considerations, C-70 to C-71

with scoreboard, C-71 to C-80

Dynamically shared libraries, control flow instruction addressing modes, A-18

Dynamic energy, definition, 23

Dynamic network reconfiguration, fault tolerance, F-67 to F-68

Dynamic power

energy efficiency, 211

microprocessors, 23

vs. static power, 26

Dynamic random-access memory (DRAM)

bandwidth issues, 322–323

characteristics, 98–100

clock rates, bandwidth, names, 101

cost vs. access time, D-3

cost trends, 27

Cray X1, G-22

CUDA, 290

dependability, 104

disk storage, D-3 to D-4

embedded benchmarks, E-13

errors and faults, D-11

first vector computers, L-45, L-47

Flash memory, 103–104

Google WSC servers, 468–469

GPU SIMD instructions, 296

IBM Blue Gene/L, I-43 to I-44

improvement over time, 17

integrated circuit costs, 28

Intel Core i7, 121

internal organization, 98

magnetic storage history, L-78

memory hierarchy design, 73, 73

memory performance, 100–102

multibanked caches, 86

NVIDIA GPU Memory structures, 305

performance milestones, 20

power consumption, 63

real-world server considerations, 52–55

Roofline model, 286

server energy savings, 25

Sony PlayStation 2, E-16, E-17

speed trends, 99

technology trends, 17

vector memory systems, G-9

vector processor, G-25

WSC efficiency measurement, 450

WSC memory costs, 473–474

WSC memory hierarchy, 444–445

WSC power modes, 472

yield, 32

Dynamic scheduling

first use, L-27

ILP

basic concept, 168–169

definition, 168

example and algorithms, 176–178

with multiple issue and speculation, 197–202

overcoming data hazards, 167–176

Tomasulo’s algorithm, 170–176, 178–179, 181–183

MIPS scoreboarding, C-79

SMT on superscalar processors, 230

and unoptimized code, C-81

Dynamic voltage-frequency scaling (DVFS)

energy efficiency, 25

Google WSC, 467

processor performance equation, 52

Dynamo (Amazon), 438, 452

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Computer Architecture: A Quantitative Approach

Create new playlist

Sign In

Sign Up

D

Table of Contents for
Computer Architecture: A Quantitative Approach