Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

C

CAC See Computer aided design (CAD) tools

Cache bandwidth

caches, 78

multibanked caches, 85–86

nonblocking caches, 83–85

pipelined cache access, 82

Cache block

AMD Opteron data cache, B-13, B-13 to B-14

cache coherence protocol, 357–358

compiler optimizations, 89–90

critical word first, 86–87

definition, B-2

directory-based cache coherence protocol, 382–386, 383

false sharing, 366

GPU comparisons, 329

inclusion, 397–398

memory block, B-61

miss categories, B-26

miss rate reduction, B-26 to B-28

scientific workloads on symmetric shared-memory multiprocessors, I-22, I-25, I-25

shared-memory multiprogramming workload, 375–377, 376

way prediction, 81

write invalidate protocol implementation, 356–357

write strategy, B-10

Cache coherence

advanced directory protocol case study, 420–426

basic considerations, 112–113

Cray X1, G-22

directory-based See Directory-based cache coherence

enforcement, 354–355

extensions, 362–363

hardware primitives, 388

Intel SCCC, F-70

large-scale multiprocessor history, L-61

large-scale multiprocessors

deadlock and buffering, I-38 to I-40

directory controller, I-40 to I-41

DSM implementation, I-36 to I-37

overview, I-34 to I-36

latency hiding with speculation, 396

lock implementation, 389–391

mechanism, 358

memory hierarchy basics, 75

multiprocessor-optimized software, 409

multiprocessors, 352–353

protocol definitions, 354–355

single-chip multicore processor case study, 412–418

single memory location example, 352

snooping See Snooping cache coherence

state diagram, 361

steps and bus traffic examples, 391

write-back cache, 360

Cache definition, B-2

Cache hit

AMD Opteron example, B-14

definition, B-2

example calculation, B-5

Cache latency, nonblocking cache, 83–84

Cache miss

and average memory access time, B-17 to B-20

block replacement, B-10

definition, B-2

distributed-memory multiprocessors, I-32

example calculations, 83–84

Intel Core i7, 122

interconnection network, F-87

large-scale multiprocessors, I-34 to I-35

nonblocking cache, 84

single vs. multiple thread executions, 228

WCET, E-4

Cache-only memory architecture (COMA), L-61

Cache optimizations

basic categories, B-22

basic optimizations, B-40

case studies, 131–133

compiler-controlled prefetching, 92–95

compiler optimizations, 87–90

critical word first, 86–87

energy consumption, 81

hardware instruction prefetching, 91–92, 92

hit time reduction, B-36 to B-40

miss categories, B-23 to B-26

miss penalty reduction

via multilevel caches, B-30 to B-35

read misses vs. writes, B-35 to B-36

miss rate reduction

via associativity, B-28 to B-30

via block size, B-26 to B-28

via cache size, B-28

multibanked caches, 85–86, 86

nonblocking caches, 83–85, 84

overview, 78–79

pipelined cache access, 82

simple first-level caches, 79–80

techniques overview, 96

way prediction, 81–82

write buffer merging, 87, 88

Cache organization

blocks, B-7, B-8

Opteron data cache, B-12 to B-13, B-13

optimization, B-19

performance impact, B-19

Cache performance

average memory access time, B-16 to B-20

basic considerations, B-3 to B-6, B-16

basic equations, B-22

basic optimizations, B-40

cache optimization, 96

case study, 131–133

example calculation, B-16 to B-17

out-of-order processors, B-20 to B-22

prediction, 125–126

Cache prefetch, cache optimization, 92

Caches See also Memory hierarchy

access time vs. block size, B-28

AMD Opteron example, B-12 to B-15, B-13, B-15

basic considerations, B-48 to B-49

coining of term, L-11

definition, B-2

early work, L-10

embedded systems, E-4 to E-5

Fermi GPU architecture, 306

ideal processor, 214

ILP for realizable processors, 216–218

Itanium 2, H-42

multichip multicore multiprocessor, 419

parameter ranges, B-42

Sony PlayStation 2 Emotion Engine, E-18

vector processors, G-25

vs. virtual memory, B-42 to B-43

Cache size

and access time, 77

AMD Opteron example, B-13 to B-14

energy consumption, 81

highly parallel memory systems, 133

memory hierarchy basics, 76

misses per instruction, 126, 371

miss rate, B-24 to B-25

vs. miss rate, B-27

miss rate reduction, B-28

multilevel caches, B-33

and relative execution time, B-34

scientific workloads

distributed-memory multiprocessors, I-29 to I-31

symmetric shared-memory multiprocessors, I-22 to I-23, I-24

shared-memory multiprogramming workload, 376

virtually addressed, B-37

CACTI

cache optimization, 79–80, 81

memory access times, 77

Caller saving, control flow instructions, A-19 to A-20

Call gate

IA-32 segment descriptors, B-53

segmented virtual memory, B-54

Calls

compiler structure, A-25 to A-26

control flow instructions, A-17, A-19 to A-21

CUDA Thread, 297

dependence analysis, 321

high-level instruction set, A-42 to A-43

Intel 80x86 integer operations, K-51

invocation options, A-19

ISAs, 14

MIPS control flow instructions, A-38

MIPS registers, 12

multiprogrammed workload, 378

NVIDIA GPU Memory structures, 304–305

return address predictors, 206

shared-memory multiprocessor workload, 369

user-to-OS gates, B-54

VAX, K-71 to K-72

Canceling branch, branch delay slots, C-24 to C-25

Canonical form, AMD64 paged virtual memory, B-55

Capabilities, protection schemes, L-9 to L-10

Capacity misses

blocking, 89–90

and cache size, B-24

definition, B-23

memory hierarchy basics, 75

scientific workloads on symmetric shared-memory multiprocessors, I-22, I-23, I-24

shared-memory workload, 373

CAPEX See Capital expenditures (CAPEX)

Capital expenditures (CAPEX)

WSC costs, 452–455, 453

WSC Flash memory, 475

WSC TCO case study, 476–478

Carrier sensing, shared-media networks, F-23

Carrier signal, wireless networks, E-21

Carry condition code, MIPS core, K-9 to K-16

Carry-in, carry-skip adder, J-42

Carry-lookahead adder (CLA)

chip comparison, J-60

early computer arithmetic, J-63

example, J-38

integer addition speedup, J-37 to J-41

with ripple-carry adder, J-42

tree, J-40 to J-41

Carry-out

carry-lookahead circuit, J-38

floating-point addition speedup, J-25

Carry-propagate adder (CPA)

integer multiplication, J-48, J-51

multipass array multiplier, J-51

Carry-save adder (CSA)

integer division, J-54 to J-55

integer multiplication, J-47 to J-48, J-48

Carry-select adder

characteristics, J-43 to J-44

chip comparison, J-60

example, J-43

Carry-skip adder (CSA)

characteristics, J-41 to J43

example, J-42, J-44

CAS See Column access strobe (CAS)

Case statements

control flow instruction addressing modes, A-18

return address predictors, 206

Case studies

advanced directory protocol, 420–426

cache optimization, 131–133

cell phones

block diagram, E-23

Nokia circuit board, E-24

overview, E-20

radio receiver, E-23

standards and evolution, E-25

wireless communication challenges, E-21

wireless networks, E-21 to E-22

chip fabrication cost, 61–62

computer system power consumption, 63–64

directory-based coherence, 418–420

dirty bits, D-61 to D-64

disk array deconstruction, D-51 to D-55, D-52 to D-55

disk deconstruction, D-48 to D-51, D-50

highly parallel memory systems, 133–136

instruction set principles, A-47 to A-54

I/O subsystem design, D-59 to D-61

memory hierarchy, B-60 to B-67

microarchitectural techniques, 247–254

pipelining example, C-82 to C-88

RAID performance prediction, D-57 to D-59

RAID reconstruction, D-55 to D-57

Sanyo VPC-SX500 digital camera, E-19

single-chip multicore processor, 412–418

Sony PlayStation 2 Emotion Engine, E-15 to E-18

sorting, D-64 to D-67

vector kernel on vector processor and GPU, 334–336

WSC resource allocation, 478–479

WSC TCO, 476–478

CCD See Charge-coupled device (CCD)

C/C++ language

dependence analysis, H-6

GPU computing history, L-52

hardware impact on software development, 4

integer division/remainder, J-12

loop-level parallelism dependences, 318, 320–321

NVIDIA GPU programming, 289

return address predictors, 206

CDB See Common data bus (CDB)

CDC See Control Data Corporation (CDC)

CDF, datacenter, 487

CDMA See Code division multiple access (CDMA)

Cedar project, L-60

Cell, Barnes-Hut n-body algorithm, I-9

Cell phones

block diagram, E-23

embedded system case study

characteristics, E-22 to E-24

overview, E-20

radio receiver, E-23

standards and evolution, E-25

wireless network overview, E-21 to E-22

Flash memory, D-3

GPU features, 324

Nokia circuit board, E-24

wireless communication challenges, E-21

wireless networks, E-22

Centralized shared-memory multiprocessors

basic considerations, 351–352

basic structure, 346–347, 347

cache coherence, 352–353

cache coherence enforcement, 354–355

cache coherence example, 357–362

cache coherence extensions, 362–363

invalidate protocol implementation, 356–357

SMP and snooping limitations, 363–364

snooping coherence implementation, 365–366

snooping coherence protocols, 355–356

Centralized switched networks

example, F-31

routing algorithms, F-48

topology, F-30 to F-34, F-31

Centrally buffered switch, microarchitecture, F-57

Central processing unit (CPU)

Amdahl’s law, 48

average memory access time, B-17

cache performance, B-4

coarse-grained multithreading, 224

early pipelined versions, L-26 to L-27

exception stopping/restarting, C-47

extensive pipelining, C-81

Google server usage, 440

GPU computing history, L-52

vs. GPUs, 288

instruction set complications, C-50

MIPS implementation, C-33 to C-34

MIPS precise exceptions, C-59 to C-60

MIPS scoreboarding, C-77

performance measurement history, L-6

pipeline branch issues, C-41

pipelining exceptions, C-43 to C-46

pipelining performance, C-10

Sony PlayStation 2 Emotion Engine, E-17

SPEC server benchmarks, 40

TI TMS320C55 DSP, E-8

vector memory systems, G-10

Central processing unit (CPU) time

execution time, 36

modeling, B-18

processor performance calculations, B-19 to B-21

processor performance equation, 49–51

processor performance time, 49

Cerf, Vint, F-97

CERN See European Center for Particle Research (CERN)

CFM See Current frame pointer (CFM)

Chaining

convoys, DAXPY code, G-16

vector processor performance, G-11 to G-12, G-12

VMIPS, 268–269

Channel adapter See Network interface

Channels, cell phones, E-24

Character

floating-point performance, A-2

as operand type, A-13 to A-14

operand types/sizes, 12

Charge-coupled device (CCD), Sanyo VPC-SX500 digital camera, E-19

Checksum

dirty bits, D-61 to D-64

packet format, F-7

Chillers

Google WSC, 466, 468

WSC containers, 464

WSC cooling systems, 448–449

Chime

definition, 309

GPUs vs. vector architectures, 308

multiple lanes, 272

NVIDIA GPU computational structures, 296

vector chaining, G-12

vector execution time, 269, G-4

vector performance, G-2

vector sequence calculations, 270

Chip-crossing wire delay, F-70

OCN history, F-103

Chipkill

memory dependability, 104–105

WSCs, 473

Choke packets, congestion management, F-65

Chunk

disk array deconstruction, D-51

Shear algorithm, D-53

CIFS See Common Internet File System (CIFS)

Circuit switching

congestion management, F-64 to F-65

interconnected networks, F-50

Circulating water system (CWS)

cooling system design, 448

WSCs, 448

CISC See Complex Instruction Set Computer (CISC)

CLA See Carry-lookahead adder (CLA)

Clean block, definition, B-11

Climate Savers Computing Initiative, power supply efficiencies, 462

Clock cycles

basic MIPS pipeline, C-34 to C-35

and branch penalties, 205

cache performance, B-4

FP pipeline, C-66

and full associativity, B-23

GPU conditional branching, 303

ILP exploitation, 197, 200

ILP exposure, 157

instruction fetch bandwidth, 202–203

instruction steps, 173–175

Intel Core i7 branch predictor, 166

MIPS exceptions, C-48

MIPS pipeline, C-52

MIPS pipeline FP operations, C-52 to C-53

MIPS scoreboarding, C-77

miss rate calculations, B-31 to B-32

multithreading approaches, 225–226

pipelining performance, C-10

processor performance equation, 49

RISC classic pipeline, C-7

Sun T1 multithreading, 226–227

switch microarchitecture pipelining, F-61

vector architectures, G-4

vector execution time, 269

vector multiple lanes, 271–273

VLIW processors, 195

Clock cycles per instruction (CPI)

addressing modes, A-10

ARM Cortex-A8, 235

branch schemes, C-25 to C-26, C-26

cache behavior impact, B-18 to B-19

cache hit calculation, B-5

data hazards requiring stalls, C-20

extensive pipelining, C-81

floating-point calculations, 50–52

ILP concepts, 148–149, 149

ILP exploitation, 192

Intel Core i7, 124, 240, 240–241

microprocessor advances, L-33

MIPS R4000 performance, C-69

miss penalty reduction, B-32

multiprocessing/multithreading-based performance, 398–400

multiprocessor communication calculations, 350

pipeline branch issues, C-41

pipeline with stalls, C-12 to C-13

pipeline structural hazards, C-15 to C-16

pipelining concept, C-3

processor performance calculations, 218–219

processor performance time, 49–51

and processor speed, 244

RISC history, L-21

shared-memory workloads, 369–370

simple MIPS implementation, C-33 to C-34

structural hazards, C-13

Sun T1 multithreading unicore performance, 229

Sun T1 processor, 399

Tomasulo’s algorithm, 181

VAX 8700 vs. MIPS M2000, K-82

Clock cycle time

and associativity, B-29

average memory access time, B-21 to B-22

cache optimization, B-19 to B-20, B-30

cache performance, B-4

CPU time equation, 49–50, B-18

MIPS implementation, C-34

miss penalties, 219

pipeline performance, C-12, C-14 to C-15

pipelining, C-3

shared- vs. switched-media networks, F-25

Clock periods, processor performance equation, 48–49

Clock rate

DDR DRAMS and DIMMS, 101

ILP for realizable processors, 218

Intel Core i7, 236–237

microprocessor advances, L-33

microprocessors, 24

MIPS pipeline FP operations, C-53

multicore processor performance, 400

and processor speed, 244

Clocks, processor performance equation, 48–49

Clock skew, pipelining performance, C-10

Clock ticks

cache coherence, 391

processor performance equation, 48–49

Clos network

Benesˆ topology, F-33

as nonblocking, F-33

Cloud computing

basic considerations, 455–461

clusters, 345

provider issues, 471–472

utility computing history, L-73 to L-74

Clusters

characteristics, 8, I-45

cloud computing, 345

as computer class, 5

containers, L-74 to L-75

Cray X1, G-22

Google WSC servers, 469

historical background, L-62 to L-64

IBM Blue Gene/L, I-41 to I-44, I-43 to I-44

interconnection network domains, F-3 to F-4

Internet Archive Cluster See Internet Archive Cluster

large-scale multiprocessors, I-6

large-scale multiprocessor trends, L-62 to L-63

outage/anomaly statistics, 435

power consumption, F-85

utility computing, L-73 to L-74

as WSC forerunners, 435–436, L-72 to L-73

WSC storage, 442–443

Cm*, L-56

C.mmp, L-56

CMOS

DRAM, 99

first vector computers, L-46, L-48

ripple-carry adder, J-3

vector processors, G-25 to G-27

Coarse-grained multithreading, definition, 224–226

Cocke, John, L-19, L-28

Code division multiple access (CDMA), cell phones, E-25

Code generation

compiler structure, A-25 to A-26, A-30

dependences, 220

general-purpose register computers, A-6

ILP limitation studies, 220

loop unrolling/scheduling, 162

Code scheduling

example, H-16

parallelism, H-15 to H-23

superblock scheduling, H-21 to H-23, H-22

trace scheduling, H-19 to H-21, H-20

Code size

architect-compiler considerations, A-30

benchmark information, A-2

comparisons, A-44

flawless architecture design, A-45

instruction set encoding, A-22 to A-23

ISA and compiler technology, A-43 to A-44

loop unrolling, 160–161

multiprogramming, 375–376

PMDs, 6

RISCs, A-23 to A-24

VAX design, A-45

VLIW model, 195–196

Coefficient of variance, D-27

Coerced exceptions

definition, C-45

exception types, C-46

Coherence See Cache coherence

Coherence misses

definition, 366

multiprogramming, 376–377

role, 367

scientific workloads on symmetric shared-memory multiprocessors, I-22

snooping protocols, 355–356

Cold-start misses, definition, B-23

Collision, shared-media networks, F-23

Collision detection, shared-media networks, F-23

Collision misses, definition, B-23

Collocation sites, interconnection networks, F-85

COLOSSUS, L-4

Column access strobe (CAS), DRAM, 98–99

Column major order

blocking, 89

stride, 278

COMA See Cache-only memory architecture (COMA)

Combining tree, large-scale multiprocessor synchronization, I-18

Command queue depth, vs. disk throughput, D-4

Commercial interconnection networks

congestion management, F-64 to F-66

connectivity, F-62 to F-63

cross-company interoperability, F-63 to F-64

DECstation 5000 reboots, F-69

fault tolerance, F-66 to F-69

Commercial workloads

execution time distribution, 369

symmetric shared-memory multiprocessors, 367–374

Commit stage, ROB instruction, 186–187, 188

Commodities

Amazon Web Services, 456–457

array switch, 443

cloud computing, 455

cost vs. price, 32–33

cost trends, 27–28, 32

Ethernet rack switch, 442

HPC hardware, 436

shared-memory multiprocessor, 441

WSCs, 441

Commodity cluster, characteristics, I-45

Common data bus (CDB)

dynamic scheduling with Tomasulo’s algorithm, 172, 175

FP unit with Tomasulo’s algorithm, 185

reservation stations/register tags, 177

Tomasulo’s algorithm, 180, 182

Common Internet File System (CIFS), D-35

NetApp FAS6000 filer, D-41 to D-42

Communication bandwidth, basic considerations, I-3

Communication latency, basic considerations, I-3 to I-4

Communication latency hiding, basic considerations, I-4

Communication mechanism

adaptive routing, F-93 to F-94

internetworking, F-81 to F-82

large-scale multiprocessors

advantages, I-4 to I-6

metrics, I-3 to I-4

multiprocessor communication calculations, 350

network interfaces, F-7 to F-8

NEWS communication, F-42 to F-43

SMP limitations, 363

Communication protocol, definition, F-8

Communication subnets See Interconnection networks

Communication subsystems See Interconnection networks

Compare instruction, VAX, K-71

Compares, MIPS core, K-9 to K-16

Compare-select-store unit (CSSU), TI TMS320C55 DSP, E-8

Compiler-controlled prefetching, miss penalty/rate reduction, 92–95

Compiler optimizations

blocking, 89–90

cache optimization, 131–133

compiler assumptions, A-25 to A-26

and consistency model, 396

loop interchange, 88–89

miss rate reduction, 87–90

passes, A-25

performance impact, A-27

types and classes, A-28

Compiler scheduling

data dependences, 151

definition, C-71

hardware support, L-30 to L-31

IBM 360 architecture, 171

Compiler speculation, hardware support

memory references, H-32

overview, H-27

preserving exception behavior, H-28 to H-32

Compiler techniques

dependence analysis, H-7

global code scheduling, H-17 to H-18

ILP exposure, 156–162

vectorization, G-14

vector sparse matrices, G-12

Compiler technology

and architecture decisions, A-27 to A-29

Cray X1, G-21 to G-22

ISA and code size, A-43 to A-44

multimedia instruction support, A-31 to A-32

structure, A-24 to A-26, A-25

Compiler writer-architect relationship, A-29 to A-30

Complex Instruction Set Computer (CISC)

RISC history, L-22

VAX as, K-65

Compulsory misses

and cache size, B-24

definition, B-23

memory hierarchy basics, 75

shared-memory workload, 373

Computation-to-communication ratios

parallel programs, I-10 to I-12

scaling, I-11

Compute-optimized processors, interconnection networks, F-88

Computer aided design (CAD) tools, cache optimization, 79–80

Computer architecture See also Architecture

coining of term, K-83 to K-84

computer design innovations, 4

defining, 11

definition, L-17 to L-18

exceptions, C-44

factors in improvement, 2

flawless design, K-81

flaws and success, K-81

floating-point addition, rules, J-24

goals/functions requirements, 15, 15–16, 16

high-level language, L-18 to L-19

instruction execution issues, K-81

ISA, 11–15

multiprocessor software development, 407–409

parallel, 9–10

WSC basics, 432, 441–442

array switch, 443

memory hierarchy, 443–446

storage, 442–443

Computer arithmetic

chip comparison, J-58, J-58 to J-61, J-59 to J-60

floating point

exceptions, J-34 to J-35

fused multiply-add, J-32 to J-33

IEEE 754, J-16

iterative division, J-27 to J-31

and memory bandwidth, J-62

overview, J-13 to J-14

precisions, J-33 to J-34

remainder, J-31 to J-32

special values, J-16

special values and denormals, J-14 to J-15

underflow, J-36 to J-37, J-62

floating-point addition

denormals, J-26 to J-27

overview, J-21 to J-25

speedup, J-25 to J-26

floating-point multiplication

denormals, J-20 to J-21

examples, J-19

overview, J-17 to J-20

rounding, J-18

integer addition speedup

carry-lookahead, J-37 to J-41

carry-lookahead circuit, J-38

carry-lookahead tree, J-40

carry-lookahead tree adder, J-41

carry-select adder, J-43, J-43 to J-44, J-44

carry-skip adder, J-41 to J43, J-42

overview, J-37

integer arithmetic

language comparison, J-12

overflow, J-11

Radix-2 multiplication/division, J-4, J-4 to J-7

restoring/nonrestoring division, J-6

ripply-carry addition, J-2 to J-3, J-3

signed numbers, J-7 to J-10

systems issues, J-10 to J-13

integer division

radix-2 division, J-55

radix-4 division, J-56

radix-4 SRT division, J-57

with single adder, J-54 to J-58

SRT division, J-45 to J-47, J-46

integer-FP conversions, J-62

integer multiplication

array multiplier, J-50

Booth recoding, J-49

even/odd array, J-52

with many adders, J-50 to J-54

multipass array multiplier, J-51

signed-digit addition table, J-54

with single adder, J-47 to J-49, J-48

Wallace tree, J-53

integer multiplication/division, shifting over zeros, J-45 to J-47

overview, J-2

rounding modes, J-20

Computer chip fabrication

cost case study, 61–62

Cray X1E, G-24

Computer classes

desktops, 6

embedded computers, 8–9

example, 5

overview, 5

parallelism and parallel architectures, 9–10

PMDs, 6

servers, 7

and system characteristics, E-4

warehouse-scale computers, 8

Computer design principles

Amdahl’s law, 46–48

common case, 45–46

parallelism, 44–45

principle of locality, 45

processor performance equation, 48–52

Computer history, technology and architecture, 2–5

Computer room air-conditioning (CRAC), WSC infrastructure, 448–449

Compute tiles, OCNs, F-3

Compute Unified Device Architecture See CUDA (Compute Unified Device Architecture)

Conditional branches

branch folding, 206

compare frequencies, A-20

compiler performance, C-24 to C-25

control flow instructions, 14, A-16, A-17, A-19, A-21

desktop RISCs, K-17

embedded RISCs, K-17

evaluation, A-19

global code scheduling, H-16, H-16

GPUs, 300–303

ideal processor, 214

ISAs, A-46

MIPS control flow instructions, A-38, A-40

MIPS core, K-9 to K-16

PA-RISC instructions, K-34, K-34

predictor misprediction rates, 166

PTX instruction set, 298–299

static branch prediction, C-26

types, A-20

vector-GPU comparison, 311

Conditional instructions

exposing parallelism, H-23 to H-27

limitations, H-26 to H-27

Condition codes

branch conditions, A-19

control flow instructions, 14

definition, C-5

high-level instruction set, A-43

instruction set complications, C-50

MIPS core, K-9 to K-16

pipeline branch penalties, C-23

VAX, K-71

Conflict misses

and block size, B-28

cache coherence mechanism, 358

and cache size, B-24, B-26

definition, B-23

as kernel miss, 376

L3 caches, 371

memory hierarchy basics, 75

OLTP workload, 370

PIDs, B-37

shared-memory workload, 373

Congestion control

commercial interconnection networks, F-64

system area network history, F-101

Congestion management, commercial interconnection networks, F-64 to F-66

Connectedness

dimension-order routing, F-47 to F-48

interconnection network topology, F-29

Connection delay, multi-device interconnection networks, F-25

Connection Machine CM-5, F-91, F-100

Connection Multiprocessor 2, L-44, L-57

Consistency See Memory consistency

Constant extension

desktop RISCs, K-9

embedded RISCs, K-9

Constellation, characteristics, I-45

Containers

airflow, 466

cluster history, L-74 to L-75

Google WSCs, 464–465, 465

Context Switching

definition, 106, B-49

Fermi GPU, 307

Control bits, messages, F-6

Control Data Corporation (CDC), first vector computers, L-44 to L-45

Control Data Corporation (CDC) 6600

computer architecture definition, L-18

dynamically scheduling with scoreboard, C-71 to C-72

early computer arithmetic, J-64

first dynamic scheduling, L-27

MIPS scoreboarding, C-75, C-77

multiple-issue processor development, L-28

multithreading history, L-34

RISC history, L-19

Control Data Corporation (CDC) STAR-100

first vector computers, L-44

peak performance vs. start-up overhead, 331

Control Data Corporation (CDC) STAR processor, G-26

Control dependences

conditional instructions, H-24

as data dependence, 150

global code scheduling, H-16

hardware-based speculation, 183

ILP, 154–156

ILP hardware model, 214

and Tomasulo’s algorithm, 170

vector mask registers, 275–276

Control flow instructions

addressing modes, A-17 to A-18

basic considerations, A-16 to A-17, A-20 to A-21

classes, A-17

conditional branch options, A-19

conditional instructions, H-27

hardware vs. software speculation, 221

Intel 80x86 integer operations, K-51

ISAs, 14

MIPS, A-37 to A-38, A-38

procedure invocation options, A-19 to A-20

Control hazards

ARM Cortex-A8, 235

definition, C-11

Control instructions

Intel 80x86, K-53

RISCs

desktop systems, K-12, K-22

embedded systems, K-16

VAX, B-73

Controllers, historical background, L-80 to L-81

Controller transitions

directory-based, 422

snooping cache, 421

Control Processor

definition, 309

GPUs, 333

SIMD, 10

Thread Block Scheduler, 294

vector processor, 310, 310–311

vector unit structure, 273

Conventional datacenters, vs. WSCs, 436

Convex Exemplar, L-61

Convex processors, vector processor history, G-26

Convolution, DSP, E-5

Convoy

chained, DAXPY code, G-16

DAXPY on VMIPS, G-20

strip-mined loop, G-5

vector execution time, 269–270

vector starting times, G-4

Conway, Lynn, L-28

Cooling systems

Google WSC, 465–468

mechanical design, 448

WSC infrastructure, 448–449

Copper wiring

Ethernet, F-78

interconnection networks, F-9

“Coprocessor operations,” MIPS core extensions, K-21

Copy propagation, definition, H-10 to H-11

Core definition, 15

Core plus ASIC, embedded systems, E-3

Correlating branch predictors, branch costs, 162–163

Cosmic Cube, F-100, L-60

Cost

Amazon EC2, 458

Amazon Web Services, 457

bisection bandwidth, F-89

branch predictors, 162–167, C-26

chip fabrication case study, 61–62

cloud computing providers, 471–472

disk storage, D-2

DRAM/magnetic disk, D-3

interconnecting node calculations, F-31 to F-32, F-35

Internet Archive Cluster, D-38 to D-40

internetworking, F-80

I/O system design/evaluation, D-36

magnetic storage history, L-78

MapReduce calculations, 458–459, 459

memory hierarchy design, 72

MINs vs. direct networks, F-92

multiprocessor cost relationship, 409

multiprocessor linear speedup, 407

network topology, F-40

PMDs, 6

server calculations, 454, 454–455

server usage, 7

SIMD supercomputer development, L-43

speculation, 210

torus topology interconnections, F-36 to F-38

tournament predictors, 164–166

WSC array switch, 443

WSC vs. datacenters, 455–456

WSC efficiency, 450–452

WSC facilities, 472

WSC network bottleneck, 461

WSCs, 446–450, 452–455, 453

WSCs vs. servers, 434

WSC TCO case study, 476–478

Cost associativity, cloud computing, 460–461

Cost-performance

commercial interconnection networks, F-63

computer trends, 3

extensive pipelining, C-80 to C-81

IBM eServer p5 processor, 409

sorting case study, D-64 to D-67

WSC Flash memory, 474–475

WSC goals/requirements, 433

WSC hardware inactivity, 474

WSC processors, 472–473

Cost trends

integrated circuits, 28–32

manufacturing vs. operation, 33

overview, 27

vs. price, 32–33

time, volume, commoditization, 27–28

Count register, PowerPC instructions, K-32 to K-33

CP-67 program, L-10

CPA See Carry-propagate adder (CPA)

CPI See Clock cycles per instruction (CPI)

CPU See Central processing unit (CPU)

CRAC See Computer room air-conditioning (CRAC)

Cray, Seymour, G-25, G-27, L-44, L-47

Cray-1

first vector computers, L-44 to L-45

peak performance vs. start-up overhead, 331

pipeline depths, G-4

RISC history, L-19

vector performance, 332

vector performance measures, G-16

as VMIPS basis, 264, 270–271, 276–277

Cray-2

DRAM, G-25

first vector computers, L-47

tailgating, G-20

Cray-3, G-27

Cray-4, G-27

Cray C90

first vector computers, L-46, L-48

vector performance calculations, G-8

Cray J90, L-48

Cray Research T3D, F-86 to F-87, F-87

Cray supercomputers, early computer arithmetic, J-63 to J-64

Cray T3D, F-100, L-60

Cray T3E, F-67, F-94, F-100, L-48, L-60

Cray T90, memory bank calculations, 276

Cray X1

cluster history, L-63

first vector computers, L-46, L-48

MSP module, G-22, G-23 to G-24

overview, G-21 to G-23

peak performance, 58

Cray X1E, F-86, F-91

characteristics, G-24

Cray X2, L-46 to L-47

first vector computers, L-48 to L-49

Cray X-MP, L-45

first vector computers, L-47

Cray XT3, L-58, L-63

Cray XT3 SeaStar, F-63

Cray Y-MP

first vector computers, L-45 to L-47

parallel processing debates, L-57

vector architecture programming, 281, 281–282

CRC See Cyclic redundancy check (CRC)

Create vector index instruction (CVI), sparse matrices, G-13

Credit-based control flow

InfiniBand, F-74

interconnection networks, F-10, F-17

CRISP, L-27

Critical path

global code scheduling, H-16

trace scheduling, H-19 to H-21, H-20

Critical word first, cache optimization, 86–87

Crossbars

centralized switched networks, F-30, F-31

characteristics, F-73

Convex Exemplar, L-61

HOL blocking, F-59

OCN history, F-104

switch microarchitecture, F-62

switch microarchitecture pipelining, F-60 to F-61, F-61

VMIPS, 265

Crossbar switch

centralized switched networks, F-30

interconnecting node calculations, F-31 to F-32

Cross-company interoperability, commercial interconnection networks, F-63 to F-64

Crusoe, L-31

Cryptanalysis, L-4

CSA See Carry-save adder (CSA); Carry-skip adder (CSA)

C# language, hardware impact on software development, 4

CSSU See Compare-select-store unit (CSSU)

CUDA (Compute Unified Device Architecture)

GPU computing history, L-52

GPU conditional branching, 303

GPUs vs. vector architectures, 310

NVIDIA GPU programming, 289

PTX, 298, 300

sample program, 289–290

SIMD instructions, 297

terminology, 313–315

CUDA Thread

CUDA programming model, 300, 315

definition, 292, 313

definitions and terms, 314

GPU data addresses, 310

GPU Memory structures, 304

NVIDIA parallelism, 289–290

vs. POSIX Threads, 297

PTX Instructions, 298

SIMD Instructions, 303

Thread Block, 313

Current frame pointer (CFM), IA-64 register model, H-33 to H-34

Custom cluster

characteristics, I-45

IBM Blue Gene/L, I-41 to I-44, I-43 to I-44

Cut-through packet switching, F-51

routing comparison, F-54

CVI See Create vector index instruction (CVI)

CWS See Circulating water system (CWS)

CYBER 180/990, precise exceptions, C-59

CYBER 205

peak performance vs. start-up overhead, 331

vector processor history, G-26 to G-27

CYBER 250, L-45

Cycles, processor performance equation, 49

Cycle time See also Clock cycle time

CPI calculations, 350

pipelining, C-81

scoreboarding, C-79

vector processors, 277

Cyclic redundancy check (CRC)

IBM Blue Gene/L 3D torus network, F-73

network interface, F-8

Cydrome Cydra 6, L-30, L-32

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Computer Architecture: A Quantitative Approach

Create new playlist

Sign In

Sign Up

C

Table of Contents for
Computer Architecture: A Quantitative Approach