C
Cache bandwidth
multibanked caches,
85–86
nonblocking caches,
83–85
pipelined cache access,
82
Cache block
compiler optimizations,
89–90
critical word first,
86–87
directory-based cache coherence protocol,
382–386,
383
scientific workloads on symmetric shared-memory multiprocessors, I-22, I-25,
I-25
shared-memory multiprogramming workload,
375–377,
376
write invalidate protocol implementation,
356–357
Cache coherence
advanced directory protocol case study,
420–426
large-scale multiprocessor history, L-61
large-scale multiprocessors
deadlock and buffering, I-38 to I-40
directory controller, I-40 to I-41
DSM implementation, I-36 to I-37
latency hiding with speculation,
396
memory hierarchy basics,
75
multiprocessor-optimized software,
409
single-chip multicore processor case study,
412–418
single memory location example,
352
steps and bus traffic examples,
391
Cache hit
AMD Opteron example,
B-14
Cache latency, nonblocking cache,
83–84
Cache miss
distributed-memory multiprocessors,
I-32
example calculations,
83–84
interconnection network, F-87
large-scale multiprocessors, I-34 to I-35
single
vs. multiple thread executions,
228
Cache-only memory architecture (COMA), L-61
Cache optimizations
basic optimizations,
B-40
compiler-controlled prefetching,
92–95
compiler optimizations,
87–90
critical word first,
86–87
hardware instruction prefetching,
91–92,
92
pipelined cache access,
82
simple first-level caches,
79–80
write buffer merging,
87,
88
Cache performance
basic optimizations,
B-40
Cache prefetch, cache optimization,
92
Caches
See also Memory hierarchy
access time
vs. block size,
B-28
embedded systems, E-4 to E-5
Fermi GPU architecture,
306
ILP for realizable processors,
216–218
multichip multicore multiprocessor,
419
Sony PlayStation 2 Emotion Engine, E-18
Cache size
highly parallel memory systems,
133
memory hierarchy basics,
76
misses per instruction,
126,
371
miss rate reduction,
B-28
and relative execution time,
B-34
scientific workloads
distributed-memory multiprocessors,
I-29 to I-31
symmetric shared-memory multiprocessors, I-22 to I-23,
I-24
shared-memory multiprogramming workload,
376
virtually addressed,
B-37
Call gate
IA-32 segment descriptors,
B-53
segmented virtual memory,
B-54
Calls
Intel 80x86 integer operations, K-51
MIPS control flow instructions,
A-38
multiprogrammed workload,
378
NVIDIA GPU Memory structures,
304–305
return address predictors,
206
shared-memory multiprocessor workload,
369
Canonical form, AMD64 paged virtual memory,
B-55
Capabilities, protection schemes, L-9 to L-10
Capacity misses
memory hierarchy basics,
75
scientific workloads on symmetric shared-memory multiprocessors, I-22,
I-23, I-24
shared-memory workload,
373
Capital expenditures (CAPEX)
Carrier sensing, shared-media networks, F-23
Carrier signal, wireless networks, E-21
Carry condition code, MIPS core, K-9 to K-16
Carry-in, carry-skip adder, J-42
Carry-lookahead adder (CLA)
early computer arithmetic, J-63
integer addition speedup, J-37 to J-41
with ripple-carry adder,
J-42
Carry-out
carry-lookahead circuit,
J-38
floating-point addition speedup, J-25
Carry-propagate adder (CPA)
integer multiplication, J-48, J-51
multipass array multiplier,
J-51
Carry-save adder (CSA)
integer division, J-54 to J-55
integer multiplication, J-47 to J-48,
J-48
Carry-select adder
characteristics, J-43 to J-44
Carry-skip adder (CSA)
characteristics, J-41 to J43
Case statements
control flow instruction addressing modes,
A-18
return address predictors,
206
Case studies
advanced directory protocol,
420–426
cell phones
Nokia circuit board,
E-24
standards and evolution, E-25
wireless communication challenges,
E-21
wireless networks, E-21 to E-22
chip fabrication cost,
61–62
computer system power consumption,
63–64
disk array deconstruction, D-51 to D-55,
D-52 to D-55
disk deconstruction, D-48 to D-51,
D-50
highly parallel memory systems,
133–136
I/O subsystem design, D-59 to D-61
microarchitectural techniques,
247–254
RAID performance prediction, D-57 to D-59
RAID reconstruction, D-55 to D-57
Sanyo VPC-SX500 digital camera, E-19
single-chip multicore processor,
412–418
Sony PlayStation 2 Emotion Engine, E-15 to E-18
vector kernel on vector processor and GPU,
334–336
C/C++ language
GPU computing history, L-52
hardware impact on software development,
integer division/remainder,
J-12
NVIDIA GPU programming,
289
return address predictors,
206
Cell, Barnes-Hut
n-body algorithm, I-9
Cell phones
embedded system case study
characteristics, E-22 to E-24
standards and evolution, E-25
wireless network overview, E-21 to E-22
Nokia circuit board,
E-24
wireless communication challenges,
E-21
Centralized shared-memory multiprocessors
cache coherence enforcement,
354–355
cache coherence extensions,
362–363
invalidate protocol implementation,
356–357
SMP and snooping limitations,
363–364
snooping coherence implementation,
365–366
snooping coherence protocols,
355–356
Centralized switched networks
topology, F-30 to F-34,
F-31
Centrally buffered switch, microarchitecture, F-57
Central processing unit (CPU)
average memory access time,
B-17
coarse-grained multithreading,
224
early pipelined versions, L-26 to L-27
exception stopping/restarting,
C-47
extensive pipelining,
C-81
GPU computing history, L-52
instruction set complications,
C-50
performance measurement history, L-6
pipeline branch issues,
C-41
pipelining performance,
C-10
Sony PlayStation 2 Emotion Engine, E-17
SPEC server benchmarks,
40
vector memory systems,
G-10
Central processing unit (CPU) time
processor performance equation,
49–51
processor performance time,
49
Chaining
convoys, DAXPY code,
G-16
vector processor performance, G-11 to G-12,
G-12
Channels, cell phones, E-24
Character
floating-point performance,
A-2
Charge-coupled device (CCD), Sanyo VPC-SX500 digital camera, E-19
Chime
GPUs
vs. vector architectures,
308
NVIDIA GPU computational structures,
296
vector sequence calculations,
270
Chip-crossing wire delay, F-70
Choke packets, congestion management, F-65
Chunk
disk array deconstruction, D-51
Circuit switching
congestion management, F-64 to F-65
interconnected networks, F-50
Circulating water system (CWS)
cooling system design,
448
Clean block, definition,
B-11
Climate Savers Computing Initiative, power supply efficiencies,
462
Clock cycles
and branch penalties,
205
and full associativity,
B-23
GPU conditional branching,
303
instruction fetch bandwidth,
202–203
Intel Core i7 branch predictor,
166
pipelining performance,
C-10
processor performance equation,
49
RISC classic pipeline,
C-7
switch microarchitecture pipelining, F-61
vector architectures, G-4
vector execution time,
269
Clock cycles per instruction (CPI)
cache hit calculation,
B-5
data hazards requiring stalls,
C-20
extensive pipelining,
C-81
floating-point calculations,
50–52
microprocessor advances, L-33
MIPS R4000 performance,
C-69
miss penalty reduction,
B-32
multiprocessing/multithreading-based performance,
398–400
multiprocessor communication calculations,
350
pipeline branch issues,
C-41
processor performance calculations,
218–219
processor performance time,
49–51
Sun T1 multithreading unicore performance,
229
Tomasulo’s algorithm,
181
VAX 8700
vs. MIPS M2000,
K-82
Clock cycle time
MIPS implementation,
C-34
shared-
vs. switched-media networks, F-25
Clock periods, processor performance equation,
48–49
Clock rate
ILP for realizable processors,
218
microprocessor advances, L-33
MIPS pipeline FP operations,
C-53
multicore processor performance,
400
Clocks, processor performance equation,
48–49
Clock skew, pipelining performance,
C-10
Clock ticks
processor performance equation,
48–49
Cloud computing
utility computing history, L-73 to L-74
Clusters
historical background, L-62 to L-64
IBM Blue Gene/L, I-41 to I-44, I-43 to I-44
interconnection network domains, F-3 to F-4
large-scale multiprocessors, I-6
large-scale multiprocessor trends, L-62 to L-63
outage/anomaly statistics,
435
utility computing, L-73 to L-74
CMOS
first vector computers, L-46, L-48
vector processors, G-25 to G-27
Coarse-grained multithreading, definition,
224–226
Code division multiple access (CDMA), cell phones, E-25
Code generation
general-purpose register computers,
A-6
ILP limitation studies,
220
loop unrolling/scheduling,
162
Code scheduling
parallelism, H-15 to H-23
superblock scheduling, H-21 to H-23,
H-22
trace scheduling, H-19 to H-21,
H-20
Code size
architect-compiler considerations,
A-30
benchmark information,
A-2
flawless architecture design,
A-45
Coefficient of variance, D-27
Coherence misses
scientific workloads on symmetric shared-memory multiprocessors, I-22
Cold-start misses, definition,
B-23
Collision, shared-media networks, F-23
Collision detection, shared-media networks, F-23
Collision misses, definition,
B-23
Collocation sites, interconnection networks, F-85
Column access strobe (CAS), DRAM,
98–99
Combining tree, large-scale multiprocessor synchronization, I-18
Command queue depth,
vs. disk throughput,
D-4
Commercial interconnection networks
congestion management, F-64 to F-66
connectivity, F-62 to F-63
cross-company interoperability, F-63 to F-64
DECstation 5000 reboots,
F-69
fault tolerance, F-66 to F-69
Commercial workloads
execution time distribution,
369
symmetric shared-memory multiprocessors,
367–374
Commodities
Ethernet rack switch,
442
shared-memory multiprocessor,
441
Commodity cluster, characteristics, I-45
Common data bus (CDB)
dynamic scheduling with Tomasulo’s algorithm,
172,
175
FP unit with Tomasulo’s algorithm,
185
reservation stations/register tags,
177
Tomasulo’s algorithm,
180,
182
Common Internet File System (CIFS), D-35
NetApp FAS6000 filer, D-41 to D-42
Communication bandwidth, basic considerations, I-3
Communication latency, basic considerations, I-3 to I-4
Communication latency hiding, basic considerations, I-4
Communication mechanism
adaptive routing, F-93 to F-94
internetworking, F-81 to F-82
large-scale multiprocessors
multiprocessor communication calculations,
350
network interfaces, F-7 to F-8
NEWS communication, F-42 to F-43
Communication protocol, definition, F-8
Compare instruction, VAX, K-71
Compares, MIPS core, K-9 to K-16
Compare-select-store unit (CSSU), TI TMS320C55 DSP, E-8
Compiler-controlled prefetching, miss penalty/rate reduction,
92–95
Compiler optimizations
and consistency model,
396
miss rate reduction,
87–90
Compiler scheduling
hardware support, L-30 to L-31
IBM 360 architecture,
171
Compiler speculation, hardware support
preserving exception behavior, H-28 to H-32
Compiler techniques
global code scheduling, H-17 to H-18
vector sparse matrices, G-12
Complex Instruction Set Computer (CISC)
Compulsory misses
memory hierarchy basics,
75
shared-memory workload,
373
Computation-to-communication ratios
parallel programs, I-10 to I-12
Compute-optimized processors, interconnection networks, F-88
Computer aided design (CAD) tools, cache optimization,
79–80
Computer architecture
See also Architecture
coining of term, K-83 to K-84
computer design innovations,
floating-point addition, rules,
J-24
high-level language, L-18 to L-19
instruction execution issues, K-81
multiprocessor software development,
407–409
Computer arithmetic
chip comparison,
J-58, J-58 to J-61,
J-59 to J-60
floating point
fused multiply-add, J-32 to J-33
iterative division, J-27 to J-31
and memory bandwidth, J-62
special values and denormals, J-14 to J-15
underflow, J-36 to J-37, J-62
floating-point multiplication
integer addition speedup
carry-lookahead, J-37 to J-41
carry-lookahead circuit,
J-38
carry-lookahead tree,
J-40
carry-lookahead tree adder,
J-41
carry-select adder,
J-43, J-43 to J-44,
J-44
carry-skip adder, J-41 to J43,
J-42
integer arithmetic
language comparison,
J-12
Radix-2 multiplication/division,
J-4, J-4 to J-7
restoring/nonrestoring division,
J-6
ripply-carry addition, J-2 to J-3,
J-3
signed numbers, J-7 to J-10
systems issues, J-10 to J-13
integer division
radix-4 SRT division,
J-57
with single adder, J-54 to J-58
SRT division, J-45 to J-47, J-46
integer-FP conversions, J-62
integer multiplication
with many adders, J-50 to J-54
multipass array multiplier,
J-51
signed-digit addition table,
J-54
with single adder, J-47 to J-49,
J-48
integer multiplication/division, shifting over zeros, J-45 to J-47
Computer chip fabrication
Computer classes
parallelism and parallel architectures,
9–10
and system characteristics,
E-4
warehouse-scale computers,
Computer design principles
principle of locality,
45
processor performance equation,
48–52
Computer history, technology and architecture,
2–5
Computer room air-conditioning (CRAC), WSC infrastructure,
448–449
Conditional branches
compare frequencies,
A-20
global code scheduling, H-16,
H-16
MIPS control flow instructions,
A-38,
A-40
PA-RISC instructions, K-34,
K-34
predictor misprediction rates,
166
static branch prediction,
C-26
vector-GPU comparison,
311
Conditional instructions
exposing parallelism, H-23 to H-27
limitations, H-26 to H-27
Condition codes
control flow instructions,
14
high-level instruction set,
A-43
instruction set complications,
C-50
pipeline branch penalties,
C-23
Conflict misses
cache coherence mechanism,
358
memory hierarchy basics,
75
shared-memory workload,
373
Congestion control
commercial interconnection networks, F-64
system area network history, F-101
Congestion management, commercial interconnection networks, F-64 to F-66
Connectedness
dimension-order routing, F-47 to F-48
interconnection network topology, F-29
Connection delay, multi-device interconnection networks, F-25
Connection Machine CM-5, F-91, F-100
Connection Multiprocessor 2, L-44, L-57
Constellation, characteristics,
I-45
Containers
cluster history, L-74 to L-75
Control bits, messages, F-6
Control Data Corporation (CDC), first vector computers, L-44 to L-45
Control Data Corporation (CDC) 6600
computer architecture definition, L-18
early computer arithmetic, J-64
first dynamic scheduling, L-27
multiple-issue processor development, L-28
multithreading history, L-34
Control Data Corporation (CDC) STAR-100
first vector computers, L-44
peak performance
vs. start-up overhead,
331
Control Data Corporation (CDC) STAR processor, G-26
Control dependences
conditional instructions, H-24
global code scheduling, H-16
hardware-based speculation,
183
and Tomasulo’s algorithm,
170
Control flow instructions
conditional branch options,
A-19
conditional instructions, H-27
hardware
vs. software speculation,
221
Intel 80x86 integer operations, K-51
Control instructions
RISCs
desktop systems,
K-12,
K-22
Controllers, historical background, L-80 to L-81
Control Processor
Thread Block Scheduler,
294
vector unit structure,
273
Conventional datacenters,
vs. WSCs,
436
Convex processors, vector processor history, G-26
Convoy
chained, DAXPY code,
G-16
vector starting times,
G-4
Copper wiring
interconnection networks, F-9
“Coprocessor operations,” MIPS core extensions, K-21
Copy propagation, definition, H-10 to H-11
Core plus ASIC, embedded systems, E-3
Correlating branch predictors, branch costs,
162–163
Cost
bisection bandwidth, F-89
chip fabrication case study,
61–62
interconnecting node calculations, F-31 to F-32, F-35
Internet Archive Cluster, D-38 to D-40
I/O system design/evaluation, D-36
magnetic storage history, L-78
memory hierarchy design,
72
MINs
vs. direct networks, F-92
multiprocessor cost relationship,
409
multiprocessor linear speedup,
407
SIMD supercomputer development, L-43
torus topology interconnections, F-36 to F-38
WSC network bottleneck,
461
Cost associativity, cloud computing,
460–461
Cost-performance
commercial interconnection networks, F-63
IBM eServer p5 processor,
409
sorting case study, D-64 to D-67
WSC goals/requirements,
433
WSC hardware inactivity,
474
Cost trends
integrated circuits,
28–32
manufacturing
vs. operation,
33
time, volume, commoditization,
27–28
Count register, PowerPC instructions, K-32 to K-33
Cray, Seymour, G-25, G-27, L-44, L-47
Cray-1
first vector computers, L-44 to L-45
peak performance
vs. start-up overhead,
331
vector performance measures, G-16
Cray-2
first vector computers, L-47
Cray C90
first vector computers, L-46, L-48
vector performance calculations, G-8
Cray Research T3D, F-86 to F-87,
F-87
Cray supercomputers, early computer arithmetic, J-63 to J-64
Cray T3E, F-67, F-94, F-100, L-48, L-60
Cray T90, memory bank calculations,
276
Cray X1
first vector computers, L-46, L-48
MSP module,
G-22, G-23 to G-24
Cray X2, L-46 to L-47
first vector computers, L-48 to L-49
Cray X-MP, L-45
first vector computers, L-47
Cray Y-MP
first vector computers, L-45 to L-47
parallel processing debates, L-57
Create vector index instruction (CVI), sparse matrices, G-13
Credit-based control flow
interconnection networks, F-10, F-17
Critical path
global code scheduling, H-16
trace scheduling, H-19 to H-21,
H-20
Critical word first, cache optimization,
86–87
Crossbars
centralized switched networks, F-30,
F-31
switch microarchitecture, F-62
switch microarchitecture pipelining, F-60 to F-61,
F-61
Crossbar switch
centralized switched networks, F-30
interconnecting node calculations, F-31 to F-32
Cross-company interoperability, commercial interconnection networks, F-63 to F-64
C# language, hardware impact on software development,
CUDA (Compute Unified Device Architecture)
GPU computing history, L-52
GPU conditional branching,
303
GPUs
vs. vector architectures,
310
NVIDIA GPU programming,
289
CUDA Thread
CUDA programming model,
300,
315
definitions and terms,
314
GPU Memory structures,
304
Current frame pointer (CFM), IA-64 register model, H-33 to H-34
Custom cluster
IBM Blue Gene/L, I-41 to I-44, I-43 to I-44
Cut-through packet switching, F-51
CYBER 180/990, precise exceptions,
C-59
CYBER 205
peak performance
vs. start-up overhead,
331
vector processor history, G-26 to G-27
Cycles, processor performance equation,
49
Cyclic redundancy check (CRC)
IBM Blue Gene/L 3D torus network, F-73
Cydrome Cydra 6, L-30, L-32