C

Cache bandwidth
caches, 78
multibanked caches, 85–86
nonblocking caches, 83–85
pipelined cache access, 82
Cache block
AMD Opteron data cache, B-13, B-13 to B-14
cache coherence protocol, 357–358
compiler optimizations, 89–90
critical word first, 86–87
definition, B-2
directory-based cache coherence protocol, 382–386, 383
false sharing, 366
GPU comparisons, 329
inclusion, 397–398
memory block, B-61
miss categories, B-26
miss rate reduction, B-26 to B-28
scientific workloads on symmetric shared-memory multiprocessors, I-22, I-25, I-25
shared-memory multiprogramming workload, 375–377, 376
way prediction, 81
write invalidate protocol implementation, 356–357
write strategy, B-10
Cache coherence
advanced directory protocol case study, 420–426
basic considerations, 112–113
Cray X1, G-22
directory-based See Directory-based cache coherence
enforcement, 354–355
extensions, 362–363
hardware primitives, 388
Intel SCCC, F-70
large-scale multiprocessor history, L-61
large-scale multiprocessors
deadlock and buffering, I-38 to I-40
directory controller, I-40 to I-41
DSM implementation, I-36 to I-37
overview, I-34 to I-36
latency hiding with speculation, 396
lock implementation, 389–391
mechanism, 358
memory hierarchy basics, 75
multiprocessor-optimized software, 409
multiprocessors, 352–353
protocol definitions, 354–355
single-chip multicore processor case study, 412–418
single memory location example, 352
state diagram, 361
steps and bus traffic examples, 391
write-back cache, 360
Cache definition, B-2
Cache hit
AMD Opteron example, B-14
definition, B-2
example calculation, B-5
Cache latency, nonblocking cache, 83–84
Cache miss
and average memory access time, B-17 to B-20
block replacement, B-10
definition, B-2
distributed-memory multiprocessors, I-32
example calculations, 83–84
Intel Core i7, 122
interconnection network, F-87
large-scale multiprocessors, I-34 to I-35
nonblocking cache, 84
single vs. multiple thread executions, 228
WCET, E-4
Cache-only memory architecture (COMA), L-61
Cache optimizations
basic categories, B-22
basic optimizations, B-40
case studies, 131–133
compiler-controlled prefetching, 92–95
compiler optimizations, 87–90
critical word first, 86–87
energy consumption, 81
hardware instruction prefetching, 91–92, 92
hit time reduction, B-36 to B-40
miss categories, B-23 to B-26
miss penalty reduction
via multilevel caches, B-30 to B-35
read misses vs. writes, B-35 to B-36
miss rate reduction
via associativity, B-28 to B-30
via block size, B-26 to B-28
via cache size, B-28
multibanked caches, 85–86, 86
nonblocking caches, 83–85, 84
overview, 78–79
pipelined cache access, 82
simple first-level caches, 79–80
techniques overview, 96
way prediction, 81–82
write buffer merging, 87, 88
Cache organization
blocks, B-7, B-8
Opteron data cache, B-12 to B-13, B-13
optimization, B-19
performance impact, B-19
Cache performance
average memory access time, B-16 to B-20
basic considerations, B-3 to B-6, B-16
basic equations, B-22
basic optimizations, B-40
cache optimization, 96
case study, 131–133
example calculation, B-16 to B-17
out-of-order processors, B-20 to B-22
prediction, 125–126
Cache prefetch, cache optimization, 92
Caches See also Memory hierarchy
access time vs. block size, B-28
AMD Opteron example, B-12 to B-15, B-13, B-15
basic considerations, B-48 to B-49
coining of term, L-11
definition, B-2
early work, L-10
embedded systems, E-4 to E-5
Fermi GPU architecture, 306
ideal processor, 214
ILP for realizable processors, 216–218
Itanium 2, H-42
multichip multicore multiprocessor, 419
parameter ranges, B-42
Sony PlayStation 2 Emotion Engine, E-18
vector processors, G-25
vs. virtual memory, B-42 to B-43
Cache size
and access time, 77
AMD Opteron example, B-13 to B-14
energy consumption, 81
highly parallel memory systems, 133
memory hierarchy basics, 76
misses per instruction, 126, 371
miss rate, B-24 to B-25
vs. miss rate, B-27
miss rate reduction, B-28
multilevel caches, B-33
and relative execution time, B-34
scientific workloads
distributed-memory multiprocessors, I-29 to I-31
symmetric shared-memory multiprocessors, I-22 to I-23, I-24
shared-memory multiprogramming workload, 376
virtually addressed, B-37
CACTI
cache optimization, 79–80, 81
memory access times, 77
Caller saving, control flow instructions, A-19 to A-20
Call gate
IA-32 segment descriptors, B-53
segmented virtual memory, B-54
Calls
compiler structure, A-25 to A-26
control flow instructions, A-17, A-19 to A-21
CUDA Thread, 297
dependence analysis, 321
high-level instruction set, A-42 to A-43
Intel 80x86 integer operations, K-51
invocation options, A-19
ISAs, 14
MIPS control flow instructions, A-38
MIPS registers, 12
multiprogrammed workload, 378
NVIDIA GPU Memory structures, 304–305
return address predictors, 206
shared-memory multiprocessor workload, 369
user-to-OS gates, B-54
VAX, K-71 to K-72
Canceling branch, branch delay slots, C-24 to C-25
Canonical form, AMD64 paged virtual memory, B-55
Capabilities, protection schemes, L-9 to L-10
Capacity misses
blocking, 89–90
and cache size, B-24
definition, B-23
memory hierarchy basics, 75
scientific workloads on symmetric shared-memory multiprocessors, I-22, I-23, I-24
shared-memory workload, 373
Capital expenditures (CAPEX)
WSC costs, 452–455, 453
WSC Flash memory, 475
WSC TCO case study, 476–478
Carrier sensing, shared-media networks, F-23
Carrier signal, wireless networks, E-21
Carry condition code, MIPS core, K-9 to K-16
Carry-in, carry-skip adder, J-42
Carry-lookahead adder (CLA)
chip comparison, J-60
early computer arithmetic, J-63
example, J-38
integer addition speedup, J-37 to J-41
with ripple-carry adder, J-42
tree, J-40 to J-41
Carry-out
carry-lookahead circuit, J-38
floating-point addition speedup, J-25
Carry-propagate adder (CPA)
integer multiplication, J-48, J-51
multipass array multiplier, J-51
Carry-save adder (CSA)
integer division, J-54 to J-55
integer multiplication, J-47 to J-48, J-48
Carry-select adder
characteristics, J-43 to J-44
chip comparison, J-60
example, J-43
Carry-skip adder (CSA)
characteristics, J-41 to J43
example, J-42, J-44
Case statements
control flow instruction addressing modes, A-18
return address predictors, 206
Case studies
advanced directory protocol, 420–426
cache optimization, 131–133
cell phones
block diagram, E-23
Nokia circuit board, E-24
overview, E-20
radio receiver, E-23
standards and evolution, E-25
wireless communication challenges, E-21
wireless networks, E-21 to E-22
chip fabrication cost, 61–62
computer system power consumption, 63–64
directory-based coherence, 418–420
dirty bits, D-61 to D-64
disk array deconstruction, D-51 to D-55, D-52 to D-55
disk deconstruction, D-48 to D-51, D-50
highly parallel memory systems, 133–136
instruction set principles, A-47 to A-54
I/O subsystem design, D-59 to D-61
memory hierarchy, B-60 to B-67
microarchitectural techniques, 247–254
pipelining example, C-82 to C-88
RAID performance prediction, D-57 to D-59
RAID reconstruction, D-55 to D-57
Sanyo VPC-SX500 digital camera, E-19
single-chip multicore processor, 412–418
Sony PlayStation 2 Emotion Engine, E-15 to E-18
sorting, D-64 to D-67
vector kernel on vector processor and GPU, 334–336
WSC resource allocation, 478–479
WSC TCO, 476–478
C/C++ language
dependence analysis, H-6
GPU computing history, L-52
hardware impact on software development, 4
integer division/remainder, J-12
loop-level parallelism dependences, 318, 320–321
NVIDIA GPU programming, 289
return address predictors, 206
CDF, datacenter, 487
Cedar project, L-60
Cell, Barnes-Hut n-body algorithm, I-9
Cell phones
block diagram, E-23
embedded system case study
characteristics, E-22 to E-24
overview, E-20
radio receiver, E-23
standards and evolution, E-25
wireless network overview, E-21 to E-22
Flash memory, D-3
GPU features, 324
Nokia circuit board, E-24
wireless communication challenges, E-21
wireless networks, E-22
Centralized shared-memory multiprocessors
basic considerations, 351–352
basic structure, 346–347, 347
cache coherence, 352–353
cache coherence enforcement, 354–355
cache coherence example, 357–362
cache coherence extensions, 362–363
invalidate protocol implementation, 356–357
SMP and snooping limitations, 363–364
snooping coherence implementation, 365–366
snooping coherence protocols, 355–356
Centralized switched networks
example, F-31
routing algorithms, F-48
topology, F-30 to F-34, F-31
Centrally buffered switch, microarchitecture, F-57
Central processing unit (CPU)
Amdahl’s law, 48
average memory access time, B-17
cache performance, B-4
coarse-grained multithreading, 224
early pipelined versions, L-26 to L-27
exception stopping/restarting, C-47
extensive pipelining, C-81
Google server usage, 440
GPU computing history, L-52
vs. GPUs, 288
instruction set complications, C-50
MIPS implementation, C-33 to C-34
MIPS precise exceptions, C-59 to C-60
MIPS scoreboarding, C-77
performance measurement history, L-6
pipeline branch issues, C-41
pipelining exceptions, C-43 to C-46
pipelining performance, C-10
Sony PlayStation 2 Emotion Engine, E-17
SPEC server benchmarks, 40
TI TMS320C55 DSP, E-8
vector memory systems, G-10
Central processing unit (CPU) time
execution time, 36
modeling, B-18
processor performance calculations, B-19 to B-21
processor performance equation, 49–51
processor performance time, 49
Cerf, Vint, F-97
Chaining
convoys, DAXPY code, G-16
vector processor performance, G-11 to G-12, G-12
VMIPS, 268–269
Channel adapter See Network interface
Channels, cell phones, E-24
Character
floating-point performance, A-2
as operand type, A-13 to A-14
operand types/sizes, 12
Charge-coupled device (CCD), Sanyo VPC-SX500 digital camera, E-19
Checksum
dirty bits, D-61 to D-64
packet format, F-7
Chillers
Google WSC, 466, 468
WSC containers, 464
WSC cooling systems, 448–449
Chime
definition, 309
GPUs vs. vector architectures, 308
multiple lanes, 272
NVIDIA GPU computational structures, 296
vector chaining, G-12
vector execution time, 269, G-4
vector performance, G-2
vector sequence calculations, 270
Chip-crossing wire delay, F-70
OCN history, F-103
Chipkill
memory dependability, 104–105
WSCs, 473
Choke packets, congestion management, F-65
Chunk
disk array deconstruction, D-51
Shear algorithm, D-53
Circuit switching
congestion management, F-64 to F-65
interconnected networks, F-50
Circulating water system (CWS)
cooling system design, 448
WSCs, 448
Clean block, definition, B-11
Climate Savers Computing Initiative, power supply efficiencies, 462
Clock cycles
basic MIPS pipeline, C-34 to C-35
and branch penalties, 205
cache performance, B-4
FP pipeline, C-66
and full associativity, B-23
GPU conditional branching, 303
ILP exploitation, 197, 200
ILP exposure, 157
instruction fetch bandwidth, 202–203
instruction steps, 173–175
Intel Core i7 branch predictor, 166
MIPS exceptions, C-48
MIPS pipeline, C-52
MIPS pipeline FP operations, C-52 to C-53
MIPS scoreboarding, C-77
miss rate calculations, B-31 to B-32
multithreading approaches, 225–226
pipelining performance, C-10
processor performance equation, 49
RISC classic pipeline, C-7
Sun T1 multithreading, 226–227
switch microarchitecture pipelining, F-61
vector architectures, G-4
vector execution time, 269
vector multiple lanes, 271–273
VLIW processors, 195
Clock cycles per instruction (CPI)
addressing modes, A-10
ARM Cortex-A8, 235
branch schemes, C-25 to C-26, C-26
cache behavior impact, B-18 to B-19
cache hit calculation, B-5
data hazards requiring stalls, C-20
extensive pipelining, C-81
floating-point calculations, 50–52
ILP concepts, 148–149, 149
ILP exploitation, 192
Intel Core i7, 124, 240, 240–241
microprocessor advances, L-33
MIPS R4000 performance, C-69
miss penalty reduction, B-32
multiprocessing/multithreading-based performance, 398–400
multiprocessor communication calculations, 350
pipeline branch issues, C-41
pipeline with stalls, C-12 to C-13
pipeline structural hazards, C-15 to C-16
pipelining concept, C-3
processor performance calculations, 218–219
processor performance time, 49–51
and processor speed, 244
RISC history, L-21
shared-memory workloads, 369–370
simple MIPS implementation, C-33 to C-34
structural hazards, C-13
Sun T1 multithreading unicore performance, 229
Sun T1 processor, 399
Tomasulo’s algorithm, 181
VAX 8700 vs. MIPS M2000, K-82
Clock cycle time
and associativity, B-29
average memory access time, B-21 to B-22
cache optimization, B-19 to B-20, B-30
cache performance, B-4
CPU time equation, 49–50, B-18
MIPS implementation, C-34
miss penalties, 219
pipeline performance, C-12, C-14 to C-15
pipelining, C-3
shared- vs. switched-media networks, F-25
Clock periods, processor performance equation, 48–49
Clock rate
DDR DRAMS and DIMMS, 101
ILP for realizable processors, 218
Intel Core i7, 236–237
microprocessor advances, L-33
microprocessors, 24
MIPS pipeline FP operations, C-53
multicore processor performance, 400
and processor speed, 244
Clocks, processor performance equation, 48–49
Clock skew, pipelining performance, C-10
Clock ticks
cache coherence, 391
processor performance equation, 48–49
Clos network
Benesˆ topology, F-33
as nonblocking, F-33
Cloud computing
basic considerations, 455–461
clusters, 345
provider issues, 471–472
utility computing history, L-73 to L-74
Clusters
characteristics, 8, I-45
cloud computing, 345
as computer class, 5
containers, L-74 to L-75
Cray X1, G-22
Google WSC servers, 469
historical background, L-62 to L-64
IBM Blue Gene/L, I-41 to I-44, I-43 to I-44
interconnection network domains, F-3 to F-4
Internet Archive Cluster See Internet Archive Cluster
large-scale multiprocessors, I-6
large-scale multiprocessor trends, L-62 to L-63
outage/anomaly statistics, 435
power consumption, F-85
utility computing, L-73 to L-74
as WSC forerunners, 435–436, L-72 to L-73
WSC storage, 442–443
Cm*, L-56
C.mmp, L-56
CMOS
DRAM, 99
first vector computers, L-46, L-48
ripple-carry adder, J-3
vector processors, G-25 to G-27
Coarse-grained multithreading, definition, 224–226
Cocke, John, L-19, L-28
Code division multiple access (CDMA), cell phones, E-25
Code generation
compiler structure, A-25 to A-26, A-30
dependences, 220
general-purpose register computers, A-6
ILP limitation studies, 220
loop unrolling/scheduling, 162
Code scheduling
example, H-16
parallelism, H-15 to H-23
superblock scheduling, H-21 to H-23, H-22
trace scheduling, H-19 to H-21, H-20
Code size
architect-compiler considerations, A-30
benchmark information, A-2
comparisons, A-44
flawless architecture design, A-45
instruction set encoding, A-22 to A-23
ISA and compiler technology, A-43 to A-44
loop unrolling, 160–161
multiprogramming, 375–376
PMDs, 6
RISCs, A-23 to A-24
VAX design, A-45
VLIW model, 195–196
Coefficient of variance, D-27
Coerced exceptions
definition, C-45
exception types, C-46
Coherence See Cache coherence
Coherence misses
definition, 366
multiprogramming, 376–377
role, 367
scientific workloads on symmetric shared-memory multiprocessors, I-22
snooping protocols, 355–356
Cold-start misses, definition, B-23
Collision, shared-media networks, F-23
Collision detection, shared-media networks, F-23
Collision misses, definition, B-23
Collocation sites, interconnection networks, F-85
COLOSSUS, L-4
Column access strobe (CAS), DRAM, 98–99
Column major order
blocking, 89
stride, 278
Combining tree, large-scale multiprocessor synchronization, I-18
Command queue depth, vs. disk throughput, D-4
Commercial interconnection networks
congestion management, F-64 to F-66
connectivity, F-62 to F-63
cross-company interoperability, F-63 to F-64
DECstation 5000 reboots, F-69
fault tolerance, F-66 to F-69
Commercial workloads
execution time distribution, 369
symmetric shared-memory multiprocessors, 367–374
Commit stage, ROB instruction, 186–187, 188
Commodities
Amazon Web Services, 456–457
array switch, 443
cloud computing, 455
cost vs. price, 32–33
cost trends, 27–28, 32
Ethernet rack switch, 442
HPC hardware, 436
shared-memory multiprocessor, 441
WSCs, 441
Commodity cluster, characteristics, I-45
Common data bus (CDB)
dynamic scheduling with Tomasulo’s algorithm, 172, 175
FP unit with Tomasulo’s algorithm, 185
reservation stations/register tags, 177
Tomasulo’s algorithm, 180, 182
Common Internet File System (CIFS), D-35
NetApp FAS6000 filer, D-41 to D-42
Communication bandwidth, basic considerations, I-3
Communication latency, basic considerations, I-3 to I-4
Communication latency hiding, basic considerations, I-4
Communication mechanism
adaptive routing, F-93 to F-94
internetworking, F-81 to F-82
large-scale multiprocessors
advantages, I-4 to I-6
metrics, I-3 to I-4
multiprocessor communication calculations, 350
network interfaces, F-7 to F-8
NEWS communication, F-42 to F-43
SMP limitations, 363
Communication protocol, definition, F-8
Communication subnets See Interconnection networks
Communication subsystems See Interconnection networks
Compare instruction, VAX, K-71
Compares, MIPS core, K-9 to K-16
Compare-select-store unit (CSSU), TI TMS320C55 DSP, E-8
Compiler-controlled prefetching, miss penalty/rate reduction, 92–95
Compiler optimizations
blocking, 89–90
cache optimization, 131–133
compiler assumptions, A-25 to A-26
and consistency model, 396
loop interchange, 88–89
miss rate reduction, 87–90
passes, A-25
performance impact, A-27
types and classes, A-28
Compiler scheduling
data dependences, 151
definition, C-71
hardware support, L-30 to L-31
IBM 360 architecture, 171
Compiler speculation, hardware support
memory references, H-32
overview, H-27
preserving exception behavior, H-28 to H-32
Compiler techniques
dependence analysis, H-7
global code scheduling, H-17 to H-18
ILP exposure, 156–162
vectorization, G-14
vector sparse matrices, G-12
Compiler technology
and architecture decisions, A-27 to A-29
Cray X1, G-21 to G-22
ISA and code size, A-43 to A-44
multimedia instruction support, A-31 to A-32
register allocation, A-26 to A-27
structure, A-24 to A-26, A-25
Compiler writer-architect relationship, A-29 to A-30
Complex Instruction Set Computer (CISC)
RISC history, L-22
VAX as, K-65
Compulsory misses
and cache size, B-24
definition, B-23
memory hierarchy basics, 75
shared-memory workload, 373
Computation-to-communication ratios
parallel programs, I-10 to I-12
scaling, I-11
Compute-optimized processors, interconnection networks, F-88
Computer aided design (CAD) tools, cache optimization, 79–80
Computer architecture See also Architecture
coining of term, K-83 to K-84
computer design innovations, 4
defining, 11
definition, L-17 to L-18
exceptions, C-44
factors in improvement, 2
flawless design, K-81
flaws and success, K-81
floating-point addition, rules, J-24
goals/functions requirements, 15, 15–16, 16
high-level language, L-18 to L-19
instruction execution issues, K-81
ISA, 11–15
multiprocessor software development, 407–409
parallel, 9–10
WSC basics, 432, 441–442
array switch, 443
memory hierarchy, 443–446
storage, 442–443
Computer arithmetic
chip comparison, J-58, J-58 to J-61, J-59 to J-60
floating point
exceptions, J-34 to J-35
fused multiply-add, J-32 to J-33
IEEE 754, J-16
iterative division, J-27 to J-31
and memory bandwidth, J-62
overview, J-13 to J-14
precisions, J-33 to J-34
remainder, J-31 to J-32
special values, J-16
special values and denormals, J-14 to J-15
underflow, J-36 to J-37, J-62
floating-point addition
denormals, J-26 to J-27
overview, J-21 to J-25
speedup, J-25 to J-26
floating-point multiplication
denormals, J-20 to J-21
examples, J-19
overview, J-17 to J-20
rounding, J-18
integer addition speedup
carry-lookahead, J-37 to J-41
carry-lookahead circuit, J-38
carry-lookahead tree, J-40
carry-lookahead tree adder, J-41
carry-select adder, J-43, J-43 to J-44, J-44
carry-skip adder, J-41 to J43, J-42
overview, J-37
integer arithmetic
language comparison, J-12
overflow, J-11
Radix-2 multiplication/division, J-4, J-4 to J-7
restoring/nonrestoring division, J-6
ripply-carry addition, J-2 to J-3, J-3
signed numbers, J-7 to J-10
systems issues, J-10 to J-13
integer division
radix-2 division, J-55
radix-4 division, J-56
radix-4 SRT division, J-57
with single adder, J-54 to J-58
SRT division, J-45 to J-47, J-46
integer-FP conversions, J-62
integer multiplication
array multiplier, J-50
Booth recoding, J-49
even/odd array, J-52
with many adders, J-50 to J-54
multipass array multiplier, J-51
signed-digit addition table, J-54
with single adder, J-47 to J-49, J-48
Wallace tree, J-53
integer multiplication/division, shifting over zeros, J-45 to J-47
overview, J-2
rounding modes, J-20
Computer chip fabrication
cost case study, 61–62
Cray X1E, G-24
Computer classes
desktops, 6
embedded computers, 8–9
example, 5
overview, 5
parallelism and parallel architectures, 9–10
PMDs, 6
servers, 7
and system characteristics, E-4
warehouse-scale computers, 8
Computer design principles
Amdahl’s law, 46–48
common case, 45–46
parallelism, 44–45
principle of locality, 45
processor performance equation, 48–52
Computer history, technology and architecture, 2–5
Computer room air-conditioning (CRAC), WSC infrastructure, 448–449
Compute tiles, OCNs, F-3
Compute Unified Device Architecture See CUDA (Compute Unified Device Architecture)
Conditional branches
branch folding, 206
compare frequencies, A-20
compiler performance, C-24 to C-25
control flow instructions, 14, A-16, A-17, A-19, A-21
desktop RISCs, K-17
embedded RISCs, K-17
evaluation, A-19
global code scheduling, H-16, H-16
GPUs, 300–303
ideal processor, 214
ISAs, A-46
MIPS control flow instructions, A-38, A-40
MIPS core, K-9 to K-16
PA-RISC instructions, K-34, K-34
predictor misprediction rates, 166
PTX instruction set, 298–299
static branch prediction, C-26
types, A-20
vector-GPU comparison, 311
Conditional instructions
exposing parallelism, H-23 to H-27
limitations, H-26 to H-27
Condition codes
branch conditions, A-19
control flow instructions, 14
definition, C-5
high-level instruction set, A-43
instruction set complications, C-50
MIPS core, K-9 to K-16
pipeline branch penalties, C-23
VAX, K-71
Conflict misses
and block size, B-28
cache coherence mechanism, 358
and cache size, B-24, B-26
definition, B-23
as kernel miss, 376
L3 caches, 371
memory hierarchy basics, 75
OLTP workload, 370
PIDs, B-37
shared-memory workload, 373
Congestion control
commercial interconnection networks, F-64
system area network history, F-101
Congestion management, commercial interconnection networks, F-64 to F-66
Connectedness
dimension-order routing, F-47 to F-48
interconnection network topology, F-29
Connection delay, multi-device interconnection networks, F-25
Connection Machine CM-5, F-91, F-100
Connection Multiprocessor 2, L-44, L-57
Consistency See Memory consistency
Constant extension
desktop RISCs, K-9
embedded RISCs, K-9
Constellation, characteristics, I-45
Containers
airflow, 466
cluster history, L-74 to L-75
Google WSCs, 464–465, 465
Context Switching
definition, 106, B-49
Fermi GPU, 307
Control bits, messages, F-6
Control Data Corporation (CDC), first vector computers, L-44 to L-45
Control Data Corporation (CDC) 6600
computer architecture definition, L-18
dynamically scheduling with scoreboard, C-71 to C-72
early computer arithmetic, J-64
first dynamic scheduling, L-27
MIPS scoreboarding, C-75, C-77
multiple-issue processor development, L-28
multithreading history, L-34
RISC history, L-19
Control Data Corporation (CDC) STAR-100
first vector computers, L-44
peak performance vs. start-up overhead, 331
Control Data Corporation (CDC) STAR processor, G-26
Control dependences
conditional instructions, H-24
as data dependence, 150
global code scheduling, H-16
hardware-based speculation, 183
ILP, 154–156
ILP hardware model, 214
and Tomasulo’s algorithm, 170
vector mask registers, 275–276
Control flow instructions
addressing modes, A-17 to A-18
basic considerations, A-16 to A-17, A-20 to A-21
classes, A-17
conditional branch options, A-19
conditional instructions, H-27
hardware vs. software speculation, 221
Intel 80x86 integer operations, K-51
ISAs, 14
procedure invocation options, A-19 to A-20
Control hazards
ARM Cortex-A8, 235
definition, C-11
Control instructions
Intel 80x86, K-53
RISCs
desktop systems, K-12, K-22
embedded systems, K-16
VAX, B-73
Controllers, historical background, L-80 to L-81
Controller transitions
directory-based, 422
snooping cache, 421
Control Processor
definition, 309
GPUs, 333
SIMD, 10
Thread Block Scheduler, 294
vector processor, 310, 310–311
vector unit structure, 273
Conventional datacenters, vs. WSCs, 436
Convex Exemplar, L-61
Convex processors, vector processor history, G-26
Convolution, DSP, E-5
Convoy
chained, DAXPY code, G-16
DAXPY on VMIPS, G-20
strip-mined loop, G-5
vector execution time, 269–270
vector starting times, G-4
Conway, Lynn, L-28
Cooling systems
Google WSC, 465–468
mechanical design, 448
WSC infrastructure, 448–449
Copper wiring
Ethernet, F-78
interconnection networks, F-9
“Coprocessor operations,” MIPS core extensions, K-21
Copy propagation, definition, H-10 to H-11
Core definition, 15
Core plus ASIC, embedded systems, E-3
Correlating branch predictors, branch costs, 162–163
Cosmic Cube, F-100, L-60
Cost
Amazon EC2, 458
Amazon Web Services, 457
bisection bandwidth, F-89
branch predictors, 162–167, C-26
chip fabrication case study, 61–62
cloud computing providers, 471–472
disk storage, D-2
DRAM/magnetic disk, D-3
interconnecting node calculations, F-31 to F-32, F-35
Internet Archive Cluster, D-38 to D-40
internetworking, F-80
I/O system design/evaluation, D-36
magnetic storage history, L-78
MapReduce calculations, 458–459, 459
memory hierarchy design, 72
MINs vs. direct networks, F-92
multiprocessor cost relationship, 409
multiprocessor linear speedup, 407
network topology, F-40
PMDs, 6
server calculations, 454, 454–455
server usage, 7
SIMD supercomputer development, L-43
speculation, 210
torus topology interconnections, F-36 to F-38
tournament predictors, 164–166
WSC array switch, 443
WSC vs. datacenters, 455–456
WSC efficiency, 450–452
WSC facilities, 472
WSC network bottleneck, 461
WSCs vs. servers, 434
WSC TCO case study, 476–478
Cost associativity, cloud computing, 460–461
Cost-performance
commercial interconnection networks, F-63
computer trends, 3
extensive pipelining, C-80 to C-81
IBM eServer p5 processor, 409
sorting case study, D-64 to D-67
WSC Flash memory, 474–475
WSC goals/requirements, 433
WSC hardware inactivity, 474
WSC processors, 472–473
Cost trends
integrated circuits, 28–32
manufacturing vs. operation, 33
overview, 27
vs. price, 32–33
time, volume, commoditization, 27–28
Count register, PowerPC instructions, K-32 to K-33
CP-67 program, L-10
Cray, Seymour, G-25, G-27, L-44, L-47
Cray-1
first vector computers, L-44 to L-45
peak performance vs. start-up overhead, 331
pipeline depths, G-4
RISC history, L-19
vector performance, 332
vector performance measures, G-16
as VMIPS basis, 264, 270–271, 276–277
Cray-2
DRAM, G-25
first vector computers, L-47
tailgating, G-20
Cray-3, G-27
Cray-4, G-27
Cray C90
first vector computers, L-46, L-48
vector performance calculations, G-8
Cray J90, L-48
Cray Research T3D, F-86 to F-87, F-87
Cray supercomputers, early computer arithmetic, J-63 to J-64
Cray T3D, F-100, L-60
Cray T3E, F-67, F-94, F-100, L-48, L-60
Cray T90, memory bank calculations, 276
Cray X1
cluster history, L-63
first vector computers, L-46, L-48
MSP module, G-22, G-23 to G-24
overview, G-21 to G-23
peak performance, 58
Cray X1E, F-86, F-91
characteristics, G-24
Cray X2, L-46 to L-47
first vector computers, L-48 to L-49
Cray X-MP, L-45
first vector computers, L-47
Cray XT3, L-58, L-63
Cray XT3 SeaStar, F-63
Cray Y-MP
first vector computers, L-45 to L-47
parallel processing debates, L-57
vector architecture programming, 281, 281–282
Create vector index instruction (CVI), sparse matrices, G-13
Credit-based control flow
InfiniBand, F-74
interconnection networks, F-10, F-17
CRISP, L-27
Critical path
global code scheduling, H-16
trace scheduling, H-19 to H-21, H-20
Critical word first, cache optimization, 86–87
Crossbars
centralized switched networks, F-30, F-31
characteristics, F-73
Convex Exemplar, L-61
HOL blocking, F-59
OCN history, F-104
switch microarchitecture, F-62
switch microarchitecture pipelining, F-60 to F-61, F-61
VMIPS, 265
Crossbar switch
centralized switched networks, F-30
interconnecting node calculations, F-31 to F-32
Cross-company interoperability, commercial interconnection networks, F-63 to F-64
Crusoe, L-31
Cryptanalysis, L-4
C# language, hardware impact on software development, 4
CUDA (Compute Unified Device Architecture)
GPU computing history, L-52
GPU conditional branching, 303
GPUs vs. vector architectures, 310
NVIDIA GPU programming, 289
PTX, 298, 300
sample program, 289–290
SIMD instructions, 297
terminology, 313–315
CUDA Thread
CUDA programming model, 300, 315
definition, 292, 313
definitions and terms, 314
GPU data addresses, 310
GPU Memory structures, 304
NVIDIA parallelism, 289–290
vs. POSIX Threads, 297
PTX Instructions, 298
SIMD Instructions, 303
Thread Block, 313
Current frame pointer (CFM), IA-64 register model, H-33 to H-34
Custom cluster
characteristics, I-45
IBM Blue Gene/L, I-41 to I-44, I-43 to I-44
Cut-through packet switching, F-51
routing comparison, F-54
CYBER 180/990, precise exceptions, C-59
CYBER 205
peak performance vs. start-up overhead, 331
vector processor history, G-26 to G-27
CYBER 250, L-45
Cycles, processor performance equation, 49
Cycle time See also Clock cycle time
CPI calculations, 350
pipelining, C-81
scoreboarding, C-79
vector processors, 277
Cyclic redundancy check (CRC)
IBM Blue Gene/L 3D torus network, F-73
network interface, F-8
Cydrome Cydra 6, L-30, L-32
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset