S

Sandy Bridge dies, wafter example, 31
Sanyo digital cameras, SOC, E-20
Sanyo VPC-SX500 digital camera, embedded system case study, E-19
SASI, L-81
SATA (Serial Advanced Technology Attachment) disks
Google WSC servers, 469
NetApp FAS6000 filer, D-42
power consumption, D-5
RAID 6, D-8
vs. SAS drives, D-5
storage area network history, F-103
Saturating arithmetic, DSP media extensions, E-11
Saturating operations, definition, K-18 to K-19
SAXPY, GPU raw/relative performance, 328
Scalability
cloud computing, 460
coherence issues, 378–379
Fermi GPU, 295
Java benchmarks, 402
multicore processors, 400
multiprocessing, 344, 395
parallelism, 44
as server characteristic, 7
transistor performance and wires, 19–21
WSCs, 8, 438
WSCs vs. servers, 434
Scalable GPUs, historical background, L-50 to L-51
Scalar expansion, loop-level parallelism dependences, 321
Scalar Processors See also Superscalar processors
definition, 292, 309
early pipelined CPUs, L-26 to L-27
lane considerations, 273
Multimedia SIMD/GPU comparisons, 312
NVIDIA GPU, 291
prefetch units, 277
vs. vector, 311, G-19
vector performance, 331–332
Scalar registers
Cray X1, G-21 to G-22
GPUs vs. vector architectures, 311
loop-level parallelism dependences, 321–322
Multimedia SIMD vs. GPUs, 312
sample renaming code, 251
vector vs. GPU, 311
vs. vector performance, 331–332
VMIPS, 265–266
Scaled addressing, VAX, K-67
Scaled speedup, Amdahl’s law and parallel computers, 406–407
Scaling
Amdahl’s law and parallel computers, 406–407
cloud computing, 456
computation-to-communication ratios, I-11
DVFS, 25, 52, 467
dynamic voltage-frequency, 25, 52, 467
Intel Core i7, 404
interconnection network speed, F-88
multicore vs. single-core, 402
processor performance trends, 3
scientific applications on parallel processing, I-34
shared- vs. switched-media networks, F-25
transistor performance and wires, 19–21
VMIPS, 267
Scan Line Interleave (SLI), scalable GPUs, L-51
Schorr, Herb, L-28
Scientific applications
Barnes, I-8 to I-9
basic characteristics, I-6 to I-7
cluster history, L-62
distributed-memory multiprocessors, I-26 to I-32, I-28 to I-32
FFT kernel, I-7
LU kernel, I-8
Ocean, I-9 to I-10
parallel processors, I-33 to I-34
parallel program computation/communication, I-10 to I-12, I-11
parallel programming, I-2
symmetric shared-memory multiprocessors, I-21 to I-26, I-23 to I-25
Scoreboarding
ARM Cortex-A8, 233, 234
components, C-76
definition, 170
dynamic scheduling, 171, 175
and dynamic scheduling, C-71 to C-80
example calculations, C-77
MIPS structure, C-73
NVIDIA GPU, 296
results tables, C-78 to C-79
SIMD thread scheduler, 296
Scripting languages, software development impact, 4
SCSI (Small Computer System Interface)
Berkeley’s Tertiary Disk project, D-12
dependability benchmarks, D-21
disk storage, D-4
historical background, L-80 to L-81
I/O subsystem design, D-59
RAID reconstruction, D-56
storage area network history, F-102
SDRWAVE, J-62
Second-level caches See also L2 caches
ARM Cortex-A8, 114
ILP, 245
Intel Core i7, 121
interconnection network, F-87
Itanium 2, H-41
memory hierarchy, B-48 to B-49
miss penalty calculations, B-33 to B-34
miss penalty reduction, B-30 to B-35
miss rate calculations, B-31 to B-35
and relative execution time, B-34
speculation, 210
SRAM, 99
Secure Virtual Machine (SVM), 129
Seek distance
storage disks, D-46
system comparison, D-47
Seek time, storage disks, D-46
Segment basics
Intel 80x86, K-50
vs. page, B-43
virtual memory definition, B-42 to B-43
Segment descriptor, IA-32 processor, B-52, B-53
Segmented virtual memory
bounds checking, B-52
Intel Pentium protection, B-51 to B-54
memory mapping, B-52
vs. paged, B-43
safe calls, B-54
sharing and protection, B-52 to B-53
Self-correction, Newton’s algorithm, J-28 to J-29
Self-draining pipelines, L-29
Self-routing, MINs, F-48
Semantic clash, high-level instruction set, A-41
Semantic gap, high-level instruction set, A-39
Semiconductors
DRAM technology, 17
Flash memory, 18
GPU vs. MIMD, 325
manufacturing, 3–4
Sending overhead
communication latency, I-3 to I-4
OCNs vs. SANs, F-27
time of flight, F-14
Sense-reversing barrier
code example, I-15, I-21
large-scale multiprocessor synchronization, I-14
Sequence of SIMD Lane Operations, definition, 292, 313
Sequency number, packet header, F-8
Sequential consistency
latency hiding with speculation, 396–397
programmer’s viewpoint, 394
relaxed consistency models, 394–395
requirements and implementation, 392–393
Sequential interleaving, multibanked caches, 86, 86
Sequent Symmetry, L-59
Serial Advanced Technology Attachment disks See SATA (Serial Advanced Technology Attachment) disks
Serial Attach SCSI (SAS) drive
historical background, L-81
power consumption, D-5
vs. SATA drives, D-5
Serialization
barrier synchronization, I-16
coherence enforcement, 354
directory-based cache coherence, 382
DSM multiprocessor cache coherence, I-37
hardware primitives, 387
multiprocessor cache coherency, 353
page tables, 408
snooping coherence protocols, 356
write invalidate protocol implementation, 356
Serpentine recording, L-77
Serve-longest-queue (SLQ) scheme, arbitration, F-49
ServerNet interconnection network, fault tolerance, F-66 to F-67
Servers See also Warehouse-scale computers (WSCs)
as computer class, 5
cost calculations, 454, 454–455
definition, D-24
energy savings, 25
Google WSC, 440, 467, 468–469
GPU features, 324
memory hierarchy design, 72
vs. mobile GPUs, 323–330
multiprocessor importance, 344
outage/anomaly statistics, 435
performance benchmarks, 40–41
power calculations, 463
power distribution example, 490
power-performance benchmarks, 54, 439–441
power-performance modes, 477
real-world examples, 52–55
RISC systems
addressing modes and instruction formats, K-5 to K-6
examples, K-3, K-4
instruction formats, K-7
multimedia extensions, K-16 to K-19
single-server model, D-25
system characteristics, E-4
workload demands, 439
WSC vs. datacenters, 455–456
WSC data transfer, 446
WSC energy efficiency, 462–464
vs. WSC facility costs, 472
WSC memory hierarchy, 444
WSC resource allocation case study, 478–479
vs. WSCs, 432–434
WSC TCO case study, 476–478
Server side Java operations per second (ssj_ops)
example calculations, 439
power-performance, 54
real-world considerations, 52–55
Server utilization
calculation, D-28 to D-29
queuing theory, D-25
Service accomplishment, SLAs, 34
Service Health Dashboard, AWS, 457
Service interruption, SLAs, 34
Service level agreements (SLAs)
Amazon Web Services, 457
and dependability, 33
WSC efficiency, 452
Service level objectives (SLOs)
and dependability, 33
WSC efficiency, 452
Session layer, definition, F-82
Set associativity
and access time, 77
address parts, B-9
AMD Opteron data cache, B-12 to B-14
ARM Cortex-A8, 114
block placement, B-7 to B-8
cache block, B-7
cache misses, 83–84, B-10
cache optimization, 79–80, B-33 to B-35, B-38 to B-40
commercial workload, 371
energy consumption, 81
memory access times, 77
memory hierarchy basics, 74, 76
nonblocking cache, 84
performance equations, B-22
pipelined cache access, 82
way prediction, 81
Set basics
block replacement, B-9 to B-10
definition, B-7
Set-on-less-than instructions (SLT)
MIPS16, K-14 to K-15
MIPS conditional branches, K-11 to K-12
Settle time, D-46
SFS benchmark, NFS, D-20
Shadow page table, Virtual Machines, 110
Sharding, WSC memory hierarchy, 445
Shared-media networks
effective bandwidth vs. nodes, F-28
example, F-22
latency and effective bandwidth, F-26 to F-28
multiple device connections, F-22 to F-24
vs. switched-media networks, F-24 to F-25
Shared Memory
definition, 292, 314
directory-based cache coherence, 418–420
invalidate protocols, 356–357
SMP/DSM definition, 348
terminology comparison, 315
Shared-memory communication, large-scale multiprocessors, I-5
Shared-memory multiprocessors
basic considerations, 351–352
basic structure, 346–347
cache coherence, 352–353
cache coherence enforcement, 354–355
cache coherence example, 357–362
cache coherence extensions, 362–363
data caching, 351–352
definition, L-63
historical background, L-60 to L-61
invalidate protocol implementation, 356–357
limitations, 363–364
performance, 366–378
single-chip multicore case study, 412–418
SMP and snooping limitations, 363–364
snooping coherence implementation, 365–366
snooping coherence protocols, 355–356
WSCs, 435, 441
Shared-memory synchronization, MIPS core extensions, K-21
Shared state
cache block, 357, 359
cache coherence, 360
cache miss calculations, 366–367
coherence extensions, 362
directory-based cache coherence protocol basics, 380, 385
private cache, 358
Sharing addition, segmented virtual memory, B-52 to B-53
Shear algorithms, disk array deconstruction, D-51 to D-52, D-52 to D-54
Shifting over zeros, integer multiplication/division, J-45 to J-47
Short-circuiting See Forwarding
SI format instructions, IBM 360, K-87
Signals, definition, E-2
Signal-to-noise ratio (SNR), wireless networks, E-21
Signed-digit representation
example, J-54
integer multiplication, J-53
Signed number arithmetic, J-7 to J-10
Sign-extended offset, RISC, C-4 to C-5
Significand, J-15
Sign magnitude, J-7
Silicon Graphics 4D/240, L-59
Silicon Graphics Altix, F-76, L-63
Silicon Graphics Challenge, L-60
Silicon Graphics Origin, L-61, L-63
Silicon Graphics systems (SGI)
economies of scale, 456
miss statistics, B-59
multiprocessor software development, 407–409
vector processor history, G-27
SIMD (Single Instruction Stream, Multiple Data Stream)
definition, 10
Fermi GPU architectural innovations, 305–308
GPU conditional branching, 301
GPU examples, 325
GPU programming, 289–290
GPUs vs. vector architectures, 308–309
historical overview, L-55 to L-56
loop-level parallelism, 150
MapReduce, 438
memory bandwidth, 332
multimedia extensions See Multimedia SIMD Extensions
multiprocessor architecture, 346
multithreaded See Multithreaded SIMD Processor
NVIDIA GPU computational structures, 291
NVIDIA GPU ISA, 300
power/DLP issues, 322
speedup via parallelism, 263
supercomputer development, L-43 to L-44
system area network history, F-100
Thread Block mapping, 293
TI 320C6x DSP, E-9
SIMD Instruction
CUDA Thread, 303
definition, 292, 313
DSP media extensions, E-10
function, 150, 291
GPU Memory structures, 304
GPUs, 300, 305
Grid mapping, 293
IBM Blue Gene/L, I-42
Intel AVX, 438
multimedia architecture programming, 285
multimedia extensions, 282–285, 312
multimedia instruction compilers, A-31 to A-32
Multithreaded SIMD Processor block diagram, 294
PTX, 301
Sony PlayStation 2, E-16
Thread of SIMD Instructions, 295–296
thread scheduling, 296–297, 297, 305
vector architectures as superset, 263–264
vector/GPU comparison, 308
Vector Registers, 309
SIMD Lane Registers, definition, 309, 314
SIMD Lanes
definition, 292, 296, 309
DLP, 322
Fermi GPU, 305, 307
GPU, 296–297, 300, 324
GPU conditional branching, 302–303
GPUs vs. vector architectures, 308, 310, 311
instruction scheduling, 297
multimedia extensions, 285
Multimedia SIMD vs. GPUs, 312, 315
multithreaded processor, 294
NVIDIA GPU Memory, 304
synchronization marker, 301
vector vs. GPU, 308, 311
SIMD Processors See also Multithreaded SIMD Processor
block diagram, 294
definition, 292, 309, 313–314
dependent computation elimination, 321
design, 333
Fermi GPU, 296, 305–308
Fermi GTX 480 GPU floorplan, 295, 295–296
GPU conditional branching, 302
GPU vs. MIMD, 329
GPU programming, 289–290
GPUs vs. vector architectures, 310, 310–311
Grid mapping, 293
Multimedia SIMD vs. GPU, 312
multiprocessor architecture, 346
NVIDIA GPU computational structures, 291
NVIDIA GPU Memory structures, 304–305
processor comparisons, 324
Roofline model, 287, 326
system area network history, F-100
SIMD Thread
GPU conditional branching, 301–302
Grid mapping, 293
Multithreaded SIMD processor, 294
NVIDIA GPU, 296
NVIDIA GPU ISA, 298
NVIDIA GPU Memory structures, 305
scheduling example, 297
vector vs. GPU, 308
vector processor, 310
SIMD Thread Scheduler
definition, 292, 314
example, 297
Fermi GPU, 295, 305–307, 306
GPU, 296
SIMT (Single Instruction, Multiple Thread)
GPU programming, 289
vs. SIMD, 314
Warp, 313
Simultaneous multithreading (SMT)
characteristics, 226
definition, 224–225
historical background, L-34 to L-35
IBM eServer p5 575, 399
ideal processors, 215
Intel Core i7, 117–118, 239–241
Java and PARSEC workloads, 403–404
multicore performance/energy efficiency, 402–405
multiprocessing/multithreading-based performance, 398–400
multithreading history, L-35
superscalar processors, 230–232
Single-extended precision floating-point arithmetic, J-33 to J-34
Single Instruction, Multiple Thread See SIMT (Single Instruction, Multiple Thread)
Single Instruction Stream, Multiple Data Stream See SIMD (Single Instruction Stream, Multiple Data Stream)
Single Instruction Stream, Single Data Stream See SISD (Single Instruction Stream, Single Data Stream)
Single-level cache hierarchy, miss rates vs. cache size, B-33
Single-precision floating point
arithmetic, J-33 to J-34
GPU examples, 325
GPU vs. MIMD, 328
MIPS data types, A-34
MIPS operations, A-36
Multimedia SIMD Extensions, 283
operand sizes/types, 12, A-13
as operand type, A-13 to A-14
representation, J-15 to J-16
Single-Streaming Processor (SSP)
Cray X1, G-21 to G-24
Cray X1E, G-24
Single-thread (ST) performance
IBM eServer p5 575, 399, 399
Intel Core i7, 239
ISA, 242
processor comparison, 243
SISD (Single Instruction Stream, Single Data Stream), 10
SIMD computer history, L-55
Skippy algorithm
disk deconstruction, D-49
sample results, D-50
Small Computer System Interface See SCSI (Small Computer System Interface)
Small form factor (SFF) disk, L-79
Smalltalk, SPARC instructions, K-30
Smart interface cards, vs. smart switches, F-85 to F-86
Smartphones
ARM Cortex-A8, 114
mobile vs. server GPUs, 323–324
Smart switches, vs. smart interface cards, F-85 to F-86
Snooping cache coherence
basic considerations, 355–356
controller transitions, 421
definition, 354–355
directory-based, 381, 386, 420–421
example, 357–362
implementation, 365–366
large-scale multiprocessor history, L-61
large-scale multiprocessors, I-34 to I-35
latencies, 414
limitations, 363–364
sample types, L-59
single-chip multicore processor case study, 412–418
symmetric shared-memory machines, 366
Soft errors, definition, 104
Soft real-time
definition, E-3
PMDs, 6
Software as a Service (SaaS)
clusters/WSCs, 8
software development, 4
WSCs, 438
WSCs vs. servers, 433–434
Software development
multiprocessor architecture issues, 407–409
performance vs. productivity, 4
WSC efficiency, 450–452
Software pipelining
example calculations, H-13 to H-14
loops, execution pattern, H-15
technique, H-12 to H-15, H-13
Software prefetching, cache optimization, 131–133
Software speculation
definition, 156
vs. hardware speculation, 221–222
VLIW, 196
Software technology
ILP approaches, 148
large-scale multiprocessors, I-6
large-scale multiprocessor synchronization, I-17 to I-18
network interfaces, F-7
vs. TCP/IP reliance, F-95
Virtual Machines protection, 108
WSC running service, 434–435
Solaris, RAID benchmarks, D-22, D-22 to D-23
Solid-state disks (SSDs)
processor performance/price/power, 52
server energy efficiency, 462
WSC cost-performance, 474–475
Sonic Smart Interconnect, OCNs, F-3
Sony PlayStation 2
block diagram, E-16
embedded multiprocessors, E-14
Emotion Engine case study, E-15 to E-18
Emotion Engine organization, E-18
Sorting, case study, D-64 to D-67
Sort primitive, GPU vs. MIMD, 329
Sort procedure, VAX
bubble sort, K-76
example code, K-77 to K-79
vs. MIPS32, K-80
register allocation, K-76
Source routing, basic concept, F-48
SPARCLE processor, L-34
Sparse matrices
loop-level parallelism dependences, 318–319
vector architectures, 279–280, G-12 to G-14
vector execution time, 271
vector mask registers, 275
Spatial locality
coining of term, L-11
definition, 45, B-2
memory hierarchy design, 72
SPEC benchmarks
branch predictor correlation, 162–164
desktop performance, 38–40
early performance measures, L-7
evolution, 39
fallacies, 56
operands, A-14
performance, 38
performance results reporting, 41
processor performance growth, 3
static branch prediction, C-26 to C-27
storage systems, D-20 to D-21
tournament predictors, 164
two-bit predictors, 165
vector processor history, G-28
SPEC89 benchmarks
branch-prediction buffers, C-28 to C-30, C-30
MIPS FP pipeline performance, C-61 to C-62
misprediction rates, 166
tournament predictors, 165–166
VAX 8700 vs. MIPS M2000, K-82
SPEC92 benchmarks
hardware vs. software speculation, 221
ILP hardware model, 215
MIPS R4000 performance, C-68 to C-69, C-69
misprediction rate, C-27
SPEC95 benchmarks
return address predictors, 206–207, 207
way prediction, 82
SPEC2000 benchmarks
ARM Cortex-A8 memory, 115–116
cache performance prediction, 125–126
cache size and misses per instruction, 126
compiler optimizations, A-29
compulsory miss rate, B-23
data reference sizes, A-44
hardware prefetching, 91
instruction misses, 127
SPEC2006 benchmarks, evolution, 39
SPECCPU2000 benchmarks
displacement addressing mode, A-12
Intel Core i7, 122
server benchmarks, 40
SPECCPU2006 benchmarks
branch predictors, 167
Intel Core i7, 123–124, 240, 240–241
ISA performance and efficiency prediction, 241
Virtual Machines protection, 108
SPECfp benchmarks
hardware prefetching, 91
interconnection network, F-87
ISA performance and efficiency prediction, 241–242
Itanium 2, H-43
MIPS FP pipeline performance, C-60 to C-61
nonblocking caches, 84
tournament predictors, 164
SPECfp92 benchmarks
Intel 80x86 vs. DLX, K-63
Intel 80x86 instruction lengths, K-60
Intel 80x86 instruction mix, K-61
Intel 80x86 operand type distribution, K-59
nonblocking cache, 83
SPECfp2000 benchmarks
hardware prefetching, 92
MIPS dynamic instruction mix, A-42
Sun Ultra 5 execution times, 43
SPECfp2006 benchmarks
Intel processor clock rates, 244
nonblocking cache, 83
SPECfpRate benchmarks
multicore processor performance, 400
multiprocessor cost effectiveness, 407
SMT, 398–400
SMT on superscalar processors, 230
SPEChpc96 benchmark, vector processor history, G-28
Special-purpose machines
historical background, L-4 to L-5
SIMD computer history, L-56
Special-purpose register
compiler writing-architecture relationship, A-30
ISA classification, A-3
VMIPS, 267
Special values
floating point, J-14 to J-15
representation, J-16
SPECINT benchmarks
hardware prefetching, 92
interconnection network, F-87
ISA performance and efficiency prediction, 241–242
Itanium 2, H-43
nonblocking caches, 84
SPECInt92 benchmarks
Intel 80x86 vs. DLX, K-63
Intel 80x86 instruction lengths, K-60
Intel 80x86 instruction mix, K-62
Intel 80x86 operand type distribution, K-59
nonblocking cache, 83
SPECint95 benchmarks, interconnection networks, F-88
SPECINT2000 benchmarks, MIPS dynamic instruction mix, A-41
SPECINT2006 benchmarks
Intel processor clock rates, 244
nonblocking cache, 83
SPECintRate benchmark
multicore processor performance, 400
multiprocessor cost effectiveness, 407
SMT, 398–400
SMT on superscalar processors, 230
SPEC Java Business Benchmark (JBB)
multicore processor performance, 400
multicore processors, 402
multiprocessing/multithreading-based performance, 398
server, 40
Sun T1 multithreading unicore performance, 227–229, 229
SPECJVM98 benchmarks, ISA performance and efficiency prediction, 241
SPECMail benchmark, characteristics, D-20
SPEC-optimized processors, vs. density-optimized, F-85
SPECPower benchmarks
Google server benchmarks, 439–440, 440
multicore processor performance, 400
real-world server considerations, 52–55
WSCs, 463
WSC server energy efficiency, 462–463
SPECRate benchmarks
Intel Core i7, 402
multicore processor performance, 400
multiprocessor cost effectiveness, 407
server benchmarks, 40
SPECRate2000 benchmarks, SMT, 398–400
SPECRatios
execution time examples, 43
geometric means calculations, 43–44
SPECSFS benchmarks
example, D-20
servers, 40
Speculation See also Hardware-based speculation See also Software speculation
advantages/disadvantages, 210–211
compilers See Compiler speculation
concept origins, L-29 to L-30
and energy efficiency, 211–212
FP unit with Tomasulo’s algorithm, 185
hardware vs. software, 221–222
IA-64, H-38 to H-40
ILP studies, L-32 to L-33
Intel Core i7, 123–124
latency hiding in consistency models, 396–397
memory reference, hardware support, H-32
and memory system, 222–223
microarchitectural techniques case study, 247–254
multiple branches, 211
register renaming vs. ROB, 208–210
SPECvirt_Sc2010 benchmarks, server, 40
SPECWeb benchmarks
characteristics, D-20
dependability, D-21
parallelism, 44
server benchmarks, 40
SPECWeb99 benchmarks
multiprocessing/multithreading-based performance, 398
Sun T1 multithreading unicore performance, 227, 229
Speedup
Amdahl’s law, 46–47
floating-point addition, J-25 to J-26
integer addition
carry-lookahead, J-37 to J-41
carry-lookahead circuit, J-38
carry-lookahead tree, J-40 to J-41
carry-lookahead tree adder, J-41
carry-select adder, J-43, J-43 to J-44, J-44
carry-skip adder, J-41 to J43, J-42
overview, J-37
integer division
radix-2 division, J-55
radix-4 division, J-56
radix-4 SRT division, J-57
with single adder, J-54 to J-58
integer multiplication
array multiplier, J-50
Booth recoding, J-49
even/odd array, J-52
with many adders, J-50 to J-54
multipass array multiplier, J-51
signed-digit addition table, J-54
with single adder, J-47 to J-49, J-48
Wallace tree, J-53
integer multiplication/division, shifting over zeros, J-45 to J-47
integer SRT division, J-45 to J-46, J-46
linear, 405–407
via parallelism, 263
pipeline with stalls, C-12 to C-13
relative, 406
scaled, 406–407
switch buffer organizations, F-58 to F-59
true, 406
Sperry-Rand, L-4 to L-5
Spin locks
via coherence, 389–390
large-scale multiprocessor synchronization
barrier synchronization, I-16
exponential back-off, I-17
SPLASH parallel benchmarks, SMT on superscalar processors, 230
Split, GPU vs. MIMD, 329
SPRAM, Sony PlayStation 2 Emotion Engine organization, E-18
Sprowl, Bob, F-99
Squared coefficient of variance, D-27
SRT division
chip comparison, J-60 to J-61
complications, J-45 to J-46
early computer arithmetic, J-65
example, J-46
historical background, J-63
integers, with adder, J-55 to J-57
radix-4, J-56, J-57
SS format instructions, IBM 360, K-85 to K-88
Stack architecture
and compiler technology, A-27
flaws vs. success, A-44 to A-45
historical background, L-16 to L-17
Intel 80x86, K-48, K-52, K-54
operands, A-3 to A-4
Stack frame, VAX, K-71
Stack pointer, VAX, K-71
Stack or Thread Local Storage, definition, 292
Stale copy, cache coherency, 112
Stall cycles
advanced directory protocol case study, 424
average memory access time, B-17
branch hazards, C-21
branch scheme performance, C-25
definition, B-4 to B-5
example calculation, B-31
loop unrolling, 161
MIPS FP pipeline performance, C-60
miss rate calculation, B-6
out-of-order processors, B-20 to B-21
performance equations, B-22
pipeline performance, C-12 to C-13
single-chip multicore multiprocessor case study, 414–418
structural hazards, C-15
Stalls
AMD Opteron data cache, B-15
ARM Cortex-A8, 235, 235–236
branch hazards, C-42
data hazard minimization, C-16 to C-19, C-18
data hazards requiring, C-19 to C-21
delayed branch, C-65
Intel Core i7, 239–241
microarchitectural techniques case study, 252
MIPS FP pipeline performance, C-60 to C-61, C-61 to C-62
MIPS pipeline multicycle operations, C-51
MIPS R4000, C-64, C-67, C-67 to C-69, C-69
miss rate calculations, B-31 to B-32
necessity, C-21
nonblocking cache, 84
pipeline performance, C-12 to C-13
from RAW hazards, FP code, C-55
structural hazard, C-15
VLIW sample code, 252
VMIPS, 268
Standardization, commercial interconnection networks, F-63 to F-64
Stardent-1500, Livermore Fortran kernels, 331
Start-up overhead, vs. peak performance, 331
Start-up time
DAXPY on VMIPS, G-20
memory banks, 276
page size selection, B-47
peak performance, 331
vector architectures, 331, G-4, G-4, G-8
vector convoys, G-4
vector execution time, 270–271
vector performance, G-2
vector performance measures, G-16
vector processor, G-7 to G-9, G-25
VMIPS, G-5
State transition diagram
director vs. cache, 385
directory-based cache coherence, 383
Statically based exploitation, ILP, H-2
Static power
basic equation, 26
SMT, 231
Static random-access memory (SRAM)
characteristics, 97–98
dependability, 104
fault detection pitfalls, 58
power, 26
vector memory systems, G-9
vector processor, G-25
yield, 32
Static scheduling
definition, C-71
ILP, 192–196
and unoptimized code, C-81
Sticky bit, J-18
Stop & Go See Xon/Xoff
Storage area networks
dependability benchmarks, D-21 to D-23, D-22
historical overview, F-102 to F-103
I/O system as black blox, D-23
Storage systems
asynchronous I/O and OSes, D-35
Berkeley’s Tertiary Disk project, D-12
block servers vs. filers, D-34 to D-35
bus replacement, D-34
component failure, D-43
computer system availability, D-43 to D-44, D-44
dependability benchmarks, D-21 to D-23
dirty bits, D-61 to D-64
disk array deconstruction case study, D-51 to D-55, D-52 to D-55
disk arrays, D-6 to D-10
disk deconstruction case study, D-48 to D-51, D-50
disk power, D-5
disk seeks, D-45 to D-47
disk storage, D-2 to D-5
file system benchmarking, D-20, D-20 to D-21
Internet Archive Cluster See Internet Archive Cluster
I/O performance, D-15 to D-16
I/O subsystem design, D-59 to D-61
I/O system design/evaluation, D-36 to D-37
mail server benchmarking, D-20 to D-21
NetApp FAS6000 filer, D-41 to D-42
operator dependability, D-13 to D-15
OS-scheduled disk access, D-44 to D-45, D-45
point-to-point links, D-34, D-34
queue I/O request calculations, D-29
queuing theory, D-23 to D-34
RAID performance prediction, D-57 to D-59
RAID reconstruction case study, D-55 to D-57
real faults and failures, D-6 to D-10
reliability, D-44
response time restrictions for benchmarks, D-18
seek distance comparison, D-47
seek time vs. distance, D-46
server utilization calculation, D-28 to D-29
sorting case study, D-64 to D-67
Tandem Computers, D-12 to D-13
throughput vs. response time, D-16, D-16 to D-18, D-17
TP benchmarks, D-18 to D-19
transactions components, D-17
web server benchmarking, D-20 to D-21
WSC vs. datacenter costs, 455
WSCs, 442–443
Store conditional
locks via coherence, 391
synchronization, 388–389
Store-and-forward packet switching, F-51
Store instructions See also Load-store instruction set architecture
definition, C-4
instruction execution, 186
ISA, 11, A-3
MIPS, A-33, A-36
NVIDIA GPU ISA, 298
Opteron data cache, B-15
RISC instruction set, C-4 to C-6, C-10
vector architectures, 310
Streaming Multiprocessor
definition, 292, 313–314
Fermi GPU, 307
Strecker, William, K-65
Strided accesses
Multimedia SIMD Extensions, 283
Roofline model, 287
TLB interaction, 323
Strided addressing See also Unit stride addressing
multimedia instruction compiler support, A-31 to A-32
Strides
gather-scatter, 280
highly parallel memory systems, 133
multidimensional arrays in vector architectures, 278–279
NVIDIA GPU ISA, 300
vector memory systems, G-10 to G-11
VMIPS, 266
String operations, Intel 80x86, K-51, K-53
Stripe, disk array deconstruction, D-51
Striping
disk arrays, D-6
RAID, D-9
Strip-Mined Vector Loop
convoys, G-5
DAXPY on VMIPS, G-20
definition, 292
multidimensional arrays, 278
Thread Block comparison, 294
vector-length registers, 274
Strip mining
DAXPY on VMIPS, G-20
GPU conditional branching, 303
GPUs vs. vector architectures, 311
NVIDIA GPU, 291
vector, 275
VLRs, 274–275
Strong scaling, Amdahl’s law and parallel computers, 407
Structural hazards
basic considerations, C-13 to C-16
definition, C-11
MIPS pipeline, C-71
MIPS scoreboarding, C-78 to C-79
pipeline stall, C-15
vector execution time, 268–269
Structural stalls, MIPS R4000 pipeline, C-68 to C-69
Subset property, and inclusion, 397
Summary overflow condition code, PowerPC, K-10 to K-11
Sun Microsystems
cache optimization, B-38
fault detection pitfalls, 58
memory dependability, 104
Sun Microsystems Enterprise, L-60
Sun Microsystems Niagara (T1/T2) processors
characteristics, 227
CPI and IPC, 399
fine-grained multithreading, 224, 225, 226–229
manufacturing cost, 62
multicore processor performance, 400–401
multiprocessing/multithreading-based performance, 398–400
multithreading history, L-34
T1 multithreading unicore performance, 227–229
Sun Microsystems SPARC
addressing modes, K-5
ALU operands, A-6
arithmetic/logical instructions, K-11, K-31
branch conditions, A-19
conditional branches, K-10, K-17
conditional instructions, H-27
constant extension, K-9
conventions, K-13
data transfer instructions, K-10
fast traps, K-30
features, K-44
FP instructions, K-23
instruction list, K-31 to K-32
integer arithmetic, J-12
integer overflow, J-11
ISA, A-2
LISP, K-30
MIPS core extensions, K-22 to K-23
overlapped integer/FP operations, K-31
precise exceptions, C-60
register windows, K-29 to K-30
RISC history, L-20
as RISC system, K-4
Smalltalk, K-30
synchronization history, L-64
unique instructions, K-29 to K-32
Sun Microsystems SPARCCenter, L-60
Sun Microsystems SPARCstation-2, F-88
Sun Microsystems SPARCstation-20, F-88
Sun Microsystems SPARC V8, floating-point precisions, J-33
Sun Microsystems SPARC VIS
characteristics, K-18
multimedia support, E-11, K-18
Sun Microsystems Ultra 5, SPECfp2000 execution times, 43
Sun Microsystems UltraSPARC, L-62, L-73
Sun Microsystems UltraSPARC T1 processor, characteristics, F-73
Sun Modular Datacenter, L-74 to L-75
Superblock scheduling
basic process, H-21 to H-23
compiler history, L-31
example, H-22
Supercomputers
commercial interconnection networks, F-63
direct network topology, F-37
low-dimensional topologies, F-100
SAN characteristics, F-76
SIMD, development, L-43 to L-44
vs. WSCs, 8
Superlinear performance, multiprocessors, 406
Superpipelining
definition, C-61
performance histories, 20
Superscalar processors
coining of term, L-29
ideal processors, 214–215
ILP, 192–197, 246
studies, L-32
microarchitectural techniques case study, 250–251
multithreading support, 225
recent advances, L-33 to L-34
register renaming code, 251
rename table and register substitution logic, 251
SMT, 230–232
VMIPS, 267
Superscalar registers, sample renaming code, 251
Supervisor process, virtual memory protection, 106
Sussenguth, Ed, L-28
Sutherland, Ivan, L-34
Swap procedure, VAX
code example, K-72, K-74
full procedure, K-75 to K-76
overview, K-72 to K-76
register allocation, K-72
register preservation, B-74 to B-75
Swim, data cache misses, B-10
Switched-media networks
basic characteristics, F-24
vs. buses, F-2
effective bandwidth vs. nodes, F-28
example, F-22
latency and effective bandwidth, F-26 to F-28
vs. shared-media networks, F-24 to F-25
Switched networks
centralized, F-30 to F-34
DOR, F-46
OCN history, F-104
topology, F-40
Switches
array, WSCs, 443–444
Benesˆ networks, F-33
context, 307, B-49
early LANs and WANs, F-29
Ethernet switches, 16, 20, 53, 441–444, 464–465, 469
interconnecting node calculations, F-35
vs. NIC, F-85 to F-86, F-86
process switch, 224, B-37, B-49 to B-50
storage systems, D-34
switched-media networks, F-24
WSC hierarchy, 441–442, 442
WSC infrastructure, 446
WSC network bottleneck, 461
Switch fabric, switched-media networks, F-24
Switching
commercial interconnection networks, F-56
interconnection networks, F-22, F-27, F-50 to F-52
network impact, F-52 to F-55
performance considerations, F-92 to F-93
SAN characteristics, F-76
switched-media networks, F-24
system area network history, F-100
Switch microarchitecture
basic microarchitecture, F-55 to F-58
buffer organizations, F-58 to F-60
enhancements, F-62
HOL blocking, F-59
input-output-buffered switch, F-57
pipelining, F-60 to F-61, F-61
Switch ports
centralized switched networks, F-30
interconnection network topology, F-29
Switch statements
control flow instruction addressing modes, A-18
GPU, 301
Syllable, IA-64, H-35
Symbolic loop unrolling, software pipelining, H-12 to H-15, H-13
Symmetric multiprocessors (SMP)
characteristics, I-45
communication calculations, 350
directory-based cache coherence, 354
first vector computers, L-47, L-49
limitations, 363–364
snooping coherence protocols, 354–355
system area network history, F-101
TLP, 345
Symmetric shared-memory multiprocessors See also Centralized shared-memory multiprocessors
data caching, 351–352
limitations, 363–364
performance
commercial workload, 367–369
commercial workload measurement, 369–374
multiprogramming and OS workload, 374–378
overview, 366–367
scientific workloads, I-21 to I-26, I-23 to I-25
Synapse N + 1, L-59
Synchronization
AltaVista search, 369
basic considerations, 386–387
basic hardware primitives, 387–389
consistency models, 395–396
cost, 403
Cray X1, G-23
definition, 375
GPU comparisons, 329
GPU conditional branching, 300–303
historical background, L-64
large-scale multiprocessors
barrier synchronization, I-13 to I-16, I-14, I-16
challenges, I-12 to I-16
hardware primitives, I-18 to I-21
sense-reversing barrier, I-21
software implementations, I-17 to I-18
tree-based barriers, I-19
locks via coherence, 389–391
message-passing communication, I-5
MIMD, 10
MIPS core extensions, K-21
programmer’s viewpoint, 393–394
PTX instruction set, 298–299
relaxed consistency models, 394–395
single-chip multicore processor case study, 412–418
vector vs. GPU, 311
VLIW, 196
WSCs, 434
Synchronous dynamic random-access memory (SDRAM)
ARM Cortex-A8, 117
DRAM, 99
vs. Flash memory, 103
IBM Blue Gene/L, I-42
Intel Core i7, 121
performance, 100
power consumption, 102, 103
SDRAM timing diagram, 139
Synchronous event, exception requirements, C-44 to C-45
Synchronous I/O, definition, D-35
Synonyms
address translation, B-38
dependability, 34
Synthetic benchmarks
definition, 37
typical program fallacy, A-43
System area networks, historical overview, F-100 to F-102
System calls
CUDA Thread, 297
multiprogrammed workload, 378
virtualization/paravirtualization performance, 141
virtual memory protection, 106
System interface controller (SIF), Intel SCCC, F-70
System-on-chip (SoC)
cell phone, E-24
cross-company interoperability, F-64
embedded systems, E-3
Sanyo digital cameras, E-20
Sanyo VPC-SX500 digital camera, E-19
shared-media networks, F-23
System Performance and Evaluation Cooperative (SPEC) See SPEC benchmarks
System Processor
definition, 309
DLP, 262, 322
Fermi GPU, 306
GPU issues, 330
GPU programming, 288–289
NVIDIA GPU ISA, 298
NVIDIA GPU Memory, 305
processor comparisons, 323–324
synchronization, 329
vector vs. GPU, 311–312
System response time, transactions, D-16, D-17
Systems on a chip (SOC), cost trends, 28
System/storage area networks (SANs)
characteristics, F-3 to F-4
communication protocols, F-8
congestion management, F-65
cross-company interoperability, F-64
effective bandwidth, F-18
example system, F-72 to F-74
fat trees, F-34
fault tolerance, F-67
InfiniBand example, F-74 to F-77
interconnection network domain relationship, F-4
LAN history, F-99
latency and effective bandwidth, F-26 to F-28
latency vs. nodes, F-27
packet latency, F-13, F-14 to F-16
routing algorithms, F-48
software overhead, F-91
TCP/IP reliance, F-95
time of flight, F-13
topology, F-30
System Virtual Machines, definition, 107
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset