S
Sandy Bridge dies, wafter example,
31
Sanyo digital cameras, SOC,
E-20
Sanyo VPC-SX500 digital camera, embedded system case study, E-19
SATA (Serial Advanced Technology Attachment) disks
NetApp FAS6000 filer, D-42
storage area network history, F-103
Saturating arithmetic, DSP media extensions, E-11
Saturating operations, definition, K-18 to K-19
SAXPY, GPU raw/relative performance,
328
Scalability
multicore processors,
400
as server characteristic,
transistor performance and wires,
19–21
Scalable GPUs, historical background, L-50 to L-51
Scalar expansion, loop-level parallelism dependences,
321
Scalar registers
GPUs
vs. vector architectures,
311
loop-level parallelism dependences,
321–322
Multimedia SIMD
vs. GPUs,
312
sample renaming code,
251
Scaled addressing, VAX, K-67
Scaled speedup, Amdahl’s law and parallel computers,
406–407
Scaling
Amdahl’s law and parallel computers,
406–407
computation-to-communication ratios,
I-11
dynamic voltage-frequency,
25,
52,
467
interconnection network speed, F-88
multicore
vs. single-core,
402
processor performance trends,
scientific applications on parallel processing, I-34
shared-
vs. switched-media networks, F-25
transistor performance and wires,
19–21
Scan Line Interleave (SLI), scalable GPUs, L-51
Scientific applications
basic characteristics, I-6 to I-7
distributed-memory multiprocessors, I-26 to I-32,
I-28 to I-32
parallel processors, I-33 to I-34
parallel program computation/communication, I-10 to I-12,
I-11
parallel programming, I-2
symmetric shared-memory multiprocessors, I-21 to I-26,
I-23 to I-25
Scoreboarding
example calculations,
C-77
SIMD thread scheduler,
296
Scripting languages, software development impact,
SCSI (Small Computer System Interface)
Berkeley’s Tertiary Disk project, D-12
dependability benchmarks, D-21
historical background, L-80 to L-81
I/O subsystem design, D-59
RAID reconstruction, D-56
storage area network history, F-102
Second-level caches
See also L2 caches
interconnection network, F-87
and relative execution time,
B-34
Secure Virtual Machine (SVM),
129
Seek time, storage disks,
D-46
Segment descriptor, IA-32 processor,
B-52,
B-53
Self-correction, Newton’s algorithm, J-28 to J-29
Self-draining pipelines, L-29
Semantic clash, high-level instruction set,
A-41
Semantic gap, high-level instruction set,
A-39
Sending overhead
communication latency, I-3 to I-4
Sense-reversing barrier
large-scale multiprocessor synchronization, I-14
Sequence of SIMD Lane Operations, definition,
292,
313
Sequency number, packet header, F-8
Sequential consistency
latency hiding with speculation,
396–397
programmer’s viewpoint,
394
relaxed consistency models,
394–395
requirements and implementation,
392–393
Sequential interleaving, multibanked caches,
86,
86
Serial Attach SCSI (SAS) drive
historical background, L-81
Serialization
barrier synchronization, I-16
coherence enforcement,
354
directory-based cache coherence,
382
DSM multiprocessor cache coherence, I-37
multiprocessor cache coherency,
353
snooping coherence protocols,
356
write invalidate protocol implementation,
356
Serpentine recording, L-77
Serve-longest-queue (SLQ) scheme, arbitration, F-49
ServerNet interconnection network, fault tolerance, F-66 to F-67
Servers
See also Warehouse-scale computers (WSCs)
memory hierarchy design,
72
multiprocessor importance,
344
outage/anomaly statistics,
435
performance benchmarks,
40–41
power distribution example,
490
power-performance modes,
477
real-world examples,
52–55
RISC systems
addressing modes and instruction formats, K-5 to K-6
multimedia extensions, K-16 to K-19
single-server model,
D-25
system characteristics,
E-4
vs. WSC facility costs,
472
WSC memory hierarchy,
444
WSC resource allocation case study,
478–479
Server side Java operations per second (ssj_ops)
example calculations,
439
real-world considerations,
52–55
Server utilization
calculation, D-28 to D-29
Service accomplishment, SLAs,
34
Service Health Dashboard, AWS,
457
Service interruption, SLAs,
34
Service level agreements (SLAs)
Service level objectives (SLOs)
Session layer, definition,
F-82
Set associativity
memory hierarchy basics,
74,
76
performance equations,
B-22
pipelined cache access,
82
Set-on-less-than instructions (SLT)
MIPS conditional branches, K-11 to K-12
Shadow page table, Virtual Machines,
110
Sharding, WSC memory hierarchy,
445
Shared-media networks
effective bandwidth
vs. nodes,
F-28
latency and effective bandwidth, F-26 to F-28
multiple device connections, F-22 to F-24
vs. switched-media networks, F-24 to F-25
Shared Memory
directory-based cache coherence,
418–420
terminology comparison,
315
Shared-memory communication, large-scale multiprocessors, I-5
Shared-memory multiprocessors
cache coherence enforcement,
354–355
cache coherence extensions,
362–363
historical background, L-60 to L-61
invalidate protocol implementation,
356–357
single-chip multicore case study,
412–418
SMP and snooping limitations,
363–364
snooping coherence implementation,
365–366
snooping coherence protocols,
355–356
Shared-memory synchronization, MIPS core extensions, K-21
Shared state
coherence extensions,
362
directory-based cache coherence protocol basics,
380,
385
Shear algorithms, disk array deconstruction, D-51 to D-52,
D-52 to D-54
Shifting over zeros, integer multiplication/division, J-45 to J-47
SI format instructions, IBM 360, K-87
Signal-to-noise ratio (SNR), wireless networks, E-21
Signed-digit representation
integer multiplication, J-53
Signed number arithmetic, J-7 to J-10
Silicon Graphics 4D/240, L-59
Silicon Graphics Altix,
F-76, L-63
Silicon Graphics Challenge, L-60
Silicon Graphics Origin, L-61, L-63
Silicon Graphics systems (SGI)
multiprocessor software development,
407–409
vector processor history, G-27
SIMD (Single Instruction Stream, Multiple Data Stream)
Fermi GPU architectural innovations,
305–308
GPU conditional branching,
301
GPUs
vs. vector architectures,
308–309
historical overview, L-55 to L-56
loop-level parallelism,
150
multiprocessor architecture,
346
NVIDIA GPU computational structures,
291
speedup via parallelism,
263
supercomputer development, L-43 to L-44
system area network history, F-100
Thread Block mapping,
293
SIMD Instruction
DSP media extensions, E-10
GPU Memory structures,
304
multimedia architecture programming,
285
Multithreaded SIMD Processor block diagram,
294
Thread of SIMD Instructions,
295–296
vector architectures as superset,
263–264
vector/GPU comparison,
308
SIMD Lane Registers, definition,
309,
314
SIMD Lanes
instruction scheduling,
297
multimedia extensions,
285
Multimedia SIMD
vs. GPUs,
312,
315
multithreaded processor,
294
synchronization marker,
301
SIMD Processors
See also Multithreaded SIMD Processor
dependent computation elimination,
321
GPU conditional branching,
302
Multimedia SIMD
vs. GPU,
312
multiprocessor architecture,
346
NVIDIA GPU computational structures,
291
NVIDIA GPU Memory structures,
304–305
processor comparisons,
324
system area network history, F-100
SIMD Thread
Multithreaded SIMD processor,
294
NVIDIA GPU Memory structures,
305
SIMT (Single Instruction, Multiple Thread)
Simultaneous multithreading (SMT)
historical background, L-34 to L-35
multicore performance/energy efficiency,
402–405
multiprocessing/multithreading-based performance,
398–400
multithreading history, L-35
Single-extended precision floating-point arithmetic, J-33 to J-34
Single-level cache hierarchy, miss rates
vs. cache size,
B-33
Single-precision floating point
Multimedia SIMD Extensions,
283
representation, J-15 to J-16
Single-Streaming Processor (SSP)
Single-thread (ST) performance
processor comparison,
243
SISD (Single Instruction Stream, Single Data Stream),
10
SIMD computer history, L-55
Skippy algorithm
disk deconstruction, D-49
Small form factor (SFF) disk, L-79
Smalltalk, SPARC instructions, K-30
Smart interface cards,
vs. smart switches, F-85 to F-86
Smart switches,
vs. smart interface cards, F-85 to F-86
Snooping cache coherence
controller transitions,
421
large-scale multiprocessor history, L-61
large-scale multiprocessors, I-34 to I-35
single-chip multicore processor case study,
412–418
symmetric shared-memory machines,
366
Soft errors, definition,
104
Software as a Service (SaaS)
Software development
multiprocessor architecture issues,
407–409
performance
vs. productivity,
Software pipelining
example calculations, H-13 to H-14
loops, execution pattern,
H-15
technique, H-12 to H-15,
H-13
Software prefetching, cache optimization,
131–133
Software technology
large-scale multiprocessors, I-6
large-scale multiprocessor synchronization, I-17 to I-18
vs. TCP/IP reliance, F-95
Virtual Machines protection,
108
Solaris, RAID benchmarks,
D-22, D-22 to D-23
Solid-state disks (SSDs)
processor performance/price/power,
52
server energy efficiency,
462
Sonic Smart Interconnect, OCNs, F-3
Sony PlayStation 2
embedded multiprocessors, E-14
Emotion Engine case study, E-15 to E-18
Emotion Engine organization,
E-18
Sorting, case study, D-64 to D-67
Sort primitive, GPU
vs. MIMD,
329
Sort procedure, VAX
example code, K-77 to K-79
register allocation, K-76
Source routing, basic concept, F-48
Sparse matrices
loop-level parallelism dependences,
318–319
vector execution time,
271
vector mask registers,
275
Spatial locality
memory hierarchy design,
72
SPEC benchmarks
branch predictor correlation,
162–164
desktop performance,
38–40
early performance measures, L-7
performance results reporting,
41
processor performance growth,
storage systems, D-20 to D-21
tournament predictors,
164
vector processor history, G-28
SPEC89 benchmarks
VAX 8700
vs. MIPS M2000,
K-82
SPEC92 benchmarks
hardware
vs. software speculation,
221
SPEC2000 benchmarks
cache performance prediction,
125–126
cache size and misses per instruction,
126
compiler optimizations,
A-29
compulsory miss rate,
B-23
data reference sizes,
A-44
SPEC2006 benchmarks, evolution,
39
SPECCPU2000 benchmarks
displacement addressing mode,
A-12
SPECCPU2006 benchmarks
ISA performance and efficiency prediction,
241
Virtual Machines protection,
108
SPECfp benchmarks
interconnection network, F-87
ISA performance and efficiency prediction,
241–242
tournament predictors,
164
SPECfp92 benchmarks
Intel 80x86
vs. DLX,
K-63
Intel 80x86 instruction lengths,
K-60
Intel 80x86 instruction mix,
K-61
Intel 80x86 operand type distribution,
K-59
SPECfp2000 benchmarks
MIPS dynamic instruction mix,
A-42
Sun Ultra 5 execution times,
43
SPECfp2006 benchmarks
Intel processor clock rates,
244
SPECfpRate benchmarks
multicore processor performance,
400
multiprocessor cost effectiveness,
407
SMT on superscalar processors,
230
SPEChpc96 benchmark, vector processor history, G-28
Special-purpose machines
historical background, L-4 to L-5
SIMD computer history, L-56
Special-purpose register
compiler writing-architecture relationship,
A-30
Special values
floating point, J-14 to J-15
SPECINT benchmarks
interconnection network, F-87
ISA performance and efficiency prediction,
241–242
SPECInt92 benchmarks
Intel 80x86
vs. DLX,
K-63
Intel 80x86 instruction lengths,
K-60
Intel 80x86 instruction mix,
K-62
Intel 80x86 operand type distribution,
K-59
SPECint95 benchmarks, interconnection networks, F-88
SPECINT2000 benchmarks, MIPS dynamic instruction mix,
A-41
SPECINT2006 benchmarks
Intel processor clock rates,
244
SPECintRate benchmark
multicore processor performance,
400
multiprocessor cost effectiveness,
407
SMT on superscalar processors,
230
SPEC Java Business Benchmark (JBB)
multicore processor performance,
400
multicore processors,
402
multiprocessing/multithreading-based performance,
398
Sun T1 multithreading unicore performance,
227–229,
229
SPECJVM98 benchmarks, ISA performance and efficiency prediction,
241
SPECMail benchmark, characteristics, D-20
SPEC-optimized processors,
vs. density-optimized, F-85
SPECPower benchmarks
multicore processor performance,
400
real-world server considerations,
52–55
WSC server energy efficiency,
462–463
SPECRate benchmarks
multicore processor performance,
400
multiprocessor cost effectiveness,
407
SPECRate2000 benchmarks, SMT,
398–400
SPECRatios
execution time examples,
43
geometric means calculations,
43–44
SPECvirt_Sc2010 benchmarks, server,
40
SPECWeb99 benchmarks
multiprocessing/multithreading-based performance,
398
Sun T1 multithreading unicore performance,
227,
229
Speedup
floating-point addition, J-25 to J-26
integer addition
carry-lookahead, J-37 to J-41
carry-lookahead circuit,
J-38
carry-lookahead tree,
J-40 to J-41
carry-lookahead tree adder,
J-41
carry-select adder,
J-43, J-43 to J-44,
J-44
carry-skip adder, J-41 to J43,
J-42
integer division
radix-4 SRT division,
J-57
with single adder, J-54 to J-58
integer multiplication
with many adders, J-50 to J-54
multipass array multiplier,
J-51
signed-digit addition table,
J-54
with single adder, J-47 to J-49,
J-48
integer multiplication/division, shifting over zeros, J-45 to J-47
integer SRT division, J-45 to J-46,
J-46
switch buffer organizations, F-58 to F-59
Spin locks
large-scale multiprocessor synchronization
barrier synchronization, I-16
exponential back-off,
I-17
SPLASH parallel benchmarks, SMT on superscalar processors,
230
SPRAM, Sony PlayStation 2 Emotion Engine organization,
E-18
Squared coefficient of variance, D-27
SRT division
chip comparison, J-60 to J-61
complications, J-45 to J-46
early computer arithmetic, J-65
historical background, J-63
integers, with adder, J-55 to J-57
SS format instructions, IBM 360, K-85 to K-88
Stack architecture
and compiler technology,
A-27
historical background, L-16 to L-17
Intel 80x86,
K-48,
K-52,
K-54
Stack or Thread Local Storage, definition,
292
Stale copy, cache coherency,
112
Stall cycles
advanced directory protocol case study,
424
average memory access time,
B-17
branch scheme performance,
C-25
example calculation,
B-31
MIPS FP pipeline performance,
C-60
miss rate calculation,
B-6
performance equations,
B-22
single-chip multicore multiprocessor case study,
414–418
Stalls
AMD Opteron data cache,
B-15
microarchitectural techniques case study,
252
MIPS pipeline multicycle operations,
C-51
from RAW hazards, FP code,
C-55
Standardization, commercial interconnection networks, F-63 to F-64
Stardent-1500, Livermore Fortran kernels,
331
Start-up overhead,
vs. peak performance,
331
Start-up time
page size selection,
B-47
vector performance measures, G-16
vector processor, G-7 to G-9, G-25
State transition diagram
directory-based cache coherence,
383
Statically based exploitation, ILP, H-2
Static random-access memory (SRAM)
fault detection pitfalls,
58
vector memory systems, G-9
Static scheduling
and unoptimized code,
C-81
Storage area networks
dependability benchmarks, D-21 to D-23,
D-22
historical overview, F-102 to F-103
I/O system as black blox,
D-23
Storage systems
asynchronous I/O and OSes, D-35
Berkeley’s Tertiary Disk project, D-12
block servers
vs. filers, D-34 to D-35
computer system availability, D-43 to D-44,
D-44
dependability benchmarks, D-21 to D-23
disk array deconstruction case study, D-51 to D-55,
D-52 to D-55
disk deconstruction case study, D-48 to D-51,
D-50
file system benchmarking,
D-20, D-20 to D-21
I/O performance, D-15 to D-16
I/O subsystem design, D-59 to D-61
I/O system design/evaluation, D-36 to D-37
mail server benchmarking, D-20 to D-21
NetApp FAS6000 filer, D-41 to D-42
operator dependability, D-13 to D-15
OS-scheduled disk access, D-44 to D-45,
D-45
point-to-point links, D-34,
D-34
queue I/O request calculations, D-29
queuing theory, D-23 to D-34
RAID performance prediction, D-57 to D-59
RAID reconstruction case study, D-55 to D-57
real faults and failures, D-6 to D-10
response time restrictions for benchmarks,
D-18
seek distance comparison,
D-47
seek time
vs. distance,
D-46
server utilization calculation, D-28 to D-29
sorting case study, D-64 to D-67
Tandem Computers, D-12 to D-13
throughput
vs. response time,
D-16, D-16 to D-18,
D-17
TP benchmarks, D-18 to D-19
transactions components,
D-17
web server benchmarking, D-20 to D-21
WSC
vs. datacenter costs,
455
Store-and-forward packet switching, F-51
Strided accesses
Multimedia SIMD Extensions,
283
Strides
highly parallel memory systems,
133
multidimensional arrays in vector architectures,
278–279
vector memory systems, G-10 to G-11
String operations, Intel 80x86, K-51,
K-53
Stripe, disk array deconstruction, D-51
Strip-Mined Vector Loop
multidimensional arrays,
278
Thread Block comparison,
294
vector-length registers,
274
Strip mining
GPU conditional branching,
303
GPUs
vs. vector architectures,
311
Strong scaling, Amdahl’s law and parallel computers,
407
Subset property, and inclusion,
397
Summary overflow condition code, PowerPC, K-10 to K-11
Sun Microsystems
fault detection pitfalls,
58
memory dependability,
104
Sun Microsystems Enterprise, L-60
Sun Microsystems Niagara (T1/T2) processors
multicore processor performance,
400–401
multiprocessing/multithreading-based performance,
398–400
multithreading history, L-34
T1 multithreading unicore performance,
227–229
Sun Microsystems SPARC
arithmetic/logical instructions,
K-11,
K-31
conditional branches, K-10,
K-17
conditional instructions, H-27
data transfer instructions,
K-10
instruction list, K-31 to K-32
MIPS core extensions, K-22 to K-23
overlapped integer/FP operations, K-31
register windows, K-29 to K-30
synchronization history, L-64
unique instructions, K-29 to K-32
Sun Microsystems SPARCCenter, L-60
Sun Microsystems SPARCstation-2, F-88
Sun Microsystems SPARCstation-20, F-88
Sun Microsystems SPARC V8, floating-point precisions, J-33
Sun Microsystems SPARC VIS
multimedia support,
E-11, K-18
Sun Microsystems Ultra 5, SPECfp2000 execution times,
43
Sun Microsystems UltraSPARC, L-62, L-73
Sun Microsystems UltraSPARC T1 processor, characteristics,
F-73
Sun Modular Datacenter, L-74 to L-75
Superblock scheduling
basic process, H-21 to H-23
Supercomputers
commercial interconnection networks, F-63
direct network topology,
F-37
low-dimensional topologies, F-100
SAN characteristics,
F-76
SIMD, development, L-43 to L-44
Superlinear performance, multiprocessors,
406
Superpipelining
performance histories,
20
Superscalar processors
microarchitectural techniques case study,
250–251
multithreading support,
225
recent advances, L-33 to L-34
register renaming code,
251
rename table and register substitution logic,
251
Superscalar registers, sample renaming code,
251
Supervisor process, virtual memory protection,
106
Swap procedure, VAX
full procedure, K-75 to K-76
register allocation, K-72
Swim, data cache misses,
B-10
Switched-media networks
basic characteristics, F-24
effective bandwidth
vs. nodes,
F-28
latency and effective bandwidth, F-26 to F-28
vs. shared-media networks, F-24 to F-25
Switched networks
centralized, F-30 to F-34
Switches
early LANs and WANs, F-29
interconnecting node calculations, F-35
vs. NIC, F-85 to F-86,
F-86
switched-media networks, F-24
WSC network bottleneck,
461
Switch fabric, switched-media networks, F-24
Switching
commercial interconnection networks,
F-56
interconnection networks, F-22,
F-27, F-50 to F-52
network impact, F-52 to F-55
performance considerations, F-92 to F-93
SAN characteristics,
F-76
switched-media networks, F-24
system area network history, F-100
Switch microarchitecture
basic microarchitecture, F-55 to F-58
buffer organizations, F-58 to F-60
input-output-buffered switch,
F-57
pipelining, F-60 to F-61,
F-61
Switch ports
centralized switched networks, F-30
interconnection network topology, F-29
Switch statements
control flow instruction addressing modes,
A-18
Symbolic loop unrolling, software pipelining, H-12 to H-15,
H-13
Symmetric multiprocessors (SMP)
communication calculations,
350
directory-based cache coherence,
354
first vector computers, L-47, L-49
snooping coherence protocols,
354–355
system area network history, F-101
Synchronization
historical background, L-64
large-scale multiprocessors
barrier synchronization, I-13 to I-16,
I-14,
I-16
hardware primitives, I-18 to I-21
sense-reversing barrier,
I-21
software implementations, I-17 to I-18
tree-based barriers,
I-19
message-passing communication, I-5
MIPS core extensions, K-21
relaxed consistency models,
394–395
single-chip multicore processor case study,
412–418
Synchronous dynamic random-access memory (SDRAM)
SDRAM timing diagram,
139
Synchronous I/O, definition, D-35
Synonyms
address translation,
B-38
Synthetic benchmarks
typical program fallacy,
A-43
System area networks, historical overview, F-100 to F-102
System calls
multiprogrammed workload,
378
virtualization/paravirtualization performance,
141
virtual memory protection,
106
System interface controller (SIF), Intel SCCC, F-70
System-on-chip (SoC)
cross-company interoperability, F-64
Sanyo digital cameras,
E-20
Sanyo VPC-SX500 digital camera, E-19
shared-media networks, F-23
System response time, transactions, D-16,
D-17
Systems on a chip (SOC), cost trends,
28
System/storage area networks (SANs)
characteristics, F-3 to F-4
communication protocols, F-8
congestion management, F-65
cross-company interoperability, F-64
effective bandwidth, F-18
example system, F-72 to F-74
InfiniBand example, F-74 to F-77
interconnection network domain relationship,
F-4
latency and effective bandwidth, F-26 to F-28
packet latency,
F-13, F-14 to F-16
System Virtual Machines, definition,
107