Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

S

S3 See Amazon Simple Storage Service (S3)

SaaS See Software as a Service (SaaS)

Sandy Bridge dies, wafter example, 31

SANs See System/storage area networks (SANs)

Sanyo digital cameras, SOC, E-20

Sanyo VPC-SX500 digital camera, embedded system case study, E-19

SAS See Serial Attach SCSI (SAS) drive

SASI, L-81

SATA (Serial Advanced Technology Attachment) disks

Google WSC servers, 469

NetApp FAS6000 filer, D-42

power consumption, D-5

RAID 6, D-8

vs. SAS drives, D-5

storage area network history, F-103

Saturating arithmetic, DSP media extensions, E-11

Saturating operations, definition, K-18 to K-19

SAXPY, GPU raw/relative performance, 328

Scalability

cloud computing, 460

coherence issues, 378–379

Fermi GPU, 295

Java benchmarks, 402

multicore processors, 400

multiprocessing, 344, 395

parallelism, 44

as server characteristic, 7

transistor performance and wires, 19–21

WSCs, 8, 438

WSCs vs. servers, 434

Scalable GPUs, historical background, L-50 to L-51

Scalar expansion, loop-level parallelism dependences, 321

Scalar Processors See also Superscalar processors

definition, 292, 309

early pipelined CPUs, L-26 to L-27

lane considerations, 273

Multimedia SIMD/GPU comparisons, 312

NVIDIA GPU, 291

prefetch units, 277

vs. vector, 311, G-19

vector performance, 331–332

Scalar registers

Cray X1, G-21 to G-22

GPUs vs. vector architectures, 311

loop-level parallelism dependences, 321–322

Multimedia SIMD vs. GPUs, 312

sample renaming code, 251

vector vs. GPU, 311

vs. vector performance, 331–332

VMIPS, 265–266

Scaled addressing, VAX, K-67

Scaled speedup, Amdahl’s law and parallel computers, 406–407

Scaling

Amdahl’s law and parallel computers, 406–407

cloud computing, 456

computation-to-communication ratios, I-11

DVFS, 25, 52, 467

dynamic voltage-frequency, 25, 52, 467

Intel Core i7, 404

interconnection network speed, F-88

multicore vs. single-core, 402

processor performance trends, 3

scientific applications on parallel processing, I-34

shared- vs. switched-media networks, F-25

transistor performance and wires, 19–21

VMIPS, 267

Scan Line Interleave (SLI), scalable GPUs, L-51

SCCC See Intel Single-Chip Cloud Computing (SCCC)

Schorr, Herb, L-28

Scientific applications

Barnes, I-8 to I-9

basic characteristics, I-6 to I-7

cluster history, L-62

distributed-memory multiprocessors, I-26 to I-32, I-28 to I-32

FFT kernel, I-7

LU kernel, I-8

Ocean, I-9 to I-10

parallel processors, I-33 to I-34

parallel program computation/communication, I-10 to I-12, I-11

parallel programming, I-2

symmetric shared-memory multiprocessors, I-21 to I-26, I-23 to I-25

Scoreboarding

ARM Cortex-A8, 233, 234

components, C-76

definition, 170

dynamic scheduling, 171, 175

and dynamic scheduling, C-71 to C-80

example calculations, C-77

MIPS structure, C-73

NVIDIA GPU, 296

results tables, C-78 to C-79

SIMD thread scheduler, 296

Scripting languages, software development impact, 4

SCSI (Small Computer System Interface)

Berkeley’s Tertiary Disk project, D-12

dependability benchmarks, D-21

disk storage, D-4

historical background, L-80 to L-81

I/O subsystem design, D-59

RAID reconstruction, D-56

storage area network history, F-102

SDRAM See Synchronous dynamic random-access memory (SDRAM)

SDRWAVE, J-62

Second-level caches See also L2 caches

ARM Cortex-A8, 114

ILP, 245

Intel Core i7, 121

interconnection network, F-87

Itanium 2, H-41

memory hierarchy, B-48 to B-49

miss penalty calculations, B-33 to B-34

miss penalty reduction, B-30 to B-35

miss rate calculations, B-31 to B-35

and relative execution time, B-34

speculation, 210

SRAM, 99

Secure Virtual Machine (SVM), 129

Seek distance

storage disks, D-46

system comparison, D-47

Seek time, storage disks, D-46

Segment basics

Intel 80x86, K-50

vs. page, B-43

virtual memory definition, B-42 to B-43

Segment descriptor, IA-32 processor, B-52, B-53

Segmented virtual memory

bounds checking, B-52

Intel Pentium protection, B-51 to B-54

memory mapping, B-52

vs. paged, B-43

safe calls, B-54

sharing and protection, B-52 to B-53

Self-correction, Newton’s algorithm, J-28 to J-29

Self-draining pipelines, L-29

Self-routing, MINs, F-48

Semantic clash, high-level instruction set, A-41

Semantic gap, high-level instruction set, A-39

Semiconductors

DRAM technology, 17

Flash memory, 18

GPU vs. MIMD, 325

manufacturing, 3–4

Sending overhead

communication latency, I-3 to I-4

OCNs vs. SANs, F-27

time of flight, F-14

Sense-reversing barrier

code example, I-15, I-21

large-scale multiprocessor synchronization, I-14

Sequence of SIMD Lane Operations, definition, 292, 313

Sequency number, packet header, F-8

Sequential consistency

latency hiding with speculation, 396–397

programmer’s viewpoint, 394

relaxed consistency models, 394–395

requirements and implementation, 392–393

Sequential interleaving, multibanked caches, 86, 86

Sequent Symmetry, L-59

Serial Advanced Technology Attachment disks See SATA (Serial Advanced Technology Attachment) disks

Serial Attach SCSI (SAS) drive

historical background, L-81

power consumption, D-5

vs. SATA drives, D-5

Serialization

barrier synchronization, I-16

coherence enforcement, 354

directory-based cache coherence, 382

DSM multiprocessor cache coherence, I-37

hardware primitives, 387

multiprocessor cache coherency, 353

page tables, 408

snooping coherence protocols, 356

write invalidate protocol implementation, 356

Serpentine recording, L-77

Serve-longest-queue (SLQ) scheme, arbitration, F-49

ServerNet interconnection network, fault tolerance, F-66 to F-67

Servers See also Warehouse-scale computers (WSCs)

as computer class, 5

cost calculations, 454, 454–455

definition, D-24

energy savings, 25

Google WSC, 440, 467, 468–469

GPU features, 324

memory hierarchy design, 72

vs. mobile GPUs, 323–330

multiprocessor importance, 344

outage/anomaly statistics, 435

performance benchmarks, 40–41

power calculations, 463

power distribution example, 490

power-performance benchmarks, 54, 439–441

power-performance modes, 477

real-world examples, 52–55

RISC systems

addressing modes and instruction formats, K-5 to K-6

examples, K-3, K-4

instruction formats, K-7

multimedia extensions, K-16 to K-19

single-server model, D-25

system characteristics, E-4

workload demands, 439

WSC vs. datacenters, 455–456

WSC data transfer, 446

WSC energy efficiency, 462–464

vs. WSC facility costs, 472

WSC memory hierarchy, 444

WSC resource allocation case study, 478–479

vs. WSCs, 432–434

WSC TCO case study, 476–478

Server side Java operations per second (ssj_ops)

example calculations, 439

power-performance, 54

real-world considerations, 52–55

Server utilization

calculation, D-28 to D-29

queuing theory, D-25

Service accomplishment, SLAs, 34

Service Health Dashboard, AWS, 457

Service interruption, SLAs, 34

Service level agreements (SLAs)

Amazon Web Services, 457

and dependability, 33

WSC efficiency, 452

Service level objectives (SLOs)

and dependability, 33

WSC efficiency, 452

Session layer, definition, F-82

Set associativity

and access time, 77

address parts, B-9

AMD Opteron data cache, B-12 to B-14

ARM Cortex-A8, 114

block placement, B-7 to B-8

cache block, B-7

cache misses, 83–84, B-10

cache optimization, 79–80, B-33 to B-35, B-38 to B-40

commercial workload, 371

energy consumption, 81

memory access times, 77

memory hierarchy basics, 74, 76

nonblocking cache, 84

performance equations, B-22

pipelined cache access, 82

way prediction, 81

Set basics

block replacement, B-9 to B-10

definition, B-7

Set-on-less-than instructions (SLT)

MIPS16, K-14 to K-15

MIPS conditional branches, K-11 to K-12

Settle time, D-46

SFF See Small form factor (SFF) disk

SFS benchmark, NFS, D-20

SGI See Silicon Graphics systems (SGI)

Shadow page table, Virtual Machines, 110

Sharding, WSC memory hierarchy, 445

Shared-media networks

effective bandwidth vs. nodes, F-28

example, F-22

latency and effective bandwidth, F-26 to F-28

multiple device connections, F-22 to F-24

vs. switched-media networks, F-24 to F-25

Shared Memory

definition, 292, 314

directory-based cache coherence, 418–420

DSM, 347–348, 348, 354–355, 378–380

invalidate protocols, 356–357

SMP/DSM definition, 348

terminology comparison, 315

Shared-memory communication, large-scale multiprocessors, I-5

Shared-memory multiprocessors

basic considerations, 351–352

basic structure, 346–347

cache coherence, 352–353

cache coherence enforcement, 354–355

cache coherence example, 357–362

cache coherence extensions, 362–363

data caching, 351–352

definition, L-63

historical background, L-60 to L-61

invalidate protocol implementation, 356–357

limitations, 363–364

performance, 366–378

single-chip multicore case study, 412–418

SMP and snooping limitations, 363–364

snooping coherence implementation, 365–366

snooping coherence protocols, 355–356

WSCs, 435, 441

Shared-memory synchronization, MIPS core extensions, K-21

Shared state

cache block, 357, 359

cache coherence, 360

cache miss calculations, 366–367

coherence extensions, 362

directory-based cache coherence protocol basics, 380, 385

private cache, 358

Sharing addition, segmented virtual memory, B-52 to B-53

Shear algorithms, disk array deconstruction, D-51 to D-52, D-52 to D-54

Shifting over zeros, integer multiplication/division, J-45 to J-47

Short-circuiting See Forwarding

SI format instructions, IBM 360, K-87

Signals, definition, E-2

Signal-to-noise ratio (SNR), wireless networks, E-21

Signed-digit representation

example, J-54

integer multiplication, J-53

Signed number arithmetic, J-7 to J-10

Sign-extended offset, RISC, C-4 to C-5

Significand, J-15

Sign magnitude, J-7

Silicon Graphics 4D/240, L-59

Silicon Graphics Altix, F-76, L-63

Silicon Graphics Challenge, L-60

Silicon Graphics Origin, L-61, L-63

Silicon Graphics systems (SGI)

economies of scale, 456

miss statistics, B-59

multiprocessor software development, 407–409

vector processor history, G-27

SIMD (Single Instruction Stream, Multiple Data Stream)

definition, 10

Fermi GPU architectural innovations, 305–308

GPU conditional branching, 301

GPU examples, 325

GPU programming, 289–290

GPUs vs. vector architectures, 308–309

historical overview, L-55 to L-56

loop-level parallelism, 150

MapReduce, 438

memory bandwidth, 332

multimedia extensions See Multimedia SIMD Extensions

multiprocessor architecture, 346

multithreaded See Multithreaded SIMD Processor

NVIDIA GPU computational structures, 291

NVIDIA GPU ISA, 300

power/DLP issues, 322

speedup via parallelism, 263

supercomputer development, L-43 to L-44

system area network history, F-100

Thread Block mapping, 293

TI 320C6x DSP, E-9

SIMD Instruction

CUDA Thread, 303

definition, 292, 313

DSP media extensions, E-10

function, 150, 291

GPU Memory structures, 304

GPUs, 300, 305

Grid mapping, 293

IBM Blue Gene/L, I-42

Intel AVX, 438

multimedia architecture programming, 285

multimedia extensions, 282–285, 312

multimedia instruction compilers, A-31 to A-32

Multithreaded SIMD Processor block diagram, 294

PTX, 301

Sony PlayStation 2, E-16

Thread of SIMD Instructions, 295–296

thread scheduling, 296–297, 297, 305

vector architectures as superset, 263–264

vector/GPU comparison, 308

Vector Registers, 309

SIMD Lane Registers, definition, 309, 314

SIMD Lanes

definition, 292, 296, 309

DLP, 322

Fermi GPU, 305, 307

GPU, 296–297, 300, 324

GPU conditional branching, 302–303

GPUs vs. vector architectures, 308, 310, 311

instruction scheduling, 297

multimedia extensions, 285

Multimedia SIMD vs. GPUs, 312, 315

multithreaded processor, 294

NVIDIA GPU Memory, 304

synchronization marker, 301

vector vs. GPU, 308, 311

SIMD Processors See also Multithreaded SIMD Processor

block diagram, 294

definition, 292, 309, 313–314

dependent computation elimination, 321

design, 333

Fermi GPU, 296, 305–308

Fermi GTX 480 GPU floorplan, 295, 295–296

GPU conditional branching, 302

GPU vs. MIMD, 329

GPU programming, 289–290

GPUs vs. vector architectures, 310, 310–311

Grid mapping, 293

Multimedia SIMD vs. GPU, 312

multiprocessor architecture, 346

NVIDIA GPU computational structures, 291

NVIDIA GPU Memory structures, 304–305

processor comparisons, 324

Roofline model, 287, 326

system area network history, F-100

SIMD Thread

GPU conditional branching, 301–302

Grid mapping, 293

Multithreaded SIMD processor, 294

NVIDIA GPU, 296

NVIDIA GPU ISA, 298

NVIDIA GPU Memory structures, 305

scheduling example, 297

vector vs. GPU, 308

vector processor, 310

SIMD Thread Scheduler

definition, 292, 314

example, 297

Fermi GPU, 295, 305–307, 306

GPU, 296

SIMT (Single Instruction, Multiple Thread)

GPU programming, 289

vs. SIMD, 314

Warp, 313

Simultaneous multithreading (SMT)

characteristics, 226

definition, 224–225

historical background, L-34 to L-35

IBM eServer p5 575, 399

ideal processors, 215

Intel Core i7, 117–118, 239–241

Java and PARSEC workloads, 403–404

multicore performance/energy efficiency, 402–405

multiprocessing/multithreading-based performance, 398–400

multithreading history, L-35

superscalar processors, 230–232

Single-extended precision floating-point arithmetic, J-33 to J-34

Single Instruction, Multiple Thread See SIMT (Single Instruction, Multiple Thread)

Single Instruction Stream, Multiple Data Stream See SIMD (Single Instruction Stream, Multiple Data Stream)

Single Instruction Stream, Single Data Stream See SISD (Single Instruction Stream, Single Data Stream)

Single-level cache hierarchy, miss rates vs. cache size, B-33

Single-precision floating point

arithmetic, J-33 to J-34

GPU examples, 325

GPU vs. MIMD, 328

MIPS data types, A-34

MIPS operations, A-36

Multimedia SIMD Extensions, 283

operand sizes/types, 12, A-13

as operand type, A-13 to A-14

representation, J-15 to J-16

Single-Streaming Processor (SSP)

Cray X1, G-21 to G-24

Cray X1E, G-24

Single-thread (ST) performance

IBM eServer p5 575, 399, 399

Intel Core i7, 239

ISA, 242

processor comparison, 243

SISD (Single Instruction Stream, Single Data Stream), 10

SIMD computer history, L-55

Skippy algorithm

disk deconstruction, D-49

sample results, D-50

SLAs See Service level agreements (SLAs)

SLI See Scan Line Interleave (SLI)

SLOs See Service level objectives (SLOs)

SLQ See Serve-longest-queue (SLQ) scheme

SLT See Set-on-less-than instructions (SLT)

SM See Distributed shared memory (DSM)

Small Computer System Interface See SCSI (Small Computer System Interface)

Small form factor (SFF) disk, L-79

Smalltalk, SPARC instructions, K-30

Smart interface cards, vs. smart switches, F-85 to F-86

Smartphones

ARM Cortex-A8, 114

mobile vs. server GPUs, 323–324

Smart switches, vs. smart interface cards, F-85 to F-86

SMP See Symmetric multiprocessors (SMP)

SMT See Simultaneous multithreading (SMT)

Snooping cache coherence

basic considerations, 355–356

controller transitions, 421

definition, 354–355

directory-based, 381, 386, 420–421

example, 357–362

implementation, 365–366

large-scale multiprocessor history, L-61

large-scale multiprocessors, I-34 to I-35

latencies, 414

limitations, 363–364

sample types, L-59

single-chip multicore processor case study, 412–418

symmetric shared-memory machines, 366

SNR See Signal-to-noise ratio (SNR)

SoC See System-on-chip (SoC)

Soft errors, definition, 104

Soft real-time

definition, E-3

PMDs, 6

Software as a Service (SaaS)

clusters/WSCs, 8

software development, 4

WSCs, 438

WSCs vs. servers, 433–434

Software development

multiprocessor architecture issues, 407–409

performance vs. productivity, 4

WSC efficiency, 450–452

Software pipelining

example calculations, H-13 to H-14

loops, execution pattern, H-15

technique, H-12 to H-15, H-13

Software prefetching, cache optimization, 131–133

Software speculation

definition, 156

vs. hardware speculation, 221–222

VLIW, 196

Software technology

ILP approaches, 148

large-scale multiprocessors, I-6

large-scale multiprocessor synchronization, I-17 to I-18

network interfaces, F-7

vs. TCP/IP reliance, F-95

Virtual Machines protection, 108

WSC running service, 434–435

Solaris, RAID benchmarks, D-22, D-22 to D-23

Solid-state disks (SSDs)

processor performance/price/power, 52

server energy efficiency, 462

WSC cost-performance, 474–475

Sonic Smart Interconnect, OCNs, F-3

Sony PlayStation 2

block diagram, E-16

embedded multiprocessors, E-14

Emotion Engine case study, E-15 to E-18

Emotion Engine organization, E-18

Sorting, case study, D-64 to D-67

Sort primitive, GPU vs. MIMD, 329

Sort procedure, VAX

bubble sort, K-76

example code, K-77 to K-79

vs. MIPS32, K-80

Source routing, basic concept, F-48

SPARCLE processor, L-34

Sparse matrices

loop-level parallelism dependences, 318–319

vector architectures, 279–280, G-12 to G-14

vector execution time, 271

vector mask registers, 275

Spatial locality

coining of term, L-11

definition, 45, B-2

memory hierarchy design, 72

SPEC benchmarks

branch predictor correlation, 162–164

desktop performance, 38–40

early performance measures, L-7

evolution, 39

fallacies, 56

operands, A-14

performance, 38

performance results reporting, 41

processor performance growth, 3

static branch prediction, C-26 to C-27

storage systems, D-20 to D-21

tournament predictors, 164

two-bit predictors, 165

vector processor history, G-28

SPEC89 benchmarks

branch-prediction buffers, C-28 to C-30, C-30

MIPS FP pipeline performance, C-61 to C-62

misprediction rates, 166

tournament predictors, 165–166

VAX 8700 vs. MIPS M2000, K-82

SPEC92 benchmarks

hardware vs. software speculation, 221

ILP hardware model, 215

MIPS R4000 performance, C-68 to C-69, C-69

misprediction rate, C-27

SPEC95 benchmarks

return address predictors, 206–207, 207

way prediction, 82

SPEC2000 benchmarks

ARM Cortex-A8 memory, 115–116

cache performance prediction, 125–126

cache size and misses per instruction, 126

compiler optimizations, A-29

compulsory miss rate, B-23

data reference sizes, A-44

hardware prefetching, 91

instruction misses, 127

SPEC2006 benchmarks, evolution, 39

SPECCPU2000 benchmarks

displacement addressing mode, A-12

Intel Core i7, 122

server benchmarks, 40

SPECCPU2006 benchmarks

branch predictors, 167

Intel Core i7, 123–124, 240, 240–241

ISA performance and efficiency prediction, 241

Virtual Machines protection, 108

SPECfp benchmarks

hardware prefetching, 91

interconnection network, F-87

ISA performance and efficiency prediction, 241–242

Itanium 2, H-43

MIPS FP pipeline performance, C-60 to C-61

nonblocking caches, 84

tournament predictors, 164

SPECfp92 benchmarks

Intel 80x86 vs. DLX, K-63

Intel 80x86 instruction lengths, K-60

Intel 80x86 instruction mix, K-61

Intel 80x86 operand type distribution, K-59

nonblocking cache, 83

SPECfp2000 benchmarks

hardware prefetching, 92

MIPS dynamic instruction mix, A-42

Sun Ultra 5 execution times, 43

SPECfp2006 benchmarks

Intel processor clock rates, 244

nonblocking cache, 83

SPECfpRate benchmarks

multicore processor performance, 400

multiprocessor cost effectiveness, 407

SMT, 398–400

SMT on superscalar processors, 230

SPEChpc96 benchmark, vector processor history, G-28

Special-purpose machines

historical background, L-4 to L-5

SIMD computer history, L-56

Special-purpose register

compiler writing-architecture relationship, A-30

ISA classification, A-3

VMIPS, 267

Special values

floating point, J-14 to J-15

representation, J-16

SPECINT benchmarks

hardware prefetching, 92

interconnection network, F-87

ISA performance and efficiency prediction, 241–242

Itanium 2, H-43

nonblocking caches, 84

SPECInt92 benchmarks

Intel 80x86 vs. DLX, K-63

Intel 80x86 instruction lengths, K-60

Intel 80x86 instruction mix, K-62

Intel 80x86 operand type distribution, K-59

nonblocking cache, 83

SPECint95 benchmarks, interconnection networks, F-88

SPECINT2000 benchmarks, MIPS dynamic instruction mix, A-41

SPECINT2006 benchmarks

Intel processor clock rates, 244

nonblocking cache, 83

SPECintRate benchmark

multicore processor performance, 400

multiprocessor cost effectiveness, 407

SMT, 398–400

SMT on superscalar processors, 230

SPEC Java Business Benchmark (JBB)

multicore processor performance, 400

multicore processors, 402

multiprocessing/multithreading-based performance, 398

server, 40

Sun T1 multithreading unicore performance, 227–229, 229

SPECJVM98 benchmarks, ISA performance and efficiency prediction, 241

SPECMail benchmark, characteristics, D-20

SPEC-optimized processors, vs. density-optimized, F-85

SPECPower benchmarks

Google server benchmarks, 439–440, 440

multicore processor performance, 400

real-world server considerations, 52–55

WSCs, 463

WSC server energy efficiency, 462–463

SPECRate benchmarks

Intel Core i7, 402

multicore processor performance, 400

multiprocessor cost effectiveness, 407

server benchmarks, 40

SPECRate2000 benchmarks, SMT, 398–400

SPECRatios

execution time examples, 43

geometric means calculations, 43–44

SPECSFS benchmarks

example, D-20

servers, 40

Speculation See also Hardware-based speculation See also Software speculation

advantages/disadvantages, 210–211

compilers See Compiler speculation

concept origins, L-29 to L-30

and energy efficiency, 211–212

FP unit with Tomasulo’s algorithm, 185

hardware vs. software, 221–222

IA-64, H-38 to H-40

ILP studies, L-32 to L-33

Intel Core i7, 123–124

latency hiding in consistency models, 396–397

memory reference, hardware support, H-32

and memory system, 222–223

microarchitectural techniques case study, 247–254

multiple branches, 211

SPECvirt_Sc2010 benchmarks, server, 40

SPECWeb benchmarks

characteristics, D-20

dependability, D-21

parallelism, 44

server benchmarks, 40

SPECWeb99 benchmarks

multiprocessing/multithreading-based performance, 398

Sun T1 multithreading unicore performance, 227, 229

Speedup

Amdahl’s law, 46–47

floating-point addition, J-25 to J-26

integer addition

carry-lookahead, J-37 to J-41

carry-lookahead circuit, J-38

carry-lookahead tree, J-40 to J-41

carry-lookahead tree adder, J-41

carry-select adder, J-43, J-43 to J-44, J-44

carry-skip adder, J-41 to J43, J-42

overview, J-37

integer division

radix-2 division, J-55

radix-4 division, J-56

radix-4 SRT division, J-57

with single adder, J-54 to J-58

integer multiplication

array multiplier, J-50

Booth recoding, J-49

even/odd array, J-52

with many adders, J-50 to J-54

multipass array multiplier, J-51

signed-digit addition table, J-54

with single adder, J-47 to J-49, J-48

Wallace tree, J-53

integer multiplication/division, shifting over zeros, J-45 to J-47

integer SRT division, J-45 to J-46, J-46

linear, 405–407

via parallelism, 263

pipeline with stalls, C-12 to C-13

relative, 406

scaled, 406–407

switch buffer organizations, F-58 to F-59

true, 406

Sperry-Rand, L-4 to L-5

Spin locks

via coherence, 389–390

large-scale multiprocessor synchronization

barrier synchronization, I-16

exponential back-off, I-17

SPLASH parallel benchmarks, SMT on superscalar processors, 230

Split, GPU vs. MIMD, 329

SPRAM, Sony PlayStation 2 Emotion Engine organization, E-18

Sprowl, Bob, F-99

Squared coefficient of variance, D-27

SRAM See Static random-access memory (SRAM)

SRT division

chip comparison, J-60 to J-61

complications, J-45 to J-46

early computer arithmetic, J-65

example, J-46

historical background, J-63

integers, with adder, J-55 to J-57

radix-4, J-56, J-57

SSDs See Solid-state disks (SSDs)

SSE See Intel Streaming SIMD Extension (SSE)

SS format instructions, IBM 360, K-85 to K-88

ssj_ops See Server side Java operations per second (ssj_ops)

SSP See Single-Streaming Processor (SSP)

Stack architecture

and compiler technology, A-27

flaws vs. success, A-44 to A-45

historical background, L-16 to L-17

Intel 80x86, K-48, K-52, K-54

operands, A-3 to A-4

Stack frame, VAX, K-71

Stack pointer, VAX, K-71

Stack or Thread Local Storage, definition, 292

Stale copy, cache coherency, 112

Stall cycles

advanced directory protocol case study, 424

average memory access time, B-17

branch hazards, C-21

branch scheme performance, C-25

definition, B-4 to B-5

example calculation, B-31

loop unrolling, 161

MIPS FP pipeline performance, C-60

miss rate calculation, B-6

out-of-order processors, B-20 to B-21

performance equations, B-22

pipeline performance, C-12 to C-13

single-chip multicore multiprocessor case study, 414–418

structural hazards, C-15

Stalls

AMD Opteron data cache, B-15

ARM Cortex-A8, 235, 235–236

branch hazards, C-42

data hazard minimization, C-16 to C-19, C-18

data hazards requiring, C-19 to C-21

delayed branch, C-65

Intel Core i7, 239–241

microarchitectural techniques case study, 252

MIPS FP pipeline performance, C-60 to C-61, C-61 to C-62

MIPS pipeline multicycle operations, C-51

MIPS R4000, C-64, C-67, C-67 to C-69, C-69

miss rate calculations, B-31 to B-32

necessity, C-21

nonblocking cache, 84

pipeline performance, C-12 to C-13

from RAW hazards, FP code, C-55

structural hazard, C-15

VLIW sample code, 252

VMIPS, 268

Standardization, commercial interconnection networks, F-63 to F-64

Stardent-1500, Livermore Fortran kernels, 331

Start-up overhead, vs. peak performance, 331

Start-up time

DAXPY on VMIPS, G-20

memory banks, 276

page size selection, B-47

peak performance, 331

vector architectures, 331, G-4, G-4, G-8

vector convoys, G-4

vector execution time, 270–271

vector performance, G-2

vector performance measures, G-16

vector processor, G-7 to G-9, G-25

VMIPS, G-5

State transition diagram

director vs. cache, 385

directory-based cache coherence, 383

Statically based exploitation, ILP, H-2

Static power

basic equation, 26

SMT, 231

Static random-access memory (SRAM)

characteristics, 97–98

dependability, 104

fault detection pitfalls, 58

power, 26

vector memory systems, G-9

vector processor, G-25

yield, 32

Static scheduling

definition, C-71

ILP, 192–196

and unoptimized code, C-81

Sticky bit, J-18

Stop & Go See Xon/Xoff

Storage area networks

dependability benchmarks, D-21 to D-23, D-22

historical overview, F-102 to F-103

I/O system as black blox, D-23

Storage systems

asynchronous I/O and OSes, D-35

Berkeley’s Tertiary Disk project, D-12

block servers vs. filers, D-34 to D-35

bus replacement, D-34

component failure, D-43

computer system availability, D-43 to D-44, D-44

dependability benchmarks, D-21 to D-23

dirty bits, D-61 to D-64

disk array deconstruction case study, D-51 to D-55, D-52 to D-55

disk arrays, D-6 to D-10

disk deconstruction case study, D-48 to D-51, D-50

disk power, D-5

disk seeks, D-45 to D-47

disk storage, D-2 to D-5

file system benchmarking, D-20, D-20 to D-21

Internet Archive Cluster See Internet Archive Cluster

I/O performance, D-15 to D-16

I/O subsystem design, D-59 to D-61

I/O system design/evaluation, D-36 to D-37

mail server benchmarking, D-20 to D-21

NetApp FAS6000 filer, D-41 to D-42

operator dependability, D-13 to D-15

OS-scheduled disk access, D-44 to D-45, D-45

point-to-point links, D-34, D-34

queue I/O request calculations, D-29

queuing theory, D-23 to D-34

RAID performance prediction, D-57 to D-59

RAID reconstruction case study, D-55 to D-57

real faults and failures, D-6 to D-10

reliability, D-44

response time restrictions for benchmarks, D-18

seek distance comparison, D-47

seek time vs. distance, D-46

server utilization calculation, D-28 to D-29

sorting case study, D-64 to D-67

Tandem Computers, D-12 to D-13

throughput vs. response time, D-16, D-16 to D-18, D-17

TP benchmarks, D-18 to D-19

transactions components, D-17

web server benchmarking, D-20 to D-21

WSC vs. datacenter costs, 455

WSCs, 442–443

Store conditional

locks via coherence, 391

synchronization, 388–389

Store-and-forward packet switching, F-51

Store instructions See also Load-store instruction set architecture

definition, C-4

instruction execution, 186

ISA, 11, A-3

MIPS, A-33, A-36

NVIDIA GPU ISA, 298

Opteron data cache, B-15

RISC instruction set, C-4 to C-6, C-10

vector architectures, 310

Streaming Multiprocessor

definition, 292, 313–314

Fermi GPU, 307

Strecker, William, K-65

Strided accesses

Multimedia SIMD Extensions, 283

Roofline model, 287

TLB interaction, 323

Strided addressing See also Unit stride addressing

multimedia instruction compiler support, A-31 to A-32

Strides

gather-scatter, 280

highly parallel memory systems, 133

multidimensional arrays in vector architectures, 278–279

NVIDIA GPU ISA, 300

vector memory systems, G-10 to G-11

VMIPS, 266

String operations, Intel 80x86, K-51, K-53

Stripe, disk array deconstruction, D-51

Striping

disk arrays, D-6

RAID, D-9

Strip-Mined Vector Loop

convoys, G-5

DAXPY on VMIPS, G-20

definition, 292

multidimensional arrays, 278

Thread Block comparison, 294

vector-length registers, 274

Strip mining

DAXPY on VMIPS, G-20

GPU conditional branching, 303

GPUs vs. vector architectures, 311

NVIDIA GPU, 291

vector, 275

VLRs, 274–275

Strong scaling, Amdahl’s law and parallel computers, 407

Structural hazards

basic considerations, C-13 to C-16

definition, C-11

MIPS pipeline, C-71

MIPS scoreboarding, C-78 to C-79

pipeline stall, C-15

vector execution time, 268–269

Structural stalls, MIPS R4000 pipeline, C-68 to C-69

Subset property, and inclusion, 397

Summary overflow condition code, PowerPC, K-10 to K-11

Sun Microsystems

cache optimization, B-38

fault detection pitfalls, 58

memory dependability, 104

Sun Microsystems Enterprise, L-60

Sun Microsystems Niagara (T1/T2) processors

characteristics, 227

CPI and IPC, 399

fine-grained multithreading, 224, 225, 226–229

manufacturing cost, 62

multicore processor performance, 400–401

multiprocessing/multithreading-based performance, 398–400

multithreading history, L-34

T1 multithreading unicore performance, 227–229

Sun Microsystems SPARC

addressing modes, K-5

ALU operands, A-6

arithmetic/logical instructions, K-11, K-31

branch conditions, A-19

conditional branches, K-10, K-17

conditional instructions, H-27

constant extension, K-9

conventions, K-13

data transfer instructions, K-10

fast traps, K-30

features, K-44

FP instructions, K-23

instruction list, K-31 to K-32

integer arithmetic, J-12

integer overflow, J-11

ISA, A-2

LISP, K-30

MIPS core extensions, K-22 to K-23

overlapped integer/FP operations, K-31

precise exceptions, C-60

RISC history, L-20

as RISC system, K-4

Smalltalk, K-30

synchronization history, L-64

unique instructions, K-29 to K-32

Sun Microsystems SPARCCenter, L-60

Sun Microsystems SPARCstation-2, F-88

Sun Microsystems SPARCstation-20, F-88

Sun Microsystems SPARC V8, floating-point precisions, J-33

Sun Microsystems SPARC VIS

characteristics, K-18

multimedia support, E-11, K-18

Sun Microsystems Ultra 5, SPECfp2000 execution times, 43

Sun Microsystems UltraSPARC, L-62, L-73

Sun Microsystems UltraSPARC T1 processor, characteristics, F-73

Sun Modular Datacenter, L-74 to L-75

Superblock scheduling

basic process, H-21 to H-23

compiler history, L-31

example, H-22

Supercomputers

commercial interconnection networks, F-63

direct network topology, F-37

low-dimensional topologies, F-100

SAN characteristics, F-76

SIMD, development, L-43 to L-44

vs. WSCs, 8

Superlinear performance, multiprocessors, 406

Superpipelining

definition, C-61

performance histories, 20

Superscalar processors

coining of term, L-29

ideal processors, 214–215

ILP, 192–197, 246

studies, L-32

microarchitectural techniques case study, 250–251

multithreading support, 225

recent advances, L-33 to L-34

rename table and register substitution logic, 251

SMT, 230–232

VMIPS, 267

Superscalar registers, sample renaming code, 251

Supervisor process, virtual memory protection, 106

Sussenguth, Ed, L-28

Sutherland, Ivan, L-34

SVM See Secure Virtual Machine (SVM)

Swap procedure, VAX

code example, K-72, K-74

full procedure, K-75 to K-76

overview, K-72 to K-76

Swim, data cache misses, B-10

Switched-media networks

basic characteristics, F-24

vs. buses, F-2

effective bandwidth vs. nodes, F-28

example, F-22

latency and effective bandwidth, F-26 to F-28

vs. shared-media networks, F-24 to F-25

Switched networks

centralized, F-30 to F-34

DOR, F-46

OCN history, F-104

topology, F-40

Switches

array, WSCs, 443–444

Benesˆ networks, F-33

context, 307, B-49

early LANs and WANs, F-29

Ethernet switches, 16, 20, 53, 441–444, 464–465, 469

interconnecting node calculations, F-35

vs. NIC, F-85 to F-86, F-86

process switch, 224, B-37, B-49 to B-50

storage systems, D-34

switched-media networks, F-24

WSC hierarchy, 441–442, 442

WSC infrastructure, 446

WSC network bottleneck, 461

Switch fabric, switched-media networks, F-24

Switching

commercial interconnection networks, F-56

interconnection networks, F-22, F-27, F-50 to F-52

network impact, F-52 to F-55

performance considerations, F-92 to F-93

SAN characteristics, F-76

switched-media networks, F-24

system area network history, F-100

Switch microarchitecture

basic microarchitecture, F-55 to F-58

buffer organizations, F-58 to F-60

enhancements, F-62

HOL blocking, F-59

input-output-buffered switch, F-57

pipelining, F-60 to F-61, F-61

Switch ports

centralized switched networks, F-30

interconnection network topology, F-29

Switch statements

control flow instruction addressing modes, A-18

GPU, 301

Syllable, IA-64, H-35

Symbolic loop unrolling, software pipelining, H-12 to H-15, H-13

Symmetric multiprocessors (SMP)

characteristics, I-45

communication calculations, 350

directory-based cache coherence, 354

first vector computers, L-47, L-49

limitations, 363–364

snooping coherence protocols, 354–355

system area network history, F-101

TLP, 345

Symmetric shared-memory multiprocessors See also Centralized shared-memory multiprocessors

data caching, 351–352

limitations, 363–364

performance

commercial workload, 367–369

commercial workload measurement, 369–374

multiprogramming and OS workload, 374–378

overview, 366–367

scientific workloads, I-21 to I-26, I-23 to I-25

Synapse N + 1, L-59

Synchronization

AltaVista search, 369

basic considerations, 386–387

basic hardware primitives, 387–389

consistency models, 395–396

cost, 403

Cray X1, G-23

definition, 375

GPU comparisons, 329

GPU conditional branching, 300–303

historical background, L-64

large-scale multiprocessors

barrier synchronization, I-13 to I-16, I-14, I-16

challenges, I-12 to I-16

hardware primitives, I-18 to I-21

sense-reversing barrier, I-21

software implementations, I-17 to I-18

tree-based barriers, I-19

locks via coherence, 389–391

message-passing communication, I-5

MIMD, 10

MIPS core extensions, K-21

programmer’s viewpoint, 393–394

PTX instruction set, 298–299

relaxed consistency models, 394–395

single-chip multicore processor case study, 412–418

vector vs. GPU, 311

VLIW, 196

WSCs, 434

Synchronous dynamic random-access memory (SDRAM)

ARM Cortex-A8, 117

DRAM, 99

vs. Flash memory, 103

IBM Blue Gene/L, I-42

Intel Core i7, 121

performance, 100

power consumption, 102, 103

SDRAM timing diagram, 139

Synchronous event, exception requirements, C-44 to C-45

Synchronous I/O, definition, D-35

Synonyms

address translation, B-38

dependability, 34

Synthetic benchmarks

definition, 37

typical program fallacy, A-43

System area networks, historical overview, F-100 to F-102

System calls

CUDA Thread, 297

multiprogrammed workload, 378

virtualization/paravirtualization performance, 141

virtual memory protection, 106

System interface controller (SIF), Intel SCCC, F-70

System-on-chip (SoC)

cell phone, E-24

cross-company interoperability, F-64

embedded systems, E-3

Sanyo digital cameras, E-20

Sanyo VPC-SX500 digital camera, E-19

shared-media networks, F-23

System Performance and Evaluation Cooperative (SPEC) See SPEC benchmarks

System Processor

definition, 309

DLP, 262, 322

Fermi GPU, 306

GPU issues, 330

GPU programming, 288–289

NVIDIA GPU ISA, 298

NVIDIA GPU Memory, 305

processor comparisons, 323–324

synchronization, 329

vector vs. GPU, 311–312

System response time, transactions, D-16, D-17

Systems on a chip (SOC), cost trends, 28

System/storage area networks (SANs)

characteristics, F-3 to F-4

communication protocols, F-8

congestion management, F-65

cross-company interoperability, F-64

effective bandwidth, F-18

example system, F-72 to F-74

fat trees, F-34

fault tolerance, F-67

InfiniBand example, F-74 to F-77

interconnection network domain relationship, F-4

LAN history, F-99

latency and effective bandwidth, F-26 to F-28

latency vs. nodes, F-27

packet latency, F-13, F-14 to F-16

routing algorithms, F-48

software overhead, F-91

TCP/IP reliance, F-95

time of flight, F-13

topology, F-30

System Virtual Machines, definition, 107

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Computer Architecture: A Quantitative Approach

Create new playlist

Sign In

Sign Up

S

Table of Contents for
Computer Architecture: A Quantitative Approach