Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

B

Back-off time, shared-media networks, F-23

Backpressure, congestion management, F-65

Backside bus, centralized shared-memory multiprocessors, 351

Balanced systems, sorting case study, D-64 to D-67

Balanced tree, MINs with nonblicking, F-34

Bandwidth See also Throughput

arbitration, F-49

and cache miss, B-2 to B-3

centralized shared-memory multiprocessors, 351–352

communication mechanism, I-3

congestion management, F-64 to F-65

Cray Research T3D, F-87

DDR DRAMS and DIMMS, 101

definition, F-13

DSM architecture, 379

Ethernet and bridges, F-78

FP arithmetic, J-62

GDRAM, 322–323

GPU computation, 327–328

GPU Memory, 327

ILP instruction fetch

basic considerations, 202–203

branch-target buffers, 203–206

integrated units, 207–208

return address predictors, 206–207

interconnection networks, F-28

multi-device networks, F-25 to F-29

performance considerations, F-89

two-device networks, F-12 to F-20

vs. latency, 18–19, 19

memory, and vector performance, 332

memory hierarchy, 126

network performance and topology, F-41

OCN history, F-103

performance milestones, 20

point-to-point links and switches, D-34

routing, F-50 to F-52

routing/arbitration/switching impact, F-52

shared- vs. switched-media networks, F-22

SMP limitations, 363

switched-media networks, F-24

system area network history, F-101

vs. TCP/IP reliance, F-95

and topology, F-39

vector load/store units, 276–277

WSC memory hierarchy, 443–444, 444

Bandwidth gap, disk storage, D-3

Banerjee, Uptal, L-30 to L-31

Bank busy time, vector memory systems, G-9

Banked memory See also Memory banks

and graphics memory, 322–323

vector architectures, G-10

Banks, Fermi GPUs, 297

Barcelona Supercomputer Center, F-76

Barnes

characteristics, I-8 to I-9

distributed-memory multiprocessor, I-32

symmetric shared-memory multiprocessors, I-22, I-23, I-25

Barnes-Hut n-body algorithm, basic concept, I-8 to I-9

Barriers

commercial workloads, 370

Cray X1, G-23

fetch-and-increment, I-20 to I-21

hardware primitives, 387

large-scale multiprocessor synchronization, I-13 to I-16, I-14, I-16, I-19, I-20

synchronization, 298, 313, 329

BARRNet See Bay Area Research Network (BARRNet)

Based indexed addressing mode, Intel 80x86, K-49, K-58

Base field, IA-32 descriptor table, B-52 to B-53

Base station

cell phones, E-23

wireless networks, E-22

Basic block, ILP, 149

Batch processing workloads

WSC goals/requirements, 433

WSC MapReduce and Hadoop, 437–438

Bay Area Research Network (BARRNet), F-80

BBN Butterfly, L-60

BBN Monarch, L-60

Before rounding rule, J-36

Benchmarking see also specific benchmark suites

desktop, 38–40

EEMBC, E-12

embedded applications

basic considerations, E-12

power consumption and efficiency, E-13

fallacies, 56

instruction set operations, A-15

as performance measurement, 37–41

real-world server considerations, 52–55

response time restrictions, D-18

server performance, 40–41

sorting case study, D-64 to D-67

Benesˆ topology

centralized switched networks, F-33

example, F-33

BER See Bit error rate (BER)

Berkeley’s Tertiary Disk project

failure statistics, D-13

overview, D-12

system log, D-43

Berners-Lee, Tim, F-98

Bertram, Jack, L-28

Best-case lower bounds, multi-device interconnection networks, F-25

Best-case upper bounds

multi-device interconnection networks, F-26

network performance and topology, F-41

Between instruction exceptions, definition, C-45

Biased exponent, J-15

Bidirectional multistage interconnection networks

Benesˆ topology, F-33

characteristics, F-33 to F-34

SAN characteristics, F-76

Bidirectional rings, topology, F-35 to F-36

Big Endian

interconnection networks, F-12

memory address interpretation, A-7

MIPS core extensions, K-20 to K-21

MIPS data transfers, A-34

Bigtable (Google), 438, 441

BINAC, L-5

Binary code compatibility

embedded systems, E-15

VLIW processors, 196

Binary-coded decimal, definition, A-14

Binary-to-decimal conversion, FP precisions, J-34

Bing search

delays and user behavior, 451

latency effects, 450–452

WSC processor cost-performance, 473

Bisection bandwidth

as network cost constraint, F-89

network performance and topology, F-41

NEWS communication, F-42

topology, F-39

Bisection bandwidth, WSC array switch, 443

Bisection traffic fraction, network performance and topology, F-41

Bit error rate (BER), wireless networks, E-21

Bit rot, case study, D-61 to D-64

Bit selection, block placement, B-7

Black box network

basic concept, F-5 to F-6

effective bandwidth, F-17

performance, F-12

switched-media networks, F-24

switched network topologies, F-40

Block addressing

block identification, B-7 to B-8

interleaved cache banks, 86

memory hierarchy basics, 74

Blocked floating point arithmetic, DSP, E-6

Block identification

memory hierarchy considerations, B-7 to B-9

virtual memory, B-44 to B-45

Blocking

benchmark fallacies, 56

centralized switched networks, F-32

direct networks, F-38

HOL See Head-of-line (HOL) blocking

network performance and topology, F-41

Blocking calls, shared-memory multiprocessor workload, 369

Blocking factor, definition, 90

Block multithreading, definition, L-34

Block offset

block identification, B-7 to B-8

cache optimization, B-38

definition, B-7 to B-8

direct-mapped cache, B-9

example, B-9

main memory, B-44

Opteron data cache, B-13, B-13 to B-14

Block placement

memory hierarchy considerations, B-7

virtual memory, B-44

Block replacement

memory hierarchy considerations, B-9 to B-10

virtual memory, B-45

Blocks See also Cache block See also Thread Block

ARM Cortex-A8, 115

vs. bytes per reference, 378

compiler optimizations, 89–90

definition, B-2

disk array deconstruction, D-51, D-55

disk deconstruction case study, D-48 to D-51

global code scheduling, H-15 to H-16

L3 cache size, misses per instruction, 371

LU kernel, I-8

memory hierarchy basics, 74

memory in cache, B-61

placement in main memory, B-44

RAID performance prediction, D-57 to D-58

TI TMS320C55 DSP, E-8

uncached state, 384

Block servers, vs. filers, D-34 to D-35

Block size

vs. access time, B-28

memory hierarchy basics, 76

vs. miss rate, B-27

Block transfer engine (BLT)

Cray Research T3D, F-87

interconnection network protection, F-87

BLT See Block transfer engine (BLT)

Body of Vectorized Loop

definition, 292, 313

GPU hardware, 295–296, 311

GPU Memory structure, 304

NVIDIA GPU, 296

SIMD Lane Registers, 314

Thread Block Scheduler, 314

Boggs, David, F-99

BOMB, L-4

Booth recoding, J-8 to J-9, J-9, J-10 to J-11

chip comparison, J-60 to J-61

integer multiplication, J-49

Bose-Einstein formula, definition, 30

Bounds checking, segmented virtual memory, B-52

Branch byte, VAX, K-71

Branch delay slot

characteristics, C-23 to C-25

control hazards, C-41

MIPS R4000, C-64

scheduling, C-24

Branches

canceling, C-24 to C-25

conditional branches, 300–303, A-17, A-19 to A-20, A-21

control flow instructions, A-16, A-18

delayed, C-23

delay slot, C-65

IBM 360, K-86 to K-87

instructions, K-25

MIPS control flow instructions, A-38

MIPS operations, A-35

nullifying, C-24 to C-25

RISC instruction set, C-5

VAX, K-71 to K-72

WCET, E-4

Branch folding, definition, 206

Branch hazards

basic considerations, C-21

penalty reduction, C-22 to C-25

pipeline issues, C-39 to C-42

scheme performance, C-25 to C-26

stall reduction, C-42

Branch history table, basic scheme, C-27 to C-30

Branch offsets, control flow instructions, A-18

Branch penalty

examples, 205

instruction fetch bandwidth, 203–206

reduction, C-22 to C-25

simple scheme examples, C-25

Branch prediction

accuracy, C-30

branch cost reduction, 162–167

correlation, 162–164

cost reduction, C-26

dynamic, C-27 to C-30

early schemes, L-27 to L-28

ideal processor, 214

ILP exploitation, 201

instruction fetch bandwidth, 205

integrated instruction fetch units, 207

Intel Core i7, 166–167, 239–241

misprediction rates on SPEC89, 166

static, C-26 to C-27

trace scheduling, H-19

two-bit predictor comparison, 165

Branch-prediction buffers, basic considerations, C-27 to C-30, C-29

Branch registers

IA-64, H-34

PowerPC instructions, K-32 to K-33

Branch stalls, MIPS R4000 pipeline, C-67

Branch-target address

branch hazards, C-42

MIPS control flow instructions, A-38

MIPS pipeline, C-36, C-37

MIPS R4000, C-25

pipeline branches, C-39

RISC instruction set, C-5

Branch-target buffers

ARM Cortex-A8, 233

branch hazard stalls, C-42

example, 203

instruction fetch bandwidth, 203–206

instruction handling, 204

MIPS control flow instructions, A-38

Branch-target cache See Branch-target buffers

Brewer, Eric, L-73

Bridges

and bandwidth, F-78

definition, F-78

Bubbles

and deadlock, F-47

routing comparison, F-54

stall as, C-13

Bubble sort, code example, K-76

Buckets, D-26

Buffered crossbar switch, switch microarchitecture, F-62

Buffered wormhole switching, F-51

Buffers

branch-prediction, C-27 to C-30, C-29

branch-target, 203–206, 204, 233, A-38, C-42

DSM multiprocessor cache coherence, I-38 to I-40

Intel SCCC, F-70

interconnection networks, F-10 to F-11

memory, 208

MIPS scoreboarding, C-74

network interface functions, F-7

ROB, 184–192, 188–189, 199, 208–210, 238

switch microarchitecture, F-58 to F-60

TLB See Translation lookaside buffer (TLB)

translation buffer, B-45 to B-46

write buffer, B-11, B-14, B-32, B-35 to B-36

Bundles

IA-64, H-34 to H-35, H-37

Itanium 2, H-41

Burks, Arthur, L-3

Burroughs B5000, L-16

Bus-based coherent multiprocessors, L-59 to L-60

Buses

barrier synchronization, I-16

cache coherence, 391

centralized shared-memory multiprocessors, 351

definition, 351

dynamic scheduling with Tomasulo’s algorithm, 172, 175

Google WSC servers, 469

I/O bus replacements, D-34, D-34

large-scale multiprocessor synchronization, I-12 to I-13

NEWS communication, F-42

scientific workloads on symmetric shared-memory multiprocessors, I-25

Sony PlayStation 2 Emotion Engine, E-18

vs. switched networks, F-2

switch microarchitecture, F-55 to F-56

Tomasulo’s algorithm, 180, 182

Bypassing See also Forwarding

data hazards requiring stalls, C-19 to C-20

dynamically scheduled pipelines, C-70 to C-71

MIPS R4000, C-65

SAN example, F-74

Byte displacement addressing, VAX, K-67

Byte offset

misaligned addresses, A-8

PTX instructions, 300

Bytes

aligned/misaligned addresses, A-8

arithmetic intensity example, 286

Intel 80x86 integer operations, K-51

memory address interpretation, A-7 to A-8

MIPS data transfers, A-34

MIPS data types, A-34

operand types/sizes, A-14

per reference, vs. block size, 378

Byte/word/long displacement deferred addressing, VAX, K-67

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Computer Architecture: A Quantitative Approach

Create new playlist

Sign In

Sign Up

B

Table of Contents for
Computer Architecture: A Quantitative Approach