B

Back-off time, shared-media networks, F-23
Backpressure, congestion management, F-65
Backside bus, centralized shared-memory multiprocessors, 351
Balanced systems, sorting case study, D-64 to D-67
Balanced tree, MINs with nonblicking, F-34
Bandwidth See also Throughput
arbitration, F-49
and cache miss, B-2 to B-3
centralized shared-memory multiprocessors, 351–352
communication mechanism, I-3
congestion management, F-64 to F-65
Cray Research T3D, F-87
DDR DRAMS and DIMMS, 101
definition, F-13
DSM architecture, 379
Ethernet and bridges, F-78
FP arithmetic, J-62
GDRAM, 322–323
GPU computation, 327–328
GPU Memory, 327
ILP instruction fetch
basic considerations, 202–203
branch-target buffers, 203–206
integrated units, 207–208
return address predictors, 206–207
interconnection networks, F-28
multi-device networks, F-25 to F-29
performance considerations, F-89
two-device networks, F-12 to F-20
vs. latency, 18–19, 19
memory, and vector performance, 332
memory hierarchy, 126
network performance and topology, F-41
OCN history, F-103
performance milestones, 20
point-to-point links and switches, D-34
routing, F-50 to F-52
routing/arbitration/switching impact, F-52
shared- vs. switched-media networks, F-22
SMP limitations, 363
switched-media networks, F-24
system area network history, F-101
vs. TCP/IP reliance, F-95
and topology, F-39
vector load/store units, 276–277
WSC memory hierarchy, 443–444, 444
Bandwidth gap, disk storage, D-3
Banerjee, Uptal, L-30 to L-31
Bank busy time, vector memory systems, G-9
Banked memory See also Memory banks
and graphics memory, 322–323
vector architectures, G-10
Banks, Fermi GPUs, 297
Barcelona Supercomputer Center, F-76
Barnes
characteristics, I-8 to I-9
distributed-memory multiprocessor, I-32
symmetric shared-memory multiprocessors, I-22, I-23, I-25
Barnes-Hut n-body algorithm, basic concept, I-8 to I-9
Barriers
commercial workloads, 370
Cray X1, G-23
fetch-and-increment, I-20 to I-21
hardware primitives, 387
large-scale multiprocessor synchronization, I-13 to I-16, I-14, I-16, I-19, I-20
synchronization, 298, 313, 329
Based indexed addressing mode, Intel 80x86, K-49, K-58
Base field, IA-32 descriptor table, B-52 to B-53
Base station
cell phones, E-23
wireless networks, E-22
Basic block, ILP, 149
Batch processing workloads
WSC goals/requirements, 433
WSC MapReduce and Hadoop, 437–438
Bay Area Research Network (BARRNet), F-80
BBN Butterfly, L-60
BBN Monarch, L-60
Before rounding rule, J-36
Benchmarking see also specific benchmark suites
desktop, 38–40
EEMBC, E-12
embedded applications
basic considerations, E-12
power consumption and efficiency, E-13
fallacies, 56
instruction set operations, A-15
as performance measurement, 37–41
real-world server considerations, 52–55
response time restrictions, D-18
server performance, 40–41
sorting case study, D-64 to D-67
Benesˆ topology
centralized switched networks, F-33
example, F-33
Berkeley’s Tertiary Disk project
failure statistics, D-13
overview, D-12
system log, D-43
Berners-Lee, Tim, F-98
Bertram, Jack, L-28
Best-case lower bounds, multi-device interconnection networks, F-25
Best-case upper bounds
multi-device interconnection networks, F-26
network performance and topology, F-41
Between instruction exceptions, definition, C-45
Biased exponent, J-15
Bidirectional multistage interconnection networks
Benesˆ topology, F-33
characteristics, F-33 to F-34
SAN characteristics, F-76
Bidirectional rings, topology, F-35 to F-36
Big Endian
interconnection networks, F-12
memory address interpretation, A-7
MIPS core extensions, K-20 to K-21
MIPS data transfers, A-34
Bigtable (Google), 438, 441
BINAC, L-5
Binary code compatibility
embedded systems, E-15
VLIW processors, 196
Binary-coded decimal, definition, A-14
Binary-to-decimal conversion, FP precisions, J-34
Bing search
delays and user behavior, 451
latency effects, 450–452
WSC processor cost-performance, 473
Bisection bandwidth
as network cost constraint, F-89
network performance and topology, F-41
NEWS communication, F-42
topology, F-39
Bisection bandwidth, WSC array switch, 443
Bisection traffic fraction, network performance and topology, F-41
Bit error rate (BER), wireless networks, E-21
Bit rot, case study, D-61 to D-64
Bit selection, block placement, B-7
Black box network
basic concept, F-5 to F-6
effective bandwidth, F-17
performance, F-12
switched-media networks, F-24
switched network topologies, F-40
Block addressing
block identification, B-7 to B-8
interleaved cache banks, 86
memory hierarchy basics, 74
Blocked floating point arithmetic, DSP, E-6
Block identification
memory hierarchy considerations, B-7 to B-9
virtual memory, B-44 to B-45
Blocking
benchmark fallacies, 56
centralized switched networks, F-32
direct networks, F-38
network performance and topology, F-41
Blocking calls, shared-memory multiprocessor workload, 369
Blocking factor, definition, 90
Block multithreading, definition, L-34
Block offset
block identification, B-7 to B-8
cache optimization, B-38
definition, B-7 to B-8
direct-mapped cache, B-9
example, B-9
main memory, B-44
Opteron data cache, B-13, B-13 to B-14
Block placement
memory hierarchy considerations, B-7
virtual memory, B-44
Block replacement
memory hierarchy considerations, B-9 to B-10
virtual memory, B-45
Blocks See also Cache block See also Thread Block
ARM Cortex-A8, 115
vs. bytes per reference, 378
compiler optimizations, 89–90
definition, B-2
disk array deconstruction, D-51, D-55
disk deconstruction case study, D-48 to D-51
global code scheduling, H-15 to H-16
L3 cache size, misses per instruction, 371
LU kernel, I-8
memory hierarchy basics, 74
memory in cache, B-61
placement in main memory, B-44
RAID performance prediction, D-57 to D-58
TI TMS320C55 DSP, E-8
uncached state, 384
Block servers, vs. filers, D-34 to D-35
Block size
vs. access time, B-28
memory hierarchy basics, 76
vs. miss rate, B-27
Block transfer engine (BLT)
Cray Research T3D, F-87
interconnection network protection, F-87
Body of Vectorized Loop
definition, 292, 313
GPU hardware, 295–296, 311
GPU Memory structure, 304
NVIDIA GPU, 296
SIMD Lane Registers, 314
Thread Block Scheduler, 314
Boggs, David, F-99
BOMB, L-4
Booth recoding, J-8 to J-9, J-9, J-10 to J-11
chip comparison, J-60 to J-61
integer multiplication, J-49
Bose-Einstein formula, definition, 30
Bounds checking, segmented virtual memory, B-52
Branch byte, VAX, K-71
Branch delay slot
characteristics, C-23 to C-25
control hazards, C-41
MIPS R4000, C-64
scheduling, C-24
Branches
canceling, C-24 to C-25
conditional branches, 300–303, A-17, A-19 to A-20, A-21
control flow instructions, A-16, A-18
delayed, C-23
delay slot, C-65
IBM 360, K-86 to K-87
instructions, K-25
MIPS control flow instructions, A-38
MIPS operations, A-35
nullifying, C-24 to C-25
RISC instruction set, C-5
VAX, K-71 to K-72
WCET, E-4
Branch folding, definition, 206
Branch hazards
basic considerations, C-21
penalty reduction, C-22 to C-25
pipeline issues, C-39 to C-42
scheme performance, C-25 to C-26
stall reduction, C-42
Branch history table, basic scheme, C-27 to C-30
Branch offsets, control flow instructions, A-18
Branch penalty
examples, 205
instruction fetch bandwidth, 203–206
reduction, C-22 to C-25
simple scheme examples, C-25
Branch prediction
accuracy, C-30
branch cost reduction, 162–167
correlation, 162–164
cost reduction, C-26
dynamic, C-27 to C-30
early schemes, L-27 to L-28
ideal processor, 214
ILP exploitation, 201
instruction fetch bandwidth, 205
integrated instruction fetch units, 207
Intel Core i7, 166–167, 239–241
misprediction rates on SPEC89, 166
static, C-26 to C-27
trace scheduling, H-19
two-bit predictor comparison, 165
Branch-prediction buffers, basic considerations, C-27 to C-30, C-29
Branch registers
IA-64, H-34
PowerPC instructions, K-32 to K-33
Branch stalls, MIPS R4000 pipeline, C-67
Branch-target address
branch hazards, C-42
MIPS control flow instructions, A-38
MIPS pipeline, C-36, C-37
MIPS R4000, C-25
pipeline branches, C-39
RISC instruction set, C-5
Branch-target buffers
ARM Cortex-A8, 233
branch hazard stalls, C-42
example, 203
instruction fetch bandwidth, 203–206
instruction handling, 204
MIPS control flow instructions, A-38
Branch-target cache See Branch-target buffers
Brewer, Eric, L-73
Bridges
and bandwidth, F-78
definition, F-78
Bubbles
and deadlock, F-47
routing comparison, F-54
stall as, C-13
Bubble sort, code example, K-76
Buckets, D-26
Buffered crossbar switch, switch microarchitecture, F-62
Buffered wormhole switching, F-51
Buffers
branch-prediction, C-27 to C-30, C-29
branch-target, 203–206, 204, 233, A-38, C-42
DSM multiprocessor cache coherence, I-38 to I-40
Intel SCCC, F-70
interconnection networks, F-10 to F-11
memory, 208
MIPS scoreboarding, C-74
network interface functions, F-7
switch microarchitecture, F-58 to F-60
translation buffer, B-45 to B-46
write buffer, B-11, B-14, B-32, B-35 to B-36
Bundles
IA-64, H-34 to H-35, H-37
Itanium 2, H-41
Burks, Arthur, L-3
Burroughs B5000, L-16
Bus-based coherent multiprocessors, L-59 to L-60
Buses
barrier synchronization, I-16
cache coherence, 391
centralized shared-memory multiprocessors, 351
definition, 351
dynamic scheduling with Tomasulo’s algorithm, 172, 175
Google WSC servers, 469
I/O bus replacements, D-34, D-34
large-scale multiprocessor synchronization, I-12 to I-13
NEWS communication, F-42
scientific workloads on symmetric shared-memory multiprocessors, I-25
Sony PlayStation 2 Emotion Engine, E-18
vs. switched networks, F-2
switch microarchitecture, F-55 to F-56
Tomasulo’s algorithm, 180, 182
Bypassing See also Forwarding
data hazards requiring stalls, C-19 to C-20
dynamically scheduled pipelines, C-70 to C-71
MIPS R4000, C-65
SAN example, F-74
Byte displacement addressing, VAX, K-67
Byte offset
misaligned addresses, A-8
PTX instructions, 300
Bytes
aligned/misaligned addresses, A-8
arithmetic intensity example, 286
Intel 80x86 integer operations, K-51
memory address interpretation, A-7 to A-8
MIPS data transfers, A-34
MIPS data types, A-34
operand types/sizes, A-14
per reference, vs. block size, 378
Byte/word/long displacement deferred addressing, VAX, K-67
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset