T

Tag
AMD Opteron data cache, B-12 to B-14
ARM Cortex-A8, 115
cache optimization, 79–80
dynamic scheduling, 177
invalidate protocols, 357
memory hierarchy basics, 74
memory hierarchy basics, 77–78
virtual memory fast address translation, B-46
write strategy, B-10
Tag check (TC)
MIPS R4000, C-63
R4000 pipeline, B-62 to B-63
R4000 pipeline structure, C-63
write process, B-10
Tag fields
block identification, B-8
dynamic scheduling, 173, 175
Tail duplication, superblock scheduling, H-21
Tailgating, definition, G-20
Tandem Computers
cluster history, L-62, L-72
faults, D-14
overview, D-12 to D-13
Target address
branch hazards, C-21, C-42
branch penalty reduction, C-22 to C-23
branch-target buffer, 206
control flow instructions, A-17 to A-18
GPU conditional branching, 301
Intel Core i7 branch predictor, 166
MIPS control flow instructions, A-38
MIPS implementation, C-32
MIPS pipeline, C-36, C-37
MIPS R4000, C-25
pipeline branches, C-39
RISC instruction set, C-5
Target channel adapters (TCAs), switch vs. NIC, F-86
Target instructions
branch delay slot scheduling, C-24
as branch-target buffer variation, 206
GPU conditional branching, 301
Task-level parallelism (TLP), definition, 9
TB-80 VME rack
example, D-38
MTTF calculation, D-40 to D-41
Technology trends
basic considerations, 17–18
performance, 18–19
Teleconferencing, multimedia support, K-17
Temporal locality
blocking, 89–90
cache optimization, B-26
coining of term, L-11
definition, 45, B-2
memory hierarchy design, 72
TERA processor, L-34
Terminate events
exceptions, C-45 to C-46
hardware-based speculation, 188
loop unrolling, 161
Tertiary Disk project
failure statistics, D-13
overview, D-12
system log, D-43
Test-and-set operation, synchronization, 388
Texas Instruments 8847
arithmetic functions, J-58 to J-61
chip comparison, J-58
chip layout, J-59
Texas Instruments ASC
first vector computers, L-44
peak performance vs. start-up overhead, 331
TFLOPS, parallel processing debates, L-57 to L-58
Thacker, Chuck, F-99
Thermal design power (TDP), power trends, 22
Thin-film transistor (TFT), Sanyo VPC-SX500 digital camera, E-19
Thinking Machines, L-44, L-56
Thinking Multiprocessors CM-5, L-60
Think time, transactions, D-16, D-17
Third-level caches See also L3 caches
ILP, 245
interconnection network, F-87
SRAM, 98–99
Thrash, memory hierarchy, B-25
Thread Block
CUDA Threads, 297, 300, 303
definition, 292, 313
Fermi GTX 480 GPU flooplan, 295
function, 294
GPU hardware levels, 296
GPU Memory performance, 332
GPU programming, 289–290
Grid mapping, 293
mapping example, 293
multithreaded SIMD Processor, 294
NVIDIA GPU computational structures, 291
NVIDIA GPU Memory structures, 304
PTX Instructions, 298
Thread Block Scheduler
definition, 292, 309, 313–314
Fermi GTX 480 GPU flooplan, 295
function, 294, 311
GPU, 296
Grid mapping, 293
multithreaded SIMD Processor, 294
Thread-level parallelism (TLP)
advanced directory protocol case study, 420–426
Amdahl’s law and parallel computers, 406–407
centralized shared-memory multiprocessors
basic considerations, 351–352
cache coherence, 352–353
cache coherence enforcement, 354–355
cache coherence example, 357–362
cache coherence extensions, 362–363
invalidate protocol implementation, 356–357
SMP and snooping limitations, 363–364
snooping coherence implementation, 365–366
snooping coherence protocols, 355–356
definition, 9
directory-based cache coherence
case study, 418–420
protocol basics, 380–382
protocol example, 382–386
DSM and directory-based coherence, 378–380
embedded systems, E-15
IBM Power7, 215
from ILP, 4–5
inclusion, 397–398
Intel Core i7 performance/energy efficiency, 401–405
memory consistency models
basic considerations, 392–393
compiler optimization, 396
programming viewpoint, 393–394
relaxed consistency models, 394–395
speculation to hide latency, 396–397
MIMDs, 344–345
multicore processor performance, 400–401
multicore processors and SMT, 404–405
multiprocessing/multithreading-based performance, 398–400
multiprocessor architecture, 346–348
multiprocessor cost effectiveness, 407
multiprocessor performance, 405–406
multiprocessor software development, 407–409
vs. multithreading, 223–224
multithreading history, L-34 to L-35
parallel processing challenges, 349–351
single-chip multicore processor case study, 412–418
Sun T1 multithreading, 226–229
symmetric shared-memory multiprocessor performance
commercial workload, 367–369
commercial workload measurement, 369–374
multiprogramming and OS workload, 374–378
overview, 366–367
synchronization
basic considerations, 386–387
basic hardware primitives, 387–389
locks via coherence, 389–391
Thread Processor
definition, 292, 314
GPU, 315
Thread Processor Registers, definition, 292
Thread Scheduler in a Multithreaded CPU, definition, 292
Thread of SIMD Instructions
characteristics, 295–296
CUDA Thread, 303
definition, 292, 313
Grid mapping, 293
lane recognition, 300
scheduling example, 297
terminology comparison, 314
vector/GPU comparison, 308–309
Thread of Vector Instructions, definition, 292
Three-dimensional space, direct networks, F-38
Three-level cache hierarchy
commercial workloads, 368
ILP, 245
Intel Core i7, 118, 118
Throttling, packets, F-10
Throughput See also Bandwidth
definition, C-3, F-13
disk storage, D-4
Google WSC, 470
ILP, 245
instruction fetch bandwidth, 202
Intel Core i7, 236–237
kernel characteristics, 327
memory banks, 276
multiple lanes, 271
parallelism, 44
performance considerations, 36
performance trends, 18–19
pipelining basics, C-10
precise exceptions, C-60
producer-server model, D-16
vs. response time, D-17
routing comparison, F-54
server benchmarks, 40–41
servers, 7
storage systems, D-16 to D-18
uniprocessors, TLP
basic considerations, 223–226
fine-grained multithreading on Sun T1, 226–229
superscalar SMT, 230–232
and virtual channels, F-93
WSCs, 434
Ticks
cache coherence, 391
processor performance equation, 48–49
Tilera TILE-Gx processors, OCNs, F-3
Time-cost relationship, components, 27–28
Time division multiple access (TDMA), cell phones, E-25
Time of flight
communication latency, I-3 to I-4
interconnection networks, F-13
Timing independent, L-17 to L-18
TI TMS320C6x DSP
architecture, E-9
characteristics, E-8 to E-10
instruction packet, E-10
TI TMS320C55 DSP
architecture, E-7
characteristics, E-7 to E-8
data operands, E-6
Tomasulo’s algorithm
advantages, 177–178
dynamic scheduling, 170–176
FP unit, 185
loop-based example, 179, 181–183
MIP FP unit, 173
register renaming vs. ROB, 209
step details, 178, 180
TOP500, L-58
Top Of Stack (TOS) register, ISA operands, A-4
Topology
Bensˆ networks, F-33
centralized switched networks, F-30 to F-34, F-31
definition, F-29
direct networks, F-37
distributed switched networks, F-34 to F-40
interconnection networks, F-21 to F-22, F-44
basic considerations, F-29 to F-30
fault tolerance, F-67
network performance and cost, F-40
network performance effects, F-40 to F-44
rings, F-36
routing/arbitration/switching impact, F-52
system area network history, F-100 to F-101
Torus networks
characteristics, F-36
commercial interconnection networks, F-63
direct networks, F-37
fault tolerance, F-67
IBM Blue Gene/L, F-72 to F-74
NEWS communication, F-43
routing comparison, F-54
system area network history, F-102
Total Cost of Ownership (TCO), WSC case study, 476–479
Total store ordering, relaxed consistency models, 395
Tournament predictors
early schemes, L-27 to L-28
ILP for realizable processors, 216
local/global predictor combinations, 164–166
Toy programs, performance benchmarks, 37
Trace compaction, basic process, H-19
Trace scheduling
basic approach, H-19 to H-21
overview, H-20
Trace selection, definition, H-19
Tradebeans benchmark, SMT on superscalar processors, 230
Traffic intensity, queuing theory, D-25
Trailer
messages, F-6
packet format, F-7
Transaction components, D-16, D-17, I-38 to I-39
Transaction-processing (TP)
server benchmarks, 41
storage system benchmarks, D-18 to D-19
Transaction Processing Council (TPC)
benchmarks overview, D-18 to D-19, D-19
parallelism, 44
performance results reporting, 41
server benchmarks, 41
TPC-B, shared-memory workloads, 368
TPC-C
file system benchmarking, D-20
IBM eServer p5 processor, 409
multiprocessing/multithreading-based performance, 398
multiprocessor cost effectiveness, 407
single vs. multiple thread executions, 228
Sun T1 multithreading unicore performance, 227–229, 229
WSC services, 441
TPC-D, shared-memory workloads, 368–369
TPC-E, shared-memory workloads, 368–369
Transfers See also Data transfers
as early control flow instruction definition, A-16
Transforms, DSP, E-5
Transient failure, commercial interconnection networks, F-66
Transient faults, storage systems, D-11
Transistors
clock rate considerations, 244
dependability, 33–36
energy and power, 23–26
ILP, 245
performance scaling, 19–21
processor comparisons, 324
processor trends, 2
RISC instructions, A-3
shrinking, 55
static power, 26
technology trends, 17–18
Translation buffer (TB)
virtual memory block identification, B-45
virtual memory fast address translation, B-46
Translation lookaside buffer (TLB)
address translation, B-39
AMD64 paged virtual memory, B-56 to B-57
ARM Cortex-A8, 114–115
cache optimization, 80, B-37
coining of term, L-9
Intel Core i7, 118, 120–121
interconnection network protection, F-86
memory hierarchy, B-48 to B-49
memory hierarchy basics, 78
MIPS64 instructions, K-27
Opteron, B-47
Opteron memory hierarchy, B-57
RISC code size, A-23
shared-memory workloads, 369–370
speculation advantages/disadvantages, 210–211
strided access interactions, 323
Virtual Machines, 110
virtual memory block identification, B-45
virtual memory fast address translation, B-46
virtual memory page size selection, B-47
virtual memory protection, 106–107
Transmission Control Protocol (TCP), congestion management, F-65
Transmission Control Protocol/Internet Protocol (TCP/IP)
ATM, F-79
headers, F-84
internetworking, F-81, F-83 to F-84, F-89
reliance on, F-95
WAN history, F-98
Transmission speed, interconnection network performance, F-13
Transmission time
communication latency, I-3 to I-4
time of flight, F-13 to F-14
Transport latency
time of flight, F-14
topology, F-35 to F-36
Transport layer, definition, F-82
Transputer, F-100
Tree-based barrier, large-scale multiprocessor synchronization, I-19
Tree height reduction, definition, H-11
Trees, MINs with nonblocking, F-34
Trellis codes, definition, E-7
TRIPS Edge processor, F-63
characteristics, F-73
Trojan horses
definition, B-51
segmented virtual memory, B-53
True dependence
finding, H-7 to H-8
loop-level parallelism calculations, 320
vs. name dependence, 153
True sharing misses
commercial workloads, 371, 373
definition, 366–367
multiprogramming workloads, 377
True speedup, multiprocessor performance, 406
TSMC, Stratton, F-3
TSS operating system, L-9
Turbo mode
hardware enhancements, 56
microprocessors, 26
Turing, Alan, L-4, L-19
Turn Model routing algorithm, example calculations, F-47 to F-48
Two-level branch predictors
branch costs, 163
Intel Core i7, 166
tournament predictors, 165
Two-level cache hierarchy
cache optimization, B-31
ILP, 245
Two’s complement, J-7 to J-8
Two-way conflict misses, definition, B-23
Two-way set associativity
ARM Cortex-A8, 233
cache block placement, B-7, B-8
cache miss rates, B-24
cache miss rates vs. size, B-33
cache optimization, B-38
cache organization calculations, B-19 to B-20
commercial workload, 370–373, 371
multiprogramming workload, 374–375
nonblocking cache, 84
Opteron data cache, B-13 to B-14
2:1 cache rule of thumb, B-29
virtual to cache access scenario, B-39
TX-2, L-34, L-49
“Typical” program, instruction set considerations, A-43
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset