T
Tag
cache optimization,
79–80
invalidate protocols,
357
memory hierarchy basics,
74
memory hierarchy basics,
77–78
virtual memory fast address translation,
B-46
Tag check (TC)
R4000 pipeline structure,
C-63
Tag fields
block identification,
B-8
Tail duplication, superblock scheduling, H-21
Tailgating, definition, G-20
Tandem Computers
cluster history, L-62, L-72
Target address
branch-target buffer,
206
GPU conditional branching,
301
Intel Core i7 branch predictor,
166
MIPS control flow instructions,
A-38
MIPS implementation,
C-32
RISC instruction set,
C-5
Target channel adapters (TCAs), switch
vs. NIC, F-86
Target instructions
branch delay slot scheduling,
C-24
as branch-target buffer variation,
206
GPU conditional branching,
301
Task-level parallelism (TLP), definition,
TB-80 VME rack
MTTF calculation, D-40 to D-41
Technology trends
basic considerations,
17–18
Teleconferencing, multimedia support, K-17
Temporal locality
memory hierarchy design,
72
Terminate events
hardware-based speculation,
188
Test-and-set operation, synchronization,
388
Texas Instruments 8847
arithmetic functions, J-58 to J-61
Texas Instruments ASC
first vector computers, L-44
peak performance
vs. start-up overhead,
331
TFLOPS, parallel processing debates, L-57 to L-58
Thermal design power (TDP), power trends,
22
Thin-film transistor (TFT), Sanyo VPC-SX500 digital camera, E-19
Thinking Machines, L-44, L-56
Thinking Multiprocessors CM-5, L-60
Think time, transactions, D-16,
D-17
Third-level caches
See also L3 caches
interconnection network, F-87
Thrash, memory hierarchy,
B-25
Thread Block
Fermi GTX 480 GPU flooplan,
295
GPU Memory performance,
332
multithreaded SIMD Processor,
294
NVIDIA GPU computational structures,
291
NVIDIA GPU Memory structures,
304
Thread Block Scheduler
Fermi GTX 480 GPU flooplan,
295
multithreaded SIMD Processor,
294
Thread-level parallelism (TLP)
advanced directory protocol case study,
420–426
Amdahl’s law and parallel computers,
406–407
centralized shared-memory multiprocessors
cache coherence enforcement,
354–355
cache coherence extensions,
362–363
invalidate protocol implementation,
356–357
SMP and snooping limitations,
363–364
snooping coherence implementation,
365–366
snooping coherence protocols,
355–356
directory-based cache coherence
DSM and directory-based coherence,
378–380
Intel Core i7 performance/energy efficiency,
401–405
memory consistency models
compiler optimization,
396
relaxed consistency models,
394–395
speculation to hide latency,
396–397
multicore processor performance,
400–401
multicore processors and SMT,
404–405
multiprocessing/multithreading-based performance,
398–400
multiprocessor architecture,
346–348
multiprocessor cost effectiveness,
407
multiprocessor performance,
405–406
multiprocessor software development,
407–409
multithreading history, L-34 to L-35
parallel processing challenges,
349–351
single-chip multicore processor case study,
412–418
symmetric shared-memory multiprocessor performance
commercial workload measurement,
369–374
multiprogramming and OS workload,
374–378
Thread Processor Registers, definition,
292
Thread Scheduler in a Multithreaded CPU, definition,
292
Thread of SIMD Instructions
terminology comparison,
314
Thread of Vector Instructions, definition,
292
Three-dimensional space, direct networks, F-38
Three-level cache hierarchy
commercial workloads,
368
Throttling, packets, F-10
Throughput
See also Bandwidth
instruction fetch bandwidth,
202
kernel characteristics,
327
performance considerations,
36
performance trends,
18–19
producer-server model,
D-16
storage systems, D-16 to D-18
uniprocessors, TLP
fine-grained multithreading on Sun T1,
226–229
and virtual channels, F-93
Ticks
processor performance equation,
48–49
Tilera TILE-Gx processors, OCNs, F-3
Time-cost relationship, components,
27–28
Time division multiple access (TDMA), cell phones, E-25
Time of flight
communication latency, I-3 to I-4
interconnection networks, F-13
Timing independent, L-17 to L-18
TI TMS320C6x DSP
characteristics, E-8 to E-10
TI TMS320C55 DSP
characteristics, E-7 to E-8
Tomasulo’s algorithm
register renaming
vs. ROB,
209
Top Of Stack (TOS) register, ISA operands,
A-4
Topology
centralized switched networks, F-30 to F-34,
F-31
distributed switched networks, F-34 to F-40
interconnection networks, F-21 to F-22,
F-44
basic considerations, F-29 to F-30
network performance and cost,
F-40
network performance effects, F-40 to F-44
routing/arbitration/switching impact, F-52
system area network history, F-100 to F-101
Torus networks
commercial interconnection networks, F-63
IBM Blue Gene/L, F-72 to F-74
system area network history, F-102
Total Cost of Ownership (TCO), WSC case study,
476–479
Total store ordering, relaxed consistency models,
395
Tournament predictors
early schemes, L-27 to L-28
ILP for realizable processors,
216
local/global predictor combinations,
164–166
Toy programs, performance benchmarks,
37
Trace compaction, basic process, H-19
Trace scheduling
basic approach, H-19 to H-21
Trace selection, definition, H-19
Tradebeans benchmark, SMT on superscalar processors,
230
Traffic intensity, queuing theory, D-25
Transaction components, D-16,
D-17, I-38 to I-39
Transaction-processing (TP)
storage system benchmarks, D-18 to D-19
Transaction Processing Council (TPC)
benchmarks overview, D-18 to D-19,
D-19
performance results reporting,
41
TPC-B, shared-memory workloads,
368
TPC-C
file system benchmarking, D-20
IBM eServer p5 processor,
409
multiprocessing/multithreading-based performance,
398
multiprocessor cost effectiveness,
407
single
vs. multiple thread executions,
228
Sun T1 multithreading unicore performance,
227–229,
229
TPC-D, shared-memory workloads,
368–369
TPC-E, shared-memory workloads,
368–369
Transient failure, commercial interconnection networks, F-66
Transient faults, storage systems, D-11
Transistors
clock rate considerations,
244
performance scaling,
19–21
processor comparisons,
324
Translation buffer (TB)
virtual memory block identification,
B-45
virtual memory fast address translation,
B-46
Translation lookaside buffer (TLB)
address translation,
B-39
interconnection network protection, F-86
memory hierarchy basics,
78
MIPS64 instructions, K-27
Opteron memory hierarchy,
B-57
speculation advantages/disadvantages,
210–211
strided access interactions,
323
virtual memory block identification,
B-45
virtual memory fast address translation,
B-46
virtual memory page size selection,
B-47
Transmission Control Protocol (TCP), congestion management, F-65
Transmission Control Protocol/Internet Protocol (TCP/IP)
internetworking, F-81, F-83 to F-84, F-89
Transmission speed, interconnection network performance, F-13
Transmission time
communication latency, I-3 to I-4
time of flight, F-13 to F-14
Transport layer, definition,
F-82
Tree-based barrier, large-scale multiprocessor synchronization,
I-19
Tree height reduction, definition, H-11
Trees, MINs with nonblocking, F-34
Trellis codes, definition, E-7
TRIPS Edge processor, F-63
Trojan horses
segmented virtual memory,
B-53
True dependence
loop-level parallelism calculations,
320
True sharing misses
commercial workloads,
371,
373
multiprogramming workloads,
377
True speedup, multiprocessor performance,
406
TSS operating system, L-9
Turbo mode
hardware enhancements,
56
Turn Model routing algorithm, example calculations, F-47 to F-48
Two-level branch predictors
tournament predictors,
165
Two-level cache hierarchy
Two’s complement, J-7 to J-8
Two-way conflict misses, definition,
B-23
Two-way set associativity
cache block placement,
B-7,
B-8
cache miss rates
vs. size,
B-33
2:1 cache rule of thumb,
B-29
virtual to cache access scenario,
B-39
“Typical” program, instruction set considerations,
A-43