L

L1 caches See also First-level caches
address translation, B-46
Alpha 21164 hierarchy, 368
ARM Cortex-A8, 116, 116, 235
ARM Cortex-A8 vs. A9, 236
ARM Cortex-A8 example, 117
cache optimization, B-31 to B-33
case study examples, B-60, B-63 to B-64
directory-based coherence, 418
Fermi GPU, 306
hardware prefetching, 91
hit time/power reduction, 79–80
inclusion, 397–398, B-34 to B-35
Intel Core i7, 118–119, 121–122, 123, 124, 124, 239, 241
invalidate protocol, 355, 356–357
memory consistency, 392
memory hierarchy, B-39
miss rates, 376–377
multiprocessor cache coherence, 352
multiprogramming workload, 374
nonblocking cache, 85
NVIDIA GPU Memory, 304
Opteron memory, B-57
processor comparison, 242
speculative execution, 223
T1 multithreading unicore performance, 228
virtual memory, B-48 to B-49
L2 caches See also Second-level caches
ARM Cortex-A8, 114, 115–116, 235–236
ARM Cortex-A8 example, 117
cache optimization, B-31 to B-33, B-34
case study example, B-63 to B-64
coherency, 352
commercial workloads, 373
directory-based coherence, 379, 418–420, 422, 424
fault detection, 58
Fermi GPU, 296, 306, 308
hardware prefetching, 91
IBM Blue Gene/L, I-42
inclusion, 397–398, B-35
Intel Core i7, 118, 120–122, 124, 124–125, 239, 241
invalidation protocol, 355, 356–357
and ISA, 241
memory consistency, 392
memory hierarchy, B-39, B-48, B-57
multithreading, 225, 228
nonblocking cache, 85
NVIDIA GPU Memory, 304
processor comparison, 242
snooping coherence, 359–361
speculation, 223
L3 caches See also Third-level caches
Alpha 21164 hierarchy, 368
coherence, 352
commercial workloads, 370, 371, 374
directory-based coherence, 379, 384
IBM Blue Gene/L, I-42
IBM Power processors, 2, 47
inclusion, 398
Intel Core i7, 118, 121, 124, 124–125, 239, 241, 403–404
invalidation protocol, 355, 356–357, 360
memory access cycle shift, 372
miss rates, 373
multicore processors, 400–401
multithreading, 225
nonblocking cache, 83
performance/price/power considerations, 52
snooping coherence, 359, 361, 363
LabVIEW, embedded benchmarks, E-13
Lampson, Butler, F-99
Lanes
GPUs vs. vector architectures, 310
Sequence of SIMD Lane Operations, 292, 313
SIMD Lane Registers, 309, 314
vector execution time, 269
vector instruction set, 271–273
Vector Lane Registers, 292
Vector Lanes, 292, 296–297, 309, 311
Large-scale multiprocessors
cache coherence implementation
deadlock and buffering, I-38 to I-40
directory controller, I-40 to I-41
DSM multiprocessor, I-36 to I-37
overview, I-34 to I-36
classification, I-45
cluster history, L-62 to L-63
historical background, L-60 to L-61
IBM Blue Gene/L, I-41 to I-44, I-43 to I-44
interprocessor communication, I-3 to I-6
for parallel programming, I-2
scientific application performance
distributed-memory multiprocessors, I-26 to I-32, I-28 to I-32
parallel processors, I-33 to I-34
symmetric shared-memory multiprocessor, I-21 to I-26, I-23 to I-25
scientific applications, I-6 to I-12
space and relation of classes, I-46
synchronization mechanisms, I-17 to I-21
synchronization performance, I-12 to I-16
Latency See also Response time
advanced directory protocol case study, 425
vs. bandwidth, 18–19, 19
barrier synchronization, I-16
and cache miss, B-2 to B-3
cluster history, L-73
communication mechanism, I-3 to I-4
definition, D-15
deterministic vs. adaptive routing, F-52 to F-55
directory coherence, 425
distributed-memory multiprocessors, I-30, I-32
dynamically scheduled pipelines, C-70 to C-71
Flash memory, D-3
FP operations, 157
FP pipeline, C-66
functional units, C-53
GPU SIMD instructions, 296
GPUs vs. vector architectures, 311
hazards and forwarding, C-54 to C-58
hiding with speculation, 396–397
ILP exposure, 157
ILP without multithreading, 225
ILP for realizable processors, 216–218
Intel SCCC, F-70
interconnection networks, F-12 to F-20
multi-device networks, F-25 to F-29
Itanium 2 instructions, H-41
microarchitectural techniques case study, 247–254
MIPS pipeline FP operations, C-52 to C-53
misses, single vs. multiple thread executions, 228
multimedia instruction compiler support, A-31
NVIDIA GPU Memory structures, 305
OCNs vs. SANs, F-27
out-of-order processors, B-20 to B-21
packets, F-13, F-14
parallel processing, 350
performance milestones, 20
pipeline, C-87
ROB commit, 187
routing, F-50
routing/arbitration/switching impact, F-52
routing comparison, F-54
SAN example, F-73
shared-memory workloads, 368
snooping coherence, 414
Sony PlayStation 2 Emotion Engine, E-17
Sun T1 multithreading, 226–229
switched network topology, F-40 to F-41
system area network history, F-101
vs. TCP/IP reliance, F-95
throughput vs. response time, D-17
utility computing, L-74
vector memory systems, G-9
vector start-up, G-8
WSC efficiency, 450–452
WSC memory hierarchy, 443, 443–444, 444, 445
WSC processor cost-performance, 472–473
WSCs vs. datacenters, 456
Layer 3 network, array and Internet linkage, 445
Layer 3 network, WSC memory hierarchy, 445
Learning curve, cost trends, 27
Least common ancestor (LCA), routing algorithms, F-48
Least recently used (LRU)
AMD Opteron data cache, B-12, B-14
block replacement, B-9
memory hierarchy history, L-11
virtual memory block replacement, B-45
Less than condition code, PowerPC, K-10 to K-11
Level 3, as Content Delivery Network, 460
Limit field, IA-32 descriptor table, B-52
Line, memory hierarchy basics, 74
Linear speedup
cost effectiveness, 407
IBM eServer p5 multiprocessor, 408
multicore processors, 400, 402
performance, 405–406
Line locking, embedded systems, E-4 to E-5
Link injection bandwidth
calculation, F-17
interconnection networks, F-89
Link pipelining, definition, F-16
Link reception bandwidth, calculation, F-17
Link register
MIPS control flow instructions, A-37 to A-38
PowerPC instructions, K-32 to K-33
procedure invocation options, A-19
synchronization, 389
Linpack benchmark
cluster history, L-63
parallel processing debates, L-58
vector processor example, 267–268
VMIPS performance, G-17 to G-19
Linux operating systems
Amazon Web Services, 456–457
architecture costs, 2
protection and ISA, 112
RAID benchmarks, D-22, D-22 to D-23
WSC services, 441
Liquid crystal display (LCD), Sanyo VPC-SX500 digital camera, E-19
LISP
RISC history, L-20
SPARC instructions, K-30
Lisp
ILP, 215
as MapReduce inspiration, 437
Literal addressing mode, basic considerations, A-10 to A-11
Little Endian
Intel 80x86, K-49
interconnection networks, F-12
memory address interpretation, A-7
MIPS core extensions, K-20 to K-21
MIPS data transfers, A-34
Little’s law
definition, D-24 to D-25
server utilization calculation, D-29
Livelock, network routing, F-44
Liveness, control dependence, 156
Livermore Fortran kernels, performance, 331, L-6
Load instructions
control dependences, 155
data hazards requiring stalls, C-20
dynamic scheduling, 177
ILP, 199, 201
loop-level parallelism, 318
memory port conflict, C-14
pipelined cache access, 82
RISC instruction set, C-4 to C-5
Tomasulo’s algorithm, 182
VLIW sample code, 252
Load interlocks
definition, C-37 to C-39
detection logic, C-39
Load linked
locks via coherence, 391
synchronization, 388–389
Load locked, synchronization, 388–389
Load memory data (LMD), simple MIPS implementation, C-32 to C-33
Load stalls, MIPS R4000 pipeline, C-67
Load-store instruction set architecture
basic concept, C-4 to C-5
IBM 360, K-87
Intel Core i7, 124
Intel 80x86 operations, K-62
as ISA, 11
ISA classification, A-5
MIPS nonaligned data transfers, K-24, K-26
MIPS operations, A-35 to A-36, A-36
PowerPC, K-33
RISC history, L-19
simple MIPS implementation, C-32
VMIPS, 265
Load/store unit
Fermi GPU, 305
ILP hardware model, 215
multiple lanes, 273
Tomasulo’s algorithm, 171–173, 182, 197
vector units, 265, 276–277
Load upper immediate (LUI), MIPS operations, A-37
Local address space, segmented virtual memory, B-52
Local area networks (LANs)
characteristics, F-4
cross-company interoperability, F-64
effective bandwidth, F-18
Ethernet as, F-77 to F-79
fault tolerance calculations, F-68
historical overview, F-99 to F-100
InfiniBand, F-74
interconnection network domain relationship, F-4
latency and effective bandwidth, F-26 to F-28
offload engines, F-8
packet latency, F-13, F-14 to F-16
routers/gateways, F-79
shared-media networks, F-23
storage area network history, F-102 to F-103
switches, F-29
TCP/IP reliance, F-95
time of flight, F-13
topology, F-30
Locality See Principle of locality
Local Memory
centralized shared-memory architectures, 351
definition, 292, 314
distributed shared-memory, 379
Fermi GPU, 306
Grid mapping, 293
multiprocessor architecture, 348
NVIDIA GPU Memory structures, 304, 304–305
SIMD, 315
symmetric shared-memory multiprocessors, 363–364
Local miss rate, definition, B-31
Local node, directory-based cache coherence protocol basics, 382
Local optimizations, compilers, A-26
Local predictors, tournament predictors, 164–166
Local scheduling, ILP, VLIW processor, 194–195
Locks
via coherence, 389–391
hardware primitives, 387
large-scale multiprocessor synchronization, I-18 to I-21
multiprocessor software development, 409
Lock-up free cache, 83
Logical units, D-34
storage systems, D-34 to D-35
Logical volumes, D-34
Long displacement addressing, VAX, K-67
Long-haul networks See Wide area networks (WANs)
Long Instruction Word (LIW)
EPIC, L-32
multiple-issue processors, L-28, L-30
Long integer
operand sizes/types, 12
SPEC benchmarks, A-14
Loop-carried dependences
CUDA, 290
definition, 315–316
dependence distance, H-6
dependent computation elimination, 321
example calculations, H-4 to H-5
GCD, 319
loop-level parallelism, H-3
as recurrence, 318
recurrence form, H-5
VMIPS, 268
Loop exit predictor, Intel Core i7, 166
Loop interchange, compiler optimizations, 88–89
Loop-level parallelism
definition, 149–150
detection and enhancement
basic approach, 315–318
dependence analysis, H-6 to H-10
dependence computation elimination, 321–322
dependences, locating, 318–321
dependent computation elimination, H-10 to H-12
overview, H-2 to H-6
history, L-30 to L-31
ILP in perfect processor, 215
ILP for realizable processors, 217–218
Loop stream detection, Intel Core i7 micro-op buffer, 238
Loop unrolling
basic considerations, 161–162
ILP exposure, 157–161
ILP limitation studies, 220
recurrences, H-12
software pipelining, H-12 to H-15, H-13, H-15
Tomasulo’s algorithm, 179, 181–183
VLIW processors, 195
Lossless networks
definition, F-11 to F-12
switch buffer organizations, F-59
Lossy networks, definition, F-11 to F-12
Lucas
compiler optimizations, A-29
data cache misses, B-10
LU kernel
characteristics, I-8
distributed-memory multiprocessor, I-32
symmetric shared-memory multiprocessors, I-22, I-23, I-25
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset