L
L1 caches
See also First-level caches
address translation,
B-46
Alpha 21164 hierarchy,
368
ARM Cortex-A8
vs. A9,
236
ARM Cortex-A8 example,
117
directory-based coherence,
418
hit time/power reduction,
79–80
multiprocessor cache coherence,
352
multiprogramming workload,
374
processor comparison,
242
speculative execution,
223
T1 multithreading unicore performance,
228
L3 caches
See also Third-level caches
Alpha 21164 hierarchy,
368
directory-based coherence,
379,
384
IBM Power processors, ,
47
memory access cycle shift,
372
performance/price/power considerations,
52
LabVIEW, embedded benchmarks, E-13
Lanes
GPUs
vs. vector architectures,
310
Sequence of SIMD Lane Operations,
292,
313
vector execution time,
269
Vector Lane Registers,
292
Large-scale multiprocessors
cache coherence implementation
deadlock and buffering, I-38 to I-40
directory controller, I-40 to I-41
DSM multiprocessor, I-36 to I-37
cluster history, L-62 to L-63
historical background, L-60 to L-61
IBM Blue Gene/L, I-41 to I-44,
I-43 to I-44
interprocessor communication, I-3 to I-6
for parallel programming, I-2
scientific application performance
distributed-memory multiprocessors, I-26 to I-32,
I-28 to I-32
parallel processors, I-33 to I-34
symmetric shared-memory multiprocessor, I-21 to I-26,
I-23 to I-25
scientific applications, I-6 to I-12
space and relation of classes,
I-46
synchronization mechanisms, I-17 to I-21
synchronization performance, I-12 to I-16
Latency
See also Response time
advanced directory protocol case study,
425
barrier synchronization, I-16
communication mechanism, I-3 to I-4
deterministic
vs. adaptive routing, F-52 to F-55
distributed-memory multiprocessors, I-30,
I-32
GPU SIMD instructions,
296
GPUs
vs. vector architectures,
311
ILP without multithreading,
225
ILP for realizable processors,
216–218
interconnection networks, F-12 to F-20
multi-device networks, F-25 to F-29
Itanium 2 instructions,
H-41
microarchitectural techniques case study,
247–254
misses, single
vs. multiple thread executions,
228
multimedia instruction compiler support,
A-31
NVIDIA GPU Memory structures,
305
performance milestones,
20
routing/arbitration/switching impact, F-52
shared-memory workloads,
368
Sony PlayStation 2 Emotion Engine, E-17
switched network topology, F-40 to F-41
system area network history, F-101
vs. TCP/IP reliance, F-95
throughput
vs. response time,
D-17
vector memory systems, G-9
WSC processor cost-performance,
472–473
WSCs
vs. datacenters,
456
Layer 3 network, array and Internet linkage,
445
Layer 3 network, WSC memory hierarchy,
445
Learning curve, cost trends,
27
Least common ancestor (LCA), routing algorithms, F-48
Least recently used (LRU)
memory hierarchy history, L-11
virtual memory block replacement,
B-45
Less than condition code, PowerPC, K-10 to K-11
Level 3, as Content Delivery Network,
460
Limit field, IA-32 descriptor table,
B-52
Line, memory hierarchy basics,
74
Linear speedup
IBM eServer p5 multiprocessor,
408
multicore processors,
400,
402
Line locking, embedded systems, E-4 to E-5
Link injection bandwidth
interconnection networks, F-89
Link pipelining, definition, F-16
Link reception bandwidth, calculation, F-17
Link register
PowerPC instructions, K-32 to K-33
procedure invocation options,
A-19
Linpack benchmark
parallel processing debates, L-58
VMIPS performance, G-17 to G-19
Linux operating systems
RAID benchmarks,
D-22, D-22 to D-23
Liquid crystal display (LCD), Sanyo VPC-SX500 digital camera, E-19
Lisp
as MapReduce inspiration,
437
Little Endian
interconnection networks, F-12
memory address interpretation,
A-7
MIPS core extensions, K-20 to K-21
MIPS data transfers,
A-34
Little’s law
server utilization calculation, D-29
Livelock, network routing, F-44
Liveness, control dependence,
156
Livermore Fortran kernels, performance,
331, L-6
Load instructions
data hazards requiring stalls,
C-20
loop-level parallelism,
318
memory port conflict,
C-14
pipelined cache access,
82
Tomasulo’s algorithm,
182
Load locked, synchronization,
388–389
Load memory data (LMD), simple MIPS implementation,
C-32 to C-33
Load stalls, MIPS R4000 pipeline,
C-67
Load-store instruction set architecture
Intel 80x86 operations, K-62
MIPS nonaligned data transfers, K-24, K-26
simple MIPS implementation,
C-32
Load upper immediate (LUI), MIPS operations,
A-37
Local address space, segmented virtual memory,
B-52
Local area networks (LANs)
cross-company interoperability, F-64
effective bandwidth, F-18
Ethernet as, F-77 to F-79
fault tolerance calculations, F-68
historical overview, F-99 to F-100
interconnection network domain relationship,
F-4
latency and effective bandwidth, F-26 to F-28
packet latency,
F-13, F-14 to F-16
shared-media networks, F-23
storage area network history, F-102 to F-103
Local Memory
centralized shared-memory architectures,
351
distributed shared-memory,
379
multiprocessor architecture,
348
symmetric shared-memory multiprocessors,
363–364
Local miss rate, definition,
B-31
Local node, directory-based cache coherence protocol basics,
382
Local optimizations, compilers,
A-26
Local predictors, tournament predictors,
164–166
Local scheduling, ILP, VLIW processor,
194–195
Locks
large-scale multiprocessor synchronization, I-18 to I-21
multiprocessor software development,
409
Logical units, D-34
storage systems, D-34 to D-35
Long displacement addressing, VAX, K-67
Long Instruction Word (LIW)
multiple-issue processors, L-28, L-30
Loop-carried dependences
dependent computation elimination,
321
example calculations, H-4 to H-5
loop-level parallelism, H-3
Loop exit predictor, Intel Core i7,
166
Loop interchange, compiler optimizations,
88–89
Loop-level parallelism
detection and enhancement
dependence analysis, H-6 to H-10
dependence computation elimination,
321–322
dependent computation elimination, H-10 to H-12
ILP in perfect processor,
215
ILP for realizable processors,
217–218
Loop stream detection, Intel Core i7 micro-op buffer,
238
Loop unrolling
ILP limitation studies,
220
software pipelining, H-12 to H-15,
H-13,
H-15
Lossless networks
switch buffer organizations, F-59
Lossy networks, definition, F-11 to F-12
Lucas
compiler optimizations,
A-29
LU kernel
distributed-memory multiprocessor,
I-32
symmetric shared-memory multiprocessors, I-22,
I-23, I-25