Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

L

L1 caches See also First-level caches

address translation, B-46

Alpha 21164 hierarchy, 368

ARM Cortex-A8, 116, 116, 235

ARM Cortex-A8 vs. A9, 236

ARM Cortex-A8 example, 117

cache optimization, B-31 to B-33

case study examples, B-60, B-63 to B-64

directory-based coherence, 418

Fermi GPU, 306

hardware prefetching, 91

hit time/power reduction, 79–80

inclusion, 397–398, B-34 to B-35

Intel Core i7, 118–119, 121–122, 123, 124, 124, 239, 241

invalidate protocol, 355, 356–357

memory consistency, 392

memory hierarchy, B-39

miss rates, 376–377

multiprocessor cache coherence, 352

multiprogramming workload, 374

nonblocking cache, 85

NVIDIA GPU Memory, 304

Opteron memory, B-57

processor comparison, 242

speculative execution, 223

T1 multithreading unicore performance, 228

virtual memory, B-48 to B-49

L2 caches See also Second-level caches

ARM Cortex-A8, 114, 115–116, 235–236

ARM Cortex-A8 example, 117

cache optimization, B-31 to B-33, B-34

case study example, B-63 to B-64

coherency, 352

commercial workloads, 373

directory-based coherence, 379, 418–420, 422, 424

fault detection, 58

Fermi GPU, 296, 306, 308

hardware prefetching, 91

IBM Blue Gene/L, I-42

inclusion, 397–398, B-35

Intel Core i7, 118, 120–122, 124, 124–125, 239, 241

invalidation protocol, 355, 356–357

and ISA, 241

memory consistency, 392

memory hierarchy, B-39, B-48, B-57

multithreading, 225, 228

nonblocking cache, 85

NVIDIA GPU Memory, 304

processor comparison, 242

snooping coherence, 359–361

speculation, 223

L3 caches See also Third-level caches

Alpha 21164 hierarchy, 368

coherence, 352

commercial workloads, 370, 371, 374

directory-based coherence, 379, 384

IBM Blue Gene/L, I-42

IBM Power processors, 2, 47

inclusion, 398

Intel Core i7, 118, 121, 124, 124–125, 239, 241, 403–404

invalidation protocol, 355, 356–357, 360

memory access cycle shift, 372

miss rates, 373

multicore processors, 400–401

multithreading, 225

nonblocking cache, 83

performance/price/power considerations, 52

snooping coherence, 359, 361, 363

LabVIEW, embedded benchmarks, E-13

Lampson, Butler, F-99

Lanes

GPUs vs. vector architectures, 310

Sequence of SIMD Lane Operations, 292, 313

SIMD Lane Registers, 309, 314

SIMD Lanes, 296–297, 297, 302–303, 308, 309, 311–312, 314

vector execution time, 269

vector instruction set, 271–273

Vector Lane Registers, 292

Vector Lanes, 292, 296–297, 309, 311

LANs See Local area networks (LANs)

Large-scale multiprocessors

cache coherence implementation

deadlock and buffering, I-38 to I-40

directory controller, I-40 to I-41

DSM multiprocessor, I-36 to I-37

overview, I-34 to I-36

classification, I-45

cluster history, L-62 to L-63

historical background, L-60 to L-61

IBM Blue Gene/L, I-41 to I-44, I-43 to I-44

interprocessor communication, I-3 to I-6

for parallel programming, I-2

scientific application performance

distributed-memory multiprocessors, I-26 to I-32, I-28 to I-32

parallel processors, I-33 to I-34

symmetric shared-memory multiprocessor, I-21 to I-26, I-23 to I-25

scientific applications, I-6 to I-12

space and relation of classes, I-46

synchronization mechanisms, I-17 to I-21

synchronization performance, I-12 to I-16

Latency See also Response time

advanced directory protocol case study, 425

vs. bandwidth, 18–19, 19

barrier synchronization, I-16

and cache miss, B-2 to B-3

cluster history, L-73

communication mechanism, I-3 to I-4

definition, D-15

deterministic vs. adaptive routing, F-52 to F-55

directory coherence, 425

distributed-memory multiprocessors, I-30, I-32

dynamically scheduled pipelines, C-70 to C-71

Flash memory, D-3

FP operations, 157

FP pipeline, C-66

functional units, C-53

GPU SIMD instructions, 296

GPUs vs. vector architectures, 311

hazards and forwarding, C-54 to C-58

hiding with speculation, 396–397

ILP exposure, 157

ILP without multithreading, 225

ILP for realizable processors, 216–218

Intel SCCC, F-70

interconnection networks, F-12 to F-20

multi-device networks, F-25 to F-29

Itanium 2 instructions, H-41

microarchitectural techniques case study, 247–254

MIPS pipeline FP operations, C-52 to C-53

misses, single vs. multiple thread executions, 228

multimedia instruction compiler support, A-31

NVIDIA GPU Memory structures, 305

OCNs vs. SANs, F-27

out-of-order processors, B-20 to B-21

packets, F-13, F-14

parallel processing, 350

performance milestones, 20

pipeline, C-87

ROB commit, 187

routing, F-50

routing/arbitration/switching impact, F-52

routing comparison, F-54

SAN example, F-73

shared-memory workloads, 368

snooping coherence, 414

Sony PlayStation 2 Emotion Engine, E-17

Sun T1 multithreading, 226–229

switched network topology, F-40 to F-41

system area network history, F-101

vs. TCP/IP reliance, F-95

throughput vs. response time, D-17

utility computing, L-74

vector memory systems, G-9

vector start-up, G-8

WSC efficiency, 450–452

WSC memory hierarchy, 443, 443–444, 444, 445

WSC processor cost-performance, 472–473

WSCs vs. datacenters, 456

Layer 3 network, array and Internet linkage, 445

Layer 3 network, WSC memory hierarchy, 445

LCA See Least common ancestor (LCA)

LCD See Liquid crystal display (LCD)

Learning curve, cost trends, 27

Least common ancestor (LCA), routing algorithms, F-48

Least recently used (LRU)

AMD Opteron data cache, B-12, B-14

block replacement, B-9

memory hierarchy history, L-11

virtual memory block replacement, B-45

Less than condition code, PowerPC, K-10 to K-11

Level 3, as Content Delivery Network, 460

Limit field, IA-32 descriptor table, B-52

Line, memory hierarchy basics, 74

Linear speedup

cost effectiveness, 407

IBM eServer p5 multiprocessor, 408

multicore processors, 400, 402

performance, 405–406

Line locking, embedded systems, E-4 to E-5

Link injection bandwidth

calculation, F-17

interconnection networks, F-89

Link pipelining, definition, F-16

Link reception bandwidth, calculation, F-17

Link register

MIPS control flow instructions, A-37 to A-38

PowerPC instructions, K-32 to K-33

procedure invocation options, A-19

synchronization, 389

Linpack benchmark

cluster history, L-63

parallel processing debates, L-58

vector processor example, 267–268

VMIPS performance, G-17 to G-19

Linux operating systems

Amazon Web Services, 456–457

architecture costs, 2

protection and ISA, 112

RAID benchmarks, D-22, D-22 to D-23

WSC services, 441

Liquid crystal display (LCD), Sanyo VPC-SX500 digital camera, E-19

LISP

RISC history, L-20

SPARC instructions, K-30

Lisp

ILP, 215

as MapReduce inspiration, 437

Literal addressing mode, basic considerations, A-10 to A-11

Little Endian

Intel 80x86, K-49

interconnection networks, F-12

memory address interpretation, A-7

MIPS core extensions, K-20 to K-21

MIPS data transfers, A-34

Little’s law

definition, D-24 to D-25

server utilization calculation, D-29

Livelock, network routing, F-44

Liveness, control dependence, 156

Livermore Fortran kernels, performance, 331, L-6

LMD See Load memory data (LMD)

Load instructions

control dependences, 155

data hazards requiring stalls, C-20

dynamic scheduling, 177

ILP, 199, 201

loop-level parallelism, 318

memory port conflict, C-14

pipelined cache access, 82

RISC instruction set, C-4 to C-5

Tomasulo’s algorithm, 182

VLIW sample code, 252

Load interlocks

definition, C-37 to C-39

detection logic, C-39

Load linked

locks via coherence, 391

synchronization, 388–389

Load locked, synchronization, 388–389

Load memory data (LMD), simple MIPS implementation, C-32 to C-33

Load stalls, MIPS R4000 pipeline, C-67

Load-store instruction set architecture

basic concept, C-4 to C-5

IBM 360, K-87

Intel Core i7, 124

Intel 80x86 operations, K-62

as ISA, 11

ISA classification, A-5

MIPS nonaligned data transfers, K-24, K-26

MIPS operations, A-35 to A-36, A-36

PowerPC, K-33

RISC history, L-19

simple MIPS implementation, C-32

VMIPS, 265

Load/store unit

Fermi GPU, 305

ILP hardware model, 215

multiple lanes, 273

Tomasulo’s algorithm, 171–173, 182, 197

vector units, 265, 276–277

Load upper immediate (LUI), MIPS operations, A-37

Local address space, segmented virtual memory, B-52

Local area networks (LANs)

characteristics, F-4

cross-company interoperability, F-64

effective bandwidth, F-18

Ethernet as, F-77 to F-79

fault tolerance calculations, F-68

historical overview, F-99 to F-100

InfiniBand, F-74

interconnection network domain relationship, F-4

latency and effective bandwidth, F-26 to F-28

offload engines, F-8

packet latency, F-13, F-14 to F-16

routers/gateways, F-79

shared-media networks, F-23

storage area network history, F-102 to F-103

switches, F-29

TCP/IP reliance, F-95

time of flight, F-13

topology, F-30

Locality See Principle of locality

Local Memory

centralized shared-memory architectures, 351

definition, 292, 314

distributed shared-memory, 379

Fermi GPU, 306

Grid mapping, 293

multiprocessor architecture, 348

NVIDIA GPU Memory structures, 304, 304–305

SIMD, 315

symmetric shared-memory multiprocessors, 363–364

Local miss rate, definition, B-31

Local node, directory-based cache coherence protocol basics, 382

Local optimizations, compilers, A-26

Local predictors, tournament predictors, 164–166

Local scheduling, ILP, VLIW processor, 194–195

Locks

via coherence, 389–391

hardware primitives, 387

large-scale multiprocessor synchronization, I-18 to I-21

multiprocessor software development, 409

Lock-up free cache, 83

Logical units, D-34

storage systems, D-34 to D-35

Logical volumes, D-34

Long displacement addressing, VAX, K-67

Long-haul networks See Wide area networks (WANs)

Long Instruction Word (LIW)

EPIC, L-32

multiple-issue processors, L-28, L-30

Long integer

operand sizes/types, 12

SPEC benchmarks, A-14

Loop-carried dependences

CUDA, 290

definition, 315–316

dependence distance, H-6

dependent computation elimination, 321

example calculations, H-4 to H-5

GCD, 319

loop-level parallelism, H-3

as recurrence, 318

recurrence form, H-5

VMIPS, 268

Loop exit predictor, Intel Core i7, 166

Loop interchange, compiler optimizations, 88–89

Loop-level parallelism

definition, 149–150

detection and enhancement

basic approach, 315–318

dependence analysis, H-6 to H-10

dependence computation elimination, 321–322

dependences, locating, 318–321

dependent computation elimination, H-10 to H-12

overview, H-2 to H-6

history, L-30 to L-31

ILP in perfect processor, 215

ILP for realizable processors, 217–218

Loop stream detection, Intel Core i7 micro-op buffer, 238

Loop unrolling

basic considerations, 161–162

ILP exposure, 157–161

ILP limitation studies, 220

recurrences, H-12

software pipelining, H-12 to H-15, H-13, H-15

Tomasulo’s algorithm, 179, 181–183

VLIW processors, 195

Lossless networks

definition, F-11 to F-12

switch buffer organizations, F-59

Lossy networks, definition, F-11 to F-12

LRU See Least recently used (LRU)

Lucas

compiler optimizations, A-29

data cache misses, B-10

LUI See Load upper immediate (LUI)

LU kernel

characteristics, I-8

distributed-memory multiprocessor, I-32

symmetric shared-memory multiprocessors, I-22, I-23, I-25

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Computer Architecture: A Quantitative Approach

Create new playlist

Sign In

Sign Up

L

Table of Contents for
Computer Architecture: A Quantitative Approach