Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

V

Valid bit

address translation, B-46

block identification, B-7

Opteron data cache, B-14

paged virtual memory, B-56

segmented virtual memory, B-52

snooping, 357

symmetric shared-memory multiprocessors, 366

Value prediction

definition, 202

hardware-based speculation, 192

ILP, 212–213, 220

speculation, 208

VAPI, InfiniBand, F-77

Variable length encoding

control flow instruction branches, A-18

instruction sets, A-22

ISAs, 14

Variables

and compiler technology, A-27 to A-29

CUDA, 289

Fermi GPU, 306

ISA, A-5, A-12

locks via coherence, 389

loop-level parallelism, 316

memory consistency, 392

NVIDIA GPU Memory, 304–305

procedure invocation options, A-19

random, distribution, D-26 to D-34

in registers, A-5

synchronization, 375

TLP programmer’s viewpoint, 394

VCs See Virtual channels (VCs)

Vector architectures

computer development, L-44 to L-49

definition, 9

DLP

basic considerations, 264

definition terms, 309

gather/scatter operations, 279–280

multidimensional arrays, 278–279

multiple lanes, 271–273

programming, 280–282

vector execution time, 268–271

vector-length registers, 274–275

vector load/store unit bandwidth, 276–277

vector-mask registers, 275–276

vector processor example, 267–268

VMIPS, 264–267

GPU conditional branching, 303

vs. GPUs, 308–312

mapping examples, 293

memory systems, G-9 to G-11

multimedia instruction compiler support, A-31

vs. Multimedia SIMD Extensions, 282

peak performance vs. start-up overhead, 331

power/DLP issues, 322

vs. scalar performance, 331–332

start-up latency and dead time, G-8

strided access-TLB interactions, 323

vector-register characteristics, G-3

Vector Functional Unit

vector add instruction, 272–273

vector execution time, 269

vector sequence chimes, 270

VMIPS, 264

Vector Instruction

definition, 292, 309

DLP, 322

Fermi GPU, 305

gather-scatter, 280

instruction-level parallelism, 150

mask registers, 275–276

Multimedia SIMD Extensions, 282

multiple lanes, 271–273

Thread of Vector Instructions, 292

vector execution time, 269

vector vs. GPU, 308, 311

vector processor example, 268

VMIPS, 265–267, 266

Vectorizable Loop

characteristics, 268

definition, 268, 292, 313

Grid mapping, 293

Livermore Fortran kernel performance, 331

mapping example, 293

NVIDIA GPU computational structures, 291

Vectorized code

multimedia compiler support, A-31

vector architecture programming, 280–282

vector execution time, 271

VMIPS, 268

Vectorized Loop See also Body of Vectorized Loop

definition, 309

GPU Memory structure, 304

vs. Grid, 291, 308

mask registers, 275

NVIDIA GPU, 295

vector vs. GPU, 308

Vectorizing compilers

effectiveness, G-14 to G-15

FORTRAN test kernels, G-15

sparse matrices, G-12 to G-13

Vector Lane Registers, definition, 292

Vector Lanes

control processor, 311

definition, 292, 309

SIMD Processor, 296–297, 297

Vector-length register (VLR)

basic operation, 274–275

performance, G-5

VMIPS, 267

Vector load/store unit

memory banks, 276–277

VMIPS, 265

Vector loops

NVIDIA GPU, 294

processor example, 267

strip-mining, 303

vector vs. GPU, 311

vector-length registers, 274–275

vector-mask registers, 275–276

Vector-mask control, characteristics, 275–276

Vector-mask registers

basic operation, 275–276

Cray X1, G-21 to G-22

VMIPS, 267

Vector Processor

caches, 305

compiler vectorization, 281

Cray X1

MSP modules, G-22

overview, G-21 to G-23

Cray X1E, G-24

definition, 292, 309

DLP processors, 322

DSP media extensions, E-10

example, 267–268

execution time, G-7

functional units, 272

gather-scatter, 280

vs. GPUs, 276

historical background, G-26

loop-level parallelism, 150

loop unrolling, 196

measures, G-15 to G-16

memory banks, 277

and multiple lanes, 273, 310

multiprocessor architecture, 346

NVIDIA GPU computational structures, 291

overview, G-25 to G-26

peak performance focus, 331

performance, G-2 to G-7

start-up and multiple lanes, G-7 to G-9

performance comparison, 58

performance enhancement

chaining, G-11 to G-12

DAXPY on VMIPS, G-19 to G-21

sparse matrices, G-12 to G-14

PTX, 301

Roofline model, 286–287, 287

vs. scalar processor, 311, 331, 333, G-19

vs. SIMD Processor, 294–296

Sony PlayStation 2 Emotion Engine, E-17 to E-18

start-up overhead, G-4

stride, 278

strip mining, 275

vector execution time, 269–271

vector/GPU comparison, 308

vector kernel implementation, 334–336

VMIPS, 264–265

VMIPS on DAXPY, G-17

VMIPS on Linpack, G-17 to G-19

Vector Registers

definition, 309

execution time, 269, 271

gather-scatter, 280

multimedia compiler support, A-31

Multimedia SIMD Extensions, 282

multiple lanes, 271–273

NVIDIA GPU, 297

NVIDIA GPU ISA, 298

performance/bandwidth trade-offs, 332

processor example, 267

strides, 278–279

vector vs. GPU, 308, 311

VMIPS, 264–267, 266

Very-large-scale integration (VLSI)

early computer arithmetic, J-63

interconnection network topology, F-29

RISC history, L-20

Wallace tree, J-53

Very Long Instruction Word (VLIW)

clock rates, 244

compiler scheduling, L-31

EPIC, L-32

IA-64, H-33 to H-34

ILP, 193–196

loop-level parallelism, 315

M32R, K-39 to K-40

multiple-issue processors, 194, L-28 to L-30

multithreading history, L-34

sample code, 252

TI 320C6x DSP, E-8 to E-10

VGA controller, L-51

Video

Amazon Web Services, 460

application trends, 4

PMDs, 6

WSCs, 8, 432, 437, 439

Video games, multimedia support, K-17

VI interface, L-73

Virtual address

address translation, B-46

AMD64 paged virtual memory, B-55

AMD Opteron data cache, B-12 to B-13

ARM Cortex-A8, 115

cache optimization, B-36 to B-39

GPU conditional branching, 303

Intel Core i7, 120

mapping to physical, B-45

memory hierarchy, B-39, B-48, B-48 to B-49

memory hierarchy basics, 77–78

miss rate vs. cache size, B-37

Opteron mapping, B-55

Opteron memory management, B-55 to B-56

and page size, B-58

page table-based mapping, B-45

translation, B-36 to B-39

virtual memory, B-42, B-49

Virtual address space

example, B-41

main memory block, B-44

Virtual caches

definition, B-36 to B-37

issues with, B-38

Virtual channels (VCs), F-47

HOL blocking, F-59

Intel SCCC, F-70

routing comparison, F-54

switching, F-51 to F-52

switch microarchitecture pipelining, F-61

system area network history, F-101

and throughput, F-93

Virtual cut-through switching, F-51

Virtual functions, control flow instructions, A-18

Virtualizable architecture

Intel 80x86 issues, 128

system call performance, 141

Virtual Machines support, 109

VMM implementation, 128–129

Virtualizable GPUs, future technology, 333

Virtual machine monitor (VMM)

characteristics, 108

nonvirtualizable ISA, 126, 128–129

requirements, 108–109

Virtual Machines ISA support, 109–110

Xen VM, 111

Virtual Machines (VMs)

Amazon Web Services, 456–457

cloud computing costs, 471

early IBM work, L-10

ISA support, 109–110

protection, 107–108

protection and ISA, 112

server benchmarks, 40

and virtual memory and I/O, 110–111

WSCs, 436

Xen VM, 111

Virtual memory

basic considerations, B-40 to B-44, B-48 to B-49

basic questions, B-44 to B-46

block identification, B-44 to B-45

block placement, B-44

block replacement, B-45

vs. caches, B-42 to B-43

classes, B-43

definition, B-3

fast address translation, B-46

Multimedia SIMD Extensions, 284

multithreading, 224

paged example, B-54 to B-57

page size selection, B-46 to B-47

parameter ranges, B-42

Pentium vs. Opteron protection, B-57

protection, 105–107

segmented example, B-51 to B-54

strided access-TLB interactions, 323

terminology, B-42

Virtual Machines impact, 110–111

writes, B-45 to B-46

Virtual methods, control flow instructions, A-18

Virtual output queues (VOQs), switch microarchitecture, F-60

VLIW See Very Long Instruction Word (VLIW)

VLR See Vector-length register (VLR)

VLSI See Very-large-scale integration (VLSI)

VMCS See Virtual Machine Control State (VMCS)

VME rack

example, D-38

Internet Archive Cluster, D-37

VMIPS

basic structure, 265

DAXPY, G-18 to G-20

DLP, 265–267

double-precision FP operations, 266

enhanced, DAXPY performance, G-19 to G-21

gather/scatter operations, 280

ISA components, 264–265

multidimensional arrays, 278–279

Multimedia SIMD Extensions, 282

multiple lanes, 271–272

peak performance on DAXPY, G-17

performance, G-4

performance on Linpack, G-17 to G-19

sparse matrices, G-13

start-up penalties, G-5

vector execution time, 269–270, G-6 to G-7

vector vs. GPU, 308

vector-length registers, 274

vector load/store unit bandwidth, 276

vector performance measures, G-16

vector processor example, 267–268

VLR, 274

VMM See Virtual machine monitor (VMM)

VMs See Virtual Machines (VMs)

Voltage regulator controller (VRC), Intel SCCC, F-70

Voltage regulator modules (VRMs), WSC server energy efficiency, 462

Volume-cost relationship, components, 27–28

Von Neumann, John, L-2 to L-6

Von Neumann computer, L-3

Voodoo2, L-51

VOQs See Virtual output queues (VOQs)

VRC See Voltage regulator controller (VRC)

VRMs See Voltage regulator modules (VRMs)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Computer Architecture: A Quantitative Approach

Create new playlist

Sign In

Sign Up

V

Table of Contents for
Computer Architecture: A Quantitative Approach