V

Valid bit
address translation, B-46
block identification, B-7
Opteron data cache, B-14
paged virtual memory, B-56
segmented virtual memory, B-52
snooping, 357
symmetric shared-memory multiprocessors, 366
Value prediction
definition, 202
hardware-based speculation, 192
ILP, 212–213, 220
speculation, 208
VAPI, InfiniBand, F-77
Variable length encoding
control flow instruction branches, A-18
instruction sets, A-22
ISAs, 14
Variables
and compiler technology, A-27 to A-29
CUDA, 289
Fermi GPU, 306
ISA, A-5, A-12
locks via coherence, 389
loop-level parallelism, 316
memory consistency, 392
NVIDIA GPU Memory, 304–305
procedure invocation options, A-19
random, distribution, D-26 to D-34
register allocation, A-26 to A-27
in registers, A-5
synchronization, 375
TLP programmer’s viewpoint, 394
Vector architectures
computer development, L-44 to L-49
definition, 9
DLP
basic considerations, 264
definition terms, 309
gather/scatter operations, 279–280
multidimensional arrays, 278–279
multiple lanes, 271–273
programming, 280–282
vector execution time, 268–271
vector-length registers, 274–275
vector load/store unit bandwidth, 276–277
vector-mask registers, 275–276
vector processor example, 267–268
VMIPS, 264–267
GPU conditional branching, 303
vs. GPUs, 308–312
mapping examples, 293
memory systems, G-9 to G-11
multimedia instruction compiler support, A-31
vs. Multimedia SIMD Extensions, 282
peak performance vs. start-up overhead, 331
power/DLP issues, 322
vs. scalar performance, 331–332
start-up latency and dead time, G-8
strided access-TLB interactions, 323
vector-register characteristics, G-3
Vector Functional Unit
vector add instruction, 272–273
vector execution time, 269
vector sequence chimes, 270
VMIPS, 264
Vector Instruction
definition, 292, 309
DLP, 322
Fermi GPU, 305
gather-scatter, 280
instruction-level parallelism, 150
mask registers, 275–276
Multimedia SIMD Extensions, 282
multiple lanes, 271–273
Thread of Vector Instructions, 292
vector execution time, 269
vector vs. GPU, 308, 311
vector processor example, 268
VMIPS, 265–267, 266
Vectorizable Loop
characteristics, 268
definition, 268, 292, 313
Grid mapping, 293
Livermore Fortran kernel performance, 331
mapping example, 293
NVIDIA GPU computational structures, 291
Vectorized code
multimedia compiler support, A-31
vector architecture programming, 280–282
vector execution time, 271
VMIPS, 268
Vectorized Loop See also Body of Vectorized Loop
definition, 309
GPU Memory structure, 304
vs. Grid, 291, 308
mask registers, 275
NVIDIA GPU, 295
vector vs. GPU, 308
Vectorizing compilers
effectiveness, G-14 to G-15
FORTRAN test kernels, G-15
sparse matrices, G-12 to G-13
Vector Lane Registers, definition, 292
Vector Lanes
control processor, 311
definition, 292, 309
SIMD Processor, 296–297, 297
Vector-length register (VLR)
basic operation, 274–275
performance, G-5
VMIPS, 267
Vector load/store unit
memory banks, 276–277
VMIPS, 265
Vector loops
NVIDIA GPU, 294
processor example, 267
strip-mining, 303
vector vs. GPU, 311
vector-length registers, 274–275
vector-mask registers, 275–276
Vector-mask control, characteristics, 275–276
Vector-mask registers
basic operation, 275–276
Cray X1, G-21 to G-22
VMIPS, 267
Vector Processor
caches, 305
compiler vectorization, 281
Cray X1
MSP modules, G-22
overview, G-21 to G-23
Cray X1E, G-24
definition, 292, 309
DLP processors, 322
DSP media extensions, E-10
example, 267–268
execution time, G-7
functional units, 272
gather-scatter, 280
vs. GPUs, 276
historical background, G-26
loop-level parallelism, 150
loop unrolling, 196
measures, G-15 to G-16
memory banks, 277
and multiple lanes, 273, 310
multiprocessor architecture, 346
NVIDIA GPU computational structures, 291
overview, G-25 to G-26
peak performance focus, 331
performance, G-2 to G-7
start-up and multiple lanes, G-7 to G-9
performance comparison, 58
performance enhancement
chaining, G-11 to G-12
DAXPY on VMIPS, G-19 to G-21
sparse matrices, G-12 to G-14
PTX, 301
Roofline model, 286–287, 287
vs. scalar processor, 311, 331, 333, G-19
vs. SIMD Processor, 294–296
Sony PlayStation 2 Emotion Engine, E-17 to E-18
start-up overhead, G-4
stride, 278
strip mining, 275
vector execution time, 269–271
vector/GPU comparison, 308
vector kernel implementation, 334–336
VMIPS, 264–265
VMIPS on DAXPY, G-17
VMIPS on Linpack, G-17 to G-19
Vector Registers
definition, 309
execution time, 269, 271
gather-scatter, 280
multimedia compiler support, A-31
Multimedia SIMD Extensions, 282
multiple lanes, 271–273
NVIDIA GPU, 297
NVIDIA GPU ISA, 298
performance/bandwidth trade-offs, 332
processor example, 267
strides, 278–279
vector vs. GPU, 308, 311
VMIPS, 264–267, 266
Very-large-scale integration (VLSI)
early computer arithmetic, J-63
interconnection network topology, F-29
RISC history, L-20
Wallace tree, J-53
Very Long Instruction Word (VLIW)
clock rates, 244
compiler scheduling, L-31
EPIC, L-32
IA-64, H-33 to H-34
ILP, 193–196
loop-level parallelism, 315
M32R, K-39 to K-40
multiple-issue processors, 194, L-28 to L-30
multithreading history, L-34
sample code, 252
TI 320C6x DSP, E-8 to E-10
VGA controller, L-51
Video
Amazon Web Services, 460
application trends, 4
PMDs, 6
WSCs, 8, 432, 437, 439
Video games, multimedia support, K-17
VI interface, L-73
Virtual address
address translation, B-46
AMD64 paged virtual memory, B-55
AMD Opteron data cache, B-12 to B-13
ARM Cortex-A8, 115
cache optimization, B-36 to B-39
GPU conditional branching, 303
Intel Core i7, 120
mapping to physical, B-45
memory hierarchy, B-39, B-48, B-48 to B-49
memory hierarchy basics, 77–78
miss rate vs. cache size, B-37
Opteron mapping, B-55
Opteron memory management, B-55 to B-56
and page size, B-58
page table-based mapping, B-45
translation, B-36 to B-39
virtual memory, B-42, B-49
Virtual address space
example, B-41
main memory block, B-44
Virtual caches
definition, B-36 to B-37
issues with, B-38
Virtual channels (VCs), F-47
HOL blocking, F-59
Intel SCCC, F-70
routing comparison, F-54
switching, F-51 to F-52
switch microarchitecture pipelining, F-61
system area network history, F-101
and throughput, F-93
Virtual cut-through switching, F-51
Virtual functions, control flow instructions, A-18
Virtualizable architecture
Intel 80x86 issues, 128
system call performance, 141
Virtual Machines support, 109
VMM implementation, 128–129
Virtualizable GPUs, future technology, 333
Virtual machine monitor (VMM)
characteristics, 108
nonvirtualizable ISA, 126, 128–129
requirements, 108–109
Virtual Machines ISA support, 109–110
Xen VM, 111
Virtual Machines (VMs)
Amazon Web Services, 456–457
cloud computing costs, 471
early IBM work, L-10
ISA support, 109–110
protection, 107–108
protection and ISA, 112
server benchmarks, 40
and virtual memory and I/O, 110–111
WSCs, 436
Xen VM, 111
Virtual memory
basic considerations, B-40 to B-44, B-48 to B-49
basic questions, B-44 to B-46
block identification, B-44 to B-45
block placement, B-44
block replacement, B-45
vs. caches, B-42 to B-43
classes, B-43
definition, B-3
fast address translation, B-46
Multimedia SIMD Extensions, 284
multithreading, 224
paged example, B-54 to B-57
page size selection, B-46 to B-47
parameter ranges, B-42
Pentium vs. Opteron protection, B-57
protection, 105–107
segmented example, B-51 to B-54
strided access-TLB interactions, 323
terminology, B-42
Virtual Machines impact, 110–111
writes, B-45 to B-46
Virtual methods, control flow instructions, A-18
Virtual output queues (VOQs), switch microarchitecture, F-60
VME rack
example, D-38
Internet Archive Cluster, D-37
VMIPS
basic structure, 265
DAXPY, G-18 to G-20
DLP, 265–267
double-precision FP operations, 266
enhanced, DAXPY performance, G-19 to G-21
gather/scatter operations, 280
ISA components, 264–265
multidimensional arrays, 278–279
Multimedia SIMD Extensions, 282
multiple lanes, 271–272
peak performance on DAXPY, G-17
performance, G-4
performance on Linpack, G-17 to G-19
sparse matrices, G-13
start-up penalties, G-5
vector execution time, 269–270, G-6 to G-7
vector vs. GPU, 308
vector-length registers, 274
vector load/store unit bandwidth, 276
vector performance measures, G-16
vector processor example, 267–268
VLR, 274
Voltage regulator controller (VRC), Intel SCCC, F-70
Voltage regulator modules (VRMs), WSC server energy efficiency, 462
Volume-cost relationship, components, 27–28
Von Neumann, John, L-2 to L-6
Von Neumann computer, L-3
Voodoo2, L-51
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset