Index

Page references in bold represent figures and tables.

Numbers

2:1 cache rule of thumb, definition, B-29

A

ABC (Atanasoff Berry Computer), L-5
Absolute addressing mode, Intel 80x86, K-47
Accelerated Strategic Computing Initiative (ASCI)
ASCI Red, F-100
ASCI White, F-67, F-100
system area network history, F-101
Access 1/Access 2 stages, TI 320C55 DSP, E-7
Access bit
IA-32 descriptor table, B-52
Access time See also Average Memory Access Time (AMAT)
vs. block size, B-28
distributed-memory multiprocessor, 348
DRAM/magnetic disk, D-3
memory hierarchy basics, 77
miss penalties, 218, B-42
NUMA, 348
paging, B-43
shared-memory multiprocessor, 347, 363
slowdown causes, B-3
TLP workloads, 369–370
during write, B-45
WSC memory hierarchy, 444
Access time gap, disk storage, D-3
Acknowledgment, packets, F-16
ACS project, L-28 to L-29
Active low power modes, WSCs, 472
Ada language, integer division/remainder, J-12
Adaptive routing
definition, F-47
vs. deterministic routing, F-52 to F-55, F-54
network fault tolerance, F-94
and overhead, F-93 to F-94
Adders
carry-lookahead, J-37 to J-41
chip comparison, J-60
full, J-2, J-3
half, J-2
integer division speedup, J-54 to J-58
integer multiplication speedup
even/odd array, J-52
many adders, J-50, J-50 to J-54
multipass array multiplier, J-51
signed-digit addition table, J-54
single adder, J-47 to J-49, J-48 to J-49
Wallace tree, J-53
radix-2 division, J-55
radix-4 division, J-56
radix-4 SRT division, J-57
ripple-carry, J-3, J-3
time/space requirements, J-44
Addition operations
chip comparison, J-61
floating point
denormals, J-26 to J-27
overview, J-21 to J-25
rules, J-24
speedup, J-25 to J-26
integer, speedup
carry-lookahead, J-37 to J-41
carry-lookahead circuit, J-38
carry-lookahead tree, J-40
carry-lookahead tree adder, J-41
carry-select adder, J-43, J-43 to J-44, J-44
carry-skip adder, J-41 to J43, J-42
overview, J-37
ripply-carry addition, J-3
Address aliasing prediction
definition, 213
ideal processor, 214
ILP for realizable processors, 216
Address Coalescing Unit
function, 310
gather-scatter, 329
GPUs, 300
Multithreaded SIMD Processor block diagram, 294
vector processor, 310
Address fault, virtual memory definition, B-42
Addressing modes
comparison, A-11
compiler writing-architecture relationship, A-30
control flow instructions, A-17 to A-18
desktop architectures, K-5
displacement mode, A-10
embedded architectures, K-6
instruction set encoding, A-21
Intel 80x86, K-47 to K-49, K-58 to K-59, K-59 to K-60
Intel 80x86 operands, K-59
MIPS data transfers, A-34
RISC architectures, K-5 to K-6
selection, A-9
VAX, K-66 to K-68, K-71
VAX instruction encoding, K-68 to K-69
Address offset, virtual memory, B-56
Address space
Fermi GPU architecture, 306–307
memory hierarchy, B-48 to B-49, B-57 to B-58
Multimedia SIMD vs. GPUs, 312
SMP/DSM shared memory, 348
virtual memory, B-40 to B-41
Address specifier
instruction set encoding, A-21
VAX instruction encoding, K-68 to K-69
Address stage, TI 320C55 DSP, E-7
Address trace, cache performance, B-4
Address translation
AMD64 paged virtual memory, B-55 to B-56
during indexing, B-36 to B-40
memory hierarchy basics, 77–78
Opteron data TLB, B-47
virtual memory, B-46
virtual memory definition, B-42
virtual memory protection, 106
Administrative costs, WSC vs. datacenters, 455
Adobe Photoshop, multimedia support, K-17
Advanced directory protocol
basic function, 283
case studies, 420–426
Advanced load address table (ALAT)
IA-64 ISA, H-40
vector sparse matrices, G-13
Advanced loads, IA-64 ISA, H-40
Advanced mobile phone service (AMPS), cell phones, E-25
Advanced Research Project Agency See ARPA (Advanced Research Project Agency)
Advanced RISC Machine See ARM (Advanced RISC Machine)
Advanced Simulation and Computing (ASC) program, system area network history, F-101
Advanced Switching Interconnect (ASI), storage area network history, F-103
Advanced Switching SAN, F-67
Advanced Technology Attachment disks See ATA (Advanced Technology Attachment) disks
Advanced Vector Extensions (AVX)
double-precision FP programs, 284
vs. vector architectures, 282
Affine, loop-level parallelism dependences, 318–320, H-6
After rounding rule, J-36
Aggregate bandwidth
definition, F-13
effective bandwidth calculations, F-18 to F-19
interconnection networks, F-89
routing, F-47
shared- vs. switched-media networks, F-22, F-24 to F-25
switched-media networks, F-24
switch microarchitecture, F-56
Aiken, Howard, L-3 to L-4
Airflow
containers, 466
Google WSC server, 467
Airside econimization, WSC cooling systems, 449
Akamai, as Content Delivery Network, 460
Alewife machine, L-61
ALGOL, L-16
Aliased variables, and compiler technology, A-27 to A-28
Aliases, address translation, B-38
Alignment, memory address interpretation, A-7 to A-8, A-8
Allen, Fran, L-28
Alliant processors, vector processor history, G-26
AltaVista search
cluster history, L-62, L-73
shared-memory workloads, 369, 370
Amazon
cloud computing, 455
Dynamo, 438, 452
Amazon Elastic Computer Cloud (EC2), 456–457
MapReduce cost calculations, 458–459
price and characteristics, 458
utility computing, L-74
Amazon Simple Storage Service (S3), 456–457
Amazon Web Services (AWS)
cloud computing providers, 471–472
MapReduce cost calculations, 458–460, 459
as utility computing, 456–461
WSC cost-performance, 474
Xen VM, 111
Amdahl, Gene, L-28
Amdahl’s law
computer design principles, 46–48
computer system power consumption case study, 63–64
DRAM, 99
and parallel computers, 406–407
parallel processing calculations, 349–350
pitfalls, 55–56
vs. processor performance equation, 51
scalar performance, 331
software overhead, F-91
VMIPS on Linpack, G-18
WSC processor cost-performance, 472–473
AMD Athlon 64, Itanium 2 comparison, H-43
AMD Barcelona microprocessor, Google WSC server, 467
AMD Fusion, L-52
AMD K-5, L-30
AMD Opteron
address translation, B-38
Amazon Web Services, 457
architecture, 15
cache coherence, 361
data cache example, B-12 to B-15, B-13
Google WSC servers, 468–469
inclusion, 398
manufacturing cost, 62
misses per instruction, B-15
MOESI protocol, 362
multicore processor performance, 400–401
multilevel exclusion, B-35
NetApp FAS6000 filer, D-42
paged virtual memory example, B-54 to B-57
vs. Pentium protection, B-57
real-world server considerations, 52–55
server energy savings, 25
snooping limitations, 363–364
SPEC benchmarks, 43
TLB during address translation, B-47
AMD processors
architecture flaws vs. success, A-45
GPU computing history, L-52
power consumption, F-85
recent advances, L-33
RISC history, L-22
shared-memory multiprogramming workload, 378
terminology, 313–315
tournament predictors, 164
Virtual Machines, 110
VMMs, 129
Amortization of overhead, sorting case study, D-64 to D-67
Andreessen, Marc, F-98
Android OS, 324
Annulling delayed branch, instructions, K-25
Antenna, radio receiver, E-23
Antialiasing, address translation, B-38
Antidependences
compiler history, L-30 to L-31
definition, 152
finding, H-7 to H-8
loop-level parallelism calculations, 320
MIPS scoreboarding, C-72, C-79
Apogee Software, A-44
Apollo DN 10000, L-30
Apple iPad
ARM Cortex-A8, 114
memory hierarchy basics, 78
Application binary interface (ABI), control flow instructions, A-20
Application layer, definition, F-82
Applied Minds, L-74
Arbitration algorithm
collision detection, F-23
commercial interconnection networks, F-56
examples, F-49
Intel SCCC, F-70
interconnection networks, F-21 to F-22, F-27, F-49 to F-50
network impact, F-52 to F-55
SAN characteristics, F-76
switched-media networks, F-24
switch microarchitecture, F-57 to F-58
switch microarchitecture pipelining, F-60
system area network history, F-100
Architect-compiler writer relationship, A-29 to A-30
Architecturally visible registers, register renaming vs. ROB, 208–209
Architectural Support for Compilers and Operating Systems (ASPLOS), L-11
Architecture See also Computer architecture See also CUDA (Compute Unified Device Architecture) See also Instruction set architecture (ISA) See also Vector architectures
compiler writer-architect relationship, A-29 to A-30
definition, 15
heterogeneous, 262
microarchitecture, 15–16, 247–254
Areal density, disk storage, D-2
Argument pointer, VAX, K-71
Arithmetic intensity
as FP operation, 286, 286–288
Roofline model, 326, 326–327
Arithmetic/logical instructions
desktop RISCs, K-11, K-22
embedded RISCs, K-15, K-24
Intel 80x86, K-49, K-53
SPARC, K-31
VAX, B-73
Arithmetic-logical units (ALUs)
ARM Cortex-A8, 234, 236
basic MIPS pipeline, C-36
branch condition evaluation, A-19
data forwarding, C-40 to C-41
data hazards requiring stalls, C-19 to C-20
data hazard stall minimization, C-17 to C-19
DSP media extensions, E-10
effective address cycle, C-6
hardware-based execution, 185
hardware-based speculation, 200–201, 201
IA-64 instructions, H-35
immediate operands, A-12
integer division, J-54
integer multiplication, J-48
integer shifting over zeros, J-45 to J-46
Intel Core i7, 238
ISA operands, A-4 to A-5
ISA performance and efficiency prediction, 241
load interlocks, C-39
microarchitectural techniques case study, 253
MIPS operations, A-35, A-37
MIPS pipeline control, C-38 to C-39
MIPS pipeline FP operations, C-52 to C-53
MIPS R4000, C-65
operand forwarding, C-19
operands per instruction example, A-6
parallelism, 45
pipeline branch issues, C-39 to C-41
pipeline execution rate, C-10 to C-11
power/DLP issues, 322
RISC architectures, K-5
RISC classic pipeline, C-7
RISC instruction set, C-4
simple MIPS implementation, C-31 to C-33
TX-2, L-49
ARM (Advanced RISC Machine)
addressing modes, K-5, K-6
arithmetic/logical instructions, K-15, K-24
characteristics, K-4
condition codes, K-12 to K-13
constant extension, K-9
control flow instructions, 14
data transfer instructions, K-23
embedded instruction format, K-8
GPU computing history, L-52
ISA class, 11
memory addressing, 11
multiply-accumulate, K-20
operands, 12
RISC instruction set lineage, K-43
unique instructions, K-36 to K-37
ARM AMBA, OCNs, F-3
ARM Cortex-A8
dynamic scheduling, 170
ILP concepts, 148
instruction decode, 234
ISA performance and efficiency prediction, 241–243
memory access penalty, 117
memory hierarchy design, 78, 114–117, 115
memory performance, 115–117
multibanked caches, 86
overview, 233
pipeline performance, 233–236, 235
pipeline structure, 232
processor comparison, 242
way prediction, 81
ARM Cortex-A9
vs. A8 performance, 236
Tegra 2, mobile vs. server GPUs, 323–324, 324
ARM Thumb
addressing modes, K-6
arithmetic/logical instructions, K-24
characteristics, K-4
condition codes, K-14
constant extension, K-9
data transfer instructions, K-23
embedded instruction format, K-8
ISAs, 14
multiply-accumulate, K-20
RISC code size, A-23
unique instructions, K-37 to K-38
ARPA (Advanced Research Project Agency)
LAN history, F-99 to F-100
WAN history, F-97
ARPANET, WAN history, F-97 to F-98
Array multiplier
example, J-50
integers, J-50
multipass system, J-51
Arrays
access age, 91
blocking, 89–90
bubble sort procedure, K-76
cluster server outage/anomaly statistics, 435
examples, 90
FFT kernel, I-7
Google WSC servers, 469
Layer 3 network linkage, 445
loop interchange, 88–89
loop-level parallelism dependences, 318–319
ocean application, I-9 to I-10
recurrences, H-12
WSC memory hierarchy, 445
WSCs, 443
Array switch, WSCs, 443–444
ASCII character format, 12, A-14
ASC Purple, F-67, F-100
Assembly language, 2
Association of Computing Machinery (ACM), L-3
Associativity See also Set associativity
cache block, B-9 to B-10, B-10
cache optimization, B-22 to B-24, B-26, B-28 to B-30
cloud computing, 460–461
loop-level parallelism, 322
multilevel inclusion, 398
Opteron data cache, B-14
shared-memory multiprocessors, 368
Astronautics ZS-1, L-29
Asynchronous events, exception requirements, C-44 to C-45
Asynchronous I/O, storage systems, D-35
Asynchronous Transfer Mode (ATM)
interconnection networks, F-89
LAN history, F-99
packet format, F-75
total time statistics, F-90
VOQs, F-60
as WAN, F-79
WAN history, F-98
WANs, F-4
ATA (Advanced Technology Attachment) disks
Berkeley’s Tertiary Disk project, D-12
disk storage, D-4
historical background, L-81
power, D-5
RAID 6, D-9
server energy savings, 25
Atanasoff, John, L-5
Atanasoff Berry Computer (ABC), L-5
ATI Radeon 9700, L-51
Atlas computer, L-9
ATM systems
server benchmarks, 41
TP benchmarks, D-18
Atomic exchange
lock implementation, 389–390
synchronization, 387–388
Atomic instructions
barrier synchronization, I-14
Core i7, 329
Fermi GPU, 308
T1 multithreading unicore performance, 229
Atomicity-consistency-isolation-durability (ACID), vs. WSC storage, 439
Atomic operations
cache coherence, 360–361
snooping cache coherence implementation, 365
“Atomic swap,” definition, K-20
Attributes field, IA-32 descriptor table, B-52
Autoincrement deferred addressing, VAX, K-67
Autonet, F-48
Availability
commercial interconnection networks, F-66
computer architecture, 11, 15
computer systems, D-43 to D-44, D-44
data on Internet, 344
fault detection, 57–58
I/O system design/evaluation, D-36
loop-level parallelism, 217–218
mainstream computing classes, 5
modules, 34
open-source software, 457
RAID systems, 60
as server characteristic, 7
servers, 16
source operands, C-74
Average instruction execution time, L-6
Average Memory Access Time (AMAT)
block size calculations, B-26 to B-28
cache optimizations, B-22, B-26 to B-32, B-36
cache performance, B-16 to B-21
calculation, B-16 to B-17
centralized shared-memory architectures, 351–352
definition, B-30 to B-31
memory hierarchy basics, 75–76
miss penalty reduction, B-32
via miss rates, B-29, B-29 to B-30
as processor performance predictor, B-17 to B-20
Average reception factor
centralized switched networks, F-32
multi-device interconnection networks, F-26
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset