Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

M

MAC See Multiply-accumulate (MAC)

Machine language programmer, L-17 to L-18

Machine memory, Virtual Machines, 110

Macro-op fusion, Intel Core i7, 237–238

Magnetic storage

access time, D-3

cost vs. access time, D-3

historical background, L-77 to L-79

Mail servers, benchmarking, D-20

Main Memory

addressing modes, A-10

address translation, B-46

arithmetic intensity example, 286, 286–288

block placement, B-44

cache function, B-2

cache optimization, B-30, B-36

coherence protocol, 362

definition, 292, 309

DRAM, 17

gather-scatter, 329

GPU vs. MIMD, 327

GPUs and coprocessors, 330

GPU threads, 332

ILP considerations, 245

interlane wiring, 273

linear speedups, 407

memory hierarchy basics, 76

memory hierarchy design, 72

memory mapping, B-42

MIPS operations, A-36

Multimedia SIMD vs. GPUs, 312

multiprocessor cache coherence, 352

paging vs. segmentation, B-43

partitioning, B-50

processor performance calculations, 218–219

RISC code size, A-23

server energy efficiency, 462

symmetric shared-memory multiprocessors, 363

vector processor, G-25

vs. virtual memory, B-3, B-41

virtual memory block identification, B-44 to B-45

virtual memory writes, B-45 to B-46

VLIW, 196

write-back, B-11

write process, B-45

Manufacturing cost

chip fabrication case study, 61–62

cost trends, 27

modern processors, 62

vs. operation cost, 33

MapReduce

cloud computing, 455

cost calculations, 458–460, 459

Google usage, 437

reductions, 321

WSC batch processing, 437–438

WSC cost-performance, 474

Mark-I, L-3 to L-4, L-6

Mark-II, L-4

Mark-III, L-4

Mark-IV, L-4

Mask Registers

basic operation, 275–276

definition, 309

Multimedia SIMD, 283

NVIDIA GPU computational structures, 291

vector compilers, 303

vector vs. GPU, 311

VMIPS, 267

MasPar, L-44

Massively parallel processors (MPPs)

characteristics, I-45

cluster history, L-62, L-72 to L-73

system area network history, F-100 to F-101

Matrix300 kernel

definition, 56

prediction buffer, C-29

Matrix multiplication

benchmarks, 56

LU kernel, I-8

multidimensional arrays in vector architectures, 278

Mauchly, John, L-2 to L-3, L-5, L-19

Maximum transfer unit, network interfaces, F-7 to F-8

Maximum vector length (MVL)

Multimedia SIMD extensions, 282

vector vs. GPU, 311

VLRs, 274–275

M-bus See Memory bus (M-bus)

McCreight, Ed, F-99

MCF

compiler optimizations, A-29

data cache misses, B-10

Intel Core i7, 240–241

MCP operating system, L-16

Mean time between failures (MTBF)

fallacies, 56–57

RAID, L-79

SLA states, 34

Mean time to failure (MTTF)

computer system power consumption case study, 63–64

dependability benchmarks, D-21

disk arrays, D-6

example calculations, 34–35

I/O subsystem design, D-59 to D-61

RAID reconstruction, D-55 to D-57

SLA states, 34

TB-80 cluster, D-40 to D-41

WSCs vs. servers, 434

Mean time to repair (MTTR)

dependability benchmarks, D-21

disk arrays, D-6

RAID 6, D-8 to D-9

RAID reconstruction, D-56

Mean time until data loss (MTDL), RAID reconstruction, D-55 to D-57

Media, interconnection networks, F-9 to F-12

Media extensions, DSPs, E-10 to E-11

Mellanox MHEA28-XT, F-76

Memory access

ARM Cortex-A8 example, 117

basic MIPS pipeline, C-36

vs. block size, B-28

cache hit calculation, B-5 to B-6

Cray Research T3D, F-87

data hazards requiring stalls, C-19 to C-21

data hazard stall minimization, C-17, C-19

distributed-memory multiprocessor, I-32

exception stopping/restarting, C-46

hazards and forwarding, C-56 to C-57

instruction set complications, C-49

integrated instruction fetch units, 208

MIPS data transfers, A-34

MIPS exceptions, C-48 to C-49

MIPS pipeline control, C-37 to C-39

MIPS R4000, C-65

multimedia instruction compiler support, A-31

pipeline branch issues, C-40, C-42

RISC classic pipeline, C-7, C-10

shared-memory workloads, 372

simple MIPS implementation, C-32 to C-33

simple RISC implementation, C-6

structural hazards, C-13 to C-14

vector architectures, G-10

Memory addressing

ALU immediate operands, A-12

basic considerations, A-11 to A-13

compiler-based speculation, H-32

displacement values, A-12

immediate value distribution, A-13

interpretation, A-7 to A-8

ISA, 11

vector architectures, G-10

Memory banks See also Banked memory

gather-scatter, 280

multiprocessor architecture, 347

parallelism, 45

shared-memory multiprocessors, 363

strides, 279

vector load/store unit bandwidth, 276–277

vector systems, G-9 to G-11

Memory bus (M-bus)

definition, 351

Google WSC servers, 469

interconnection networks, F-88

Memory consistency

basic considerations, 392–393

cache coherence, 352

compiler optimization, 396

development of models, L-64

directory-based cache coherence protocol basics, 382

multiprocessor cache coherency, 353

relaxed consistency models, 394–395

single-chip multicore processor case study, 412–418

speculation to hide latency, 396–397

Memory-constrained scaling, scientific applications on parallel processors, I-33

Memory hierarchy

address space, B-57 to B-58

basic questions, B-6 to B-12

block identification, B-7 to B-9

block placement issues, B-7

block replacement, B-9 to B-10

cache optimization

basic categories, B-22

basic optimizations, B-40

hit time reduction, B-36 to B-40

miss categories, B-23 to B-26

miss penalty reduction

via multilevel caches, B-30 to B-35

read misses vs. writes, B-35 to B-36

miss rate reduction

via associativity, B-28 to B-30

via block size, B-26 to B-28

via cache size, B-28

pipelined cache access, 82

cache performance, B-3 to B-6

average memory access time, B-17 to B-20

basic considerations, B-16

basic equations, B-22

example calculation, B-16

out-of-order processors, B-20 to B-22

case studies, B-60 to B-67

development, L-9 to L-12

inclusion, 397–398

interconnection network protection, F-87 to F-88

levels in slow down, B-3

Opteron data cache example, B-12 to B-15, B-13

Opteron L1/L2, B-57

OS and page size, B-58

overview, B-39

Pentium vs. Opteron protection, B-57

processor examples, B-3

process protection, B-50

terminology, B-2 to B-3

virtual memory

basic considerations, B-40 to B-44, B-48 to B-49

basic questions, B-44 to B-46

fast address translation, B-46

overview, B-48

paged example, B-54 to B-57

page size selection, B-46 to B-47

segmented example, B-51 to B-54

write strategy, B-10 to B-12

WSCs, 443, 443–446, 444

Memory hierarchy design

access times, 77

Alpha 21264 floorplan, 143

ARM Cortex-A8 example, 114–117, 115–117

cache coherency, 112–113

cache optimization

case study, 131–133

compiler-controlled prefetching, 92–95

compiler optimizations, 87–90

critical word first, 86–87

energy consumption, 81

hardware instruction prefetching, 91–92, 92

multibanked caches, 85–86, 86

nonblocking caches, 83–85, 84

overview, 78–79

pipelined cache access, 82

techniques overview, 96

way prediction, 81–82

write buffer merging, 87, 88

cache performance prediction, 125–126

cache size and misses per instruction, 126

DDR2 SDRAM timing diagram, 139

highly parallel memory systems, 133–136

high memory bandwidth, 126

instruction miss benchmarks, 127

instruction simulation, 126

Intel Core i7, 117–124, 119, 123–125

Intel Core i7 three-level cache hierarchy, 118

Intel Core i7 TLB structure, 118

Intel 80x86 virtualization issues, 128

memory basics, 74–78

overview, 72–74

protection and ISA, 112

server vs. PMD, 72

system call virtualization/paravirtualization performance, 141

virtual machine monitor, 108–109

Virtual Machines ISA support, 109–110

Virtual Machines protection, 107–108

Virtual Machines and virtual memory and I/O, 110–111

virtual memory protection, 105–107

VMM on nonvirtualizable ISA, 128–129

Xen VM example, 111

Memory Interface Unit

NVIDIA GPU ISA, 300

vector processor example, 310

Memoryless, definition, D-28

Memory mapping

memory hierarchy, B-48 to B-49

segmented virtual memory, B-52

TLBs, 323

virtual memory definition, B-42

Memory-memory instruction set architecture, ISA classification, A-3, A-5

Memory protection

control dependence, 155

Pentium vs. Opteron, B-57

processes, B-50

safe calls, B-54

segmented virtual memory example, B-51 to B-54

virtual memory, B-41

Memory stall cycles

average memory access time, B-17

definition, B-4 to B-5

miss rate calculation, B-6

out-of-order processors, B-20 to B-21

performance equations, B-22

Memory system

cache optimization, B-36

coherency, 352–353

commercial workloads, 367, 369–371

computer architecture, 15

C program evaluation, 134–135

dependability enhancement, 104–105

distributed shared-memory, 379, 418

gather-scatter, 280

GDRAMs, 323

GPUs, 332

ILP, 245

hardware vs. software speculation, 221–222

speculative execution, 222–223

Intel Core i7, 237, 242

latency, B-21

MIPS, C-33

multiprocessor architecture, 347

multiprocessor cache coherence, 352

multiprogramming workload, 377–378

page size changes, B-58

price/performance/power considerations, 53

RISC, C-7

Roofline model, 286

shared-memory multiprocessors, 363

SMT, 399–400

stride handling, 279

T1 multithreading unicore performance, 227

vector architectures, G-9 to G-11

vector chaining, G-11

vector processors, 271, 277

virtual, B-43, B-46

Memory technology basics

DRAM, 98, 98–100, 99

DRAM and DIMM characteristics, 101

DRAM performance, 100–102

Flash memory, 102–104

overview, 96–97

performance trends, 20

SDRAM power consumption, 102, 103

SRAM, 97–98

Mesh interface unit (MIU), Intel SCCC, F-70

Mesh network

characteristics, F-73

deadlock, F-47

dimension-order routing, F-47 to F-48

OCN history, F-104

routing example, F-46

Mesh topology

characteristics, F-36

direct networks, F-37

NEWS communication, F-42 to F-43

MESI See Modified-Exclusive-Shared-Invalid (MESI) protocol

Message ID, packet header, F-8, F-16

Message-passing communication

historical background, L-60 to L-61

large-scale multiprocessors, I-5 to I-6

Message Passing Interface (MPI)

function, F-8

InfiniBand, F-77

lack in shared-memory multiprocessors, I-5

Messages

adaptive routing, F-93 to F-94

coherence maintenance, 381

InfiniBand, F-76

interconnection networks, F-6 to F-9

zero-copy protocols, F-91

MFLOPS See Millions of floating-point operations per second (MFLOPS)

Microarchitecture

as architecture component, 15–16

ARM Cortex-A8, 241

Cray X1, G-21 to G-22

data hazards, 168

ILP exploitation, 197

Intel Core i7, 236–237

Nehalem, 411

OCNs, F-3

out-of-order example, 253

PTX vs. x86, 298

switches See Switch microarchitecture

techniques case study, 247–254

Microbenchmarks

disk array deconstruction, D-51 to D-55

disk deconstruction, D-48 to D-51

Microfusion, Intel Core i7 micro-op buffer, 238

Microinstructions

complications, C-50 to C-51

x86, 298

Micro-ops

Intel Core i7, 237, 238–240, 239

processor clock rates, 244

Microprocessor overview

clock rate trends, 24

cost trends, 27–28

desktop computers, 6

embedded computers, 8–9

energy and power, 23–26

inside disks, D-4

integrated circuit improvements, 2

and Moore’s law, 3–4

performance trends, 19–20, 20

power and energy system trends, 21–23

recent advances, L-33 to L-34

technology trends, 18

Microprocessor without Interlocked Pipeline Stages See MIPS (Microprocessor without Interlocked Pipeline Stages)

Microsoft

cloud computing, 455

containers, L-74

Intel support, 245

WSCs, 464–465

Microsoft Azure, 456, L-74

Microsoft DirectX, L-51 to L-52

Microsoft Windows

benchmarks, 38

multithreading, 223

RAID benchmarks, D-22, D-22 to D-23

time/volume/commoditization impact, 28

WSC workloads, 441

Microsoft Windows 2008 Server

real-world considerations, 52–55

SPECpower benchmark, 463

Microsoft XBox, L-51

Migration, cache coherent multiprocessors, 354

Millions of floating-point operations per second (MFLOPS)

early performance measures, L-7

parallel processing debates, L-57 to L-58

SIMD computer history, L-55

SIMD supercomputer development, L-43

vector performance measures, G-15 to G-16

MIMD (Multiple Instruction Streams, Multiple Data Streams)

and Amdahl’s law, 406–407

definition, 10

early computers, L-56

first vector computers, L-46, L-48

GPU programming, 289

GPUs vs. vector architectures, 310

with Multimedia SIMD, vs. GPU, 324–330

multiprocessor architecture, 346–348

speedup via parallelism, 263

TLP, basic considerations, 344–345

Minicomputers, replacement by microprocessors, 3–4

Minniespec benchmarks

ARM Cortex-A8, 116, 235

ARM Cortex-A8 memory, 115–116

MINs See Multistage interconnection networks (MINs)

MIPS (Microprocessor without Interlocked Pipeline Stages)

addressing modes, 11–12

basic pipeline, C-34 to C-36

branch predictor correlation, 163

cache performance, B-6

conditional branches, K-11

conditional instructions, H-27

control flow instructions, 14

data dependences, 151

data hazards, 169

dynamic scheduling with Tomasulo’s algorithm, 171, 173

early pipelined CPUs, L-26

embedded systems, E-15

encoding, 14

exceptions, C-48, C-48 to C-49

exception stopping/restarting, C-46 to C-47

features, K-44

FP pipeline performance, C-60 to C-61, C-62

FP unit with Tomasulo’s algorithm, 173

hazard checks, C-71

ILP, 149

ILP exposure, 157–158

ILP hardware model, 215

instruction execution issues, K-81

instruction formats, core instructions, K-6

instruction set complications, C-49 to C-51

ISA class, 11

ISA example

addressing modes for data transfer, A-34

arithmetic/logical instructions, A-37

basic considerations, A-32 to A-33

control flow instructions, A-37 to A-38, A-38

data types, A-34

dynamic instruction mix, A-41, A-41 to A-42, A-42

FP operations, A-38 to A-39

instruction format, A-35

load-store instructions, A-36

MIPS operations, A-35 to A-37

registers, A-34

usage, A-39

Livermore Fortran kernel performance, 331

memory addressing, 11

multicycle operations

basic considerations, C-51 to C-54

hazards and forwarding, C-54 to C-58

precise exceptions, C-58 to C-60

multimedia support, K-19

multiple-issue processor history, L-29

operands, 12

performance measurement history, L-6 to L-7

pipeline branch issues, C-39 to C-42

pipeline control, C-36 to C-39

pipe stage, C-37

processor performance calculations, 218–219

registers and usage conventions, 12

RISC code size, A-23

RISC history, L-19

RISC instruction set lineage, K-43

as RISC systems, K-4

scoreboard components, C-76

scoreboarding, C-72

scoreboarding steps, C-73, C-73 to C-74

simple implementation, C-31 to C-34, C-34

Sony PlayStation 2 Emotion Engine, E-17

unaligned word read instructions, K-26

unpipelined functional units, C-52

vs. VAX, K-65 to K-66, K-75, K-82

write strategy, B-10

MIPS16

addressing modes, K-6

arithmetic/logical instructions, K-24

characteristics, K-4

constant extension, K-9

data transfer instructions, K-23

embedded instruction format, K-8

instructions, K-14 to K-16

multiply-accumulate, K-20

RISC code size, A-23

unique instructions, K-40 to K-42

MIPS32, vs. VAX sort, K-80

MIPS64

addressing modes, K-5

arithmetic/logical instructions, K-11

MIPS64

conditional branches, K-17

constant extension, K-9

conventions, K-13

data transfer instructions, K-10

FP instructions, K-23

instruction list, K-26 to K-27

instruction set architecture formats, 14

instruction subset, 13, A-40

in MIPS R4000, C-61

nonaligned data transfers, K-24 to K-26

RISC instruction set, C-4

MIPS2000, instruction benchmarks, K-82

MIPS 3010, chip layout, J-59

MIPS core

compare and conditional branch, K-9 to K-16

equivalent RISC instructions

arithmetic/logical, K-11

arithmetic/logical instructions, K-15

common extensions, K-19 to K-24

control instructions, K-12, K-16

conventions, K-16

data transfers, K-10

embedded RISC data transfers, K-14

FP instructions, K-13

instruction formats, K-9

MIPS M2000, L-21, L-21

MIPS MDMX

characteristics, K-18

multimedia support, K-18

MIPS R2000, L-20

MIPS R3000

integer arithmetic, J-12

integer overflow, J-11

MIPS R3010

arithmetic functions, J-58 to J-61

chip comparison, J-58

floating-point exceptions, J-35

MIPS R4000

early pipelined CPUs, L-27

FP pipeline, C-65 to C-67, C-66

integer pipeline, C-63

pipeline overview, C-61 to C-65

pipeline performance, C-67 to C-70

pipeline structure, C-62 to C-63

MIPS R8000, precise exceptions, C-59

MIPS R10000, 81

latency hiding, 397

precise exceptions, C-59

Misalignment, memory address interpretation, A-7 to A-8, A-8

MISD See Multiple Instruction Streams, Single Data Stream

Misprediction rate

branch-prediction buffers, C-29

predictors on SPEC89, 166

profile-based predictor, C-27

SPECCPU2006 benchmarks, 167

Mispredictions

ARM Cortex-A8, 232, 235

branch predictors, 164–167, 240, C-28

branch-target buffers, 205

hardware-based speculation, 190

hardware vs. software speculation, 221

integer vs. FP programs, 212

Intel Core i7, 237

prediction buffers, C-29

static branch prediction, C-26 to C-27

Misses per instruction

application/OS statistics, B-59

cache performance, B-5 to B-6

cache protocols, 359

cache size effect, 126

L3 cache block size, 371

memory hierarchy basics, 75

performance impact calculations, B-18

shared-memory workloads, 372

SPEC benchmarks, 127

strided access-TLB interactions, 323

Miss penalty

average memory access time, B-16 to B-17

cache optimization, 79, B-35 to B-36

cache performance, B-4, B-21

compiler-controlled prefetching, 92–95

critical word first, 86–87

hardware prefetching, 91–92

ILP speculative execution, 223

memory hierarchy basics, 75–76

nonblocking cache, 83

out-of-order processors, B-20 to B-22

processor performance calculations, 218–219

reduction via multilevel caches, B-30 to B-35

write buffer merging, 87

Miss rate

AMD Opteron data cache, B-15

ARM Cortex-A8, 116

average memory access time, B-16 to B-17, B-29

basic categories, B-23

vs. block size, B-27

cache optimization, 79

and associativity, B-28 to B-30

and block size, B-26 to B-28

and cache size, B-28

cache performance, B-4

and cache size, B-24 to B-25

compiler-controlled prefetching, 92–95

compiler optimizations, 87–90

early IBM computers, L-10 to L-11

example calculations, B-6, B-31 to B-32

hardware prefetching, 91–92

Intel Core i7, 123, 125, 241

memory hierarchy basics, 75–76

multilevel caches, B-33

processor performance calculations, 218–219

scientific workloads

distributed-memory multiprocessors, I-28 to I-30

symmetric shared-memory multiprocessors, I-22, I-23 to I-25

shared-memory multiprogramming workload, 376, 376–377

shared-memory workload, 370–373

single vs. multiple thread executions, 228

Sun T1 multithreading unicore performance, 228

vs. virtual addressed cache size, B-37

MIT Raw, characteristics, F-73

Mitsubishi M32R

addressing modes, K-6

arithmetic/logical instructions, K-24

characteristics, K-4

condition codes, K-14

constant extension, K-9

data transfer instructions, K-23

embedded instruction format, K-8

multiply-accumulate, K-20

unique instructions, K-39 to K-40

MIU See Mesh interface unit (MIU)

Mixed cache

AMD Opteron example, B-15

commercial workload, 373

Mixer, radio receiver, E-23

Miya, Eugene, L-65

M/M/1 model

example, D-32, D-32 to D-33

overview, D-30

RAID performance prediction, D-57

sample calculations, D-33

M/M/2 model, RAID performance prediction, D-57

MMX See Multimedia Extensions (MMX)

Mobile clients

data usage, 3

GPU features, 324

vs. server GPUs, 323–330

Modified-Exclusive-Shared-Invalid (MESI) protocol, characteristics, 362

Modified-Owned-Exclusive-Shared-Invalid (MOESI) protocol, characteristics, 362

Modified state

coherence protocol, 362

directory-based cache coherence protocol basics, 380

large-scale multiprocessor cache coherence, I-35

snooping coherence protocol, 358–359

Modula-3, integer division/remainder, J-12

Module availability, definition, 34

Module reliability, definition, 34

MOESI See Modified-Owned-Exclusive-Shared-Invalid (MOESI) protocol

Moore’s law

DRAM, 100

flawed architectures, A-45

interconnection networks, F-70

and microprocessor dominance, 3–4

point-to-point links and switches, D-34

RISC, A-3

RISC history, L-22

software importance, 55

switch size, F-29

technology trends, 17

Mortar shot graphs, multiprocessor performance measurement, 405–406

Motion JPEG encoder, Sanyo VPC-SX500 digital camera, E-19

Motorola 68000

characteristics, K-42

memory protection, L-10

Motorola 68882, floating-point precisions, J-33

Move address, VAX, K-70

MPEG

Multimedia SIMD Extensions history, L-49

multimedia support, K-17

Sanyo VPC-SX500 digital camera, E-19

Sony PlayStation 2 Emotion Engine, E-17

MPI See Message Passing Interface (MPI)

MPPs See Massively parallel processors (MPPs)

MSP See Multi-Streaming Processor (MSP)

MTBF See Mean time between failures (MTBF)

MTDL See Mean time until data loss (MTDL)

MTTF See Mean time to failure (MTTF)

MTTR See Mean time to repair (MTTR)

Multibanked caches

cache optimization, 85–86

example, 86

Multichip modules, OCNs, F-3

Multicomputers

cluster history, L-63

definition, 345, L-59

historical background, L-64 to L-65

Multicore processors

architecture goals/requirements, 15

cache coherence, 361–362

centralized shared-memory multiprocessor structure, 347

Cray X1E, G-24

directory-based cache coherence, 380

directory-based coherence, 381, 419

DSM architecture, 348, 379

multichip

cache and memory states, 419

with DSM, 419

multiprocessors, 345

OCN history, F-104

performance, 400–401, 401

performance gains, 398–400

performance milestones, 20

single-chip case study, 412–418

and SMT, 404–405

snooping cache coherence implementation, 365

SPEC benchmarks, 402

uniform memory access, 364

write invalidate protocol implementation, 356–357

Multics protection software, L-9

Multicycle operations, MIPS pipeline

basic considerations, C-51 to C-54

hazards and forwarding, C-54 to C-58

precise exceptions, C-58 to C-60

Multidimensional arrays

dependences, 318

in vector architectures, 278–279

Multiflow processor, L-30, L-32

Multigrid methods, Ocean application, I-9 to I-10

Multilevel caches

cache optimizations, B-22

centralized shared-memory architectures, 351

memory hierarchy basics, 76

memory hierarchy history, L-11

miss penalty reduction, B-30 to B-35

miss rate vs. cache size, B-33

Multimedia SIMD vs. GPU, 312

performance equations, B-22

purpose, 397

write process, B-11

Multilevel exclusion, definition, B-35

Multilevel inclusion

definition, 397, B-34

implementation, 397

memory hierarchy history, L-11

Multimedia applications

desktop processor support, E-11

GPUs, 288

ISA support, A-46

MIPS FP operations, A-39

vector architectures, 267

Multimedia Extensions (MMX)

compiler support, A-31

desktop RISCs, K-18

desktop/server RISCs, K-16 to K-19

SIMD history, 262, L-50

vs. vector architectures, 282–283

Multimedia instructions

ARM Cortex-A8, 236

compiler support, A-31 to A-32

Multimedia SIMD Extensions

basic considerations, 262, 282–284

compiler support, A-31

DLP, 322

DSPs, E-11

vs. GPUs, 312

historical background, L-49 to L-50

MIMD, vs. GPU, 324–330

parallelism classes, 10

programming, 285

Roofline visual performance model, 285–288, 287

256-bit-wide operations, 282

vs. vector, 263–264

Multimedia user interfaces, PMDs, 6

Multimode fiber, interconnection networks, F-9

Multipass array multiplier, example, J-51

Multiple Instruction Streams, Multiple Data Streams See MIMD (Multiple Instruction Streams, Multiple Data Streams)

Multiple Instruction Streams, Single Data Stream (MISD), definition, 10

Multiple-issue processors

basic VLIW approach, 193–196

with dynamic scheduling and speculation, 197–202

early development, L-28 to L-30

instruction fetch bandwidth, 202–203

integrated instruction fetch units, 207

loop unrolling, 162

microarchitectural techniques case study, 247–254

primary approaches, 194

SMT, 224, 226

with speculation, 198

Tomasulo’s algorithm, 183

Multiple lanes technique

vector instruction set, 271–273

vector performance, G-7 to G-9

vector performance calculations, G-8

Multiple paths, ILP limitation studies, 220

Multiple-precision addition, J-13

Multiply-accumulate (MAC)

DSP, E-5

embedded RISCs, K-20

TI TMS320C55 DSP, E-8

Multiply operations

chip comparison, J-61

floating point

denormals, J-20 to J-21

examples, J-19

multiplication, J-17 to J-20

precision, J-21

rounding, J-18, J-19

integer arithmetic

array multiplier, J-50

Booth recoding, J-49

even/odd array, J-52

issues, J-11

with many adders, J-50 to J-54

multipass array multiplier, J-51

n-bit unsigned integers, J-4

Radix-2, J-4 to J-7

signed-digit addition table, J-54

with single adder, J-47 to J-49, J-48

Wallace tree, J-53

integer shifting over zeros, J-45 to J-47

PA-RISC instructions, K-34 to K-35

unfinished instructions, 179

Multiprocessor basics

architectural issues and approaches, 346–348

architecture goals/requirements, 15

architecture and software development, 407–409

basic hardware primitives, 387–389

cache coherence, 352–353

coining of term, L-59

communication calculations, 350

computer categories, 10

consistency models, 395

definition, 345

early machines, L-56

embedded systems, E-14 to E-15

fallacies, 55

locks via coherence, 389–391

low-to-high-end roles, 344–345

parallel processing challenges, 349–351

for performance gains, 398–400

performance trends, 21

point-to-point example, 413

shared-memory See Shared-memory multiprocessors

SMP, 345, 350, 354–355, 363–364

streaming Multiprocessor, 292, 307, 313–314

Multiprocessor history

bus-based coherent multiprocessors, L-59 to L-60

clusters, L-62 to L-64

early computers, L-56

large-scale multiprocessors, L-60 to L-61

parallel processing debates, L-56 to L-58

recent advances and developments, L-58 to L-60

SIMD computers, L-55 to L-56

synchronization and consistency models, L-64

virtual memory, L-64

Multiprogramming

definition, 345

multithreading, 224

performance, 36

shared-memory workload performance, 375–378, 377

shared-memory workloads, 374–375

software optimization, 408

virtual memory-based protection, 105–106, B-49

workload execution time, 375

Multistage interconnection networks (MINs)

bidirectional, F-33 to F-34

crossbar switch calculations, F-31 to F-32

vs. direct network costs, F-92

example, F-31

self-routing, F-48

system area network history, F-100 to F-101

topology, F-30 to F-31, F-38 to F-39

Multistage switch fabrics, topology, F-30

Multi-Streaming Processor (MSP)

Cray X1, G-21 to G-23, G-22, G-23 to G-24

Cray X1E, G-24

first vector computers, L-46

Multithreaded SIMD Processor

block diagram, 294

definition, 292, 309, 313–314

Fermi GPU architectural innovations, 305–308

Fermi GPU block diagram, 307

Fermi GTX 480 GPU floorplan, 295, 295–296

GPU programming, 289–290

GPUs vs. vector architectures, 310, 310–311

Grid mapping, 293

NVIDIA GPU computational structures, 291

NVIDIA GPU Memory structures, 304, 304–305

Roofline model, 326

Multithreaded vector processor

definition, 292

Fermi GPU comparison, 305

Multithreading

coarse-grained, 224–226

definition and types, 223–225

fine-grained, 224–226

GPU programming, 289

historical background, L-34 to L-35

ILP, 223–232

memory hierarchy basics, 75–76

parallel benchmarks, 231, 231–232

for performance gains, 398–400

SMT See Simultaneous multithreading (SMT)

Sun T1 effectiveness, 226–229

MVAPICH, F-77

MVL See Maximum vector length (MVL)

MXP processor, components, E-14

Myrinet SAN, F-67

characteristics, F-76

cluster history, L-62 to L-63, L-73

routing algorithms, F-48

switch vs. NIC, F-86

system area network history, F-100

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Computer Architecture: A Quantitative Approach

Create new playlist

Sign In

Sign Up

M

Table of Contents for
Computer Architecture: A Quantitative Approach