Index

Note: Page numbers followed by f indicate figures and t indicate tables.

A

Adders 
carry-look-ahead adder 103, 103f
prefix computation 103–104
ripple-carry adder 102, 102f
Advanced vector extensions (AVX) 304
Alice programming language 13
Amdahl’s law 199–201
Anonymous processes, Python 34–36, 35f
append method 248
Array processors 304
Arrays 173, 193–194
Asynchronous communication 139
Asynchronous computing 314

B

Barriers 132, 263–264
Bit serial adder 113
Blocking command 20–21, 21f, 22f
Blocking communication 313
Bounded buffer 251–252
Broadcast 90
Broadcast commands 20–21, 22f
Bulk synchronous parallel (BSP) model 
definition 136
direction and velocity 136
grid points 137, 137f
MPI 119
superstep 136, 137f
Bus arbiter 113
Buses 95–96

C

Call stack 165
Carry-look-ahead adder 103, 103f
Catch statement 221
Center for Curriculum Development and Educational Resources (CDER) 1–3
Cluster computing 
high-performance networks 140–148
outcomes 117
Clusters 310
Coarse-grained data parallelism 
definition 309
distributed memory 310–314, 311f, 312f
shared memory parallelism 309–310, 310f
task scheduling 314–315
Collective communication 
barrier synchronization 132
broadcast collective 133
gather collective 133, 135
message passing programs 134f
reduce collective 135, 136
Communication 
blocking 127–132
collective 132–136
nonblocking 138–140, 141f, 313–314
point-to-point 127–132
two-sided 127–132
Communicators 128
Compute unified device architecture (CUDA) language 307–308
Concurrency control 161–164
Condition variables 
bounded buffer 251–252
bugs 254, 257
full/empty buffer 255–256
inexcusably inefficient 252
newCondition method 257
notify 253
notifyAll 253, 256
ReentrantLock class 257
spinning 252, 257
wait 253
Context switch 30–31
Convergecast 90, 92–93
Conway’s game of life 6
cells 300–301
coarse-grained data parallelism 309–315
data parallelism 303–308
data partitioning 316–318, 316f, 317f
load balancing 320
loop-based parallelism 308–309
outcome 300
programming 301–302
rules of 301
work/minimizing communication 318–319, 319f
Copy-scans 
n ≤ p 275
n > p 275
Counters 
flip-flop propagation delay 109f
AND gates 108, 108f
internet 110
prefix computations 107
ripple 109, 109f
synchronous circuit 109
Cray Seastar interconnection 100

D

Dataflow 315
Data parallelism 171
GPUs 307–308
sequential reference code 303–304
SIMD 304
vector instructions 304–306, 305f
vector pipelining 306–307
Data partitioning 
distributed case 316
latency 316
matrix-vector product algorithm 317
One-dimensional partitioning 316, 317–318, 317f
strong scaling 317
two-dimensional partitioning 316, 317–318, 317f
weak scaling 317
Deadlock 5, 245–248, 313
Digital logic 49990:t0020 49990:t0025 
adders 102–104
counters 107–110
field-programmable gate arrays 113–114
finite state machines 113
flip-flops 112–113
hardware 84
Karnaugh maps 99–102, 100f, 101f
latches 112–113
logic gates 90–96
multiplexers 104–106, 105f, 106f, 107f
number representation 85–89
outcomes 83
pedagogical levels, PDC 84t
programmable logic devices 113–114
recursive thinking 88–89
shift registers 110–112, 110f, 111f
timing analysis 97–99, 97f, 98f
tree representation 86–87, 86f
trees and interconnection networks 88–89
unary number vs. binary number 88–89
Verilog 113
Directed acyclic graphs (DAGs) 
applications 98–99
data hazards 98–99
faster circuit/algorithm 98
finite impulse response filter 98
multiplexer circuit 97, 97f, 98f
speedup and parallelism 196–198
steady state 97
synchronization 98–99
threads 99
work and span 194–196, 195f, 197f
Distributed memory parallelism 310–311, 311f
Distributed memory programming 
blocking communication 313
deadlock 313
MPI library 311, 312
nonblocking communication 313–314
pseudocode 314
two-sided communication 312
Divide-and-conquer parallelism 
array elements 181
helper thread 185
implementations of 184
insertion sort 185
java.lang.InterruptedException 183
recursion tree 181, 183f
strategies 5
DoubleBalance method 225
Dynamic scheduling 80–81

E

Early adopter program 2
EduPar workshop series 2
Ethernet 140–142, 148, 310
Exclusive scan 269–270

F

Fan-out and fan-in 
associative operations 90
broadcast 90
convergecast 90, 92–93
efficiency 94
feasible schedule 93–94, 94f
Monopoly 92–93, 93f
multicast 90, 92f
OR gate 90
recursive decomposition 94–95
speedup 94
trees 90
Finally statement 221
Fine-grained parallelism  See Data parallelism
Flip-flops 112–113
FORALL loop 174
Fork-join parallelism 49990:t0035 49990:t0040 5, 177
Amdahl’s law 199–201
arrays 173, 193–194
vs. concurrency control 161–164
cores 161
dataflow 171
data parallelism 171
divide-and-conquer parallelism 181–186
error-prone 170
expected time bound 198–199
FORALL loop 174
Java framework 187–190
java.lang.Thread method 176
join method 177
maps 190–193
material 158–160
mergesort 209–211
message passing 171
Moore’s law 199–201
multithreaded programming 160–161
outcomes 157
pack algorithm 206–207
prefix-sum algorithm 202–206
processors 173, 178–181
quicksort 207–209, 209f
reductions 190–193
sequential programming 160
shared memory 164–170
speedup and parallelism, DAG 196–198
sum method 175, 176, 179
SumThread constructor 175
synchronization 176
threads 164–170
time-slicing 160
work and span, DAG 194–196, 195f, 197f

G

Graphics processing unit (GPU) 
data parallelism 307–308
supercomputers 150, 151

H

Halo region 318–319
Heap data structure 165
Hiding 216
High-performance computing (HPC) community 119, 125–127, 126f
High-performance networks 
Cray supercomputing networks 140–142
Ethernet 140–142
features of 142
HPC vs. traditional networking 144
InfiniBand 140–142
operating system 142, 143f
performance of 142
RDMA 144–146

I

IBM Blue Gene/L 100
Immutable memory 238–239
Inclusive scan 269–270, 274
Independent iterations 308
InfiniBand 140–142, 310
Integrated development environment (IDE) 27
International Parallel and Distributed Processing Symposium (IPDPS) 2
isEmpty method 232

J

Java 
condition variables 225–226
doubleBalance method 225
private field 223
public field 224
synchronized statement 222–223
unusual style 223
Java ForkJoin framework 
classes and methods 188
ForkJoinPool 188
java.util.concurrent package 187
RecursiveAction class 188–189, 190
RecursiveTask 189
Jitter 79
Joining 64–65
Join method 42–43, 43f, 49, 50f, 177, 190

K

Karnaugh maps 
adjacency graph 99–100, 100f, 101f
definition 99–100
hypercube 102
mesh and torus 100
ring-based topologies 100
synchronization 101
topology of 4
wraparound connections 101
Kogge-Stone circuit 108f

L

Ladner-Fischer circuit 108f
Latches 112–113
Latency 316, 318
Lightweight kernels 151
Load imbalance 320
Lock 
catch statement 221
concurrency control and 220
finally statement 221
Java 222–226
pseudocode 220, 221
reentrant locks 222
throw statement 221
Logic gates 
buses 95–96
fan-in 90–95
fan-out 90–95
tristate gates 95–96
Loop-based parallelism 308–309
Loop unrolling 306

M

Mandelbrot set 
explicit threading approach 81
fractal image 72–73, 73f
OpenMP 75–81, 76f, 78f, 80f
sequential program 74, 75f
uses 73–74
Mandlebrot 4
Map-reduce strategies 5
Masking 216
Master-worker model 315
Medium access control (MAC) protocols 110
Meld operation 277
Memory 
distributed 310–311, 311f
shared 309–310, 310f
Mergesort 
fork-join parallelism 209–211
POOL/map paradigm 54, 56f, 57f
queue objects 46–47, 46f
spawning process 46–47, 46f
Message passing  See Message passing interface (MPI)
Message passing interface (MPI) 49990:t0030 4–5, 311
BSP model 119, 136–138
cluster computing 118–127
collective communication 132–136
definition 118
HPC community 119, 125–127, 126f
internet 118
LAM 119
nodes 119
nonblocking communication 138–140, 141f
nonuniform memory access 120–121
OpenMP 148–150, 150f
Parallel Virtual Machine middleware 119
point-to-point communication 127–132
scalability 122–123
shared memory systems 119–120
SPMD 123–125
supercomputers 150–152, 152t
threads and 150f
Monitors 258
Moore’s Law 201
MPI_Recv() 131–132
MPI_Send() 131–132
Multicast 90, 92f
Multiple instruction, multiple data (MIMD) 310
Multiplexers 104–106, 105f, 106f, 107f
Multiprocessing module 27, 28–29
Multithreaded prime counter 
64f
C++ 64f
Java 65f
nextCand loop 63
Mutex 66, 67
Mutual exclusion 66

N

Networks 49990:t0030  See also High-performance networks
cluster computing 117, 140–148
interconnection 88–89, 95–96
and MPI 49990:t0030 
Next costume command 15
Node 118, 119, 310
Nonblocking command 20–21, 21f, 22f
Nonblocking communication 138–140, 141f, 313–314
notifyAll () method 253, 256
Notify () method 253

O

One-dimensional partitioning 316, 317–318, 317f
OpenMP 308
compared to explicit threading 81
dynamic scheduling 80–81
race condition 77
OR gate 90
Outline 264
Overhead 79

P

Parallel and distributed computing (PDC) 1
CDER 2–3
circuit fan-in and fan-out 4
Conway’s Game of Life 6
course material 6–7
courseware repository 2
early adopter program 2
EduPar workshop series 2
Fork-join parallelism 5
MPI 4–5
organization 3
prime counting 4
Python-based computer science course 6
Python programming language 4
scratch 3–4
shared-memory concurrency control 5
topology 4
Parallelizable loops 262–263
Parallel partitioning 279–281
analysis 281
meld operation 277
permute operation 278–279
Parallel performance 49–52, 50f, 51f
Parallel reduction 
global variables 264–265
n > p 267–269
n ≤ p 265–267, 266f
operator 264
Parallel running times 293–294, 295f, 296f
Pattern recognizer 113
Permute operations 278–279
Pipeline 214, 306–307
Play sound command 20
Point-to-point communication 
and message size 130–131, 131f
sequential processes 127
telephone, classic game of 128–129, 129f
unique addresses 127–128
using buffers storage 130
POOL/map paradigm, Python 
abstract procedure 52
mergesort 54, 56f, 57f
P Monte Carlo estimation 52, 53f
Riemann sum 54, 55f
Pragmas 60
Prefix-sum algorithm 
definition 202
down pass 204–206, 205f
sum-an-array problem 206
up pass 202–203, 202f, 203f
Prime counting module 4
assignment 71–72
C program 69f
C++ program 69f
definition 61
Java program 69f
load balance 70–71, 70f
multiple thread creation 63, 64f, 65f
privatizing variables 68
race condition 66–68, 66f
sequential code 62–63, 62f
thread joining 64–65
uses 62
Privatizing variables 68
Program counter 165
Public method 167
Python 49990:t0010 49990:t0050 
anonymous processes 34–36, 35f
child process 43–44, 44f
communicating via queue 40–41, 40f
communication 27
context switches 30–31
course materials 30
data structures 29, 47–48, 47f, 48f
extended communication via queue 41, 42f
goals 26
graphics courses 29
harnessing parallel hardware 27
hole diggers 38–40, 39f
introductory courses 29
join method 42–43, 43f
limited programming ability 28
lock to control printing 36–38, 38f
mergesort 46–47, 46f
multiple child process 45, 45f
multiprocessing module 27, 28–29
outcomes 25
POOL/map paradigm 52–54, 53f, 55f, 56f, 57f
process names 36, 37f
Python 2.x 26–27
requirements 26
security/resource management 28
spawning process 31–34, 31f, 33f, 34f
speedup 48–52
tools 26–27
traditional content 29–30
Python-based computer science course 6
Python 2.x 26–27

Q

Queue 
communication 40–41, 40f
extended communication 41, 42f
Quicksort 
analysis 284–285
fork-join parallelism 207–209, 209f
partitioning in 282, 283f, 284f
segment bits 282
segmented scan 282

R

Race condition, shared-memory concurrency control 
bad interleavings 226, 227–232, 227f
data race 226–227, 227f, 232–237
definition 226
stacks 227–232
Random numbers 48–49, 49f
Rank, MPI 128
Reader/writer locks 249–251
Recursion tree 181, 183f
Reentrant locks 222
Remote direct memory access (RDMA) 
Ethernet 148
main memory and 146–148
queue-pair model 144–146, 147f
send-recv model 146
uses 144
write/read 146
Riemann sum 54, 55f
Ripple-carry adder 102, 102f
Ripple counters 109, 109f

S

Scaling 
strong 317
weak 317
Scan operation 
exclusive scan 269–270
inclusive scan 269–270, 274
n > p 272–274
n ≤ p 270–272, 270f
op function 270
Scratch 49990:t0005 
advantage of 3–4
blocking command 20–21, 21f, 22f
cartoonish characters and photographs 13
communication, clean solutions 14–17, 15f, 16f, 17f
computer literacy course 12
context of 12–13
CS1 13
dragging blocks 13
nonblocking command 20–21, 21f, 22f
outcomes 11
PDC 11–12
private variable 21–23, 22f
race conditions 17–20, 19f
shared variable 21–23, 22f
toy language 13–14
Segmented copy-scans 289–290
Segmented inclusive scans 289
Segmented reductions 290–292
Segmented scan 282, 285–289
Sequential implementation 302
Sequential parallel 49–52, 50f, 51f
Sequential running times 293–294, 295f, 296f
Shared-memory concurrency control 49990:t0045 5
bank-account class 215
cooks 214–215
deadlock 245–248
debug 215
failure/performance isolation 216
immutable memory 238–239
mutual-exclusion (  See Lock)
outcomes 213
parallelism 216
pipeline 214
processor utilization 216
race condition 226–237
responsiveness 216
synchronization 215–226, 240–245, 248–259
thread 215
thread-local memory 237–238
Shared memory parallelism 309, 310f
Shared memory systems 119–120
Shift registers 110–112, 110f, 111f
Single program, multiple data (SPMD) 310
advantage of 125
applications 125
data parallelism 304
definition 123
extreme scales 124
failures 125
STAT 124
synchronization 123–124
Spawning process, Python 
mergesort 46–47, 46f
multiple 32–34, 33f
Pool and Pool.map mechanism 34, 34f
single process 31–32
Sprite 14–15
Stacks 
array 229
index 229
intermediate states 227–228
myPeekHelperWrong causes 230, 231f
pop operation 228, 231
push operation 228, 230
StackEmptyException 231
synchronized keyword 229
Stack trace analysis tool (STAT) 124
Start method 32, 35–36, 167, 185
Static scheduling 80
Streaming SIMD Extensions (SSE) instructions 304
StringBuffer class 248
Strong scaling 317
Subdomain 316–317
Sum method 175, 176, 179
Supercomputers 150–152, 152t
Synchronization 215–216
A-B-A problem 243–244
assignment statements 242
atomic operations 244
BankAccount class 217
barriers 258
coarse-grained locking 241–242
ConcurrentHashMap class 244, 245
condition variables 251–258
data races 240
finegrained locking 241
Hashtable class 244
locking 240
monitors 258
mutual exclusion 218–219
possible interleavings 216–217
primitives 220
race conditions 242
reader/writer locks 249–251
semaphores 258
stale information 218
thread scheduler 217
vertical timeline 217–218
while loop 219
System.out.println method 168

T

Think command 20
Threads 
DAGs 99
MPI 150f
OpenMP 60
shared-memory concurrency control 215
Threads and shared memory 170f
aliasing 170
call stack 165
constructor 168
definition 164
heap data structure 165
interference 166
multithreaded program 166f
new thread creation 167, 169f
nondeterministic 167
program counter 165
scheduling 164
sequential program 164–165, 165f
Throw statement 221
Tightly coupled systems 122–123
Time-slicing 160
Toy language 13–14
transferTo method 245, 246, 247, 248
Tristate gates 95–96

U

Unary number vs. binary number 88–89

V

Vector pipelining 306–307
Verilog 113
Volatile keyword 235

W

Weak scaling 317
Whack-a-mole game 22, 22f
While loops 219, 254
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset