Index
Note: Page numbers followed by f indicate figures and t indicate tables.
A
Accelerated processing unit (APU)
architecture
205
GPU and CPU cores
63–64
integration, SoCs
53–54
Intel's Sandy Bridge design
63, 65f
AMD APP Profiler
application trace
238
GPU kernel performance counters
247
performance analysis
238
AMD APP KernelAnalyzer
243–244
AMD “Bobcat” CPU core
AMD “Bulldozer”
AMD E350 “Zacate”
AMD E350 “Zacate”
63, 64f
AMD “Istanbul”
AMD Phenom II
AMD Radeon HD6970
Applications programming interface (API)
defined
15
description
16
Win32 and POSIX
17–18
B
Buffers
array access syntax
110
creation flags
111
error codes
111–112
reading and writing
113
C
Caching data
global memory to local memory copying
155–156
Cell processor
55
Central processing units (CPUs).
Intel Itanium
2, 59
low-power
AMD “Bobcat” CPU core
56
ARM ISA
56
Intel Atom
56–58
power/performance
58
mainstream desktop
AMD “Bulldozer”
58–59
AMD “Istanbul” CPU
214, 217f
AMD Phenom II
58
Intel Sandy Bridge
58
multi-core
42
Command queues
communication
22–23
flush and finish commands
26
host–device interaction
23
multiple-queue scenarios
99
Concurrent runtime (ConcRT)
214–215
Microsoft concurrency runtime (ConcRT)
214
Threading building blocks (TBB)
214–215
Constant memory (
Contexts
22
Convolution (
CPU/GPU OpenCL implementation
AMD Phenom II X6
architecture, mapping
123, 125f
CPU core
123–124
design
123, 124f
memory mapping
127, 128f
worker thread processes
124, 125–127, 126f
128-bit vector
127
AMD Radeon HD6970
architecture, core
135–136, 135f
clause-based SIMD execution
ALU operations
135
ISA code and wavefronts
135
software control
132–135
texture clause
137
VLIW packets
136, 136f
hardware
synchronization support
128–129
workload schedule
129–130, 129f
performance
130
queuing mechanism
130
resource allocation
registers and LDS storage space
137
SIMD core, OpenCL workloads
137–138, 138f
threading and memory system
132
high-bandwidth GDDR5
132
memory hierarchy
130–132, 131f
PCI express bus
132
D
Data sharing and synchronization
barriers/locks, primitives
11
concurrent program
11
Debugging
AMD printf extension
252–253
developer's view
249
gDEBugger
API-level debugging
250–251
kernel debugging
251–252
parallel programs
249
E
Event chaining
OpenCL
command queue model
180–181
runtime
181
video frame buffer
181
Events
command sequence
96–99, 97f
defined
23
devices, command queue
99
error conditions
103
event information
103
execution models
100
multiple command queues
99–100, 99f
out-of-order queue
96
parallel multidevice
100, 101f, 102–103
read buffer operation
103–104
task graph creation
96–99
wait list orders execution
100–102
Example applications
Convolution (buffers)
code listings
host code
162–166
kernel code
166–171
performance
code size
161
loop unrolling
160–161
memory optimization
161
workgroup size selection
151–154
Convolution (images)
compilation and execution
82
description
blurring and vertical edge-detecting filters
77–78, 79f
blurring kernel
77
kernel
83–84
sampler creation
82
serial code
78
Histogram
description
185
global memory data access patterns
187–189
global reduction operation
186f
kernel code
193–196
local memory access
binning operations
190
hardware banks
191, 191f
reduction operation
192f
SIMD vectors
190
reduction
bank conflicts
192–193
global
193
using atomics
AMD Radeon HD6970 architecture
190
memory pressure
189
SIMD vector
189
workgroups
global memory transactions
186
local memory
185–186
pixels mapping
186
size, optimal
186–187
Image rotation
description
73
buffer declaration and data movement
76
C++ bindings
75, 76
coordinates and equations
74
input decomposition
74–75
program execution
77
runtime kernel compilation
76–77
Matrix multiplication
buffer declaration and data movement
70–71
implementation steps
69, 70f
loops
67–68
program execution
72
runtime kernel compilation
72
Mixed particle simulation
computation
collisions, particles
197–198
data and physical properties
199
GPU vs. CPU
198
screen shot
197–198
small–small collisions
199
CPU implementation
202–203
data-parallel devices
197
description
197
GPU implementation
acceleration structure
201
buffer creation
200–201
computing collisions
201–202
integration
202
kernels
205–209
load balancing
CPU's thread workload
203–204
pool imbalance
203
structure
204f
performance
204–205
uniform grid creation kernel
205–206
Vector addition
32–38
Web photo editor
editing image
265
painting mode
265, 266f
prototype
264, 265f
time spent, WebGL shaders and kernels
265
touch screen optimization
264
WebGL and WebCL, image processing
operators
264–265
Execution model
devices
88
kernels
17, 87
queuing and synchronization
callbacks, event
104–108
command barriers and markers
108–109
events
96–104
finish operation
95–96
memory consistency
96
thread-safe command queues
94
synchronization
90–95, 91f
work-items
17, 87
workgroups
87, 88
wavefront and warp
88
Extensions
types
211
EXT extension
211
F
Firefox extension
layers implementation
258, 258f
sources modification and NPAPI
257
XUL and XML user interface
257–258
G
Global memory
see alsoMemory model
access alignment
64 and 128-byte segments
156
data array padding
157–158
GT200 series architecture
158
global data access
AMD GPUs and CPUs
187
arithmetic operations
187
efficient access pattern, reading
188–189, 188f
nonparallel work items
188f
read converge
189
serial coalesced trade-off
189f
global performance
access memory
139
analyzing performance
139
bandwidth measurement
139
128-bit float4 accesses
140–141
efficiency loss
140–141, 141f
modern CPUs, vector instruction
139–140
multiple wavefronts
142
gDEBugger
API-level debugging and function
250–251
components
249
interaction
249–250, 250f
kernel debugging
251–252
OpenCL performance and memory consumption
249
Graphic processing units (GPUs)
described
60–61
handheld
61
high-end desktop
AMD Radeon HD6970
cores
62
lanewide SIMD model
61–62
NVIDIA GTX580
61, 62f
scratchpad memory buffer
62–63
SIMD arrays and threads
61
vs. CPU designs
63
integration
202
kernel performance counters
242–243
H
Hardware trade-offs
cache hierarchies and memory systems
access patterns
54
GPUs and cell processor
55
latency
54–55
cores
42
graphics APIs and pixel shaders
41
heterogeneity
42
multi-core architectures
AMD Istanbul CPU
214, 217f
AMD Phenom II
58
AMD Radeon HD6970 GPU architecture
52, 53f
AMD's “Bulldozer” and “Bobcat” designs
51–52, 52f
cloning, single core
51
Intel Sandy Bridge
58
Intel Itanium
2, 59
multi-core CPUs
42
multithreading
48–51
performance enhancement
clock frequency
44
CMOS dynamic power consumption
43
parallel coding
43
voltage and frequency
43
SIMD and vector processing
47–48
SoC and APU
53–54
superscalar execution
44, 45f
I
Images
channel order and type
114
data
113
multidimensional data structure
114
objects
109–110
runtime system and hardware
114
scalar and vector reads
114–115
transformations
114
vs. buffers
113
Z-order/Morton order memory layouts
114, 115f
Image rotation
Intel Sandy Bridge58
Interoperability with OpenGL
J
JavaScript and OpenCL
characteristics
258–259
CL program
262
garbage collection
258
kernel arguments
262
memory objects
262, 263
resources and contexts
258
void pointer
259
WebCLDataObject use
263–264
K
Kernels
arguments
107
debugging
251–252
enqueue function
28
extraction
27
local memory allocations
31
mapping
31
performance counters
242–243
simulation
206–209
textual scope
31
uniform grid creation
205–206
KHR extension
211
L
Local data shares (LDS)
see alsoLocal memory
allocation
137–138
availability
138f
memory latency
137–138
read/write access
132
Local memory
see alsoMemory model
data access
binning operations
190
hardware banks
191, 191f
reduction operation
192f
SIMD vectors
190
performance
access phase
145, 145f
balance efficiency
148
behavior
146f, 147
code loads
143
data caches
142
data structures
147–148, 147f
HD6970 memory system and SIMD cores
143f
images map data
142–143
prefix sum
146f, 147
read/write operations
144–145
trade-offs
143
VLIW packet
145–146
M
Matrix multiplication
Memory model
device-side memory model
constant memory
121–122
global memory
117–119
local memory
119–121
private memory
122
relaxed consistency
116–117
host-side memory model
buffers
110–113
images
113–115
Memory objects
buffers
24–25
definition
23–24
images
25–26
Message-passing communication
MPI
10
physical/arbitrary device
9–10
Message passing interface (MPI)
10
Mixed particle simulation
Multithreading
Cray/Tera MTA and XMT designs
51
extraction, instruction parallelism
48
SMT
49, 49f
time-sliced version
49, 50f
types
48–49
N
O
Open computing language (OpenCL)
compilation
AMD's implementation
84
dynamic libraries
85
Linux
85
device architectures
block-based parallelism model
41
design space
APU and APU-like designs
63–64
CPU designs
56–60
CPUs vs. GPUs
55
state storage and ALUs
55–56, 57f
hardware trade-offs
cache hierarchies and memory systems
54–55
cores
42
graphics APIs and pixel shaders
41
heterogeneity
42
multi-core architectures
51–52
multi-core CPUs
42
multithreading
48–51
performance increase, frequency
43–44
SIMD and vector processing
47–48
superscalar execution
44
device fission, extensions
class Parallel
214–217
command queues
218, 220f
creation, subdevices
218
exported extension
217
funcWrapper
221
implementation
221–225
subdevice partition properties
218, 219t
double precision, extensions
C++ Wrapper API
227–228
data types
225–226
floating point formats
225
matrix multiplication implementation
226–227
execution environment
command queues
22–23
contexts
22
events
23
flush and finish command
26
memory objects
23–26
execution model
CPU concurrency models
16–17
data-parallel execution
17–18
hierarchical concurrency model
18
memory structures
16
NDRange
18, 19f
framework, heterogeneous programming
1–2
interoperability
image2D memory object
183
memory objects
182
queue object
183–184
synchronization
183–184
texture
182–183
kernels
compiling kernels
33, 36
enqueue function
28
extraction
27
local memory allocations
31
mapping
31
textual scope
31
querying, platform and device
212
memory model
parallel computing
2
platform independence
41
platform vendors
2
profiling API
228–233
program object creation
scope and applicability
2
specification, models
defined
15–16
execution
15
memory
16
parallel execution
16
platform
16
programming
16
standard
15
writing kernels
P
Parallel programming
array multiplication
5f
computing, definition
4
and concurrency
assignments
7–8
data sharing and synchronization
11
message-passing communication
9–10
parallelism grains
10–11
receiving and processing input
7
subsets, program
8, 8f
threads and shared memory
9
control and data intensive
1
data and task-level parallelism
5–6
divide-and-conquer methods
2–3
elements multiplication
4
goals
2
GPUs
3–4
heterogeneity
1
image filtration, FFT
5f
multiple processors
3
OpenCL
1–2
parallelism and concurrency, classes
4
reduction concept
6
simple sorting and vector–scalar multiply
3f
structure
11–12
Parallelism grains
chunk size selection
10–11
coarse-grained
10
computation ratio
10–11
fine-grained
10
Pragma directive
213–214
Private memory
Profiling events
command's queues and status
236
enabling
236
information
237
kernel execution
237–238
valid values enumeration
236–237
Programs
binary representation
27
build process features
27
dynamic library interface
26–27
kernels
26
runtime compilation
26–27
Q
Queuing and global synchronization
callbacks, event
104–107
command barriers and markers
enqueueMarker
109
event list
108
finish operation
95–96
synchronization
109
task graphs
109
events
memory consistency
commands
96
data
96
runtime
96
primary points
94–95
thread-safe command queues
94
R
RGBA format180
S
Simultaneous multithreading (SMT)
49, 49f
Single instruction multiple data (SIMD)
ALU operations
135
architecture
135–136, 135f
cores
52, 53f
ISA code and wavefronts
135
OpenCL workloads
137–138, 138f
software control
132–135
texture clause
137
threads
61
vector processing
CPUs and GPUs
48
execution advantage
48
parallelism
47
VLIW packets
136, 136f
STI Cell Processor
54
Sun Niagara design59–60
Superscalar execution
44
Systems-on-chip (SoC)
APU
benefits
53–54
cell broadband engine processor
54
multi-core design
53
fused processors
63
T
Threading building blocks (TBB)
214–215
Threads and shared memory
consistency model, defined
9
definition
9
global view
9
OpenCL
9
shared bus
9
Throughput computing50–5151f
V
Vector addition
Vendor extensions
211
Very long instruction word (VLIW)
ALU packing
160–161
architectures
42
description
44–45
designs
47
double precision and integer operations
136
DSP chips
47
efficiency losses
46
four-way
20
low-level shader compiler
61–62
method
44–45
out-of-order execution
45–46, 45f, 46f
packet stream
45–46
SIMD lane
53f
Video processing
CPU decoding, frame
callback methodology
174–175
framework and request
174
heterogeneous computing
173–174
multithreading
175
VLC project
174
display to screen
double-buffered texture
181
OpenCL/OpenGL interoperability
181–183
GPU decoding, frames
OpenVideo
176–179
power consumption
175–176
processing
176, 176f
multiple videos with multiple special effects
chain effects
180
event chaining
180–181
OpenCL
features
179
kernel, parameters
179–180
RGBA format
180
W
Web applications
client-side
255–256
JavaScript code
260
visual and interaction
255
WebCL
advantages, World Wide Web
255
browser environment
255
client-side web applications
255–256
framework designing
development
256
device capabilities
256
error management
256, 257t
goal
256
requirement, Web usage
256
garbage collector
267
HTML and JavaScript
267–268
OpenCL
color image, gray scale conversion
260
ctx object function
261–262
error reporting/handling
260, 261
input and the output buffers
261
JavaScript vs. C
260
kernel source
260
property array
260
syntax
260–264
performance
266–267, 267t
photo editor
264–265
pilot implementation
Firefox
257–258
JavaScript and OpenCL
258–259
portable devices
268
web environment
268
Web photo editor
Workgroups
definition
88
histogram
global memory transactions
186
local memory
185–186
pixels mapping
186
size, optimal
186–187
size selection
aligning data
158
caching data to local memory
152, 154–159
efficiency, vector reading
158–159
filter size
151–152
input-output efficiency
151–152
memory access aligning
156–158
optimization approach
153
out-of-bounds access
153–154, 154f
Work-items 17
87
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset