Index
Note: Page numbers followed by f indicate figures and t indicate tables.
A
Accelerated processing unit (APU)
Intel's Sandy Bridge design
63,
65fAMD APP Profiler
GPU kernel performance counters
247Applications programming interface (API)
C
Caching data
global memory to local memory copying
155–156Central processing units (CPUs).
Command queues
flush and finish commands
26host–device interaction
23multiple-queue scenarios
99Concurrent runtime (ConcRT)
214–215Microsoft concurrency runtime (ConcRT)
214Threading building blocks (TBB)
214–215CPU/GPU OpenCL implementation
AMD Radeon HD6970
clause-based SIMD execution
ISA code and wavefronts
135resource allocation
registers and LDS storage space
137threading and memory system
132D
Data sharing and synchronization
barriers/locks, primitives
11E
Example applications
Convolution (images)
compilation and execution
82description
blurring and vertical edge-detecting filters
77–78,
79fHistogram
global memory data access patterns
187–189global reduction operation
186fusing atomics
AMD Radeon HD6970 architecture
190workgroups
global memory transactions
186Image rotation
buffer declaration and data movement
76coordinates and equations
74runtime kernel compilation
76–77Matrix multiplication
buffer declaration and data movement
70–71implementation steps
69,
70fruntime kernel compilation
72Mixed particle simulation
computation
data and physical properties
199small–small collisions
199GPU implementation
acceleration structure
201uniform grid creation kernel
205–206Web photo editor
time spent, WebGL shaders and kernels
265touch screen optimization
264WebGL and WebCL, image processing
Execution model
queuing and synchronization
command barriers and markers
108–109thread-safe command queues
94F
Firefox extension
sources modification and NPAPI
257G
Global memory
access alignment
64 and 128-byte segments
156GT200 series architecture
158global data access
nonparallel work items
188fserial coalesced trade-off
189fglobal performance
modern CPUs, vector instruction
139–140gDEBugger
API-level debugging and function
250–251OpenCL performance and memory consumption
249Graphic processing units (GPUs)
high-end desktop
scratchpad memory buffer
62–63SIMD arrays and threads
61H
Hardware trade-offs
cache hierarchies and memory systems
GPUs and cell processor
55graphics APIs and pixel shaders
41multi-core architectures
AMD Radeon HD6970 GPU architecture
52,
53fAMD's “Bulldozer” and “Bobcat” designs
51–52,
52fperformance enhancement
CMOS dynamic power consumption
43SIMD and vector processing
47–48superscalar execution
44,
45fI
Images
channel order and type
114multidimensional data structure
114runtime system and hardware
114Z-order/Morton order memory layouts
114,
115fInteroperability with OpenGL
J
JavaScript and OpenCL
resources and contexts
258K
Kernels
local memory allocations
31L
Local memory
performance
HD6970 memory system and SIMD cores
143fM
Message-passing communication
physical/arbitrary device
9–10Message passing interface (MPI)
10Mixed particle simulation
Multithreading
Cray/Tera MTA and XMT designs
51extraction, instruction parallelism
48time-sliced version
49,
50fO
Open computing language (OpenCL)
device architectures
block-based parallelism model
41design space
APU and APU-like designs
63–64hardware trade-offs
cache hierarchies and memory systems
54–55graphics APIs and pixel shaders
41multi-core architectures
51–52performance increase, frequency
43–44SIMD and vector processing
47–48device fission, extensions
subdevice partition properties
218,
219tdouble precision, extensions
floating point formats
225matrix multiplication implementation
226–227execution environment
flush and finish command
26execution model
CPU concurrency models
16–17data-parallel execution
17–18hierarchical concurrency model
18framework, heterogeneous programming
1–2local memory allocations
31querying, platform and device
212P
Parallel programming
and concurrency
data sharing and synchronization
11message-passing communication
9–10receiving and processing input
threads and shared memory
control and data intensive
data and task-level parallelism
5–6divide-and-conquer methods
2–3parallelism and concurrency, classes
simple sorting and vector–scalar multiply
3fParallelism grains
chunk size selection
10–11Profiling events
command's queues and status
236Programs
dynamic library interface
26–27Q
Queuing and global synchronization
command barriers and markers
thread-safe command queues
94S
Simultaneous multithreading (SMT)
49,
49fSingle instruction multiple data (SIMD)
ISA code and wavefronts
135Systems-on-chip (SoC)
APU
cell broadband engine processor
54T
Threading building blocks (TBB)
214–215Threads and shared memory
consistency model, defined
V
Very long instruction word (VLIW)
double precision and integer operations
136low-level shader compiler
61–62Video processing
display to screen
double-buffered texture
181OpenCL/OpenGL interoperability
181–183multiple videos with multiple special effects
W
Web applications
visual and interaction
255WebCL
advantages, World Wide Web
255client-side web applications
255–256framework designing
requirement, Web usage
256OpenCL
color image, gray scale conversion
260error reporting/handling
260,
261input and the output buffers
261Workgroups
histogram
global memory transactions
186
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.