Table of Contents for Chapter 13. WebCL

AMD “Bulldozer”

AMD E350 “Zacate”

seeAccelerated processing unit

AMD E350 “Zacate”

63, 64f

AMD “Istanbul”

AMD Phenom II

AMD Radeon HD6970

seeGraphics processing units

Applications programming interface (API)

defined

description

Win32 and POSIX

17–18

APU

seeAccelerated processing unit

Buffers

array access syntax

creation flags

error codes

reading and writing

Caching data

global memory to local memory copying

155–156

Cell processor

see alsoHardware trade-offs

Central processing units (CPUs).

see alsoCPU/GPU OpenCL implementation

Intel Itanium

2, 59

low-power

AMD “Bobcat” CPU core

ARM ISA

Intel Atom

56–58

power/performance

mainstream desktop

AMD “Bulldozer”

58–59

AMD “Istanbul” CPU

214, 217f

AMD Phenom II

Intel Sandy Bridge

multi-core

Command queues

communication

22–23

flush and finish commands

host–device interaction

multiple-queue scenarios

Concurrent runtime (ConcRT)

214–215

Microsoft concurrency runtime (ConcRT)

214

Threading building blocks (TBB)

214–215

Constant memory (

seeMemory model)

Contexts

Convolution (

seeExample applications)

CPU/GPU OpenCL implementation

AMD Phenom II X6

architecture, mapping

123, 125f

CPU core

123–124

design

123, 124f

memory mapping

127, 128f

worker thread processes

124, 125–127, 126f

128-bit vector

127

AMD Radeon HD6970

architecture, core

135–136, 135f

clause-based SIMD execution

ALU operations

135

ISA code and wavefronts

135

software control

132–135

texture clause

137

VLIW packets

136, 136f

hardware

synchronization support

workload schedule

performance

queuing mechanism

resource allocation

registers and LDS storage space

137

SIMD core, OpenCL workloads

137–138, 138f

threading and memory system

high-bandwidth GDDR5

memory hierarchy

PCI express bus

Data sharing and synchronization

barriers/locks, primitives

concurrent program

Debugging

AMD printf extension

252–253

developer's view

249

gDEBugger

API-level debugging

250–251

kernel debugging

251–252

parallel programs

249

Event chaining

OpenCL

command queue model

180–181

runtime

181

video frame buffer

181

Events

code

96–99

command sequence

96–99, 97f

defined

devices, command queue

error conditions

103

event information

103

execution models

100

multiple command queues

99–100, 99f

out-of-order queue

parallel multidevice

100, 101f, 102–103

read buffer operation

103–104

task graph creation

96–99

user

103–104

wait list orders execution

100–102

Example applications

Convolution (buffers)

code listings

host code

162–166

kernel code

166–171

performance

code size

161

loop unrolling

160–161

memory optimization

161

workgroup size selection

151–154

Convolution (images)

compilation and execution

description

blurring and vertical edge-detecting filters

77–78, 79f

blurring kernel

kernel

83–84

sampler creation

serial code

Histogram

description

185

global memory data access patterns

187–189

global reduction operation

186f

kernel code

193–196

local memory access

binning operations

hardware banks

191, 191f

reduction operation

192f

SIMD vectors

reduction

bank conflicts

192–193

bins

192–193

global

193

using atomics

AMD Radeon HD6970 architecture

memory pressure

189

SIMD vector

189

workgroups

global memory transactions

local memory

pixels mapping

size, optimal

Image rotation

description

buffer declaration and data movement

C++ bindings

75, 76

coordinates and equations

input decomposition

74–75

program execution

runtime kernel compilation

76–77

Matrix multiplication

buffer declaration and data movement

70–71

implementation steps

69, 70f

loops

67–68

program execution

runtime kernel compilation

Mixed particle simulation

computation

collisions, particles

197–198

data and physical properties

199

GPU vs. CPU

198

screen shot

197–198

small–small collisions

199

CPU implementation

202–203

data-parallel devices

197

description

197

GPU implementation

acceleration structure

buffer creation

computing collisions

integration

kernels

load balancing

CPU's thread workload

pool imbalance

structure

performance

uniform grid creation kernel

205–206

Vector addition

32–38

Web photo editor

editing image

265

painting mode

265, 266f

prototype

264, 265f

time spent, WebGL shaders and kernels

265

touch screen optimization

264

WebGL and WebCL, image processing

operators

264–265

Execution model

devices

kernels

17, 87

queuing and synchronization

callbacks, event

104–108

command barriers and markers

108–109

events

96–104

finish operation

95–96

memory consistency

thread-safe command queues

synchronization

90–95, 91f

work-items

17, 87

workgroups

87, 88

wavefront and warp

Extensions

see alsoOpen Computing Language

types

EXT extension

Firefox extension

layers implementation

258, 258f

sources modification and NPAPI

257

XUL and XML user interface

257–258

Global memory

see alsoMemory model

access alignment

64 and 128-byte segments

156

data array padding

157–158

GT200 series architecture

158

global data access

AMD GPUs and CPUs

187

arithmetic operations

187

efficient access pattern, reading

188–189, 188f

nonparallel work items

188f

read converge

189

serial coalesced trade-off

189f

global performance

access memory

139

analyzing performance

139

bandwidth measurement

139

128-bit float4 accesses

140–141

bits

141–142, 142f

efficiency loss

140–141, 141f

modern CPUs, vector instruction

139–140

multiple wavefronts

142

gDEBugger

API-level debugging and function

components

interaction

kernel debugging

OpenCL performance and memory consumption

249

Graphic processing units (GPUs)

see alsoCPU/GPU OpenCL implementation

described

60–61

handheld

high-end desktop

AMD Radeon HD6970

cores

lanewide SIMD model

61–62

NVIDIA GTX580

61, 62f

scratchpad memory buffer

62–63

SIMD arrays and threads

vs. CPU designs

integration

202

kernel performance counters

242–243

Hardware trade-offs

cache hierarchies and memory systems

access patterns

GPUs and cell processor

latency

54–55

cores

graphics APIs and pixel shaders

heterogeneity

multi-core architectures

AMD Istanbul CPU

214, 217f

AMD Phenom II

AMD Radeon HD6970 GPU architecture

52, 53f

AMD's “Bulldozer” and “Bobcat” designs

51–52, 52f

cloning, single core

Intel Sandy Bridge

Intel Itanium

2, 59

multi-core CPUs

multithreading

48–51

performance enhancement

clock frequency

CMOS dynamic power consumption

parallel coding

voltage and frequency

SIMD and vector processing

47–48

SoC and APU

53–54

superscalar execution

44, 45f

VLIW

44–47

Histogram

Images

channel order and type

data

113

multidimensional data structure

objects

109–110

runtime system and hardware

scalar and vector reads

114–115

transformations

vs. buffers

113

Z-order/Morton order memory layouts

114, 115f

Image rotation

Intel Itanium2 59

Intel Sandy Bridge58

Interoperability with OpenGL

seeVideo processing

JavaScript and OpenCL

characteristics

CL program

garbage collection

kernel arguments

memory objects

262, 263

resources and contexts

258

void pointer

259

WebCLDataObject use

263–264

Kernels

arguments

107

debugging

251–252

enqueue function

extraction

local memory allocations

mapping

performance counters

242–243

simulation

206–209

textual scope

uniform grid creation

205–206

KHR extension

Local data shares (LDS)

see alsoLocal memory

allocation

availability

memory latency

read/write access

SIMD

Local memory

see alsoMemory model

data access

binning operations

hardware banks

191, 191f

reduction operation

192f

SIMD vectors

performance

access phase

145, 145f

balance efficiency

148

behavior

146f, 147

code loads

143

data caches

142

data structures

147–148, 147f

HD6970 memory system and SIMD cores

143f

images map data

142–143

prefix sum

146f, 147

read/write operations

144–145

trade-offs

143

VLIW packet

145–146

Matrix multiplication

Memory model

device-side memory model

constant memory

global memory

local memory

private memory

relaxed consistency

host-side memory model

buffers

110–113

images

113–115

Memory objects

buffers

24–25

definition

23–24

images

25–26

Message-passing communication

MPI

physical/arbitrary device

9–10

Message passing interface (MPI)

Mixed particle simulation

MPI

seeMessage passing interface

Multithreading

Cray/Tera MTA and XMT designs

extraction, instruction parallelism

SMT

49, 49f

time-sliced version

49, 50f

types

48–49

NVIDIA GTX580

seeGraphics processing units

OpenCL

seeOpen computing language

Open computing language (OpenCL)

compilation

AMD's implementation

dynamic libraries

Linux

device architectures

block-based parallelism model

design space

APU and APU-like designs

63–64

CPU designs

56–60

CPUs vs. GPUs

GPU

60–63

state storage and ALUs

55–56, 57f

hardware trade-offs

cache hierarchies and memory systems

54–55

cores

graphics APIs and pixel shaders

heterogeneity

multi-core architectures

51–52

multi-core CPUs

multithreading

48–51

performance increase, frequency

43–44

SIMD and vector processing

47–48

superscalar execution

VLIW

44–47

device fission, extensions

class Parallel

214–217

command queues

218, 220f

creation, subdevices

exported extension

funcWrapper

implementation

subdevice partition properties

218, 219t

double precision, extensions

C++ Wrapper API

227–228

data types

225–226

floating point formats

225

matrix multiplication implementation

226–227

execution environment

command queues

22–23

contexts

events

flush and finish command

memory objects

23–26

execution model

CPU concurrency models

16–17

data-parallel execution

17–18

hierarchical concurrency model

memory structures

NDRange

18, 19f

framework, heterogeneous programming

1–2

interoperability

image2D memory object

memory objects

queue object

synchronization

texture

kernels

compiling kernels

33, 36

enqueue function

extraction

local memory allocations

mapping

textual scope

querying, platform and device

212

memory model

seeMemory model

parallel computing

platform independence

platform vendors

profiling API

228–233

program object creation

seePrograms

scope and applicability

specification, models

defined

15–16

execution

memory

parallel execution

platform

programming

standard

writing kernels

Parallel programming

array multiplication

computing, definition

and concurrency

assignments

7–8

data sharing and synchronization

message-passing communication

9–10

parallelism grains

10–11

receiving and processing input

subsets, program

8, 8f

threads and shared memory

control and data intensive

data and task-level parallelism

5–6

divide-and-conquer methods

2–3

elements multiplication

goals

GPUs

3–4

heterogeneity

image filtration, FFT

multiple processors

OpenCL

1–2

parallelism and concurrency, classes

reduction concept

simple sorting and vector–scalar multiply

structure

11–12

Parallelism grains

chunk size selection

10–11

coarse-grained

computation ratio

10–11

fine-grained

Pragma directive

213–214

Private memory

seeMemory model

Profiling events

command's queues and status

enabling

information

kernel execution

valid values enumeration

236–237

Programs

binary representation

build process features

dynamic library interface

26–27

kernels

runtime compilation

26–27

Queuing and global synchronization

callbacks, event

104–107

command barriers and markers

enqueueMarker

event list

finish operation

synchronization

task graphs

events

seeEvents

memory consistency

commands

data

runtime

primary points

94–95

thread-safe command queues

RGBA format180

seeOpenCL Images

SIMD

seeSingle instruction multiple data

Simultaneous multithreading (SMT)

49, 49f

Single instruction multiple data (SIMD)

ALU operations

135

architecture

135–136, 135f

cores

52, 53f

ISA code and wavefronts

OpenCL workloads

software control

texture clause

threads

vector processing

CPUs and GPUs

execution advantage

parallelism

VLIW packets

136, 136f

SMT

seeSimultaneous multithreading

SoC

seeSystems-on-chip

STI Cell Processor

Sun Niagara design59–60

Superscalar execution

Systems-on-chip (SoC)

APU

benefits

53–54

cell broadband engine processor

multi-core design

fused processors

Threading building blocks (TBB)

214–215

Threads and shared memory

consistency model, defined

definition

global view

OpenCL

shared bus

Throughput computing50–51 51f

seeMultithreading

Vector addition

Vendor extensions

Very long instruction word (VLIW)

ALU packing

160–161

architectures

description

44–45

designs

double precision and integer operations

136

DSP chips

efficiency losses

four-way

low-level shader compiler

61–62

method

44–45

out-of-order execution

45–46, 45f, 46f

packet stream

45–46

SIMD lane

53f

Video processing

CPU decoding, frame

callback methodology

174–175

framework and request

174

heterogeneous computing

173–174

multithreading

175

VLC project

174

display to screen

double-buffered texture

181

OpenCL/OpenGL interoperability

181–183

GPU decoding, frames

OpenVideo

176–179

power consumption

175–176

processing

176, 176f

multiple videos with multiple special effects

chain effects

180

event chaining

180–181

OpenCL

features

179

kernel, parameters

179–180

RGBA format

180

VLIW

seeVery long instruction word

Web applications

client-side

255–256

JavaScript code

260

visual and interaction

255

WebCL

advantages, World Wide Web

255

browser environment

255

client-side web applications

255–256

framework designing

development

device capabilities

error management

256, 257t

goal

requirement, Web usage

garbage collector

267

HTML and JavaScript

267–268

OpenCL

color image, gray scale conversion

260

ctx object function

261–262

error reporting/handling

260, 261

input and the output buffers

JavaScript vs. C

kernel source

property array

syntax

performance

photo editor

pilot implementation

Firefox

257–258

JavaScript and OpenCL

258–259

portable devices

268

web environment

268

Web photo editor