

add() function, CPU vector sums, 40–44

add_to_table() kernel, GPU hash table, 272

ALUs (arithmetic logic units)

CUDA Architecture, 7

using constant memory, 96

anim_and_exit() method, GPU ripples, 70

anim_gpu() routine, texture memory, 123, 129


GPU Julia Set example, 50–57

GPU ripple using threads, 69–74

heat transfer simulation, 121–125

animExit(), 149

asynchronous call

cudaMemcpyAsync() as, 197

using events with, 109

atomic locks

GPU hash table, 274–275

overview of, 251–254


atomic locks, 251–254

histogram kernel using global memory, 180

not supporting floating-point numbers, 251

atomicCAS(), GPU lock, 252–253

atomicExch(), GPU lock, 253–254

atomics, 163–184

advanced, 249–277

compute capability of NVIDIA GPUs, 164–167

dot product and, 248–251

hash tables. see hash tables

histogram computation, CPU, 171–173

histogram computation, GPU, 173–179

histogram computation, overview, 170

histogram kernel using global memory atomics, 179–181

histogram kernel using shared/global memory atomics, 181–183

for minimum compute capability, 167–168

locks, 251–254

operations, 168–170

overview of, 163–164, 249

summary review, 183–184, 277


bandwidth, constant memory saving, 106–107

Basic Linear Algebra Subprograms (BLAS), CUBLAS library, 239–240

bin counts, CPU histogram computation, 171–173

BLAS (Basic Linear Algebra Subprograms), CUBLAS library, 239–240


2D texture memory, 131–133

texture memory, 127–129

blockDim variable

2D texture memory, 132–133

dot product computation, 76–78, 85

dot product computation, incorrect optimization, 88

dot product computation with atomic locks, 255–256

dot product computation, zero-copy memory, 221–222

GPU hash table implementation, 272

GPU ripple using threads, 72–73

GPU sums of a longer vector, 63–65

GPU sums of arbitrarily long vectors, 66–67

graphics interoperability, 145

histogram kernel using global memory atomics, 179–180

histogram kernel using shared/global memory atomics, 182–183

multiple CUDA streams, 200

ray tracing on GPU, 102

shared memory bitmap, 91

temperature update computation, 119–120

blockIdx variable

2D texture memory, 132–133

defined, 57

dot product computation, 76–77, 85

dot product computation with atomic locks, 255–256

dot product computation, zero-copy memory, 221–222

GPU hash table implementation, 272

GPU Julia Set, 53

GPU ripple using threads, 72–73

GPU sums of a longer vector, 63–64

GPU vector sums, 44–45

graphics interoperability, 145

histogram kernel using global memory atomics, 179–180

histogram kernel using shared/global memory atomics, 182–183

multiple CUDA streams, 200

ray tracing on GPU, 102

shared memory bitmap, 91

temperature update computation, 119–121


defined, 57

GPU Julia Set, 51

GPU vector sums, 44–45

hardware-imposed limits on, 46

splitting into threads. see parallel blocks, splitting into threads

breast cancer, CUDA applications for, 8–9

bridges, connecting multiple GPUs, 224

buckets, hash table

concept of, 259–260

GPU hash table implementation, 269–275

multithreaded hash tables and, 267–268

bufferObj variable

creating GPUAnimBitmap, 149

registering with CUDA runtime, 143

registering with cudaGraphicsGL-RegisterBuffer(), 151

setting up graphics interoperability, 141, 143–144

buffers, declaring shared memory, 76–77


cache[] shared memory variable

declaring buffer of shared memory named, 76–77

dot product computation, 79–80, 85–86

dot product computation with atomic locks, 255–256

cacheIndex, incorrect dot product optimization, 88

caches, texture, 116–117

callbacks, GPUAnimBitmap user registration for, 149

Cambridge University, CUDA applications, 9–10


ray tracing concepts, 97–98

ray tracing on GPU, 99–104

cellular phones, parallel processing in, 2

central processing units. see CPUs (central processing units)

cleaning agents, CUDA applications for, 10–11

clickDrag(), 149

clock speed, evolution of, 2–3

code, breaking assumptions, 45–46

code resources, CUDa, 246–248

collision resolution, hash tables, 260–261


CPU Julia Set, 48–49

early days of GPU computing, 5–6

ray tracing concepts, 98


for minimum compute capability, 167–168

standard C, for GPU code, 18–19

complex numbers

defining generic class to store, 49–50

storing with single-precision floating-point components, 54

computational fluid dynamics, CUDA applications for, 9–10

compute capability

compiling for minimum, 167–168

cudaChooseDevice()and, 141

defined, 164

of NVIDIA GPUs, 164–167

overview of, 141–142

computer games, 3D graphic development for, 4–5

constant memory

accelerating applications with, 95

measuring performance with events, 108–110

measuring ray tracer performance, 110–114

overview of, 96

performance with, 106–107

ray tracing introduction, 96–98

ray tracing on GPU, 98–104

ray tracing with, 104–106

summary review, 114


declaring memory as, 104–106

performance with constant memory, 106–107

copy_const_kernel() kernel

2D texture memory, 133

using texture memory, 129–130

copy_constant_kernel(), computing -temperature updates, 119–121

CPUAnimBitmap class, creating GPU ripple, 69–70, 147–148

CPUs (central processing units)

evolution of clock speed, 2–3

evolution of core count, 3

freeing memory. see free(), C language

hash tables, 261–267

histogram computation on, 171–173

as host in this book, 23

thread management and scheduling in, 72

vector sums, 39–41

verifying GPU histogram using reverse CPU histogram, 175–176

CUBLAS library, 239–240

cuComplex structure, CPU Julia Set, 48–49

cuComplex structure, GPU Julia Set, 53–55

CUDA, Supercomputing for the Masses, 245–246

CUDA Architecture

computational fluid dynamic applications, 9–10

defined, 7

environmental science applications, 10–11

first application of, 7

medical imaging applications, 8–9

resource for understanding, 244–245

using, 7–8


computational fluid dynamic applications, 9–10

CUDA development toolkit, 16–18

CUDA-enabled graphics processor, 14–16

debugging, 241–242

development environment setup. see development environment setup

development of, 7

environmental science applications, 10–11

getting started, 13–20

medical imaging applications, 8–9

NVIDIA device driver, 16

on multiple GPUs. see GPUs (graphics processing units), multi-system

overview of, 21–22

parallel programming in. see parallel programming, CUDA

passing parameters, 24–27

querying devices, 27–33

standard C compiler, 18–19

summary review, 19, 35

using device properties, 33–35

writing first program, 22–24

CUDA Data Parallel Primitives Library (CUDPP), 246

CUDA event API, and performance, 108–110

CUDA Memory Checker, 242

CUDA streams

GPU work scheduling with, 205–208

multiple, 198–205, 208–210

overview of, 192

single, 192–198

summary review, 211

CUDA Toolkit, 238–240

in development environment, 16–18

CUDA tools

CUBLAS library, 239–240

CUDA Toolkit, 238–239

CUFFT library, 239

debugging CUDA C, 241–242

GPU Computing SDK download, 240–241

NVIDIA Performance Primitives, 241

overview of, 238

Visual Profiler, 243–244

CUDA Zone, 167

cuda_malloc_test(), page-locked memory, 189

cudaBindTexture(), texture memory, 126–127

cudaBindTexture2D(), texture memory, 134

cudaChannelFormatDesc(), binding 2D textures, 134


defined, 34

GPUAnimBitmap initialization, 150

for valid ID, 141–142

cudaD39SetDirect3DDevice(), DirectX interoperability, 160–161

cudaDeviceMapHost(), zero-copy memory dot product, 221

cudaDeviceProp structure

cudaChooseDevice()working with, 141

multiple CUDA streams, 200

overview of, 28–31

single CUDA streams, 193–194

using device properties, 34

CUDA-enabled graphics processors, 14–16


2D texture memory, 134

CUDA streams, 192, 194, 201

GPU hash table implementation, 274–275

GPU histogram computation, 173, 177

measuring performance with events, 108–110, 112

page-locked host memory application, 188–189

performing animation with GPUAnimBitmap, 158

ray tracing on GPU, 100

standard host memory dot product, 215

texture memory, 124

zero-copy host memory, 215, 217


defined, 112

GPU hash table implementation, 275

GPU histogram computation, 176, 178

heat transfer simulation, 123, 131, 137

measuring performance with events, 111–113

page-locked host memory, 189–190

texture memory, 136

zero-copy host memory, 217, 220


2D texture memory, 130

CUDA streams, 198, 204

defined, 112

GPU hash table implementation, 275

GPU histogram computation, 175, 178

heat transfer simulation animation, 122

heat transfer using graphics interoperability, 157

page-locked host memory, 188, 190

standard host memory dot product, 216

zero-copy memory dot product, 219


CUDA streams, 194, 198, 201

CUDA streams and, 192

GPU hash table implementation, 274–275

GPU histogram computation, 173, 175, 177

heat transfer simulation animation, 122

heat transfer using graphics interoperability, 156–157

measuring performance with events, 108–109

measuring ray tracer performance, 110–113

page-locked host memory, 188–190

ray tracing on GPU, 100

standard host memory dot product, 216

using texture memory, 129–130


2D texture memory, 130

GPU hash table implementation, 275

GPU histogram computation, 175, 178

heat transfer simulation animation, 122

heat transfer using graphics interoperability, 157

measuring performance with events, 109, 111, 113

page-locked host memory, 188, 190

standard host memory dot product, 216


allocating portable pinned memory, 235

CPU vector sums, 42

CUDA streams, 198, 205

defined, 26–27

dot product computation, 84, 87

dot product computation with atomic locks, 258

GPU hash table implementation, 269–270, 275

GPU ripple using threads, 69

GPU sums of arbitrarily long vectors, 69

multiple CPUs, 229

page-locked host memory, 189–190

ray tracing on GPU, 101

ray tracing with constant memory, 105

shared memory bitmap, 91

standard host memory dot product, 217


allocating portable pinned memory, 233

CUDA streams, 198, 204

defined, 190

freeing buffer allocated with cudaHostAlloc(), 190

zero-copy memory dot product, 220

CUDA-GDB debugging tool, 241–242


CUDA streams, 193, 200

device properties, 34

zero-copy memory dot product, 220


device properties, 34

getting count of CUDA devices, 28

multiple CPUs, 224–225


determining if GPU is integrated or discrete, 223

multiple CUDA streams, 200

querying devices, 33–35

zero-copy memory dot product, 220


graphics interoperation with OpenGL, 150

preparing CUDA to use OpenGL driver, 142

cudaGraphicsGLRegisterBuffer(), 143, 151

cudaGraphicsMapFlagsNone(), 143

cudaGraphicsMapFlagsReadOnly(), 143

cudaGraphicsMapFlagsWriteDiscard(), 143

cudaGraphicsUnapResources(), 144


CUDA streams, 195, 202

malloc() versus, 186–187

page-locked host memory application, 187–192

zero-copy memory dot product, 217–220


CUDA streams, 195, 202

default pinned memory, 214

page-locked host memory, 189–190


default pinned memory, 214

portable pinned memory, 231

zero-copy memory dot product, 217–218

cudaHostAllocPortable(), portable pinned memory, 230–235


portable pinned memory, 231

zero-copy memory dot product, 217–218


portable pinned memory, 234

zero-copy memory dot product, 218–219

cudaMalloc(), 124

2D texture memory, 133–135

allocating device memory using, 26

CPU vector sums application, 42

CUDA streams, 194, 201–202

dot product computation, 82, 86

dot product computation, standard host memory, 215

dot product computation with atomic locks, 256

GPU hash table implementation, 269, 274–275

GPU Julia Set, 51

GPU lock function, 253

GPU ripple using threads, 70

GPU sums of arbitrarily long vectors, 68

measuring ray tracer performance, 110, 112

portable pinned memory, 234

ray tracing on GPU, 100

ray tracing with constant memory, 105

shared memory bitmap, 90

using multiple CPUs, 228

using texture memory, 127

cuda-memcheck, 242


2D texture binding, 136

copying data between host and device, 27

CPU vector sums application, 42

dot product computation, 82–83, 86

dot product computation with atomic locks, 257

GPU hash table implementation, 270, 274–275

GPU histogram computation, 174–175

GPU Julia Set, 52

GPU lock function implementation, 253

GPU ripple using threads, 70

GPU sums of arbitrarily long vectors, 68

heat transfer simulation animation, 122–125

measuring ray tracer performance, 111

page-locked host memory and, 187, 189

ray tracing on GPU, 101

standard host memory dot product, 216

using multiple CPUs, 228–229


GPU work scheduling, 206–208

multiple CUDA streams, 203, 208–210

single CUDA streams, 196

timeline of intended application execution using multiple streams, 199


CPU vector sums application, 42

dot product computation, 82, 86–87

GPU hash table implementation, 270

GPU histogram computation, 174–175

GPU Julia Set, 52

GPU sums of arbitrarily long vectors, 68

multiple CUDA streams, 204

page-locked host memory, 190

ray tracing on GPU, 101

shared memory bitmap, 91

standard host memory dot product, 216

using multiple CPUs, 229


CPU vector sums application, 42

dot product computation, 86

GPU sums of arbitrarily long vectors, 68

implementing GPU lock function, 253

measuring ray tracer performance, 111

multiple CPUs, 228

multiple CUDA streams, 203

page-locked host memory, 189

standard host memory dot product, 216

cudaMemcpyToSymbol(), constant memory, 105–106


GPU hash table implementation, 269

GPU histogram computation, 174

CUDA.NET project, 247


allocating portable pinned memory, 231–232, 233–234

using device properties, 34

using multiple CPUs, 227–228


allocating portable pinned memory, 231, 234

zero-copy memory dot product, 221

cudaStreamCreate(), 194, 201

cudaStreamDestroy(), 198, 205

cudaStreamSynchronize(), 197–198, 204

cudaThreadSynchronize(), 219

cudaUnbindTexture(), 2D texture memory, 136–137

CUDPP (CUDA Data Parallel Primitives Library), 246

CUFFT library, 239

CULAtools, 246

current animation time, GPU ripple using threads, 72–74


debugging CUDA C, 241–242

detergents, CUDA applications, 10–11

dev_bitmap pointer, GPU Julia Set, 51

development environment setup

CUDA Toolkit, 16–18

CUDA-enabled graphics processor, 14–16

NVIDIA device driver, 16

standard C compiler, 18–19

summary review, 19

device drivers, 16

device overlap, GPU, 194, 198–199


GPU hash table implementation, 268–275

GPU Julia Set, 54


getting count of CUDA, 28

GPU vector sums, 41–46

passing parameters, 25–27

querying, 27–33

use of term in this book, 23

using properties of, 33–35

devPtr, graphics interoperability, 144

dim3 variable grid, GPU Julia Set, 51–52

DIMxDIM bitmap image, GPU Julia Set, 49–51, 53

direct memory access (DMA), for page-locked memory, 186


adding standard C to, 7

breakthrough in GPU technology, 5–6

GeForce 8800 GTX, 7

graphics interoperability, 160–161

discrete GPUs, 222–224

display accelerators, 2D, 4

DMA (direct memory access), for page-locked memory, 186

dot product computation

optimized incorrectly, 87–90

shared memory and, 76–87

standard host memory version of, 215–217

using atomics to keep entirely on GPU, 250–251, 254–258

dot product computation, multiple GPUs

allocating portable pinned memory, 230–235

using, 224–229

zero-copy, 217–222

zero-copy performance, 223

Dr. Dobb’s CUDA, 245–246

DRAMs, discrete GPUs with own dedicated, 222–223

draw_func, graphics interoperability, 144–146


end_thread(), multiple CPUs, 226

environmental science, CUDA applications for, 10–11

event timer. see timer, event


computing elapsed time between recorded. see cudaEventElapsedTime()

creating. see cudaEventCreate()

GPU histogram computation, 173

measuring performance with, 95

measuring ray tracer performance, 110–114

overview of, 108–110

recording. see cudaEventRecord()

stopping and starting. see cudaEventDestroy()

summary review, 114

EXIT_FAILURE(), passing parameters, 26


fAnim(), storing registered callbacks, 149

Fast Fourier Transform library, NVIDIA, 239

first program, writing, 22–24

flags, in graphics interoperability, 143

float_to_color() kernels, in graphics -interoperability, 157

floating-point numbers

atomic arithmetic not supported for, 251

CUDA Architecture designed for, 7

early days of GPU computing not able to handle, 6

FORTRAN applications

CUBLAS compatibility with, 239–240

language wrapper for CUDA C, 246

forums, NVIDIA, 246

fractals. see Julia Set example

free(), C language

cudaFree( )versus, 26–27

dot product computation with atomic locks, 258

GPU hash table implementation, 275

multiple CPUs, 227

standard host memory dot product, 217


GeForce 256, 5

GeForce 8800 GTX, 7

generate_frame(), GPU ripple, 70, 72–73, 154

generic classes, storing complex numbers with, 49–50

GL_PIXEL_UNPACK_BUFFER_ARB target, OpenGL interoperation, 151


creating pixel buffer object, 143

graphics interoperability, 146

glBufferData(), pixel buffer object, 143


graphics interoperability, 146

overview of, 154–155

glGenBuffers(), pixel buffer object, 143

global memory atomics

GPU compute capability requirements, 167

histogram kernel using, 179–181

histogram kernel using shared and, 181–183


add function, 43

kernel call, 23–24

running kernel() in GPU Julia Set application, 51–52

GLUT (GL Utility Toolkit)

graphics interoperability setup, 144

initialization of, 150

initializing OpenGL driver by calling, 142

glutIdleFunc(), 149

glutInit(), 150

glutMainLoop(), 144

GPU Computing SDK download, 18, 240–241

GPU ripple

with graphics interoperability, 147–154

using threads, 69–74

GPU vector sums

application, 41–46

of arbitrarily long vectors, using threads, 65–69

of longer vector, using threads, 63–65

using threads, 61–63

gpu_anim.h, 152–154

GPUAnimBitmap structure

creating, 148–152

GPU ripple performing animation, 152–154

heat transfer with graphics interoperability, 156–160

GPUs (graphics processing units)

called “devices” in this book, 23

developing code in CUDA C with CUDA-enabled, 14–16

development of CUDA for, 6–8

discrete versus integrated, 222–223

early days of, 5–6

freeing memory. see cudaFree()

hash tables, 268–275

histogram computation on, 173–179

histogram kernel using global memory atomics, 179–181

histogram kernel using shared/global memory atomics, 181–183

history of, 4–5

Julia Set example, 50–57

measuring performance with events, 108–110

ray tracing on, 98–104

work scheduling, 205–208

GPUs (graphics processing units), multiple, 213–236

overview of, 213–214

portable pinned memory, 230–235

summary review, 235–236

using, 224–229

zero-copy host memory, 214–222

zero-copy performance, 222–223

graphics accelerators, 3D graphics, 4–5

graphics interoperability, 139–161

DirectX, 160–161

generating image data with kernel, 139–142

GPU ripple with, 147–154

heat transfer with, 154–160

overview of, 139–140

passing image data to Open GL for rendering, 142–147

summary review, 161

graphics processing units. see GPUs (graphics processing units)

grey(), GPU ripple, 74


as collection of parallel blocks, 45

defined, 57

three-dimensional, 51

gridDim variable

2D texture memory, 132–133

defined, 57

dot product computation, 77–78

dot product computation with atomic locks, 255–256

GPU hash table implementation, 272

GPU Julia Set, 53

GPU ripple using threads, 72–73

GPU sums of arbitrarily long vectors, 66–67

graphics interoperability setup, 145

histogram kernel using global memory atomics, 179–180

histogram kernel using shared/global memory atomics, 182–183

ray tracing on GPU, 102

shared memory bitmap, 91

temperature update computation, 119–120

zero-copy memory dot product, 222


half-warps, reading constant memory, 107


2D texture memory, 133–136

CUDA streams, 194–198, 201–204, 209–210

dot product computation, 82–83, 86–87

dot product computation with atomic locks, 256–258

GPU hash table implementation, 270

GPU histogram computation completion, 175

GPU lock function implementation, 253

GPU ripple using threads, 70

GPU sums of arbitrarily long vectors, 68

heat transfer simulation animation, 122–125

measuring ray tracer performance, 110–114

page-locked host memory application, 188–189

passing parameters, 26

paying attention to, 46

portable pinned memory, 231–235

ray tracing on GPU, 100–101

ray tracing with constant memory, 104–105

shared memory bitmap, 90–91

standard host memory dot product, 215–217

texture memory, 127, 129

zero-copy memory dot product, 217–222


decoupling parallelization from method of executing, 66

performing atomic operations on memory, 167

hardware limitations

GPU sums of arbitrarily long vectors, 65–69

number of blocks in single launch, 46

number of threads per block in kernel launch, 63

hash function

CPU hash table implementation, 261–267

GPU hash table implementation, 268–275

overview of, 259–261

hash tables

concepts, 259–261

CPU, 261–267

GPU, 268–275

multithreaded, 267–268

performance, 276–277

summary review, 277

heat transfer simulation

2D texture memory, 131–137

animating, 121–125

computing temperature updates, 119–121

with graphics interoperability, 154–160

simple heating model, 117–118

using texture memory, 125–131

“Hello, World” example

kernel call, 23–24

passing parameters, 24–27

writing first program, 22–23

Highly Optimized Object-oriented Many-particle Dynamics (HOOMD), 10–11

histogram computation

on CPUs, 171–173

on GPUs, 173–179

overview, 170

histogram kernel

using global memory atomics, 179–181

using shared/global memory atomics, 181–183

hit() method, ray tracing on GPU, 99, 102

HOOMD (Highly Optimized Object-oriented Many-particle Dynamics), 10–11


allocating memory to. see malloc()

CPU vector sums, 39–41

CUDA C blurring device code and, 26

page-locked memory, 186–192

passing parameters, 25–27

use of term in this book, 23

zero-copy host memory, 214–222


idle_func() member, GPUAnimBitmap, 154

IEEE requirements, ALUs, 7

increment operator (x++), 168–170


CPU hash table implementation, 263, 266

CPU histogram computation, 171

GLUT, 142, 150, 173–174

GPUAnimBitmap, 149

inner products. see dot product computation

integrated GPUs, 222–224

interleaved operations, 169–170

interoperation. see graphics interoperability


julia() function, 48–49, 53

Julia Set example

CPU application of, 47–50

GPU application of, 50–57

overview of, 46–47



2D texture memory, 131–133

blockIdx.x variable, 44

call to, 23–24

defined, 23

GPU histogram computation, 176–178

GPU Julia Set, 49–52

GPU ripple performing animation, 154

GPU ripple using threads, 70–72

GPU sums of a longer vector, 63–65

graphics interoperability, 139–142, 144–146

“Hello, World” example of call to, 23–24

launching with number in angle brackets that is not 1, 43–44

passing parameters to, 24–27

ray tracing on GPU, 102–104

texture memory, 127–131

key_func, graphics interoperability, 144–146


CPU hash table implementation, 261–267

GPU hash table implementation, 269–275

hash table concepts, 259–260


language wrappers, 246–247

LAPACK (Linear Algebra Package), 246

light effects, ray tracing concepts, 97

Linux, standard C compiler for, 19

Lock structure, 254–258, 268–275

locks, atomic, 251–254


Macintosh OS X, standard C compiler, 19


2D texture memory, 133–136

CPU hash table implementation, 266–267

CPU histogram computation, 171

dot product computation, 81–84

dot product computation with atomic locks, 255–256

GPU hash table implementation, 273–275

GPU histogram computation, 173

GPU Julia Set, 47, 50–51

GPU ripple using threads, 69–70

GPU vector sums, 41–42

graphics interoperability, 144

page-locked host memory application, 190–192

ray tracing on GPU, 99–100

ray tracing with constant memory, 104–106

shared memory bitmap, 90

single CUDA streams, 193–194

zero-copy memory dot product, 220–222


cudaHostAlloc() versus, 186

cudaHostAlloc()versus, 190

cudaMalloc( )versus, 26

ray tracing on GPU, 100

mammograms, CUDA applications for medical imaging, 9

maxThreadsPerBlock field, device properties, 63

media and communications processors (MCPs), 223

medical imaging, CUDA applications for, 8–9

memcpy(), C language, 27


allocating device. see cudaMalloc()

constant. see constant memory

CUDA Architecture creating access to, 7

early days of GPU computing, 6

executing device code that uses allocated, 70

freeing. see cudaFree(); free(), C language

GPU histogram computation, 173–174

page-locked host (pinned), 186–192

querying devices, 27–33

shared. see shared memory

texture. see texture memory

use of term in this book, 23

Memory Checker, CUDA, 242

memset(), C language, 174

Microsoft Windows, Visual Studio C compiler, 18–19

Microsoft.NET, 247

multicore revolution, evolution of CPUs, 3

multiplication, in vector dot products, 76

multithreaded hash tables, 267–268

mutex, GPU lock function, 252–254


nForce media and communications processors (MCPs), 222–223


compute capability of various GPUs, 164–167

creating 3D graphics for consumers, 5

creating CUDA C for GPU, 7

creating first GPU built with CUDA Architecture, 7

CUBLAS library, 239–240

CUDA-enabled graphics processors, 14–16

CUDA-GDB debugging tool, 241–242

CUFFT library, 239

device driver, 16

GPU Computing SDK download, 18, 240–241

Parallel NSight debugging tool, 242

Performance Primitives, 241

products containing multiple GPUs, 224

Visual Profiler, 243–244

NVIDIA CUDA Programming Guide, 31


offset, 2D texture memory, 133

on-chip caching. see constant memory; texture memory

one-dimensional blocks

GPU sums of a longer vector, 63

two-dimensional blocks versus, 44

online resources. see resources, online


creating GPUAnimBitmap, 148–152

in early days of GPU computing, 5–6

generating image data with kernel, 139–142

interoperation, 142–147

writing 3D graphics, 4

operations, atomic, 168–170

optimization, incorrect dot product, 87–90


page-locked host memory

allocating as portable pinned memory, 230–235

overview of, 186–187

restricted use of, 187

single CUDA streams with, 195–197

parallel blocks

GPU Julia Set, 51

GPU vector sums, 45

parallel blocks, splitting into threads

GPU sums of arbitrarily long vectors, 65–69

GPU sums of longer vector, 63–65

GPU vector sums using threads, 61–63

overview of, 60

vector sums, 60–61

Parallel NSight debugging tool, 242

parallel processing

evolution of CPUs, 2–3

past perception of, 1

parallel programming, CUDA

CPU vector sums, 39–41

example, CPU Julia Set application, 47–50

example, GPU Julia Set application, 50–57

example, overview, 46–47

GPU vector sums, 41–46

overview of, 38

summary review, 56

summing vectors, 38–41

parameter passing, 24–27, 40, 72

PC gaming, 3D graphics for, 4–5

PCI Express slots, adding multiple GPUs to, 224


constant memory and, 106–107

evolution of CPUs, 2–3

hash table, 276

launching kernel for GPU histogram computation, 176–177

measuring with events, 108–114

page-locked host memory and, 187

zero-copy memory and, 222–223

pinned memory

allocating as portable, 230–235

cudaHostAllocDefault()getting default, 214

as page-locked memory. see page-locked host memory

pixel buffer objects (PBO), OpenGL, 142–143

pixel shaders, early days of GPU computing, 5–6

pixels, number of threads per block, 70–74

portable computing devices, 2

Programming Massively Parallel Processors: A Hands-on Approach (Kirk, Hwu), 244


cudaDeviceProp structure. see -cudaDeviceProp structure

maxThreadsPerBlock field for device, 63

reporting device, 31

using device, 33–35

PyCUDA project, 246–247

Python language wrappers for CUDA C, 246


querying, devices, 27–33


rasterization, 97

ray tracing

concepts behind, 96–98

with constant memory, 104–106

on GPU, 98–104

measuring performance, 110–114

read-modify-write operations

atomic operations as, 168–170, 251

using atomic locks, 251–254

read-only memory. see constant memory; texture memory


dot products as, 83

overview of, 250

shared memory and synchronization for, 79–81

references, texture memory, 126–127, 131–137


bufferObj with cudaGraphicsGLRegisterBuffer(), 151

callback, 149

rendering, GPUs performing complex, 139

resource variable

creating GPUAnimBitmap, 148–152

graphics interoperation, 141

resources, online

CUDA code, 246–248

CUDA Toolkit, 16

CUDA University, 245

CUDPP, 246

CULAtools, 246

Dr. Dobb’s CUDA, 246

GPU Computing SDK code samples, 18

language wrappers, 246–247

NVIDIA device driver, 16

NVIDIA forums, 246

standard C compiler for Mac OS X, 19

Visual Studio C compiler, 18

resources, written

CUDA U, 245–246

forums, 246

programming massive parallel processors, 244–245

ripple, GPU

with graphics interoperability, 147–154

producing, 69–74


allocating portable pinned memory, 232–234

using multiple CPUs, 226–228

Russian nesting doll hierarchy, 164


scalable link interface (SLI), adding multiple GPUs with, 224

scale factor, CPU Julia Set, 49

scientific computations, in early days, 6


animated heat transfer simulation, 126

GPU Julia Set example, 57

GPU ripple example, 74

graphics interoperation example, 147

ray tracing example, 103–104

rendered with proper synchronization, 93

rendered without proper synchronization, 92

shading languages, 6

shared data buffers, kernel/OpenGL rendering -interoperation, 142

shared memory

atomics, 167, 181–183

bitmap, 90–93

CUDA Architecture creating access to, 7

dot product, 76–87

dot product optimized incorrectly, 87–90

and synchronization, 75

Silicon Graphics, OpenGL library, 4


animation of, 121–125

challenges of physical, 117

computing temperature updates, 119–121

simple heating model, 117–118

SLI (scalable link interface), adding multiple GPUs with, 224

spatial locality

designing texture caches for graphics with, 116

heat transfer simulation animation, 125–126

split parallel blocks. see parallel blocks, splitting into threads

standard C compiler

compiling for minimum compute capability, 167–168

development environment, 18–19

kernel call, 23–24

start event, 108–110

start_thread(), multiple CPUs, 226–227

stop event, 108–110


CUDA, overview of, 192

CUDA, using multiple, 198–205, 208–210

CUDA, using single, 192–198

GPU work scheduling and, 205–208

overview of, 185–186

page-locked host memory and, 186–192

summary review, 211

supercomputers, performance gains in, 3

surfactants, environmental devastation of, 10


of events. see cudaEventSynchronize()

of streams, 197–198, 204

of threads, 219

synchronization, and shared memory

dot product, 76–87

dot product optimized incorrectly, 87–90

overview of, 75

shared memory bitmap, 90–93


dot product computation, 78–80, 85

shared memory bitmap using, 90–93

unintended consequences of, 87–90


task parallelism, CPU versus GPU applications, 185

TechniScan Medical Systems, CUDA applications, 9


computing temperature updates, 119–121

heat transfer simulation, 117–118

heat transfer simulation animation, 121–125

Temple University research, CUDA applications, 10–11

tex1Dfetch() compiler intrinsic, texture memory, 127–128, 131–132

tex2D() compiler intrinsic, texture memory, 132–133

texture, early days of GPU computing, 5–6

texture memory

animation of simulation, 121–125

defined, 115

overview of, 115–117

simulating heat transfer, 117–121

summary review, 137

two-dimensional, 131–137

using, 125–131

threadIdx variable

2D texture memory, 132–133

dot product computation, 76–77, 85

dot product computation with atomic locks, 255–256

GPU hash table implementation, 272

GPU Julia Set, 52

GPU ripple using threads, 72–73

GPU sums of a longer vector, 63–64

GPU sums of arbitrarily long vectors, 66–67

GPU vector sums using threads, 61

histogram kernel using global memory atomics, 179–180

histogram kernel using shared/global memory atomics, 182–183

multiple CUDA streams, 200

ray tracing on GPU, 102

setting up graphics interoperability, 145

shared memory bitmap, 91

temperature update computation, 119–121

zero-copy memory dot product, 221


coding with, 38–41

constant memory and, 106–107

GPU ripple using, 69–74

GPU sums of a longer vector, 63–65

GPU sums of arbitrarily long vectors, 65–69

GPU vector sums using, 61–63

hardware limit to number of, 63

histogram kernel using global memory atomics, 179–181

incorrect dot product optimization and divergence of, 89

multiple CPUs, 225–229

overview of, 59–60

ray tracing on GPU and, 102–104

read-modify-write operations, 168–170

shared memory and. see shared memory

summary review, 94

synchronizing, 219


allocating shared memory, 76–77

dot product computation, 79–87

three-dimensional blocks, GPU sums of a longer vector, 63

three-dimensional graphics, history of GPUs, 4–5

three-dimensional scenes, ray tracing producing 2-D image of, 97

tid variable

blockIdx.x variable assigning value of, 44

checking that it is less than N, 45–46

dot product computation, 77–78

parallelizing code on multiple CPUs, 40

time, GPU ripple using threads, 72–74

timer, event. see cudaEventElapsedTime()

Toolkit, CUDA, 16–18

two-dimensional blocks

arrangement of blocks and threads, 64

GPU Julia Set, 51

GPU ripple using threads, 70

gridDim variable as, 63

one-dimensional indexing versus, 44

two-dimensional display accelerators, development of GPUs, 4

two-dimensional texture memory

defined, 116

heat transfer simulation, 117–118

overview of, 131–137


ultrasound imaging, CUDA applications for, 9

unified shader pipeline, CUDA Architecture, 7

university, CUDA, 245



CPU hash table implementation, 261–267

GPU hash table implementation, 269–275

hash table concepts, 259–260

vector dot products. see dot product computation

vector sums

CPU, 39–41

GPU, 41–46

GPU sums of arbitrarily long vectors, 65–69

GPU sums of longer vector, 63–65

GPU sums using threads, 61–63

overview of, 38–39, 60–61

verify_table(), GPU hash table, 270

Visual Profiler, NVIDIA, 243–244

Visual Studio C compiler, 18–19


warps, reading constant memory with, 106–107

while() loop

CPU vector sums, 40

GPU lock function, 253

work scheduling, GPU, 205–208


zero-copy memory

allocating/using, 214–222

defined, 214

performance, 222–223

