add()
function, CPU vector sums, 40–44
add_to_table()
kernel, GPU hash table, 272
ALUs (arithmetic logic units)
CUDA Architecture, 7
using constant memory, 96
anim_and_exit()
method, GPU ripples, 70
anim_gpu()
routine, texture memory, 123, 129
animation
GPU Julia Set example, 50–57
GPU ripple using threads, 69–74
heat transfer simulation, 121–125
animExit()
, 149
asynchronous call
cudaMemcpyAsync()
as, 197
using events with, 109
atomic locks
GPU hash table, 274–275
overview of, 251–254
atomicAdd()
atomic locks, 251–254
histogram kernel using global memory, 180
not supporting floating-point numbers, 251
atomicCAS()
, GPU lock, 252–253
atomicExch()
, GPU lock, 253–254
atomics, 163–184
advanced, 249–277
compute capability of NVIDIA GPUs, 164–167
dot product and, 248–251
hash tables. see hash tables
histogram computation, CPU, 171–173
histogram computation, GPU, 173–179
histogram computation, overview, 170
histogram kernel using global memory atomics, 179–181
histogram kernel using shared/global memory atomics, 181–183
for minimum compute capability, 167–168
locks, 251–254
operations, 168–170
bandwidth, constant memory saving, 106–107
Basic Linear Algebra Subprograms (BLAS), CUBLAS library, 239–240
bin counts, CPU histogram computation, 171–173
BLAS (Basic Linear Algebra Subprograms), CUBLAS library, 239–240
2D texture memory, 131–133
texture memory, 127–129
blockDim
variable
2D texture memory, 132–133
dot product computation, 76–78, 85
dot product computation, incorrect optimization, 88
dot product computation with atomic locks, 255–256
dot product computation, zero-copy memory, 221–222
GPU hash table implementation, 272
GPU ripple using threads, 72–73
GPU sums of a longer vector, 63–65
GPU sums of arbitrarily long vectors, 66–67
graphics interoperability, 145
histogram kernel using global memory atomics, 179–180
histogram kernel using shared/global memory atomics, 182–183
multiple CUDA streams, 200
ray tracing on GPU, 102
shared memory bitmap, 91
temperature update computation, 119–120
blockIdx
variable
2D texture memory, 132–133
defined, 57
dot product computation, 76–77, 85
dot product computation with atomic locks, 255–256
dot product computation, zero-copy memory, 221–222
GPU hash table implementation, 272
GPU Julia Set, 53
GPU ripple using threads, 72–73
GPU sums of a longer vector, 63–64
GPU vector sums, 44–45
graphics interoperability, 145
histogram kernel using global memory atomics, 179–180
histogram kernel using shared/global memory atomics, 182–183
multiple CUDA streams, 200
ray tracing on GPU, 102
shared memory bitmap, 91
temperature update computation, 119–121
blocks
defined, 57
GPU Julia Set, 51
GPU vector sums, 44–45
hardware-imposed limits on, 46
splitting into threads. see parallel blocks, splitting into threads
breast cancer, CUDA applications for, 8–9
bridges, connecting multiple GPUs, 224
buckets, hash table
concept of, 259–260
GPU hash table implementation, 269–275
multithreaded hash tables and, 267–268
bufferObj
variable
creating GPUAnimBitmap
, 149
registering with CUDA runtime, 143
registering with cudaGraphicsGL-RegisterBuffer()
, 151
setting up graphics interoperability, 141, 143–144
buffers, declaring shared memory, 76–77
cache[]
shared memory variable
declaring buffer of shared memory named, 76–77
dot product computation, 79–80, 85–86
dot product computation with atomic locks, 255–256
cacheIndex
, incorrect dot product optimization, 88
caches, texture, 116–117
callbacks, GPUAnimBitmap
user registration for, 149
Cambridge University, CUDA applications, 9–10
camera
ray tracing concepts, 97–98
ray tracing on GPU, 99–104
cellular phones, parallel processing in, 2
central processing units. see CPUs (central processing units)
cleaning agents, CUDA applications for, 10–11
clickDrag()
, 149
clock speed, evolution of, 2–3
code, breaking assumptions, 45–46
code resources, CUDa, 246–248
collision resolution, hash tables, 260–261
color
CPU Julia Set, 48–49
early days of GPU computing, 5–6
ray tracing concepts, 98
compiler
for minimum compute capability, 167–168
standard C, for GPU code, 18–19
complex numbers
defining generic class to store, 49–50
storing with single-precision floating-point components, 54
computational fluid dynamics, CUDA applications for, 9–10
compute capability
compiling for minimum, 167–168
cudaChooseDevice()and
, 141
defined, 164
of NVIDIA GPUs, 164–167
overview of, 141–142
computer games, 3D graphic development for, 4–5
constant memory
accelerating applications with, 95
measuring performance with events, 108–110
measuring ray tracer performance, 110–114
overview of, 96
performance with, 106–107
ray tracing introduction, 96–98
ray tracing on GPU, 98–104
ray tracing with, 104–106
summary review, 114
__constant__
function
declaring memory as, 104–106
performance with constant memory, 106–107
copy_const_kernel()
kernel
2D texture memory, 133
using texture memory, 129–130
copy_constant_kernel()
, computing -temperature updates, 119–121
CPUAnimBitmap
class, creating GPU ripple, 69–70, 147–148
CPUs (central processing units)
evolution of clock speed, 2–3
evolution of core count, 3
freeing memory. see free()
, C language
hash tables, 261–267
histogram computation on, 171–173
as host in this book, 23
thread management and scheduling in, 72
vector sums, 39–41
verifying GPU histogram using reverse CPU histogram, 175–176
CUBLAS library, 239–240
cuComplex
structure, CPU Julia Set, 48–49
cuComplex
structure, GPU Julia Set, 53–55
CUDA, Supercomputing for the Masses, 245–246
CUDA Architecture
computational fluid dynamic applications, 9–10
defined, 7
environmental science applications, 10–11
first application of, 7
medical imaging applications, 8–9
resource for understanding, 244–245
using, 7–8
CUDA C
computational fluid dynamic applications, 9–10
CUDA development toolkit, 16–18
CUDA-enabled graphics processor, 14–16
debugging, 241–242
development environment setup. see development environment setup
development of, 7
environmental science applications, 10–11
getting started, 13–20
medical imaging applications, 8–9
NVIDIA device driver, 16
on multiple GPUs. see GPUs (graphics processing units), multi-system
overview of, 21–22
parallel programming in. see parallel programming, CUDA
passing parameters, 24–27
querying devices, 27–33
standard C compiler, 18–19
using device properties, 33–35
writing first program, 22–24
CUDA Data Parallel Primitives Library (CUDPP), 246
CUDA event API, and performance, 108–110
CUDA Memory Checker, 242
CUDA streams
GPU work scheduling with, 205–208
overview of, 192
single, 192–198
summary review, 211
CUDA Toolkit, 238–240
in development environment, 16–18
CUDA tools
CUBLAS library, 239–240
CUDA Toolkit, 238–239
CUFFT library, 239
debugging CUDA C, 241–242
GPU Computing SDK download, 240–241
NVIDIA Performance Primitives, 241
overview of, 238
Visual Profiler, 243–244
CUDA Zone, 167
cuda_malloc_test()
, page-locked memory, 189
cudaBindTexture()
, texture memory, 126–127
cudaBindTexture2D()
, texture memory, 134
cudaChannelFormatDesc()
, binding 2D textures, 134
cudaChooseDevice()
defined, 34
GPUAnimBitmap
initialization, 150
for valid ID, 141–142
cudaD39SetDirect3DDevice()
, DirectX interoperability, 160–161
cudaDeviceMapHost()
, zero-copy memory dot product, 221
cudaDeviceProp
structure
cudaChooseDevice()
working with, 141
multiple CUDA streams, 200
overview of, 28–31
single CUDA streams, 193–194
using device properties, 34
CUDA-enabled graphics processors, 14–16
cudaEventCreate()
2D texture memory, 134
GPU hash table implementation, 274–275
GPU histogram computation, 173, 177
measuring performance with events, 108–110, 112
page-locked host memory application, 188–189
performing animation with GPUAnimBitmap
, 158
ray tracing on GPU, 100
standard host memory dot product, 215
texture memory, 124
zero-copy host memory, 215, 217
defined, 112
GPU hash table implementation, 275
GPU histogram computation, 176, 178
heat transfer simulation, 123, 131, 137
measuring performance with events, 111–113
page-locked host memory, 189–190
texture memory, 136
zero-copy host memory, 217, 220
cudaEventElapsedTime()
2D texture memory, 130
defined, 112
GPU hash table implementation, 275
GPU histogram computation, 175, 178
heat transfer simulation animation, 122
heat transfer using graphics interoperability, 157
page-locked host memory, 188, 190
standard host memory dot product, 216
zero-copy memory dot product, 219
cudaEventRecord()
CUDA streams and, 192
GPU hash table implementation, 274–275
GPU histogram computation, 173, 175, 177
heat transfer simulation animation, 122
heat transfer using graphics interoperability, 156–157
measuring performance with events, 108–109
measuring ray tracer performance, 110–113
page-locked host memory, 188–190
ray tracing on GPU, 100
standard host memory dot product, 216
using texture memory, 129–130
cudaEventSynchronize()
2D texture memory, 130
GPU hash table implementation, 275
GPU histogram computation, 175, 178
heat transfer simulation animation, 122
heat transfer using graphics interoperability, 157
measuring performance with events, 109, 111, 113
page-locked host memory, 188, 190
standard host memory dot product, 216
cudaFree()
allocating portable pinned memory, 235
CPU vector sums, 42
defined, 26–27
dot product computation, 84, 87
dot product computation with atomic locks, 258
GPU hash table implementation, 269–270, 275
GPU ripple using threads, 69
GPU sums of arbitrarily long vectors, 69
multiple CPUs, 229
page-locked host memory, 189–190
ray tracing on GPU, 101
ray tracing with constant memory, 105
shared memory bitmap, 91
standard host memory dot product, 217
cudaFreeHost()
allocating portable pinned memory, 233
defined, 190
freeing buffer allocated with cudaHostAlloc()
, 190
zero-copy memory dot product, 220
CUDA-GDB debugging tool, 241–242
cudaGetDevice()
device properties, 34
zero-copy memory dot product, 220
cudaGetDeviceCount()
device properties, 34
getting count of CUDA devices, 28
multiple CPUs, 224–225
cudaGetDeviceProperties()
determining if GPU is integrated or discrete, 223
multiple CUDA streams, 200
querying devices, 33–35
zero-copy memory dot product, 220
cudaGLSetGLDevice()
graphics interoperation with OpenGL, 150
preparing CUDA to use OpenGL driver, 142
cudaGraphicsGLRegisterBuffer()
, 143, 151
cudaGraphicsMapFlagsNone()
, 143
cudaGraphicsMapFlagsReadOnly()
, 143
cudaGraphicsMapFlagsWriteDiscard()
, 143
cudaGraphicsUnapResources()
, 144
cudaHostAlloc()
malloc()
versus, 186–187
page-locked host memory application, 187–192
zero-copy memory dot product, 217–220
cudaHostAllocDefault()
default pinned memory, 214
page-locked host memory, 189–190
cudaHostAllocMapped()
flag
default pinned memory, 214
portable pinned memory, 231
zero-copy memory dot product, 217–218
cudaHostAllocPortable()
, portable pinned memory, 230–235
cudaHostAllocWriteCombined()
flag
portable pinned memory, 231
zero-copy memory dot product, 217–218
portable pinned memory, 234
zero-copy memory dot product, 218–219
cudaMalloc()
, 124
2D texture memory, 133–135
allocating device memory using, 26
CPU vector sums application, 42
dot product computation, 82, 86
dot product computation, standard host memory, 215
dot product computation with atomic locks, 256
GPU hash table implementation, 269, 274–275
GPU Julia Set, 51
GPU lock function, 253
GPU ripple using threads, 70
GPU sums of arbitrarily long vectors, 68
measuring ray tracer performance, 110, 112
portable pinned memory, 234
ray tracing on GPU, 100
ray tracing with constant memory, 105
shared memory bitmap, 90
using multiple CPUs, 228
using texture memory, 127
cuda-memcheck
, 242
cudaMemcpy()
2D texture binding, 136
copying data between host and device, 27
CPU vector sums application, 42
dot product computation, 82–83, 86
dot product computation with atomic locks, 257
GPU hash table implementation, 270, 274–275
GPU histogram computation, 174–175
GPU Julia Set, 52
GPU lock function implementation, 253
GPU ripple using threads, 70
GPU sums of arbitrarily long vectors, 68
heat transfer simulation animation, 122–125
measuring ray tracer performance, 111
page-locked host memory and, 187, 189
ray tracing on GPU, 101
standard host memory dot product, 216
using multiple CPUs, 228–229
cudaMemcpyAsync()
GPU work scheduling, 206–208
multiple CUDA streams, 203, 208–210
single CUDA streams, 196
timeline of intended application execution using multiple streams, 199
cudaMemcpyDeviceToHost()
CPU vector sums application, 42
dot product computation, 82, 86–87
GPU hash table implementation, 270
GPU histogram computation, 174–175
GPU Julia Set, 52
GPU sums of arbitrarily long vectors, 68
multiple CUDA streams, 204
page-locked host memory, 190
ray tracing on GPU, 101
shared memory bitmap, 91
standard host memory dot product, 216
using multiple CPUs, 229
cudaMemcpyHostToDevice()
CPU vector sums application, 42
dot product computation, 86
GPU sums of arbitrarily long vectors, 68
implementing GPU lock function, 253
measuring ray tracer performance, 111
multiple CPUs, 228
multiple CUDA streams, 203
page-locked host memory, 189
standard host memory dot product, 216
cudaMemcpyToSymbol()
, constant memory, 105–106
cudaMemset()
GPU hash table implementation, 269
GPU histogram computation, 174
CUDA.NET project, 247
cudaSetDevice()
allocating portable pinned memory, 231–232, 233–234
using device properties, 34
using multiple CPUs, 227–228
cudaSetDeviceFlags()
allocating portable pinned memory, 231, 234
zero-copy memory dot product, 221
cudaStreamSynchronize()
, 197–198, 204
cudaThreadSynchronize()
, 219
cudaUnbindTexture()
, 2D texture memory, 136–137
CUDPP (CUDA Data Parallel Primitives Library), 246
CUFFT library, 239
CULAtools, 246
current animation time, GPU ripple using threads, 72–74
debugging CUDA C, 241–242
detergents, CUDA applications, 10–11
dev_bitmap
pointer, GPU Julia Set, 51
development environment setup
CUDA Toolkit, 16–18
CUDA-enabled graphics processor, 14–16
NVIDIA device driver, 16
standard C compiler, 18–19
summary review, 19
device drivers, 16
device overlap, GPU, 194, 198–199
__device__
function
GPU hash table implementation, 268–275
GPU Julia Set, 54
devices
getting count of CUDA, 28
GPU vector sums, 41–46
passing parameters, 25–27
querying, 27–33
use of term in this book, 23
using properties of, 33–35
devPtr
, graphics interoperability, 144
dim3
variable grid, GPU Julia Set, 51–52
DIMxDIM
bitmap image, GPU Julia Set, 49–51, 53
direct memory access (DMA), for page-locked memory, 186
DirectX
adding standard C to, 7
breakthrough in GPU technology, 5–6
GeForce 8800 GTX, 7
graphics interoperability, 160–161
discrete GPUs, 222–224
display accelerators, 2D, 4
DMA (direct memory access), for page-locked memory, 186
dot product computation
optimized incorrectly, 87–90
shared memory and, 76–87
standard host memory version of, 215–217
using atomics to keep entirely on GPU, 250–251, 254–258
dot product computation, multiple GPUs
allocating portable pinned memory, 230–235
using, 224–229
zero-copy, 217–222
zero-copy performance, 223
Dr. Dobb’s CUDA, 245–246
DRAMs, discrete GPUs with own dedicated, 222–223
draw_func
, graphics interoperability, 144–146
end_thread()
, multiple CPUs, 226
environmental science, CUDA applications for, 10–11
event timer. see timer, event
events
computing elapsed time between recorded. see cudaEventElapsedTime()
creating. see cudaEventCreate()
GPU histogram computation, 173
measuring performance with, 95
measuring ray tracer performance, 110–114
overview of, 108–110
recording. see cudaEventRecord()
stopping and starting. see cudaEventDestroy()
summary review, 114
EXIT_FAILURE()
, passing parameters, 26
fAnim()
, storing registered callbacks, 149
Fast Fourier Transform library, NVIDIA, 239
first program, writing, 22–24
flags, in graphics interoperability, 143
float_to_color()
kernels, in graphics -interoperability, 157
floating-point numbers
atomic arithmetic not supported for, 251
CUDA Architecture designed for, 7
early days of GPU computing not able to handle, 6
FORTRAN applications
CUBLAS compatibility with, 239–240
language wrapper for CUDA C, 246
forums, NVIDIA, 246
fractals. see Julia Set example
free()
, C language
cudaFree(
)
versus, 26–27
dot product computation with atomic locks, 258
GPU hash table implementation, 275
multiple CPUs, 227
standard host memory dot product, 217
GeForce 256, 5
GeForce 8800 GTX, 7
generate_frame()
, GPU ripple, 70, 72–73, 154
generic classes, storing complex numbers with, 49–50
GL_PIXEL_UNPACK_BUFFER_ARB
target, OpenGL interoperation, 151
glBindBuffer()
creating pixel buffer object, 143
graphics interoperability, 146
glBufferData()
, pixel buffer object, 143
glDrawPixels()
graphics interoperability, 146
overview of, 154–155
glGenBuffers()
, pixel buffer object, 143
global memory atomics
GPU compute capability requirements, 167
histogram kernel using, 179–181
histogram kernel using shared and, 181–183
add
function, 43
kernel call, 23–24
running kernel()
in GPU Julia Set application, 51–52
GLUT (GL Utility Toolkit)
graphics interoperability setup, 144
initialization of, 150
initializing OpenGL driver by calling, 142
glutIdleFunc()
, 149
glutInit()
, 150
glutMainLoop()
, 144
GPU Computing SDK download, 18, 240–241
GPU ripple
with graphics interoperability, 147–154
using threads, 69–74
GPU vector sums
application, 41–46
of arbitrarily long vectors, using threads, 65–69
of longer vector, using threads, 63–65
using threads, 61–63
gpu_anim.h
, 152–154
GPUAnimBitmap
structure
creating, 148–152
GPU ripple performing animation, 152–154
heat transfer with graphics interoperability, 156–160
GPUs (graphics processing units)
called “devices” in this book, 23
developing code in CUDA C with CUDA-enabled, 14–16
development of CUDA for, 6–8
discrete versus integrated, 222–223
early days of, 5–6
freeing memory. see cudaFree()
hash tables, 268–275
histogram computation on, 173–179
histogram kernel using global memory atomics, 179–181
histogram kernel using shared/global memory atomics, 181–183
history of, 4–5
Julia Set example, 50–57
measuring performance with events, 108–110
ray tracing on, 98–104
work scheduling, 205–208
GPUs (graphics processing units), multiple, 213–236
overview of, 213–214
portable pinned memory, 230–235
summary review, 235–236
using, 224–229
zero-copy host memory, 214–222
zero-copy performance, 222–223
graphics accelerators, 3D graphics, 4–5
graphics interoperability, 139–161
DirectX, 160–161
generating image data with kernel, 139–142
GPU ripple with, 147–154
heat transfer with, 154–160
overview of, 139–140
passing image data to Open GL for rendering, 142–147
summary review, 161
graphics processing units. see GPUs (graphics processing units)
grey()
, GPU ripple, 74
grid
as collection of parallel blocks, 45
defined, 57
three-dimensional, 51
gridDim
variable
2D texture memory, 132–133
defined, 57
dot product computation, 77–78
dot product computation with atomic locks, 255–256
GPU hash table implementation, 272
GPU Julia Set, 53
GPU ripple using threads, 72–73
GPU sums of arbitrarily long vectors, 66–67
graphics interoperability setup, 145
histogram kernel using global memory atomics, 179–180
histogram kernel using shared/global memory atomics, 182–183
ray tracing on GPU, 102
shared memory bitmap, 91
temperature update computation, 119–120
zero-copy memory dot product, 222
half-warps, reading constant memory, 107
HANDLE_ERROR()
macro
2D texture memory, 133–136
CUDA streams, 194–198, 201–204, 209–210
dot product computation, 82–83, 86–87
dot product computation with atomic locks, 256–258
GPU hash table implementation, 270
GPU histogram computation completion, 175
GPU lock function implementation, 253
GPU ripple using threads, 70
GPU sums of arbitrarily long vectors, 68
heat transfer simulation animation, 122–125
measuring ray tracer performance, 110–114
page-locked host memory application, 188–189
passing parameters, 26
paying attention to, 46
portable pinned memory, 231–235
ray tracing on GPU, 100–101
ray tracing with constant memory, 104–105
shared memory bitmap, 90–91
standard host memory dot product, 215–217
zero-copy memory dot product, 217–222
hardware
decoupling parallelization from method of executing, 66
performing atomic operations on memory, 167
hardware limitations
GPU sums of arbitrarily long vectors, 65–69
number of blocks in single launch, 46
number of threads per block in kernel launch, 63
hash function
CPU hash table implementation, 261–267
GPU hash table implementation, 268–275
overview of, 259–261
hash tables
concepts, 259–261
CPU, 261–267
GPU, 268–275
multithreaded, 267–268
performance, 276–277
summary review, 277
heat transfer simulation
2D texture memory, 131–137
animating, 121–125
computing temperature updates, 119–121
with graphics interoperability, 154–160
simple heating model, 117–118
using texture memory, 125–131
“Hello, World” example
kernel call, 23–24
passing parameters, 24–27
writing first program, 22–23
Highly Optimized Object-oriented Many-particle Dynamics (HOOMD), 10–11
histogram computation
on CPUs, 171–173
on GPUs, 173–179
overview, 170
histogram kernel
using global memory atomics, 179–181
using shared/global memory atomics, 181–183
hit()
method, ray tracing on GPU, 99, 102
HOOMD (Highly Optimized Object-oriented Many-particle Dynamics), 10–11
hosts
allocating memory to. see malloc()
CPU vector sums, 39–41
CUDA C blurring device code and, 26
page-locked memory, 186–192
passing parameters, 25–27
use of term in this book, 23
zero-copy host memory, 214–222
idle_func()
member, GPUAnimBitmap
, 154
IEEE requirements, ALUs, 7
increment operator (x++
), 168–170
initialization
CPU hash table implementation, 263, 266
CPU histogram computation, 171
GPUAnimBitmap
, 149
inner products. see dot product computation
integrated GPUs, 222–224
interleaved operations, 169–170
interoperation. see graphics interoperability
Julia Set example
CPU application of, 47–50
GPU application of, 50–57
overview of, 46–47
kernel
2D texture memory, 131–133
blockIdx.x
variable, 44
call to, 23–24
defined, 23
GPU histogram computation, 176–178
GPU Julia Set, 49–52
GPU ripple performing animation, 154
GPU ripple using threads, 70–72
GPU sums of a longer vector, 63–65
graphics interoperability, 139–142, 144–146
“Hello, World” example of call to, 23–24
launching with number in angle brackets that is not 1, 43–44
passing parameters to, 24–27
ray tracing on GPU, 102–104
texture memory, 127–131
key_func
, graphics interoperability, 144–146
CPU hash table implementation, 261–267
GPU hash table implementation, 269–275
hash table concepts, 259–260
language wrappers, 246–247
LAPACK (Linear Algebra Package), 246
light effects, ray tracing concepts, 97
Linux, standard C compiler for, 19
Lock
structure, 254–258, 268–275
locks, atomic, 251–254
Macintosh OS X, standard C compiler, 19
main()
routine
2D texture memory, 133–136
CPU hash table implementation, 266–267
CPU histogram computation, 171
dot product computation, 81–84
dot product computation with atomic locks, 255–256
GPU hash table implementation, 273–275
GPU histogram computation, 173
GPU ripple using threads, 69–70
GPU vector sums, 41–42
graphics interoperability, 144
page-locked host memory application, 190–192
ray tracing on GPU, 99–100
ray tracing with constant memory, 104–106
shared memory bitmap, 90
single CUDA streams, 193–194
zero-copy memory dot product, 220–222
malloc()
cudaHostAlloc()
versus, 186
cudaHostAlloc()
versus, 190
cudaMalloc(
)
versus, 26
ray tracing on GPU, 100
mammograms, CUDA applications for medical imaging, 9
maxThreadsPerBlock
field, device properties, 63
media and communications processors (MCPs), 223
medical imaging, CUDA applications for, 8–9
memcpy()
, C language, 27
memory
allocating device. see cudaMalloc()
constant. see constant memory
CUDA Architecture creating access to, 7
early days of GPU computing, 6
executing device code that uses allocated, 70
freeing. see cudaFree()
; free()
, C language
GPU histogram computation, 173–174
page-locked host (pinned), 186–192
querying devices, 27–33
shared. see shared memory
texture. see texture memory
use of term in this book, 23
Memory Checker, CUDA, 242
memset()
, C language, 174
Microsoft Windows, Visual Studio C compiler, 18–19
Microsoft.NET, 247
multicore revolution, evolution of CPUs, 3
multiplication, in vector dot products, 76
multithreaded hash tables, 267–268
mutex
, GPU lock function, 252–254
nForce media and communications processors (MCPs), 222–223
NVIDIA
compute capability of various GPUs, 164–167
creating 3D graphics for consumers, 5
creating CUDA C for GPU, 7
creating first GPU built with CUDA Architecture, 7
CUBLAS library, 239–240
CUDA-enabled graphics processors, 14–16
CUDA-GDB debugging tool, 241–242
CUFFT library, 239
device driver, 16
GPU Computing SDK download, 18, 240–241
Parallel NSight debugging tool, 242
Performance Primitives, 241
products containing multiple GPUs, 224
Visual Profiler, 243–244
NVIDIA CUDA Programming Guide, 31
offset
, 2D texture memory, 133
on-chip caching. see constant memory; texture memory
one-dimensional blocks
GPU sums of a longer vector, 63
two-dimensional blocks versus, 44
online resources. see resources, online
OpenGL
creating GPUAnimBitmap
, 148–152
in early days of GPU computing, 5–6
generating image data with kernel, 139–142
interoperation, 142–147
writing 3D graphics, 4
operations, atomic, 168–170
optimization, incorrect dot product, 87–90
page-locked host memory
allocating as portable pinned memory, 230–235
overview of, 186–187
restricted use of, 187
single CUDA streams with, 195–197
parallel blocks
GPU Julia Set, 51
GPU vector sums, 45
parallel blocks, splitting into threads
GPU sums of arbitrarily long vectors, 65–69
GPU sums of longer vector, 63–65
GPU vector sums using threads, 61–63
overview of, 60
vector sums, 60–61
Parallel NSight debugging tool, 242
parallel processing
evolution of CPUs, 2–3
past perception of, 1
parallel programming, CUDA
CPU vector sums, 39–41
example, CPU Julia Set application, 47–50
example, GPU Julia Set application, 50–57
example, overview, 46–47
GPU vector sums, 41–46
overview of, 38
summary review, 56
summing vectors, 38–41
parameter passing, 24–27, 40, 72
PC gaming, 3D graphics for, 4–5
PCI Express slots, adding multiple GPUs to, 224
performance
constant memory and, 106–107
evolution of CPUs, 2–3
hash table, 276
launching kernel for GPU histogram computation, 176–177
measuring with events, 108–114
page-locked host memory and, 187
zero-copy memory and, 222–223
pinned memory
allocating as portable, 230–235
cudaHostAllocDefault()
getting default, 214
as page-locked memory. see page-locked host memory
pixel buffer objects (PBO), OpenGL, 142–143
pixel shaders, early days of GPU computing, 5–6
pixels, number of threads per block, 70–74
portable computing devices, 2
Programming Massively Parallel Processors: A Hands-on Approach (Kirk, Hwu), 244
properties
cudaDeviceProp
structure. see -cudaDeviceProp
structure
maxThreadsPerBlock
field for device, 63
reporting device, 31
using device, 33–35
PyCUDA project, 246–247
Python language wrappers for CUDA C, 246
querying, devices, 27–33
rasterization, 97
ray tracing
concepts behind, 96–98
with constant memory, 104–106
on GPU, 98–104
measuring performance, 110–114
read-modify-write operations
atomic operations as, 168–170, 251
using atomic locks, 251–254
read-only memory. see constant memory; texture memory
reductions
dot products as, 83
overview of, 250
shared memory and synchronization for, 79–81
references, texture memory, 126–127, 131–137
registration
bufferObj
with cudaGraphicsGLRegisterBuffer()
, 151
callback, 149
rendering, GPUs performing complex, 139
resource
variable
creating GPUAnimBitmap
, 148–152
graphics interoperation, 141
resources, online
CUDA code, 246–248
CUDA Toolkit, 16
CUDA University, 245
CUDPP, 246
CULAtools, 246
Dr. Dobb’s CUDA, 246
GPU Computing SDK code samples, 18
language wrappers, 246–247
NVIDIA device driver, 16
NVIDIA forums, 246
standard C compiler for Mac OS X, 19
Visual Studio C compiler, 18
CUDA U, 245–246
forums, 246
programming massive parallel processors, 244–245
ripple, GPU
with graphics interoperability, 147–154
producing, 69–74
routine()
allocating portable pinned memory, 232–234
using multiple CPUs, 226–228
Russian nesting doll hierarchy, 164
scalable link interface (SLI), adding multiple GPUs with, 224
scale
factor, CPU Julia Set, 49
scientific computations, in early days, 6
screenshots
animated heat transfer simulation, 126
GPU Julia Set example, 57
GPU ripple example, 74
graphics interoperation example, 147
ray tracing example, 103–104
rendered with proper synchronization, 93
rendered without proper synchronization, 92
shading languages, 6
shared data buffers, kernel/OpenGL rendering -interoperation, 142
shared memory
bitmap, 90–93
CUDA Architecture creating access to, 7
dot product, 76–87
dot product optimized incorrectly, 87–90
and synchronization, 75
Silicon Graphics, OpenGL library, 4
simulation
animation of, 121–125
challenges of physical, 117
computing temperature updates, 119–121
simple heating model, 117–118
SLI (scalable link interface), adding multiple GPUs with, 224
spatial locality
designing texture caches for graphics with, 116
heat transfer simulation animation, 125–126
split parallel blocks. see parallel blocks, splitting into threads
standard C compiler
compiling for minimum compute capability, 167–168
development environment, 18–19
kernel call, 23–24
start
event, 108–110
start_thread()
, multiple CPUs, 226–227
stop
event, 108–110
streams
CUDA, overview of, 192
CUDA, using multiple, 198–205, 208–210
CUDA, using single, 192–198
GPU work scheduling and, 205–208
overview of, 185–186
page-locked host memory and, 186–192
summary review, 211
supercomputers, performance gains in, 3
surfactants, environmental devastation of, 10
synchronization
of events. see cudaEventSynchronize()
of threads, 219
synchronization, and shared memory
dot product, 76–87
dot product optimized incorrectly, 87–90
overview of, 75
shared memory bitmap, 90–93
__syncthreads()
dot product computation, 78–80, 85
shared memory bitmap using, 90–93
unintended consequences of, 87–90
task parallelism, CPU versus GPU applications, 185
TechniScan Medical Systems, CUDA applications, 9
temperatures
computing temperature updates, 119–121
heat transfer simulation, 117–118
heat transfer simulation animation, 121–125
Temple University research, CUDA applications, 10–11
tex1Dfetch()
compiler intrinsic, texture memory, 127–128, 131–132
tex2D()
compiler intrinsic, texture memory, 132–133
texture, early days of GPU computing, 5–6
texture memory
animation of simulation, 121–125
defined, 115
overview of, 115–117
simulating heat transfer, 117–121
summary review, 137
two-dimensional, 131–137
using, 125–131
2D texture memory, 132–133
dot product computation, 76–77, 85
dot product computation with atomic locks, 255–256
GPU hash table implementation, 272
GPU Julia Set, 52
GPU ripple using threads, 72–73
GPU sums of a longer vector, 63–64
GPU sums of arbitrarily long vectors, 66–67
GPU vector sums using threads, 61
histogram kernel using global memory atomics, 179–180
histogram kernel using shared/global memory atomics, 182–183
multiple CUDA streams, 200
ray tracing on GPU, 102
setting up graphics interoperability, 145
shared memory bitmap, 91
temperature update computation, 119–121
zero-copy memory dot product, 221
threads
coding with, 38–41
constant memory and, 106–107
GPU ripple using, 69–74
GPU sums of a longer vector, 63–65
GPU sums of arbitrarily long vectors, 65–69
GPU vector sums using, 61–63
hardware limit to number of, 63
histogram kernel using global memory atomics, 179–181
incorrect dot product optimization and divergence of, 89
multiple CPUs, 225–229
overview of, 59–60
ray tracing on GPU and, 102–104
read-modify-write operations, 168–170
shared memory and. see shared memory
summary review, 94
synchronizing, 219
threadsPerBlock
allocating shared memory, 76–77
dot product computation, 79–87
three-dimensional blocks, GPU sums of a longer vector, 63
three-dimensional graphics, history of GPUs, 4–5
three-dimensional scenes, ray tracing producing 2-D image of, 97
tid
variable
blockIdx.x
variable assigning value of, 44
checking that it is less than N
, 45–46
dot product computation, 77–78
parallelizing code on multiple CPUs, 40
time, GPU ripple using threads, 72–74
timer, event. see cudaEventElapsedTime()
Toolkit, CUDA, 16–18
two-dimensional blocks
arrangement of blocks and threads, 64
GPU Julia Set, 51
GPU ripple using threads, 70
gridDim
variable as, 63
one-dimensional indexing versus, 44
two-dimensional display accelerators, development of GPUs, 4
two-dimensional texture memory
defined, 116
heat transfer simulation, 117–118
overview of, 131–137
ultrasound imaging, CUDA applications for, 9
unified shader pipeline, CUDA Architecture, 7
university, CUDA, 245
values
CPU hash table implementation, 261–267
GPU hash table implementation, 269–275
hash table concepts, 259–260
vector dot products. see dot product computation
vector sums
CPU, 39–41
GPU, 41–46
GPU sums of arbitrarily long vectors, 65–69
GPU sums of longer vector, 63–65
GPU sums using threads, 61–63
verify_table()
, GPU hash table, 270
Visual Profiler, NVIDIA, 243–244
Visual Studio C compiler, 18–19
warps, reading constant memory with, 106–107
while()
loop
CPU vector sums, 40
GPU lock function, 253
work scheduling, GPU, 205–208
zero-copy memory
allocating/using, 214–222
defined, 214
performance, 222–223