22.9Results 375
22.9Results
The described method has been implemented and tested on two different ma-
chines:
A desktop PC with an Nvidia GeForce GTS250, 1GB VRAM and a proces-
sor Intel Core i5.
A laptop PC with an Nvidia Quadro FX 360M, 128MB VRAM and a proces-
sor Intel Core2 Duo.
We collected performance times for each GPU computing platform, varying the
numbers of particles and springs, from a grid resolution of
3
232
(1024 particles
and 11,412 springs) to 256 256
(65,536 particles and approximately 700,000
springs). Numerical results are collected in the plots in Figures 22.5 and 22.6.
From the data plotted in Figures 22.5 and 22.6, the computing superiority of
the GPU compared with the CPU is evident. This is mainly due to the fact that
this cloth simulation algorithm is strongly parallelizable, like most of the particle-
based approaches. While the computational cost on the CPU keeps growing line-
arly with the number of particles, the computation time on the GPU remains rela-
tively low because the particle dynamics are computed in parallel. On the
GTS250 device, this leads to a performance gain ranging from 10 to 40 times,
depending on the number of particles.
It is interesting to note that in this case, GLSL has a much better performance
than CUDA does. This can be explained by considering how the memory is ac-
cessed by the GPU kernels. In the GLSL fragment program, images are em-
ployed to store particle data in texture memory, while in CUDA and OpenCL,
these data is stored in the global memory of the device. Texture memory has two
main advantages [Nvidia 2010]. First, it is cached, and thus, video memory is
accessed only if there is a cache miss. Second, it is built in such a way as to op-
timize the access to 2D local data, which is the case because each particle corre-
sponds to a pixel, and it must have access to the positions of its neighbors, which
are stored in the immediately adjacent texture pixels. Furthermore, the results in
GLSL are stored in the color render targets that are then directly mapped to
VBOs and drawn on the screen. The data resides in video memory and does not
need to be copied between different memory areas. This makes the entire process
extremely fast compared with the other approaches.
The plots also highlight the lower performance of OpenCL compared with
CUDA. This difference is caused by the fact that it has been rather difficult to
tune the number of global and local work items due to causes requiring further
376 22.GPGPUClothSimulationUsing GLSL,OpenCL,andCUDA
Figure 22.5. Computation times measured on different computation platforms using a
GeForce GTS 250 device (16 computing units, 128 CUDA cores).
Figure 22.6. Computation times measured on different computation platforms using a
Quadro FX 360M device (2 computing units, 16 CUDA cores).
1.57
0.25
1.02
0.70
0.28
0.99
0.71
6.48
25.5
0.23
1.58
0.69
0.17
4.10
1.36
99.8
0
5
10
15
20
25
30
35
Time (ms)
CPU GLSL OCL CUDA CPU GLSL OCL CUDA CPU GLSL OCL CUDA CPU GLSL OCL CUDA
1024 particles
11,412 springs
4096 particles
47,380 springs
16,384 particles
193,044 springs
65,536 particles
779,284 springs
0
10
20
30
40
50
60
70
Time (ms)
2.54
0.30
1.66
0.81
0.30
9.78
3.93
10.4
42.5
0.26
10.1
3.32
0.29
34.1
11.0
160
CPU GLSL OCL CUDA CPU GLSL OCL CUDA CPU GLSL OCL CUDA CPU GLSL OCL CUDA
1024 particles
11,412 springs
4096 particles
47,380 springs
16,384 particles
193,044 springs
65,536 particles
779,284 springs
22.10FutureWork 377
investigation. OpenCL is a very young standard, and both the specification and
the driver implementation are likely to change in the near future in order to avoid
such instabilities.
The GLSL program works on relatively old hardware, and different from
CUDA, it does not require Nvidia hardware. CUDA on the other hand, is a more
flexible architecture that has been specifically devised for performing computing
tasks (not only graphics, like GLSL), which is easier to debug and provides ac-
cess to hardware resources, like the shared memory, allowing for a further boost
to the performance. OpenCL has the same features as CUDA, but its implementa-
tion is rather naive at the moment, and it is harder to debug. However, different
from CUDA, it has been devised to run on the widest range of hardware plat-
forms (including consoles and mobile phones), not limited to Nvidia ones, and
thus, it is the main candidate for becoming the reference platform for GPGPU in
the near future.
The main effort when dealing with GPGPU is in the design of the algorithm.
The challenging task that researchers and developers are currently facing is how
to redesign algorithms that have been originally conceived to run in a serial man-
ner for the CPU, to make them parallel and thus suitable for the GPU. The main
disadvantage of particle-based methods is that they require a very large number
of particles to obtain realistic results. However, it is relatively easy to parallelize
algorithms handling particle systems, and the massive parallel computation capa-
bilities of modern GPUs now makes it possible to simulate large systems at inter-
active rates.
22.10FutureWork
Our algorithm for cloth simulation can be improved in many ways. In the CUDA
and OpenCL implementations, it would be interesting to exploit the use of shared
memory, which should reduce the amount of global accesses and lead to im-
proved performance.
For future research, we would like to investigate ways to generalize this algo-
rithm by introducing connectivity information [Tejada 2005] that stores the in-
dexes of the neighbors of each particle. This data can be stored in constant
memory to hide as much as possible the inevitable latency that using this infor-
mation would introduce. By using connectivity, it would be possible to simulate
deformable, breakable objects with arbitrary shapes, not only rectangular pieces
of cloth.
378 22.GPGPUClothSimulationUsing GLSL,OpenCL,andCUDA
22.11Demo
An implementation of the GPU cloth simulation is provided on the website, and
it includes both the source code in C++ and the Windows binaries. The demo
allows you to switch among the computing platforms at run time, and it includes
a hierarchical profiler. Even though the source code has been developed for Win-
dows using Visual Studio 2008, it has been written with cross-platform compati-
bility in mind, without using any Windows-specific commands, so it should
compile and run on *nix platforms (Mac and Linux). The demo requires a ma-
chine capable of running Nvidia CUDA, and the CUDA Computing SDK 3.0
needs to have been compiled. A video is also included on the website.
Acknowledgements
The shader used for rendering the cloth is “fabric plaid” from RenderMonkey 1.82 by
AMD and 3DLabs. The author is grateful to Professor Ingemar Ragnemalm for having
introduced him to the fascinating world of GPGPU.
References
[Müller 2008] Matthias Müller, Jos Stam, Doug James, and Nils Thürey. “Real Time
Physics.” ACM SIGGRAPH 2008 Course Notes. Available at http://www.
matthiasmueller.info/realtimephysics/index.html.
[Nvidia 2010] “NVIDIA CUDA Best Practices Guide,” Version 3.0, 2010. Available at
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_
CUDA_BestPracticesGuide.pdf.
[Tejada 2005] Eduardo Tejada and Thomas Ertl. “Large Steps in GPU-Based Deforma-
ble Bodies Simulation.” Simulation Modelling Practice and Theory 13:8 (Novem-
ber 2005), pp. 703–715.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset