22.9Results 375
22.9Results
The described method has been implemented and tested on two different ma-
chines:
■ A desktop PC with an Nvidia GeForce GTS250, 1GB VRAM and a proces-
sor Intel Core i5.
■ A laptop PC with an Nvidia Quadro FX 360M, 128MB VRAM and a proces-
sor Intel Core2 Duo.
We collected performance times for each GPU computing platform, varying the
numbers of particles and springs, from a grid resolution of
232
(1024 particles
and 11,412 springs) to 256 256
(65,536 particles and approximately 700,000
springs). Numerical results are collected in the plots in Figures 22.5 and 22.6.
From the data plotted in Figures 22.5 and 22.6, the computing superiority of
the GPU compared with the CPU is evident. This is mainly due to the fact that
this cloth simulation algorithm is strongly parallelizable, like most of the particle-
based approaches. While the computational cost on the CPU keeps growing line-
arly with the number of particles, the computation time on the GPU remains rela-
tively low because the particle dynamics are computed in parallel. On the
GTS250 device, this leads to a performance gain ranging from 10 to 40 times,
depending on the number of particles.
It is interesting to note that in this case, GLSL has a much better performance
than CUDA does. This can be explained by considering how the memory is ac-
cessed by the GPU kernels. In the GLSL fragment program, images are em-
ployed to store particle data in texture memory, while in CUDA and OpenCL,
these data is stored in the global memory of the device. Texture memory has two
main advantages [Nvidia 2010]. First, it is cached, and thus, video memory is
accessed only if there is a cache miss. Second, it is built in such a way as to op-
timize the access to 2D local data, which is the case because each particle corre-
sponds to a pixel, and it must have access to the positions of its neighbors, which
are stored in the immediately adjacent texture pixels. Furthermore, the results in
GLSL are stored in the color render targets that are then directly mapped to
VBOs and drawn on the screen. The data resides in video memory and does not
need to be copied between different memory areas. This makes the entire process
extremely fast compared with the other approaches.
The plots also highlight the lower performance of OpenCL compared with
CUDA. This difference is caused by the fact that it has been rather difficult to
tune the number of global and local work items due to causes requiring further