Chapter 10. OpenCL Case Study

Mixed Particle Simulation
Justin Hensley and Takahiro Harada
Advanced Micro Devices, Incorporated, Sunnyvale, California
This chapter discusses a mixed particle simulation.
Keywords Example program, OpenCL, particle simulation, zero copy (or buffer sharing)

Introduction

This case study examines leveraging both the CPU and the GPU to implement a mixed-sized particle simulation. Without a loss of generality, this example is restricted to two dimensions to keep the kernels easily understandable. In addition, this case study leverages OpenCL's ability to share data between the CPU and the GPU, which is beneficial on highly integrated devices such as AMD's APUs.
Implementing a simple simulation with uniform particle sizes on data-parallel devices such as GPUs is relatively straightforward in OpenCL. Unfortunately, the efficiency of the simulation decreases if the assumption breaks (i.e., the particle size varies). The inefficiency results from the non-uniform granularity of the computation, especially for the collision detection, which is the most expensive part in a particle simulation. If there is a single large particle and many small particles, the number of collisions detected on the large particle can be significantly more than the number detected on the small particles. This difference can cause inefficiency because most GPUs execute in SIMD manner over wide vectors. If a GPU SIMD engine executes small particles for most SIMD lanes and in a single lane computes the result for a large particle, the “large” lane runs for longer than the “small” lanes. This means that the lanes computing collisions for small particles have to wait for the large lane to finish. This results in load imbalance and poor utilization of the hardware as the vector width increases. However, the CPU has no problem dealing with this variation in the granularity of computation. Accordingly, the approach we present here is to use the CPU for the collision of large particles, whereas the GPU is used only for the collision of the uniformly sized small particles. This chapter discusses how to realize the collaboration of the GPU and the CPU for a mixed-sized particle simulation using OpenCL.

Overview of the Computation

Figure 10.1 is a screen shot from the simulation. You can see a large number of small particles rendered in blue interacting with varying-sized particles rendered in green and red. The green and the red colors represent dynamic and static particles, respectively. The collisions between the particles are classified into three types: small–small, large–small, and large–large. The granularity of the small–small collision is uniform, so the work is dispatched to the GPU. However, as discussed previously, the granularity of large–small and large–large is varying, so the CPU is better suited for performing these computations. If there are no large–small interactions, the two simulations are completely disjoint, so the GPU simulates the small particles and the CPU simulates the large particles. The mixed-sized particle simulation is implemented as an extension of the two disjoint computations. The challenge is to efficiently realize the interaction between the simulations because the physical properties of small particles live in GPU memory, whereas those of large particles live in CPU memory. If the memories are allocated without care, the data on the GPU has to be read back to the host every time it calculates the interaction. By using OpenCL buffers that are shared between and accessed by both the CPU and the GPU, the read back is no longer necessary.
B9780123877666000335/f10-01-9780123877666.jpg is missing
Figure 10.1
A screen shot from the mixed particle simulation. (Please see front cover of the book)
Figure 10.2 shows an overview of the algorithm and the interaction between the GPU manager thread and the CPU work thread, both of which run on the host CPU. There are three major phases of computation per iteration: (1) data structure construction (2) computation of forces resulting from (3) collisions and force integration.
B9780123877666000335/f10-02-9780123877666.jpg is missing
Figure 10.2
Overview of the computation for a single iteration.
First, the GPU kernel data structure is built. Then the GPU and CPU are synchronized to guarantee that the computation of the new data structure is complete and we can share the updated data and physical properties of the small particles with the CPU thread. After the first synchronization, threads start processing collisions. The GPU performs the small–small collisions, and the CPU performs the large–small and large–large collisions. To accelerate the collision detection of small particles against large particles, the data structure built on the GPU is also used on the CPU. The output from the collisions is a set of forces on particles. Two force buffers are prepared for the small particles: One is filled with small–small forces, resulting from the small–small collisions, and the other is shared with the CPU, which fills the buffer with the forces resulting from the large–small collisions. After the collisions are completed, the GPU and CPU threads are synchronized again so that the CPU is known to have finished writing to the force buffer, which is going to be used in the integration stage. Then, the GPU integrates forces on small particles, and the CPU integrates forces on large particles. Note that there are no write conflicts between the CPU and the GPU for the shared buffer. The CPU only writes to the shared buffer during the large–small collision phase. The buffer is only accessed by the GPU during the integration after the synchronization.

GPU Implementation

Buffer Creation

First, the buffers have to be created so that the runtime has the option of sharing data between the GPU and the CPU. A shared buffer can be created with the CL_MEM_ALLOC_HOST_PTR flag, which indicates the CPU will access the buffer. If the buffer is created with this flag, the runtime must return a location in host memory where data can be accessed between map and unmap function calls. We have the option of maintaining the data in host memory permanently (although this is not a requirement). 1 If this data is located permanently in host memory, the map and unmap operations applied to the buffer will introduce little execution overhead, requiring only synchronization overhead. However, this benefit does not come for free on a discrete GPU. Because the memory is stored in the CPU memory, whenever the GPU accesses the buffer the request must utilize the PCI Express bus. This transfer can result in larger latency compared with memory allocated on the GPU. On AMD's Fusion APU architecture and similar heterogeneous multiprocessors-on-a-chip where the GPU and the CPU share memory, we avoid the overhead of using the PCI Express bus. Removing this bottleneck leads to better performance. The following code shows an allocation of a shared buffer. Note that we are passing the CL_MEM_ALLOC_HOST_PTR flag as an argument:
1Note that CL_MEM_ALLOC_HOST_PTR guarantees that when map is called on the buffer, the passed pointer becomes the host pointer returned by map. Data is still only valid at the host pointer between calls to map and unmap.
buffer = clCreateBuffer(
context,
CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
bufferSize,
0,
&status );
Although this buffer is created to be shared between the CPU and the GPU, we cannot access the data from the buffer pointer because the memory space is not unified in OpenCL. To access the data from the CPU, the buffer has to be mapped:
ptr = clEnqueueMapBuffer(
commandQueue,
buffer,
false,
CL_MAP_READ | CL_MAP_WRITE,
0,
bufferSize,
0,
0,
0,
&status );
The clEnqueueMapBuffer call returns a pointer to the buffer data. If a buffer is created without the flag being mapped, the data will be copied over from the device to the host, and the map operation is expensive. However, the map is almost free when a buffer is created using the flag and combined with appropriate runtime and hardware support, the buffer then lives in the host memory.

Building the Acceleration Structure

For accelerating the collision detection of particles, we use a uniform grid (a grid with a spatially invariant cell size) with a fixed capacity for each cell. Fast build and query are properties of a uniform grid that make it well suited for collision detection in a particle simulation, if the simulation does not have to extend to infinite space. The uniform grid has two buffers: a counter buffer storing cell counters and an index buffer storing particle indices or cell data. A work-item assigned to a particle fetches the particle position from global memory to a register and converts the world space position to a grid space coordinate. The work-item reads from the counter buffer the number of particles stored in the cell and adds the particle index to the index buffer. Care must be taken when accessing the counter because there can be several particles that fall into the same cell. In that case, several work-items try to read and write to the same counter. To guarantee the success of the memory access, we use an atomic operation as shown here:
for( all particles ) {
idx = convertToGridIdx( particle[i] );
count = atomic_inc( grid_count[idx] );
grid[ calcAddr( idx, count ) ] = I;
}
The code listed under Kernels for Simulation has some additional logic to handle the boundary conditions and to prevent too many particles from being stored in a single cell.

Computing Collisions

This step calculates the force on each particle from the colliding particles. The uniform grid is used to efficiently locate the colliding particles associated with each particle. The particle position is converted to a grid coordinate system, which is used to calculate the address in memory of the grid cell. Because the size of each cell is defined to be equal to the diameter of the particles, potential colliding particles are only in the 32 grid cells (in three dimensions, 33) around the cell to which the particle belongs. Note that the granularity of the computation is uniform because the particles we are calculating have a uniform diameter, so all work-items check nine cells without any exceptions. A work-item iterates over the nine cells and reads the number of particles and particle indices stored in each cell to calculate the force on the particle. This force is a function of the colliding particle positions and velocities. After interparticle collision, each work-item checks for particle collisions with the boundary. These forces are accumulated and then written out to a force buffer.
for( all particles ) {
force = 0;
// compute particle interaction
for( neighborhood ) {
for( particles in grid cell[n] ) {
j = getStoredParticleIdx();
force += calcForce( particle[i], particle[j] );
}
}
// compute boundary interaction
force += calcForce( particle[i], top );
force += calcForce( particle[i], bottom );
force += calcForce( particle[i], left );
force += calcForce( particle[i], right );
forceOut[i] = force;
}

Integration

After the forces are computed on the CPU and GPU, a kernel is used to update the particles’ positions and velocities via a process called integration. The velocity is computed as the sum of the forces resulting from the small–small particle interactions and the forces resulting from the large–small particle interactions. The particle position is updated based on the resulting particle velocity multiplied by the time step, dt. Finally, the buffer used to share force data with the CPU is zeroed for the next iteration. This kernel pseudo-code is fairly simple:
for( all particles ) {
vel[i] += forceSmallSmall[i] + forceLargeSmall[i];
pos[i] += vel[i]*dt;
// zero forces for the CPU side
forceBigParticles[i] = 0;
}

CPU Implementation

The CPU implementation does not differ much from the GPU version, but it runs in scalar rather than vector fashion. The collisions of particles are serialized so that we do not have to worry about the granularity. The collision of a large particle with small particles is performed using the uniform grid that was built on the GPU and shared with the CPU. It first calculates the extent of the large particle in the grid space and looks up all the cells in the extent. Figure 10.3 illustrates overlapping cells for a large particle. As the reader can see from this figure, the large particle overlaps many cells, whereas a small particle does not. If this code were to be executed on the GPU, it would waste many cycles as SIMD lanes were masked out.
B9780123877666000335/f10-03-9780123877666.jpg is missing
Figure 10.3
Small and large particles as mapped to the uniform grid.

Load Balancing

Although we have shown the flow of the simulation using two threads in Figure 10.2, our implementation suffers from pool imbalance in terms of computation between the workloads. This imbalance results in low utilization of CPU threads. Of course, the degree of imbalance depends on the numbers of small and large particles present. Looking at the CPU work thread, it is easy to see that the large–small and large–large collisions are more expensive than the integration. Thus, the time between the first and the second synchronization is the largest amount of time. On the other hand, the most expensive part of the GPU execution is the small–small collision computation, followed by the construction of the uniform grid. Although the GPU is a powerful processor, the cost of building the uniform grid is not negligible. If the simulation is executed as shown in Figure 10.2, a thread has to wait before the first synchronization and the other has to wait before the second synchronization, which we can see more clearly in Figure 10.4A.
B9780123877666000335/f10-04-9780123877666.jpg is missing
Figure 10.4
Load balancing.
The computations can be reordered to improve load imbalance. Figure 10.2 shows the flow of a single iteration, but it is not necessary for grid creation to be the first step if we assume that particles do not move between the end of the previous iteration and the start of the current iteration. This is generally true unless a particle is directly moved, likely through user interaction, which is relatively rare. From this observation, grid creation can be moved to the end of the previous iteration. This modification makes two big blocks for the GPU thread, but the CPU's thread workload only involves the integration of particles (which is inexpensive). If we can move some computation to the CPU thread from the first half to the second half, then this should also ease load imbalance. Careful observation of the data flow of the pipeline shows that the second synchronization is necessary only for the buffer containing forces for the small particles that were filled by the CPU. As a result, the large–large collision does not have to be performed during the first half given the removal of any data dependencies with the next GPU computation. This means that the large–large collision can be moved to the second half of the computation (as shown in Figure 10.4B). In this way, we improve the load balance on the CPU threads, which results in a reduction in total simulation time.

Performance and Summary

Figure 10.1 is a screen shot from the mixed particle simulation with approximately 8200 small particles and 150 large particles running on an AMD A-series Fusion APU, on which the CPU and GPU are integrated on-chip. The bar at the right of the figure shows the computation times on two threads (one column per thread). The full height of the column is equal to 1/60 of a second. The blue and green blocks correspond to the time spent on the GPU and the CPU, respectively. Because there are four iterations per time step, one iteration takes approximately 4 ms in this simulation. The first four blocks are for the first iterations. As can be seen, the threads synchronize after the first GPU computation. The CPU work thread is waiting for the GPU thread to finish, producing the gap between the first green block and the synchronization.
If a buffer is created without the CL_MEM_ALLOC_HOST_PTR flag, the data has to be copied when the buffer is mapped and unmapped. The effect of the map operation is quantitatively compared between a simulation with a normal buffer and a simulation with a shared buffer. Figure 10.5 is the breakdown of the times. Although the simulation itself is the same, one can see a clear difference between the two versions with and without data sharing. The comparison shows a clear advantage of using a shared buffer. However, notice the extra execution time for the GPU-side small–small collision phase when using data sharing: This is due to the extra latency of accesses over the shared memory interface.
B9780123877666000335/f10-05-9780123877666.jpg is missing
Figure 10.5
Breakdown of the simulation times.
In this chapter, we discussed an implementation of a particle simulation that handles variably sized particles by leveraging the best features of both the CPU and the GPU. In addition, by sharing memory between the CPU and the GPU on tightly integrated devices such as AMD's Fusion APU architecture, the efficiency of the computation improved by eliminating wasteful memory copies. Although this example is designed to be simple to understand, the same technique can be applicable to more complicated applications.

Kernel for Uniform Grid Creation

typedef struct
{
float4 m_max;
float4 m_min;
int4 m_nCells;
float m_gridScale;
u32 m_maxParticles;
}ConstBuffer;
__kernel
void GridConstructionKernel(
__global float4* gPosIn,
__global int* gridG,
__global int* gridCounterG,
ConstBuffer cb )
{
int gIdx = get_global_id(0);
if( gIdx >= cb.m_maxParticles ) {
return;
}
float4 iPos = gPosIn[gIdx];
int4 gridCrd = ugConvertToGridCrd(
iPos-cb.m_min, cb.m_gridScale );
if( gridCrd.x < 0 || gridCrd.x >= cb.m_nCells.x
|| gridCrd.y < 0 || gridCrd.y >= cb.m_nCells.y ) {
return;
}
int gridIdx = ugGridCrdToGridIdx(
gridCrd,
cb.m_nCells.x,
cb.m_nCells.y,
cb.m_nCells.z );
int count = atom_add(&gridCounterG[gridIdx], 1);
if( count < MAX_IDX_PER_GRID ) {
gridG[ gridIdx*MAX_IDX_PER_GRID + count ] = gIdx;
}
}

Kernels for Simulation

typedef struct
{
float4 m_g;
int m_numParticles;
float m_dt;
float m_scale;
float m_e;
int4 m_nCells;
float4 m_spaceMin;
float m_gridScale;
}ConstBuffer;
__kernel
void CollideGridKernel(
__global float4* posIn,
__global float4* velIn,
__global int* grid,
__global int* gridCounter,
__global float4* forceOut,
ConstBuffer cb)
{
int gIdx = get_global_id(0);
if(gIdx >= cb.m_numParticles ) {
return;
}
int4 nCells = cb.m_nCells;
float4 spaceMin = cb.m_spaceMin;
float gridScale = cb.m_gridScale;
float dt = cb.m_dt;
float e = cb.m_e;
float4 f = make_float4(0,0,0,0);
float4 x_i = posIn[ gIdx ];
float4 v_i = velIn[ gIdx ];
float sCoeff, dCoeff;
calcCoeffs(v_i.w, v_j.w, sCoeff, dCoeff);
int4 iGridCrd = ugConvertToGridCrd( x_i-spaceMin, gridScale );
//1. collide particles
for(int i=-1;i<=1;i++) for(int j=-1;j<=1;j++) {
int4 gridCrd = make_int4(
iGridCrd.x+i,
iGridCrd.y+j,
iGridCrd.z+k,
0);
if( gridCrd.x < 0 || gridCrd.x >= nCells.x
|| gridCrd.y < 0 || gridCrd.y >= nCells.y ) {
continue;
}
int gridIdx = ugGridCrdToGridIdx(
gridCrd,
nCells.x,
nCells.y,
nCells.z );
int numElem = gridCounter[ gridIdx ];
numElem = min(numElem, MAX_IDX_PER_GRID);
for(int ie=0; ie<numElem; ie++) {
int jIdx = grid[ MAX_IDX_PER_GRID*gridIdx + ie ];
if( jIdx == gIdx ) {
continue;
}
float4 x_j = posIn[jIdx];
float4 v_j = velIn[jIdx];
f += calcForce(
x_i,
x_j,
v_i,
v_j,
x_i.w,
x_j.w,
v_i.w,
v_j.w,
dt,
sCoeff,
dCoeff );
}
}
//2. collide with boundary
{
float sCoeff, dCoeff;
calcCoeffs(v_i.w, sCoeff, dCoeff);
float4 planes[4];
planes[0] = make_float4(0,1,0,cb.m_scale);
planes[1] = make_float4(-1,0,0,cb.m_scale);
planes[2] = make_float4(1,0,0,cb.m_scale);
planes[3] = make_float4(0,-1,0,cb.m_scale);
for(int j=0; j<4; j++) {
float4 eqn = planes[j];
float dist = dot3w1( x_i, eqn );
f += calcForceB(
x_i,
v_i,
x_i.w,
dist,
eqn,
dt,
sCoeff,
dCoeff );
}
}
forceOut[ gIdx ] = f;
}
__kernel
void IntegrateKernel(
__global float4* pos,
__global float4* vel,
__global float4* forceSS,
__global float4* forceLS,
ConstBuffer cb)
{
int gIdx = get_global_id(0);
if( gIdx >= cb.m_numParticles ) {
return;
}
float4 x = pos[gIdx];
float4 v = vel[gIdx];
v += (forceSS[gIdx]+ forceLS[gIdx])*cb.m_dt/v.w+cb.m_g;
x += v*cb.m_dt;
pos[gIdx] = make_float4(x.x, x.y, x.z, pos[gIdx].w);
vel[gIdx] = make_float4(v.x, v.y, v.z, vel[gIdx].w);
forceLS[gIdx] = make_float4(0,0,0,0);
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset