Chapter 4. Basic OpenCL Examples
This chapter discusses some basic OpenCL examples, which allow us to summarize our understanding of the specification discussed in Chapter 2. These examples demonstrate the programming steps needed to write complete OpenCL applications. We also include an example using the C++ Wrapper API for developers who have a preference toward C++. The examples discussed here can serve as baselines to compare the optimized versions, which can be written after studying later chapters.
Keywords C++, example program, matrix multiplication, OpenCL

Introduction

In Chapter 2, we discussed the OpenCL specification and how it can be used to implement programs for heterogeneous platforms. Chapter 3 covered the architecture of some possible OpenCL targets. In this chapter, we discuss a few more complex examples, which build on the simple examples such as vector addition discussed in Chapter 2. We cover the implementation of both the host and the device code in a methodical manner.
The aim of this chapter is to give the reader more intuition of how OpenCL can be used to write data-parallel programs. The implementations in this chapter are complete OpenCL examples. However, they have not been tuned to take advantage of any particular device architecture. The aim is to provide the user with implementation guidelines for OpenCL applications and to discuss implementations that can serve as a baseline for the architecture-specific optimization of applications in later chapters.

Example applications

In this section, we discuss the implementation of some example OpenCL applications. The examples covered here include image rotation, matrix multiplication, and image convolution.

Simple Matrix Multiplication Example

A simple serial C implementation of matrix multiplication is shown here (remember that OpenCL host programs can be written in either C or using the OpenCL C++ Wrapper API). The code iterates over three nested for loops, multiplying Matrix A by Matrix B and storing the result in Matrix C. The two outer loops are used to iterate over each element of the output matrix. The innermost loop will iterate over the individual elements of the input matrices to calculate the result of each output location.
// Iterate over the rows of Matrix A
for(int i = 0; i < heightA; i++) {
// Iterate over the columns of Matrix B
for(int j = 0; j < widthB; j++) {
C[i][j] = 0;
// Multiply and accumulate the values in the current row
// of A and column of B
for(int k = 0; k < widthA; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
It is straightforward to map the serial implementation to OpenCL, as the two outer for-loops work independently of each other. This means that a separate work-item can be created for each output element of the matrix. The two outer for-loops are mapped to the two-dimensional range of work-items for the kernel.
The independence of output values inherent in matrix multiplication is shown in Figure 4.1. Each work-item reads in its own row of Matrix A and its column of Matrix B. The data being read is multiplied and written at the appropriate location of the output Matrix C.
B978012387766600027X/f04-01-9780123877666.jpg is missing
Figure 4.1
Each output value in a matrix multiplication is generated independently of all others.
// widthA = heightB for valid matrix multiplication
__kernel void simpleMultiply(
__global float* outputC,
int widthA,
int heightA,
int widthB,
int heightB,
__global float* inputA,
__global float* inputB) {
//Get global position in Y direction
int row = get_global_id(1);
//Get global position in X direction
int col = get_global_id(0);
float sum = 0.0f;
//Calculate result of one element of Matrix C
for (int i = 0; i < widthA; i++) {
sum += inputA[row*widthA+i] * inputB[i*widthB+col];
}
outputC[row*widthB+col] = sum;
}
Now that we have understood the implementation of the data-parallel kernel, we need to write the OpenCL API calls that move the data to the device. The implementation steps for the rest of the matrix multiplication application are summarized in Figure 4.2. We need to create a context for the device we wish to use. Using the context, we create the command queue, which is used to send commands to the device. Once the command queue is created, we can send the input data to the device, run the parallel kernel, and read the resultant output data back from the device.
B978012387766600027X/f04-02-9780123877666.jpg is missing
Figure 4.2
Programming steps to writing a complete OpenCL application.

Step 1: Set Up Environment

In this step, we declare a context, choose a device type, and create the context and a command queue. Throughout this example, the ciErrNum variable should always be checked to see if an error code is returned by the implementation.
cl_int ciErrNum;
// Use the first platform
cl_platform_id platform;
ciErrNum = clGetPlatformIDs(1, &platform, NULL);
// Use the first device
cl_device_id device;
ciErrNum = clGetDeviceIDs(
platform,
CL_DEVICE_TYPE_ALL,
1,
&device,
NULL);
cl_context_properties cps[3] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0};
// Create the context
cl_context ctx = clCreateContext(
cps,
1,
&device,
NULL,
NULL,
&ciErrNum);
// Create the command queue
cl_command_queue myqueue = clCreateCommandQueue(
ctx,
device,
0,
&ciErrNum);

Step 2: Declare Buffers and Move Data

Declare buffers on the device and enqueue copies of input matrices to the device. Also declare the output buffer.
// We assume that A, B, C are float arrays which
// have been declared and initialized
// Allocate space for Matrix A on the device
cl_mem bufferA = clCreateBuffer(
ctx,
CL_MEM_READ_ONLY,
wA*hA*sizeof(float),
NULL,
&ciErrNum);
// Copy Matrix A to the device
ciErrNum = clEnqueueWriteBuffer(
myqueue,
bufferA,
CL_TRUE,
0,
wA*hA*sizeof(float),
(void *)A,
0,
NULL,
NULL);
// Allocate space for Matrix B on the device
cl_mem bufferB = clCreateBuffer(
ctx,
CL_MEM_READ_ONLY,
wB*hB*sizeof(float),
NULL,
&ciErrNum);
// Copy Matrix B to the device
ciErrNum = clEnqueueWriteBuffer(
myqueue,
bufferB,
CL_TRUE,
0,
wB*hB*sizeof(float),
(void *)B,
0,
NULL,
NULL);
// Allocate space for Matrix C on the device
cl_mem bufferC = clCreateBuffer(
ctx,
CL_MEM_READ_ONLY,
hA*wB*sizeof(float),
NULL,
&ciErrNum);

Step 3: Runtime Kernel Compilation

Compile the program from the kernel array, build the program, and define the kernel.
// We assume that the program source is stored in the variable
// ‘source’ and is NULL terminated
cl_program myprog = clCreateProgramWithSource (
ctx,
1,
(const char**)&source,
NULL,
&ciErrNum);
// Compile the program. Passing NULL for the ‘device_list’
// argument targets all devices in the context
ciErrNum = clBuildProgram(myprog, 0, NULL, NULL, NULL, NULL);
// Create the kernel
cl_kernel mykernel = clCreateKernel(
myprog,
“simpleMultiply”,
&ciErrNum);

Step 4: Run the Program

Set kernel arguments and the workgroup size. We can then enqueue kernel onto the command queue to execute on the device.
// Set the kernel arguments
clSetKernelArg(mykernel, 0, sizeof(cl_mem), (void *)&d_C);
clSetKernelArg(mykernel, 1, sizeof(cl_int), (void *)&wA);
clSetKernelArg(mykernel, 2, sizeof(cl_int), (void *)&hA);
clSetKernelArg(mykernel, 3, sizeof(cl_int), (void *)&wB);
clSetKernelArg(mykernel, 4, sizeof(cl_int), (void *)&hB);
clSetKernelArg(mykernel, 5, sizeof(cl_mem), (void *)&d_A);
clSetKernelArg(mykernel, 6, sizeof(cl_mem), (void *)&d_B);
// Set local and global workgroup sizes
//We assume the matrix dimensions are divisible by 16
size_t localws[2] = {16,16} ;
size_t globalws[2] = {wC, hC};
// Execute the kernel
ciErrNum = clEnqueueNDRangeKernel(
myqueue,
mykernel,
2,
NULL,
globalws,
localws,
0,
NULL,
NULL);

Step 5: Obtain Results to Host

After the program has run, we enqueue a read back of the result matrix from the device buffer to host memory.
// Read the output data back to the host
ciErrNum = clEnqueueReadBuffer(
myqueue,
d_C,
CL_TRUE,
0,
wC*hC*sizeof(float),
(void *)C,
0,
NULL,
NULL);
The steps outlined here show an OpenCL implementation of matrix multiplication that can be used as a baseline. In later chapters, we use our understanding of data-parallel architectures to improve the performance of particular data-parallel algorithms.

Image Rotation Example

Image rotation is a common image processing routine with applications in matching, alignment, and other image-based algorithms. The input to an image rotation routine is an image, the rotation angle θ, and a point about which rotation is done. The aim is to achieve the result shown in Figure 4.3. For the image rotation example, we use OpenCL's C++ Wrapper API.
B978012387766600027X/f04-03-9780123877666.jpg is missing
Figure 4.3
An image rotated by 45°. The output is the same size as the input, and the out of edge values are dropped.
The coordinates of a point (x1, y1) when rotated by an angle θ around (x0, y0) become (x2, y2), as shown by the following equation:
B978012387766600027X/si1.gif is missing
By rotating the image about the origin (0, 0), we get
B978012387766600027X/si2.gif is missing
To implement image rotation with openCL, we see that the calculations of the new (x, y) coordinate of each pixel in the input can be done independently. Each work-item will calculate the new position of a single pixel. In a manner similar to matrix multiplication, a work-item can obtain the location of its respective pixel using its global ID (as shown in Figure 4.4).
B978012387766600027X/f04-04-9780123877666.jpg is missing
Figure 4.4
Each element of the input image is handled by one work-item. Each work-item calculates its data's coordinates and writes image out.
The image rotation example is a good example of an input decomposition, meaning that an element of the input (in this case, an input image) is decomposed into a work-item. When an image is rotated, the new locations of some pixels may be outside the image if the input and output image sizes are the same (see Figure 4.3, in which the corners of the input would not have fit within the resultant image). For this reason, we need to check the bounds of the calculated output coordinates.
__kernel void img_rotate(
__global float* dest_data, __global float* src_data,
int W, int H,//Image Dimensions
float sinTheta, float cosTheta ) //Rotation Parameters
{
//Work-item gets its index within index space
const int ix = get_global_id(0);
const int iy = get_global_id(1);
//Calculate location of data to move into (ix,iy)
//Output decomposition as mentioned
float xpos = ((float)ix)*cosTheta + ((float)iy)*sinTheta;
float ypos = −1.0*((float)ix)*sinTheta + ((float)iy)*cosTheta;
//Bound Checking
if(((int)xpos>=0) && ((int)xpos< W) &&
((int)ypos>=0) && ((int)ypos< H))
{
// Read (ix,iy) src_data and store at (xpos,ypos) in
// dest_data
// In this case, because we rotating about the origin
// and there is no translation, we know that (xpos,ypos)
// will be unique for each input (ix,iy) and so each
// work-item can write its results independently
dest_data[(int)ypos*W+(int)xpos]= src_data[iy*W+ix];
}
}
As seen in the previous kernel code, image rotation is an embarrassingly parallel problem, in which each resulting pixel value is computed independently. The main steps for the host code are similar to those in Figure 4.2. For this example's host code, we can reuse a substantial amount of code from the previous matrix multiplication example, including the code that will create the context and the command queue.
To give the developer wider exposure to OpenCL, we write the host code for the image rotation example with the C++ bindings for OpenCL 1.1. The C++ bindings provide access to the low-level features of the original OpenCL C API. The C++ bindings are compatible with standard C++ compilers, and they are carefully designed to perform no memory allocation and offer full access to the features of OpenCL, without unnecessary masking of functionality.
More details about the OpenCL 1.1 specification's C++ Wrapper API can be found at www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdf.
The C++ header for OpenCL is obtained by including the header cl.hpp. The steps are shown in a similar manner to the matrix multiplication example in order to illustrate the close correspondence between the C API and the more concise C++ bindings.

Step 1: Set Up Environment

// Discover platforms
cl::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// Create a context with the first platform
cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM,
(cl_context_properties)(platforms[0])(), 0};
// Create a context using this platform for a GPU type device
cl::Context context(CL_DEVICE_TYPE_ALL, cps);
// Get device list from the context
cl::vector<cl::Device> devices =
context.getInfo<CL_CONTEXT_DEVICES>();
// Create a command queue on the first device
cl::CommandQueue queue = cl::CommandQueue(context,
devices[0], 0);

Step 2: Declare Buffers and Move Data

// Create buffers for the input and output data (“W” and “H”
// are the width and height of the image, respectively)
cl::Buffer d_ip = cl::Buffer(context, CL_MEM_READ_ONLY,
W*H* sizeof(float));
cl::Buffer d_op = cl::Buffer(context, CL_MEM_WRITE_ONLY,
W*H* sizeof(float));
// Copy the input data to the device (assume that the input
// image is the array “ip”)
queue.enqueueWriteBuffer(d_ip, CL_TRUE, 0, W*H*
sizeof(float), ip);

Step 3: Runtime Kernel Compilation

// Read in the program source
std::ifstream sourceFileName("img_rotate_kernel.cl");
std::string sourceFile(
std::istreambuf_iterator<char>(sourceFileName),
(std::istreambuf_iterator<char>()));
cl::Program::Sources rotn_source(1,
std::make_pair(sourceFile.c_str(),
sourceFile.length()+1));
// Create the program
cl::Program rotn_program(context, rotn_source);
// Build the program
rotn_program.build(devices);
// Create the kernel
cl::Kernel rotn_kernel(rotn_program, "img_rotate");

Step 4: Run the Program

// The angle of rotation is theta
float cos_theta = cos(theta);
float sin_theta = sin(theta);
// Set the kernel arguments
rotn_kernel.setArg(0, d_op);
rotn_kernel.setArg(1, d_ip);
rotn_kernel.setArg(2, W);
rotn_kernel.setArg(3, H);
rotn_kernel.setArg(4, cos_theta);
rotn_kernel.setArg(5, sin_theta);
// Set the size of the NDRange and workgroups
cl::NDRange globalws(W,H);
cl::NDRange localws(16,16);
// Run the kernel
queue.enqueueNDRangeKernel(rotn_kernel, cl::NullRange,
globalws, localws);

Step 5: Read Result Back to Host

// Read the output buffer back to the host
queue.enqueueReadBuffer(d_op, CL_TRUE, 0, W*H*sizeof(float), op);
As seen from the previous code, the C++ bindings maintain a close correspondence to the C API.

Image Convolution Example

In image processing, convolution is a commonly used algorithm that modifies the value of each pixel in an image by using information from neighboring pixels. A convolution kernel, or filter, describes how each pixel will be influenced by its neighbors. For example, a blurring kernel will take the weighted average of neighboring pixels so that large differences between pixel values are reduced. By using the same source image and changing only the filter, effects such as sharpening, blurring, edge enhancing, and embossing can be produced.
A convolution kernel works by iterating over each pixel in the source image. For each source pixel, the filter is centered over the pixel and the values of the filter multiply the pixel values that they overlay. A sum of the products is then taken to produce a new pixel value. Figure 4.5 provides a visual for this algorithm. Figure 4.6B shows the effect of a blurring filter and Figure 4.6C shows the effect of an edge-detection filter on the same source image seen in Figure 4.6A.
B978012387766600027X/f04-05-9780123877666.jpg is missing
Figure 4.5
Applying a convolution filter to a source image.
B978012387766600027X/f04-06a-9780123877666.jpg is missing
B978012387766600027X/f04-06b-9780123877666.jpg is missing
B978012387766600027X/f04-06c-9780123877666.jpg is missing
Figure 4.6
The effect of a blurring filter and a vertical edge-detecting filter applied to the same source image. (A) The original image. (B) Blurring filter. (C) Vertical edge-detecting filter.
The following code performs a convolution in C. The outer two loops iterate over the source image, selecting the next source pixel. At each source pixel, the filter is applied to the neighboring pixels.
// Iterate over the rows of the source image
for(int i = halfFilterWidth; i < rows - halfFilterWidth; i++) {
// Iterate over the columns of the source image
for(int j = halfFilterWidth; j < cols - halfFilterWidth; j++) {
sum = 0; // Reset sum for new source pixel
// Apply the filter to the neighborhood
for(int k = - halfFilterWidth; k <= halfFilterWidth; k++) {
for(int l = - halfFilterWidth; l <= halfFilterWidth; l++) {
sum += Image[i+k][j+l] *
Filter[k+ halfFilterWidth][l+ halfFilterWidth];
}
}
outputImage[i][j] = sum;
}
}

Step 1: Create Image and Buffer Objects

This example implements convolution using OpenCL images for the data type of the source and output images. Using images to represent the data has a number of advantages. For the convolution, work-items representing border pixels may read out-of-bounds. Images supply a mechanism to automatically handle these accesses and return meaningful data.
The code begins by assuming that a context (context) and command queue (queue) have already been created, and that the source image (sourceImage), output image (outputImage), and filter (filter) have already been initialized on the host. The images both have dimensions width by height.
The first task is to allocate space for the source and output images and the filter on the device. Images require a format descriptor, cl_image_format, to define the size and type of data that they store and the channel layout that they use to store it. The image_channel_order field of the descriptor is where the channel layout is specified. Recall from Chapter 2 that every element of an image stores data in up to four channels, with each channel specified by RGBA, respectively. An image that should hold four values in every image element should use CL_RGBA for the channel order. However, if each work-item will only access a single value (e.g., a pixel from a grayscale image or an element of a matrix), the data can be specified to only use a single channel using CL_R. This example assumes grayscale data and so only uses a single channel. The type of data is in the image_channel_data_type field of the descriptor. Integers are specified by a combination of signedness and size. For example, CL_SIGNED_INT32 is a 32-bit signed integer, and CL_UNSIGNED_INT8 is the equivalent of an unsigned character in C. Floating point data is specified by CL_FLOAT, and this is the type of data used in the example.
After creating the image format descriptor, memory objects are created to represent the images using clCreateImage2D(). A buffer is created for the filter and will eventually be used as constant memory.
// The convolution filter is 7x7
int filterWidth = 7;
int filterSize = filterWidth*filterWidth; // Assume a square kernel
// The image format describes how the data will be stored in memory
cl_image_format format;
format.image_channel_order = CL_R; // single channel
format.image_channel_data_type = CL_FLOAT; // float data type
// Create space for the source image on the device
cl_mem bufferSourceImage = clCreateImage2D(
context,
0,
&format,
width,
height,
0,
NULL,
NULL);
// Create space for the output image on the device
cl_mem bufferOutputImage = clCreateImage2D(
context,
0,
&format,
width,
height,
0,
NULL,
NULL);
// Create space for the 7x7 filter on the device
cl_mem bufferFilter = clCreateBuffer(
context,
0,
filterSize*sizeof(float),
NULL,
NULL);

Step 2: Write the Input Data

The call to clEnqueueWriteImage() copies an image to a device. Unlike buffers, copying an image requires supplying a three-dimensional offset and region, which define the coordinates where the copy should begin and how far it should span, respectively.
The filter is copied using clEnqueueWriteBuffer(), as seen in previous examples.
// Copy the source image to the device
size_t origin[3] = {0, 0, 0}; // Offset within the image to copy from
size_t region[3] = {width, height, 1}; // Elements to per dimension
clEnqueueWriteImage(
queue,
bufferSourceImage,
CL_FALSE, origin,
region,
0,
0,
sourceImage,
0,
NULL,
NULL);
// Copy the 7x7 filter to the device
clEnqueueWriteBuffer(
queue,
bufferFilter,
CL_FALSE,
0,
filterSize*sizeof(float),
filter,
0,
NULL,
NULL);

Step 3: Create Sampler Object

In OpenCL, samplers are objects that describe how to access an image. Samplers specify the type of coordinate system, what to do when out-of-bounds accesses occur, and whether or not to interpolate if an access lies between multiple indices. The format of the clCreateSampler() API call is as follows:
cl_sampler clCreateSampler (
cl_context context,
cl_bool normalized_coords,
cl_addressing_mode addressing_mode,
cl_filter_mode filter_mode,
cl_int *errcode_ret)
The coordinate system can either be normalized (i.e., range from 0 to 1) or use standard indices. Setting the second argument to CL_TRUE enables normalized coordinates. Convolution does not use normalized coordinates, so the argument is set to FALSE.
OpenCL also allows a number of addressing modes to be used for handling out-of-bounds accesses. In the case of the convolution example, we use CL_ADDRESS_CLAMP_TO_EDGE to have any out-of-bounds access return the value on the border of the image, if the access went out-of-bounds. If CL_ADDRESS_CLAMP is used, the value produced by an out-of-bounds access is 0 for channels RG and B, and it returns either 0 or 1 for channel A (based on the image format). Other options are available when normalized coordinates are used.
The filter mode can be set to either access the closest pixel to a coordinate or interpolate between multiple pixel values if the coordinate lies somewhere in between.
// Create the image sampler
cl_sampler sampler = clCreateSampler(
context,
CL_FALSE,
CL_ADDRESS_CLAMP_TO_EDGE,
CL_FILTER_NEAREST,
NULL);

Step 4: Compile and Execute the Kernel

The steps to create and build a program, create a kernel, set the kernel arguments, and enqueue the kernel for execution are identical to those in the previous example. Unlike the reference C version, the OpenCL code using images should create as many work-items as there are pixels in the image. Any out-of-bounds accesses due to the filter size will be handled automatically, based on the sampler object.

Step 5: Read the Result

Reading the result back to the host is very similar to writing the image, except that a pointer to the location to store the output data on the host is supplied.
// Read the output image back to the host
clEnqueueReadImage(
queue,
bufferOutputImage,
CL_TRUE,
origin,
region,
0,
0,
outputImage,
0,
NULL,
NULL);

The Convolution Kernel

The kernel is fairly straightforward if the reference C code is understood—each work-item executes the two innermost loops. Data reads from the source image must be performed using an OpenCL construct that is specific to the data type. For this example, read_imagef() is used, where f signifies that the data to be read is of type single precision floating point. Accesses to an image always return a four-element vector (one per channel), so pixel (the value returned by the image access) and sum (resultant data that gets copied to the output image) must both be declared as a float4. Writing to the output image uses a similar function, write_imagef(), and requires that the data be formatted correctly (as a float4). Writing does not support out-of-bounds accesses. If there is any chance that there are more work-items in either dimension of the NDRange than there are pixels, bounds checking should be done before writing the output data.
The filter is a perfect candidate for constant memory in this example because all work-items access the same element each iteration. Simply adding the keyword __constant in the signature of the function places the filter in constant memory.
__kernel
void convolution(
__read_only image2d_t sourceImage,
__write_only image2d_t outputImage,
int rows,
int cols,
__constant float* filter,
int filterWidth,
sampler_t sampler)
{
// Store each work-item's unique row and column
int column = get_global_id(0);
int row = get_global_id(1);
// Half the width of the filter is needed for indexing
// memory later
int halfWidth = (int)(filterWidth/2);
// All accesses to images return data as four-element vector
// (i.e., float4), although only the 'x' component will contain
// meaningful data in this code
float4 sum = {0.0f, 0.0f, 0.0f, 0.0f};
// Iterator for the filter
int filterIdx = 0;
// Each work-item iterates around its local area based on the
// size of the filter
int2 coords; // Coordinates for accessing the image
// Iterate the filter rows
for(int i = -halfWidth; i <= halfWidth; i++) {
coords.y = row + i;
// Iterate over the filter columns
for(int j = -halfWidth; j <= halfWidth; j++) {
coords.x = column + j;
float4 pixel;
// Read a pixel from the image. A single channel image
// stores the pixel in the 'x' coordinate of the returned
// vector.
pixel = read_imagef(sourceImage, sampler, coords);
sum.x += pixel.x * filter[filterIdx++];
}
}
// Copy the data to the output image if the
// work-item is in bounds
if(myRow < rows && myCol < cols) {
coords.x = column;
coords.y = row;
write_imagef(outputImage, coords, sum);
}
}

Compiling OpenCL Host Applications

To run a program on a GPU, an OpenCL-supported graphics driver is required. OpenCL programs using AMD's implementation can be run on x86 CPUs without the installation of any hardware drivers but still require the OpenCL runtime.
Compiling an OpenCL program is similar to compiling any application that uses dynamic libraries. Vendors distribute their own OpenCL library that must be used when compiling and linking an OpenCL executable. To compile an OpenCL program, an include path must be set to locate the OpenCL headers (cl.h or cl.hpp). The linker must know how to locate the OpenCL library (OpenCL.lib for Windows and libOpenCL.a on Linux). That's it!
Assuming that the OpenCL SDK is installed at $(AMDAPPSDKROOT), an example compilation on Linux might be as follows:
$ g++ -o prog -I/$(AMDAPPSDKROOT)/include –L/$(AMDAPPSDKROOT)/lib/x86_64 –lOpenCL prog.cpp
We see that most of the steps are similar across applications, allowing us to reuse a lot of “boiler plate” code. Applications using the C++ Wrapper API are compiled in the same manner. The C++ header file will usually be located in the same directory as the C headers.

Summary

In this chapter, we discussed implementations of some well-known data-parallel algorithms. We studied the use of OpenCL buffer and image objects. We also used the C++ Wrapper API for the image rotation example.
In each example, a work-item computes the result of a single output element for the problem, although the input data requirements vary. The image rotation example is a case in which only one input element is needed. In matrix multiplication, a whole row and a whole column of the input matrices are needed by each work-item to calculate the result of one element of the output matrix. Convolution requires a neighborhood of input pixels to compute a result.
Although the examples discussed in this chapter are correct data-parallel OpenCL programs, their performance can be drastically improved. Optimizing performance based on specific hardware platforms is the goal of subsequent chapters.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset