Chapter 11. OpenCL Extensions
This chapter discusses a case study on using more advanced OpenCL to implement a convolution filter.
Keywords Double, extensions, fission, OpenCL

Introduction

Similar to most programming languages and frameworks, the OpenCL specification provides support for optional extensions. They represent a small but important set of OpenCL extensions. In this chapter, we discuss a number of these extensions that can provide programmers an extended set of tools to implement OpenCL applications.

Overview of Extension Mechanism

OpenCL defines three types of extensions:
• KHR extension: Formally ratified by the OpenCL working group and comes with a set of conformance tests that any application claiming to support the extension must have passed. All KHR extensions are included as part of the OpenCL specification. In general, the goal is to keep the set of KHR extensions small to avoid differences between implementations. Unfortunately, in some cases, a feature must be an extension because it is not possible to support on all platforms. DirectX interoperability is a good example because it is only relevant to OpenCL implementations that support Microsoft Windows. A KHR extension is assigned a unique string name of the form cl_khr_<name>, where name is the name given to the particular extension.
• EXT extension: Developed by one or more of the OpenCL working group members, but it has not been formally approved. There is no requirement for an EXT feature to pass conformance tests. An EXT extension is often used to provide early access to features being worked on by a subgroup of the OpenCL working group, many of which will appear in the specification core or as a KHR at a later date. An EXT extension is assigned a unique string name of the form cl_ext_<name>, where name is the name given to the particular extension.
• Vendor extension: Developed by a single vendor, most likely to expose some feature that is only accessible on the vendor's hardware or platform (and thus would not be of general application). An example is AMD's Device Attribute Query, which allows the developer to query additional information specific to AMD devices. A Vendor extension is assigned a unique string name of the form cl_<vendor>_<name>, where vendor is a given vendor-defined string (e.g., amd is used by AMD), and name is the name given to the particular extension.
Extensions can be associated with a given OpenCL platform and in those cases are likely to be always enabled. For example, AMD's Event Callback extension (cl_amd_event_callback) is always enabled. Other extensions can be specific to a particular set of devices (e.g., double precision support (cl_khr_fp64)).
In general, the application must query either the platform or a particular device to determine what extensions are supported. Using the C++ Wrapper API, the following code demonstrates how to query both the platform and a device for the extension information:
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
std::string platformExts = platforms[0].getInfo<CL_PLATFORM_EXTENSIONS>();
cl_context_properties cprops[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[1](), 0 };
cl::Context context(devType, cprops);
std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
// Check that double is supported
std::string deviceExts = devices[0].getInfo<CL_DEVICE_EXTENSIONS>();
Of course, because we now have a value of type std::string for platform and device extensions, we can simply use the method find(). For example, to check that a device supports 64-bit Atomics (defined with the name cl_khr_int64_base_atomics), you could use the following code:
bool has64Atomics = deviceExts.find(“cl_khr_int64”);
Because extensions are optional, there is no requirement for an implementation to provide any externally defined symbols that the host application can call. Accordingly, a host program cannot be statically linked against any extension APIs. Instead, we can query the particular function's address equipped with the following API:
void * clGetExtensionFunctionAdddress(const char *funcname);
This returns the address of the extension function named by funcname. The pointer returned can be cast to a function pointer of the expected type. Typedefs for each function pointer are declared for all extensions that add API entry points. For example, the EXT extension Atomic Counters, exposed in the OpenCL extension header cl_ext.h, provides low-latency atomic append/consume counters and defines the following:
typedef CL_API_ENTRY cl_counter_ext (CL_API_CALL * clCreateCounterEXT_fn)(
cl_context/* context */,
cl_counter_flags_amd /* flags */,
cl_uint/* value */,
cl_int */* error_ret */) CL_API_SUFFIX__VERSION_1_0;
Then, assuming that a particular OpenCL device supports the extension cl_ext_atomic_counters, the following call will set clCreateCounterEXT_pfn to a non-NULL value:
clCreateCounterEXT_fn clCreateCounterEXT_pfn =
clGetExtensionFunctionAdddress(“clCreateCounterEXT”);
In practice, it is often useful to use the following macro to enable quick allocation of these functions:
#define __INIT_CL_EXT_FCN_PTR(name)
if(!##name_fn) {
##name_pfn = (##name_fn)
clGetExtensionFunctionAddress(#name);
if(!##name_pfn) {
}
}
To get the same behavior as the function pointer definition previously, we can write the following:
__INIT_CL_EXT_FCN_PTR(clCreateContextEXT)
The examples presented later in this chapter go a step further and use the OpenCL C++ Wrapper API, which, if a particular extension is enabled, directly exposes a class interface and implicitly takes care of allocating and querying function pointers (as required).
Once an application has verified that a particular extension is supported by a device and has allocated any required API function addresses, it can then use any corresponding OpenCL C source extensions. However, these extension must also be enabled, this time in the OpenCL C source itself. These features are controlled using a pragma directive. The pragma is of the form
#pragma OPENCL EXTENSION extension_name : behavior
where extension_name is the corresponding name of the extension being enabled or disabled, such as cl_khr_int64, and behavior can be either enable or disable. The impact of enabling or disabling an extension follows from the position in the source code of the last seen pragma for a given extension. For example, to enable 64-bit atomics in an OpenCL C source program, the following code would do the job:
#pragma OPENCL EXTENSION cl_khr_int64 : enable
Images are the one exception to the rules described previously. To check if a device supports this feature (which is optional), a separate device query is required:
bool hasImages = devices[0].getInfo<CL_DEVICE_IMAGE_SUPPORT>;
If the result of this call is true, then images are supported by the device in question and the OpenCL C source is not required to explicitly enable this feature.
There are many KHR, EXT, and vendor extensions—too many to discuss in detail in this chapter. The remainder of this chapter discusses two important extensions in wide use today:
• Device fission—the ability for some devices to be divided into smaller subdevices
• Double precision—the ability for some devices to support double as well as float data types

Device Fission

The EXT extension Device Fission (Bellows et al., 2010) provides an interface for subdividing an OpenCL device into multiple subdevices. As an example, consider an AMD six-core Istanbul x86 CPU shown symbolically in Figure 11.1A. We view all six cores as a single OpenCL device by default, as shown in Figure 11.1B. Using Device Fission, the six cores can be subdivided into six OpenCL devices as shown in Figure 11.1C, each capable of supporting one or more command queues. Because these command queues are asynchronous and run in their own threads, it is possible to use Device Fission to build a portable and powerful threading application based on task parallelism. To date, Device Fission is supported only on CPU-like devices, but in the future this functionality will spread across all devices in the platform, including GPUs. Next, we motivate and describe the basic use of Device Fission through an example.
B9780123877666000347/f11-01a-9780123877666.jpg is missing
B9780123877666000347/f11-01b-9780123877666.jpg is missing
B9780123877666000347/f11-01c-9780123877666.jpg is missing
Figure 11.1
CPU as seen by OpenCL with and without device fission. (A) Phenom II 6 core. (B) Phenom II 6 core represented as a single OpenCL device for all cores. (C) Phenom II 6 core represented as an OpenCL device for each core.
An interesting application of Device Fission is to use it to build and study concurrent runtimes, similar to Microsoft's Concurrent Runtime (ConcRT) (Microsoft, 2010) and Intel's Threading Building Blocks (TBB) (Intel, 2011). We present a case study of building a simple C++ class that supports a parallel for construct that distributes work across a set of devices created using Device Fission. Although this example has little in common with the kind of production functionality provided by runtimes such as ConcRT and TBB, it provides enough to show that with only a small amount of effort we can build portable, scalable, parallel applications that go beyond basic data-parallel computations. For this example, the goal is to implement a simple class Parallel that provides a parallelFor method to distribute work across x86 cores, each having a corresponding OpenCL device generated by Device Fission. The class Parallel has the following public interface:
class Parallel
{
public:
Parallel();
static unsigned int atomicAdd(
unsigned int inc,
volatile unsigned int *dest);
bool parallelFor(int range, std::function<void (int i)> f);
};
We demonstrate its use with a simple application that counts the number of prime numbers in an input array of randomly generated integers. The implementation uses the OpenCL C++ Wrapper API (Gaster, 2010). To enable Device Fission, we simply need to define USE_CL_DEVICE_FISSION and include cl.hpp:1
1We also enable C++ exceptions for OpenCL as shown in the full program source, given in Listing 11.1.
#define USE_CL_DEVICE_FISSION 1
#include <CL/cl.hpp>
As described previously, we encapsulate the parallel features of our simple runtime in the class Parallel. All OpenCL state for the program is held in the private part of this class definition. The default constructor initializes OpenCL and divides the CPU device into single core subdevices. As is standard with all OpenCL initialization, we need to first query a platform, then create a context (in this case, using CL_DEVICE_TYPE_CPU), and finally inspect the list of devices (as shown in Figure 11.1B, this will always be presented as a single element):
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)(platforms[1])(),
0
};
context_ = cl::Context(CL_DEVICE_TYPE_CPU, properties);
std::vector<cl::Device> devices =
context_.getInfo<CL_CONTEXT_DEVICES>();
Before we can use the Device Fission extension, the program must check that it is an exported extension for the device:
if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find(
"cl_ext_device_fission") == std::string::npos) {
std::cout << "Required that device support “
<< “cl_ext_device_extension"
<< std::endl;
exit(−1);
}
Given an OpenCL cl::Device, in this case devices[0], the method createSubDevices creates subdevices,
cl_int createSubDevices(
const cl_device_partition_property_ext * properties,
VECTOR_CLASS<Device>* devices)
which, given a list of partition properties (as defined in Table 11.1), creates a set of subdevices, returned in the parameter devices.
Table 11.1 Subdevice Partition Properties
cl_device_partition_property_extDescription
CL_DEVICE_PARTITION_EQUALLY_EXTSplit the aggregate device into as many smaller devices as can be created, each containing N compute units. The value N is passed as the value accompanying this property. If N does not divide evenly into CL_DEVICE_MAX_COMPUTE_UNITS, then the remaining compute units are not used.
CL_DEVICE_PARTITION_BY_COUNTS_EXTThis property is followed by a CL_PARTITION_BY_COUNTS_LIST_END_EXT terminated list of compute unit counts. For each non-zero count M in the list, a subdevice is created with M compute units in it. CL_PARTITION_BY_COUNTS_LIST_END_EXT is defined to be 0.
CL_DEVICE_PARTITION_BY_NAMES_EXTThis property is followed by a list of compute unit names. Each list starts with a CL_PARTITION_BY_NAMES_LIST_END_EXT terminated list of compute unit names. Compute unit names are integers that count up from zero to the number of compute units less 1. CL_PARTITION_BY_NAMES_LIST_END_EXT is defined to be −1. Only one subdevice may be created at a time with this selector. An individual compute unit name may not appear more than once in the subdevice description.
CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXTSplit the device into smaller aggregate devices containing one or more compute units that all share part of a cache hierarchy. The value accompanying this property may be drawn from the following CL_AFFINITY_DOMAIN list:
CL_AFFINITY_DOMAIN_NUMA_EXT—Split the device into subdevices composed of compute units that share a NUMA band.
CL_AFFINITY_DOMAIN_L4_CACHE_EXT—Split the device into subdevices composed of compute units that share a level 4 data cache.
CL_AFFINITY_DOMAIN_L3_CACHE_EXT—Split the device into subdevices composed of compute units that share a level 3 data cache.
CL_AFFINITY_DOMAIN_L2_CACHE_EXT—Split the device into subdevices composed of compute units that share a level 2 data cache.
CL_AFFINITY_DOMAIN_L1_CACHE_EXT—Split the device into subdevices composed of compute units that share a level 1 data cache.
CL_AFFINITY_DOMAIN_NEXT_FISSIONABLE_EXT—Split the device along the next fissionable CL_AFFINITY_DOMAIN. The implementation shall find the first level along which the device or subdevice may be further subdivided in the order NUMA, L4, L3, L2, L1, and fission the device into subdevices composed of compute units that share memory subsystems at this level. The user may determine what happened by calling clGetDeviceInfo(CL_DEVICE_PARTITION_STYLE_EXT) on the subdevices.
The following code then creates the subdevices, using the partition CL_DEVICE_PARTITION_EQUALLY_EXT to equally subdivide (in this case), producing a one-to-one mapping between subdevice and core:
cl_device_partition_property_ext subDeviceProperties[] =
{
CL_DEVICE_PARTITION_EQUALLY_EXT,
1,
CL_PROPERTIES_LIST_END_EXT,
0
};
devices[0].createSubDevices(subDeviceProperties, &subDevices_);
if (subDevices_.size() <= 0) {
std::cout << "Failed to allocate sub-devices" << std::endl;
exit(-1);
}
The following code concludes the default constructor definition by iterating through the list of subdevices, creating a command queue for each one:
for (auto i = subDevices_.begin(); i != subDevices_.end(); i++) {
queues_.push_back(cl::CommandQueue(context_, *i));
}
Figure 11.2 shows each CPU core paired with its corresponding command queue. It is important to note that commands submitted via each queue run asynchronously and concurrently with each other.
B9780123877666000347/f11-02-9780123877666.jpg is missing
Figure 11.2
OpenCL subdevices with associated command queues.
The implementation for atomicAdd is straightforward and is provided later. Here, we focus on the definition of parallelFor, which takes two arguments. The first argument represents the bounds of the iteration space, and the second argument is a function object that will be applied to each index across this space. This second argument is a native C++ function object, 2 but OpenCL clEnqueueNDRangeKernel operations are valid only for OpenCL C kernels, and so we need another approach to support native C++ functions. Fortunately, OpenCL has just such a function, clEnqueueNativeKernel, which can enqueue a C function. With some careful marshaling, this can also be made to work for a C++ function. Native kernels were described in Chapter 5.
2The term native is used to imply that the function in question has been compiled by the host compiler and not the OpenCL C compiler.
Using clEnqeueNativeKernel, the definition of parallelFor is straightforward: 3
3Note that we use the OpenCL C++ Wrapper API (rather than the clEnqueueNativeKernel from the C API), as described previously.
bool parallelFor(int range, std::function<void (int i)> f)
{
std::vector<cl::Event> events;
size_t args[2];
args[0] = reinterpret_cast<size_t>(&f);
int index = 0;
for (int x = 0; x < range; x++) {
int numQueues =
range - x > queues_.size() ? queues_.size() : range - x;
cl::Event event;
while(numQueues > 0) {
args[1] = static_cast<size_t>(index++);
queues_[numQueues-1].enqueueNativeKernel(
funcWrapper,
std::make_pair(
static_cast<void *>(args),
sizeof(size_t)*2),
NULL,
NULL,
NULL,
&event);
events.push_back(event);
numQueues--;
x++;
}
cl::Event::waitForEvents(events);
}
return true;
}
The first part of the function sets up the argument, which in this case is the actual function we want to run. This is required because it is a C++ function object (often called a functor), and OpenCL is expecting a C function. Thus, we provide a wrapper function called funcWrapper that takes as an argument a pointer to the functor. The wrapper function “unboxes” it and calls the function object when executed. The main body of the function is the loop executing the functor for the 0 … range, with the inner loop mapping some subset of this to the number of actual subdevices. Note that these are all submitted asynchronously, and each call returns an event that we wait on at the end of each set of submissions. A more optimized version might wait until a large number of submissions have happened and control a finer grain of control using queue.flush() and so on. We leave this as an exercise for the reader.
Finally, we put this all together in Listing 11.1, which shows the complete implementation of our primes checking example, including the main function. One interesting and important aspect of this example is that no OpenCL C device code is used, and it is simply using the OpenCL runtime as a portable threading library.
// Enable OpenCL C++ exceptions
#define __CL_ENABLE_EXCEPTIONS
// Enable Device Fission
#define USE_CL_DEVICE_FISSION 1
#include <CL/cl.hpp>
#include <cstring>
#include <cstdlib>
#include <cassert>
#include <cmath>
#include <iostream>
#include <fstream>
#include <vector>
#include <functional>
#include <memory>
class Parallel
{
private:
cl::Context context_;
std::vector<cl::Device> subDevices_;
cl::CommandQueue queue_;
std::vector<cl::CommandQueue> queues_;
static void CL_CALLBACK funcWrapper(void * a)
{
size_t * args = static_cast<size_t *>(a);
std::function<void (int i)> * f =
reinterpret_cast<std::function<void (int i)>*>(args[0]);
(*f)(static_cast<int>(args[1]));
}
public:
Parallel()
{
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)(platforms[1])(),
0
};
context_ = cl::Context(CL_DEVICE_TYPE_CPU, properties);
std::vector<cl::Device> devices =
context_.getInfo<CL_CONTEXT_DEVICES>();
// Check that device fission is supported
if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find(
"cl_ext_device_fission") == std::string::npos) {
std::cout << "Required that device support “
<< “cl_ext_device_extension"
<< std::endl;
exit(−1);
}
cl_device_partition_property_ext subDeviceProperties[] =
{
CL_DEVICE_PARTITION_EQUALLY_EXT,
1,
CL_PROPERTIES_LIST_END_EXT,
0
};
devices[0].createSubDevices(subDeviceProperties, &subDevices_);
if (subDevices_.size() <= 0) {
std::cout << "Failed to allocate sub-devices" << std::endl;
exit(−1);
}
for (auto i = subDevices_.begin(); i != subDevices_.end(); i++) {
queues_.push_back(cl::CommandQueue(context_, *i));
}
std::cout << "Number of sub-devices "
<< subDevices_.size()
<< std::endl;
}
static unsigned int atomicAdd(
unsigned int inc,
volatile unsigned int *dest)
{
#if defined(_MSC_VER)
return (unsigned int)_InterlockedExchangeAdd(
(volatile long*)dest,
(long)inc);
#else
return __sync_fetch_and_add(dest, inc);
#endif
}
bool parallelFor(int range, std::function<void (int i)> f)
{
std::vector<cl::Event> events;
size_t args[2];
args[0] = reinterpret_cast<size_t>(&f);
int index = 0;
for (int x = 0; x < range; x++) {
int numQueues =
range - x > queues_.size() ? queues_.size() : range - x;
cl::Event event;
while(numQueues > 0) {
args[1] = static_cast<size_t>(index++);
queues_[numQueues-1].enqueueNativeKernel(
funcWrapper,
std::make_pair(
static_cast<void *>(args),
sizeof(size_t)*2),
NULL,
NULL,
NULL,
&event);
events.push_back(event);
numQueues−;
x++;
}
cl::Event::waitForEvents(events);
}
return true;
}
};
const unsigned int numNumbers = 1024;
int
main(int argc, char** argv)
{
volatile unsigned int numPrimes = 0;
int * numbers = new int[numNumbers];
::srand(2009);
// Random initialize
for (size_t i = 0; i < numNumbers; ++i) {
numbers[i] = ::rand();
}
try {
Parallel parallel;
parallel.parallelFor(numNumbers, [numbers, &numPrimes] (int x) {
auto isPrime = [] (unsigned int n) -> bool {
if (n == 1 || n == 2) {
return true;
}
if (n % 2 == 0) {
return false;
}
for (unsigned int odd = 3;
odd <= static_cast<unsigned int>(
sqrtf(static_cast<float>(n)));
odd +=2) {
if (n % odd == 0) {
return false;
}
}
return true;
};
if (isPrime(numbers[x])) {
Parallel::atomicAdd(1, &numPrimes);
}
});
std::cout << "Number of primes found = " << numPrimes << std::endl;
}
catch (cl::Error err) {
std::cerr
<< "ERROR: "
<< err.what()
<< "("
<< err.err()
<< ")"
<< std::endl;
return EXIT_FAILURE;
}
delete[] numbers;
return EXIT_SUCCESS;
}

Double Precision

Floating point formats were created to allow programmers to work with very large and very small non-integral data values. For many applications, single precision floating point does not provide enough range for the targeted application. Many applications (particularly in science and engineering) require part or all of a particular computation to use double precision. OpenCL does not require that a particular compute device support double precision. For critical applications in which double precision is required, OpenCL provides the optional cl_khr_fp64 extension. This is enabled by including the following directive before any double precision use in an OpenCL C program:
#pragma cl_khr_fp64 : enable
Once enabled, the double precision support provides access to the following data types:
TypeDescription
DoubleDouble precision floating point number
double22-component double vector
double33-component double vector
double44-component double vector
double88-component double vector
double1616-component double vector
The double type conforms to the IEEE 754 double precision storage format.
There is a one-to-one mapping with the corresponding single precision float types. On an AMD CPU, when the OpenCL device is CL_DEVICE_TYPE_CPU, vector types are mapped directly to SSE and AVX packed registers. In the case in which a vector type is larger than the underlying hardware vector, the resulting implementation is expanded into multiples of the hardware width. All of the conversion rules defined by the OpenCL specification for float types are also defined for doubles. The built-in math functions provided for float have been extended to include appropriate versions that work on double, double2, double3, double4, double8, and double16 as arguments and return values.
As a simple example, Listing 11.2 shows OpenCL C code to implement a block matrix multiple using double precision. Note that the first line of the OpenCL source file enables the extension.
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#include "matrixmul.h"
#define AS(i, j) As[(j) * BLOCK_SIZE + (i)]
#define BS(i, j) Bs[(j) * BLOCK_SIZE + (i)]
__kernel void
matrixMul(
__global double* C,
__global double* A,
__global double* B,
int wA,
int wB,
__local double* As,
__local double* Bs)
{
// Block index
int bx = get_group_id(0);
int by = get_group_id(1);
int tx = get_local_id(0);
int ty = get_local_id(1);
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
double Csub = 0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
// Load the matrices from device memory
// to local memory; each work-item loads
// one element of each matrix
AS(ty, tx) = A[a + wA * ty + tx];
BS(ty, tx) = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// Multiply the two matrices together;
// each work-item computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
barrier(CLK_LOCAL_MEM_FENCE);
}
// Write the block sub-matrix to device memory;
// each work-item writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}
The host OpenCL program, written using the OpenCL C++ Wrapper API, is given in Listing 11.3, with the shared header, matrixmul.h, given in Listing 11.4. The example is straightforward, but two points are worth discussing further. First, to avoid unexpected runtime errors, the application must check that the device supports the extension cl_khr_fp64, and this is achieved with the following code:
if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find("cl_khr_fp64") ==
std::string::npos) {
std::cout << "Required that device support cl_khr_fp64" << std::endl;
exit(−1);
}
In this case, if the device does not support the extension, then the application simply exits. A more robust solution might drop back to the host to perform the computation if necessary. Second, this example uses the profiling API for command queues to collect information on the time to execute the matrix multiple.
#include <CL/cl.hpp> // C++ Wrapper API (no need to include cl.h)
#include <cstring>
#include <cstdlib>
#include <cassert>
#include <cmath>
#include <iostream>
#include <fstream>
#include <vector>
#include <matrixmul.h>
template <size_t X, size_t Y>
class Matrix
{
private:
cl_double data_[X * Y];
public:
template <typename Functor>
Matrix(Functor init) { init(data_, X, Y); }
Matrix() { ::memset(data, '', X * Y * sizeof(cl_double)); }
Matrix<X,Y> operator -= (const Matrix<X,Y>& rhs)
{
for (size_t i = 0; i < (X * Y); ++i) {
data_[i] -= rhs.data_[i];
}
return *this;
}
Matrix<X,Y> operator - (const Matrix<X,Y>& rhs)
{
Matrix<X,Y> result = *this;
return result -= rhs;
}
bool operator == (const Matrix<X,Y>& rhs)
{
for (size_t i = 0; i < (X * Y); ++i) {
if (data_[i] != rhs.data_[i]) {
return false;
}
}
return true;
}
bool operator != (const Matrix<X,Y>& rhs)
{
return !(*this == rhs);
}
bool compareL2fe(const Matrix<X,Y>& reference, cl_double epsilon)
{
cl_double error = 0.0f;
cl_double ref = 0.0f;
for (size_t i = 0; i < (X * Y); ++i) {
cl_double diff = reference.data_[i] - data_[i];
error += diff * diff;
ref += reference.data_[i] * reference.data_[i];
}
cl_double normRef =::sqrt((double) ref);
if (::fabs((double) ref) < 1e-7f) {
return false;
}
cl_double normError = ::sqrtf((double) error);
error = normError / normRef;
return error < epsilon;
}
void printOn(std::ostream& out)
{
for (size_t y = 0; y < Y; ++y) {
for (size_t x = 0; x < X; ++x) {
out << data_[y * X + x] << " ";
}
out << " ";
}
}
cl_double* data() { return data_; }
size_t size() const { return X * Y * sizeof(cl_double); }
};
static void
randomInit(cl_double* data, size_t width, size_t height)
{
for (size_t i = 0; i < width*height; ++i) {
data[i] = ::rand() / (double) RAND_MAX;
}
}
static void
nop(cl_double*, size_t, size_t)
{
}
static void
computeGold(
cl_double* C, const cl_double* A, const cl_double* B,
cl_uint hA, cl_uint wA, cl_uint wB)
{
for (cl_uint i = 0; i < hA; ++i) {
for (cl_uint j = 0; j < wB; ++j) {
cl_double sum = 0;
for (cl_uint k = 0; k < wA; ++k) {
cl_double a = A[i * wA + k];
cl_double b = B[k * wB + j];
sum += a * b;
}
C[i * wB + j] = (cl_double)sum;
}
}
}
int
main(int argc, char** argv)
{
cl_device_type devType = CL_DEVICE_TYPE_CPU;
cl_int err, err1, err2, err3;
::srand(2009);
Matrix<WA,HA> A(&randomInit);
Matrix<WB,HB> B(&randomInit);
// Initialize the OpenCL runtime.
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties cprops[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[1](), 0 };
cl::Context context(devType, cprops);
std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
if (devices.size() == 0) {
std::cerr << "Device not available ";
return EXIT_FAILURE;
}
// Check that double is supported
if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find("cl_khr_fp64") ==
std::string::npos) {
std::cout << "Required that device support cl_khr_fp64" << std::endl;
exit(−1);
}
std::ifstream file("matrixmul_kernels.cl");
if (!file.is_open()) {
std::cerr << "We couldn't load CL source code ";
return EXIT_FAILURE;
}
std::string prog(
std::istreambuf_iterator<char>(file),
(std::istreambuf_iterator<char>()));
cl::Program::Sources source(
1,
std::make_pair(prog.c_str(),prog.length()+1));
cl::Program program(context, source);
err = program.build(devices, "−I.");
if (err != CL_SUCCESS) {
std::string str;
str = program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]);
std::cout << "Program Info: " << str;
exit(1);
}
cl::Kernel matrixMul(program, "matrixMul", &err);
if (err != CL_SUCCESS) {
std::cerr << "Could not create kernel "matmult" ";
return EXIT_FAILURE;
}
cl::Buffer in0(context, CL_MEM_USE_HOST_PTR, A.size(), A.data(), &err1);
cl::Buffer in1(context, CL_MEM_USE_HOST_PTR, B.size(), B.data(), &err2);
cl::Buffer out(context, CL_MEM_ALLOC_HOST_PTR,
WC * HC * sizeof(cl_double), NULL, &err3);
if (err1 != CL_SUCCESS || err2 != CL_SUCCESS || err3 != CL_SUCCESS) {
std::cerr << "Could not create memory objects ";
return EXIT_FAILURE;
}
err = matrixMul.setArg(0, out);
err |= matrixMul.setArg(1, in0);
err |= matrixMul.setArg(2, in1);
err |= matrixMul.setArg(3, WA);
err |= matrixMul.setArg(4, WB);
const size_t localSize = sizeof(double[BLOCK_SIZE][BLOCK_SIZE]);
err |= matrixMul.setArg(5, cl::__local(localSize));
err |= matrixMul.setArg(6, cl::__local(localSize));
if (err != CL_SUCCESS) {
std::cerr << "Could not set matrixMul's args ";
return EXIT_FAILURE;
}
cl_command_queue_properties properties = CL_QUEUE_PROFILING_ENABLE;
cl::CommandQueue queue(context, devices[0], properties, &err);
if (err != CL_SUCCESS) {
std::cerr << "Could not create the command queue ";
return EXIT_FAILURE;
}
cl::Event event;
err1 = queue.enqueueNDRangeKernel(
matrixMul, cl::NullRange,
cl::NDRange(WC, HC),
cl::NDRange(BLOCK_SIZE, BLOCK_SIZE),
NULL,
&event);
Matrix<WC,HC> C(&nop);
err2 = queue.enqueueReadBuffer(out, CL_TRUE, 0, C.size(), C.data());
if (err1 != CL_SUCCESS || err2 != CL_SUCCESS) {
std::cerr << "matrixMul execution failed ";
return EXIT_FAILURE;
}
Matrix<WC, HC> reference(&nop);
computeGold(reference.data(), A.data(), B.data(), HA, WA, WB);
if (!C.compareL2fe(reference, 1e-6f)) {
Matrix<WC, HC> difference = reference - C;
difference.printOn(std::cout);
std::cout << "FAILED ";
return EXIT_FAILURE;
}
if ((properties & CL_QUEUE_PROFILING_ENABLE) != 0) {
cl_long start = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_long end = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << "Elapsed time: "
<< (double)(end − start) / 1e6
<< "ms ";
}
std::cout << "PASS! ";
return EXIT_SUCCESS;
}
// Thread block size
#define BLOCK_SIZE 16
// Matrix dimensions
// (chosen as multiples of the thread block size for simplicity)
#define WA (3 * BLOCK_SIZE) // Matrix A width
#define HA (5 * BLOCK_SIZE) // Matrix A height
#define WB (8 * BLOCK_SIZE) // Matrix B width
#define HB WA // Matrix B height
#define WC WB // Matrix C width
#define HC HA // Matrix C height
References
Bellows, G.; Gaster, B.R.; Munshi, A.; Ollmann, I.; Rosenberg, O.; Watt, B., Device Fission. (2010, June) Khronos OpenCL Working Group.
Gaster, B.R., The OpenCL C++ Wrapper API, Version 1.1. (2010, June) Khronos OpenCL Working Group.
Intel, Threading Building Blocks 3.0 for Open Source, http://threadingbuildingblocks.org (2011).
Microsoft, Concurrency Runtime: Visual Studio 2010, http://msdn.microsoft.com/en-us/library/dd504870.aspx (2010).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset