There's more...

In this section, we have seen that the PyOpenCL execution model, like PyCUDA, involves a host processor that manages one or more heterogeneous devices. In particular, each PyOpenCL command is sent to the devices from the host in the form of source code that is defined through the kernel function.

The source code is then loaded into a program object for the reference architecture, the program is compiled into the reference architecture, and the kernel object that is relative to the program is created.

A kernel object can be executed in a variable number of workgroups, creating an n-dimensional computation matrix that allows it to effectively subdivide the workload for a problem in n-dimensions (1, 2, or 3) in each workgroup. In turn, they are composed of a number of work items that work in parallel.

Balancing the workload for each workgroup based on the parallel computing capability of a device is one of the critical parameters for achieving good application performance.

A wrong balancing of the workload, together with the specific characteristics of each device (such as transfer latency, throughput, and bandwidth), can lead to a substantial loss of performance or compromise the portability of the code when executed without considering any system of dynamic acquisition of information in terms of device calculation capacities.

However, the accurate use of these technologies allows us to reach high levels of performance by combining the results of the calculation of different computational units.

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...