How it works...

Let's start with looking at which libraries are imported for this example:

import PyCUDA.driver as CUDA 
import PyCUDA.autoinit 
from PyCUDA.compiler import SourceModule

In particular, the autoinit import automatically identifies which GPU on our system is available for execution, while SourceModule is the directive for the compiler of NVIDIA (nvcc) that allows us to identify the objects that must be compiled and uploaded to the device.

Then, we build the 5 × 5 input matrix by using the numpy library:

import numpy 
a = numpy.random.randn(5,5)

In this case, the elements in the matrix are converted to single-precision mode (since the graphics card on which this example is executed only supports single precision):

a = a.astype(numpy.float32)

Then, we copy the array from the host to the device, using the following two operations:

a_gpu = CUDA.mem_alloc(a.nbytes) 
CUDA.memcpy_htod(a_gpu, a)

Note that the device and host memory may never communicate during the execution of a kernel function. For this reason, in order to parallel execute the kernel function on the device, all input data relating to the kernel function must also be present in the memory of the device.

It should also be noted that the a_gpu matrix is linearized, that is, it is one-dimensional, and therefore we must manage it as such.

Moreover, all these operations do not require kernel invocation. This means that they are made directly by the host.

The SourceModule entity allows the definition of the doubleMatrix kernel function. __global__, which is an nvcc directive, indicates that the doubleMatrix function will be processed by the device:

mod = SourceModule(""" 
  __global__ void doubleMatrix(float *a)

Let's consider the kernel's body. The idx parameter is the matrix index, which is identified by the threadIdx.x and threadIdx.y thread coordinates:

    int idx = threadIdx.x + threadIdx.y*4; 
    a[idx] *= 2;

Then, mod.get_function("doubleMatrix") returns an identifier to the func parameter:

func = mod.get_function("doubleMatrix ")

In order to execute the kernel, we need to configure the execution context. This means setting the three-dimensional structure of the threads that belong to the block grid by using the block parameter inside the func call:

func(a_gpu, block = (5, 5, 1))

block = (5, 5, 1) tells us that we are calling a kernel function with the a_gpu linearized input matrix and a single thread block of size 5 (that is, 5 threads) in the x-direction, 5 threads in the y-direction, and 1 thread in the z-direction, which makes 16 threads in total. Note that each thread executes the same kernel code (25 threads in total).

After the computation in the GPU device, we use an array to store the results:

a_doubled = numpy.empty_like(a) 
CUDA.memcpy_dtoh(a_doubled, a_gpu)

To run the example, type the following on Command Prompt:

C:>python heterogenousPycuda.py

The output should be like this:

ORIGINAL MATRIX
[[-0.59975582 1.93627465 0.65337795 0.13205571 -0.46468592]
[ 0.01441949 1.40946579 0.5343408 -0.46614054 -0.31727529]
[-0.06868593 1.21149373 -0.6035406 -1.29117763 0.47762445]
[ 0.36176383 -1.443097 1.21592784 -1.04906416 -1.18935871]
[-0.06960868 -1.44647694 -1.22041082 1.17092752 0.3686313 ]]

DOUBLED MATRIX AFTER PyCUDA EXECUTION
[[-1.19951165 3.8725493 1.3067559 0.26411143 -0.92937183]
[ 0.02883899 2.81893158 1.0686816 -0.93228108 -0.63455057]
[-0.13737187 2.42298746 -1.2070812 -2.58235526 0.95524889]
[ 0.72352767 -2.886194 2.43185568 -2.09812832 -2.37871742]
[-0.13921736 -2.89295388 -2.44082164 2.34185504 0.73726263 ]]

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...