Accelerating computations with Numba and NumbaPro

In this section, we will discuss Numba and NumbaPro, two very exciting libraries to accelerate the NumPy code. Numba and NumbaPro were created by Continuum Analytics, the same company that produces the Anaconda distribution. Numba is part of the standard Anaconda distribution, but NumbaPro is a commercial product that must be purchased separately as part of the Accelerate package. However, NumbaPro can be downloaded for a free trial period.

These libraries are unique in that they allow the acceleration of code with the addition of a few lines of code. As the first example, let's consider the following lines of code to multiply two matrices:

def matrix_multiply(A, B):
    m, n = A.shape
    n, r = B.shape
    C = zeros((m, r), float64)
    for i in range(m):
        for j in range(r):
            acc = 0
            for k in range(n):
                acc += A[i, k] * B[k, j]
            C[i, j] = acc
    return C

The preceding code uses the straightforward definition of matrix multiplication and looks very much like code that would be written if we were implementing the algorithm in C. It is not Python-like and definitely not optimized. (In a real-world situation, one would simply use the NumPy built-in matrix multiplication.) Note, in particular, that the dimensions of the matrices are not checked: it is assumed that the number of columns of A is equal to the number of rows of B.

Let's first try the computation with small matrices, as follows:

A = array([[1,2,0],[1,-3,4],[0,-2,1],[3,7,-4]], dtype=float64)
B = array([[3,4],[-2,0],[2,4]], dtype = float64)
C = matrix_multiply(A, B)
print A
print B
print C

We start by defining the matrices A and B (note that the dimensions are compatible for multiplication). As in all examples in this section, we are careful to include a data type specification (this may improve optimization). Then, we simply call matrix_multiply, store the result in the array C, and print the three matrices. The result is the following:

[[  1.   2.   0.]
 [  1.  -3.   4.]
 [  0.  -2.   1.]
 [  3.   7.  -4.]]
[[  3    4]
 [ -2    0]
 [  2    4]]
[[ -1.   4.]
 [  17. 20.]
 [  6.   4.]
 [ -13. -4.]]

You can verify that the algorithm is correct by manually checking a few entries. Alternatively, we can check whether the result agrees with the built-in matrix multiplication, as follows:

C - A.dot(B)
array([[ 0. ,  0.],
       [ 0. ,  0.],
       [ 0. ,  0.],
       [ 0. ,  0.]])

Everything seems to be fine. Now, we want to define some larger random matrices, as follows:

n = 100
A = rand(n, n)
B = rand(n, n)

In a 64-bit architecture, the preceding lines of code will automatically produce matrices of 64-bit floats. Next, we multiply the matrices and time the result as follows:

%%timeit
C = matrix_multiply(A, B)

The output of the preceding computation is as follows:

1 loops, best of 3: 472 ms per loop

The timing results will, of course, differ depending on the machine running the code. This example was run on an Intel Core i7 processor at 3.5 GHz with 16 GB of memory running a Microsoft Windows 7, 64-bit operating system.

Let's now see how we can quickly optimize this function. First, load the jit function from the Numba module, as follows:

from numba import jit

Then, define the function with the @jit decorator preceding it, as follows:

@jit
def matrix_multiply_jit(A, B):
    m, n = A.shape
    n, r = B.shape
    C = zeros((m, r), float64)
    for i in range(m):
        for j in range(r):
            acc = 0
            for k in range(n):
                acc += A[i, k] * B[k, j]
            C[i, j] = acc
    return C

Note that the only change to the code is the addition of the decorator. (We also changed the name of the function to avoid confusion, but this is not necessary.) Decorators are an advanced Python topic, but we do not need to go into the details of how they work. More information about decorators is available in the excellent blog postings by Simeon Franklin at http://simeonfranklin.com/blog/2012/jul/1/python-decorators-in-12-steps/.

Let's now time our code, as follows:

%%timeit
C = matrix_multiply_jit(A, B)

The following is the resultant output:

1000 loops, best of 3: 1.8 ms per loop

This is a 260-fold improvement with a single line of code! You should keep things in perspective here since this kind of acceleration cannot be expected for generic code. Remember that we wrote our code purposefully in a way that does not use the already-optimized functions from NumPy. For the sake of comparison and full disclosure, let's compare this with the built-in dot() method:

%%timeit
C = A.dot(B)

The resultant output is as follows:

10000 loops, best of 3: 28.6 µs per loop

So, even with acceleration, our function cannot compete with built-in NumPy. We emphasize again that the goal of this section is to present an overview of acceleration techniques and not delve deeply into sophisticated optimization methods.

It is worth having some understanding of how the @jit decorator works. When a function decorated by @jit is called, the library attempts to infer the data type of the arguments and return value, and on the fly produces a compiled version of the function and then calls it. The result is a function call that is comparable to code written in C.

Instead of letting the type of arguments and return value be inferred, it is possible to specify the data types, which may result in improved performance. The following table consists of the data types supported and the abbreviations used by Numba:

Data type

Abbreviation

boolean

b1

bool_

b1

byte

u1

uint8

u1

uint16

u2

uint32

u4

uint64

u8

char

i1

int8

i1

int16

i2

int32

i4

int64

i8

float_

f4

float32

f4

double

f8

float64

f8

complex64

c8

complex128

c16

These names are all defined in the Numba module. For example, to define a function that adds two floating-point values, we use the following code:

from numba import jit, f8
@jit (f8(f8,f8))
def my_sum(a, b):
    return a + b

Note the decorator syntax, which is as follows:

@jit (f8(f8,f8))

This specifies a function that takes two float64 arguments and returns a float64 value. The function is then called, as follows:

my_sum(3.5, 6.9)

This produces the expected result. However, if we try something like the following code, we get an error:

a = array([1.4, 2.0])
b = array([2.3, 5,2])
my_sum(a,b)

It is, however, possible to use arrays with the @jit decorator. To define a function that adds two one-dimensional arrays, one would use the following lines of code:

@jit (f8[:](f8[:],f8[:]))
def vector_sum(a, b):
    return a + b

Note how a vector is specified. A two-dimensional array is denoted by f8[:,:], a three-dimensional array by f8[:,:,:], and so on.

NumbaPro is the commercial version of Numba and adds several enhancements. We will focus on parallel processing using the Graphics Processing Unit (GPU) as an example of an exciting new technology that is made easily available in a notebook.

To run the examples that follow, the reader must have NumbaPro, a CUDA-compatible GPU (henceforth called the "device"), and the latest CUDA-compatible driver.

A list of CUDA-compatible devices can be found at https://developer.nvidia.com/cuda-gpus. After verifying that you have a compatible device, download and install the latest version of the CUDA SDK from https://developer.nvidia.com/cuda-downloads for the appropriate platform. The CUDA toolkit comes with several examples that you can use to test the installation.

The NumbaPro download is available at https://store.continuum.io/cshop/accelerate/. Download and install the Accelerate library.

To test the setup, start an IPython notebook and run the following in a cell:

import numbapro
numbapro.check_cuda()

If everything is fine, this will print a list of the CUDA libraries installed by Anaconda as well as a list of the CUDA-compatible devices available in your system. You will also see a PASSED message at the end of the display.

Even though CUDA programming is a relatively easy path to massive parallelism, there are still some concepts that have to be mastered before we can tackle our first CUDA program. We will outline the basics of the architecture here, discussing only enough to run the examples that follow. For a complete specification, see the CUDA Programming Guide, available at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model.

GPUs were originally designed to process rendering operations with greater speed than the computer's CPU is capable of. This processing acceleration is achieved, in large measure, by massively parallelizing the graphical operations required by the rendering pipeline.

A CUDA-compatible GPU consists of an array of Streaming Multiprocessors (SMs). Each one of the SMs, by itself, cannot compete with current CPUs in terms of speed. However, the fact that many SMs can cooperate to solve a problem more than compensates for that. The SMs can also access memory that resides in the GPU, referred to as device memory.

In CUDA, there is a strict separation between code that runs in the CPU and code that runs in the device (the GPU). One particular restriction is that while CPU code can only access regular computer memory, device code can only access device memory. Code that runs in the device is specified in a function called a kernel. The kernel is compiled into a low-level language that is understood by the SMs and runs into each SM asynchronously (meaning that each SM proceeds at its own pace unless special synchronization instructions are found). Thus, a simple computation in CUDA usually requires the following three steps:

  1. Transfer input data from the computer memory to the device memory.
  2. Launch the kernel in the device.
  3. Transfer output data from the device memory to the computer memory so that it is accessible to the CPU again.

As you will see, the memory transfers are made transparent by Python CUDA. (There is still the possibility to control it programmatically if needed.)

The kernel is launched simultaneously in an array of SMs, and each thread proceeds independently with the computation. Each SM can run several threads in parallel and can access all the device's memory. (The architecture is more complicated, and there are other kinds of memory available that will not be discussed here.) In the simplest case, each thread will access only a few memory areas, each containing a 64-bit floating value, and the memory accessed by a thread is never accessed by any other thread. So, there is no need for synchronization. In more complex problems, synchronization may be a major issue.

The set of threads being run features a two-level array structure:

  • Threads are organized in blocks. Each block is an array of threads with up to 3 dimensions The dimensions of a block are stored in a variable called blockDim. Threads in a block are identified by the variable, threadIdx. This is a structure with three integer fields: threadIdx.x, threadIdx.y, and threadIdx.z. These fields uniquely identify each thread in the block.
  • Blocks are organized in a grid. The grid is an array of blocks with up to 3 dimensions. The dimensions of the grid are stored in a variable called gridDim. Blocks in a grid are identified by the variable, gridIdx. This is a structure with three integer fields: gridIdx.x, gridIdx.y, and gridIdx.z. These fields uniquely identify each block in the grid.

An example of this organizational structure is given in the following figure:

Accelerating computations with Numba and NumbaPro

In the preceding example, gridDim is (2, 3, 1) since there are two rows and three columns of blocks (and a single space dimension). All the blocks in the grid are one-dimensional, so blockDim is (4, 1, 1). The third thread in the first block of the bottom row, for example, is identified by the following lines of code:

blockIdx.x=0, blockIdx.y=1, blockIdx.z=1
threadIdx.x=2, threadIdx.y=1, threadIdx.z=1

At runtime, each individual thread has access to this identifying information.

A key point of the CUDA architecture is the following:

  • All threads in the same block always run concurrently in a single SM until all threads in the block have terminated
  • Different blocks can run concurrently or serially depending on the availability of an SM to carry out the computation

We are now ready to define the kernel using Python CUDA. We will write a function that computes the sum of two vectors in the GPU. Run the following code in a cell:

from numbapro import cuda
@cuda.jit('void(float64[:], float64[:], float64[:])')
def sum(a, b, result):
    i = cuda.threadIdx.x    
    result[i] = a[i] + b[i]

We assume that there is only one block of threads, and each thread is responsible for adding the elements of the array at a single position. The array position that a thread is responsible for is identified by the value of threadIdx.x. Note that the kernel has no return value. We need to specify an array, result, to hold the return value of the computation.

Let's now see how this function is called. Note that the grid and block geometry is not defined in the kernel. (The kernel can obtain geometry information if necessary; more on that later.) This is done when the kernel is launched:

a = array([1,2,3,4,5], dtype=float64)
b = array([5,4,3,2,1], dtype=float64)
result = array([0,0,0,0,0], dtype=float64)
sum[1,5](a, b, result)
print result

The preceding lines of code give the following output:

[ 6.  6.  6.  6.  6.]

The main point in this code is the following line:

sum[1,5](a, b, result)

The preceding line launches the kernel in a grid with 1 block, with 5 threads in the block. Both the grid and the blocks are one-dimensional. Let's now add larger vectors:

n = 64
a = arange(0,n,dtype=float64)
b = arange(n,0,-1, dtype=float64)
result = zeros(n,dtype=float64)
sum[1,n](a, b, result)
print result[:5]

The preceding lines of code are essentially the same as before but a little more generic in that the size of the array can be changed. What we want to do is increase the size of n. If you try a value such as n=10000, an error of type CUDA_ERROR_INVALID_VALUE occurs. The problem is that there is a hard limit on the number of threads that can be run by a single SM, that is, there is a limit to the number of threads that can be executed in a single block. To be able to handle large vectors, we need to modify the code so that it can handle multiple blocks. To this end, change the definition of the sum() function in the following way:

from numbapro import cuda
@cuda.jit('void(float64[:], float64[:], float64[:], int32)')
def sum(a, b, result, n):
    tx = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bsz = cuda.blockDim.x
    i = tx + bx * bsz
    if i < n:
        result[i] = a[i] + b[i]

The first thing to note is that we include an argument of type int32 to hold the size of the arrays being added. The main point now is that threads in different blocks must address different areas of memory, so the computation of the index i associated to a thread is more complicated. Essentially, we must know the number of blocks that come before the current block, multiply that by the block dimension, and add the current thread index. Then, before adding the relevant memory positions, we check if the index is valid. This prevents the thread from accessing areas that are not part of the input/output arrays and is an essential check in more complex code. To test the code, run the following:

n = 100000
a = arange(0,n,dtype=float64)
b = arange(n,0,-1, dtype=float64)
result = zeros(n,dtype=float64)
sum[1000,64](a, b, result, n)
print result[:5]

The preceding code should run without a hitch. Note that we are specifying a grid with 1000 blocks and 64 threads per block. The number of blocks in a grid is unlimited, the device being responsible for allocating the SMs in an optimal way. Note that the number of blocks must be large enough to cover the input/output arrays. In our case, this means blockDim.x * gridDim.x >= n.

We are now ready to compute with large vectors. Try the following code:

n = 100000
a = rand(n)
b = rand(n)
result = zeros(n, dtype=float64)
bd = 10000
gd = 64
if bd * gd < n:
    print 'Block/grid dimensions too small'
else:
    sum[bd,gd](a, b, result, n)
print result[:10]

The reader should experiment with different values of n, bd, and gd. Remember that the maximum value of gd depends on the device in your computer. An interesting experiment is to check how the computation scales for larger values of n.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset