In this section, we will discuss Numba
and NumbaPro
, two very exciting libraries to accelerate the NumPy
code. Numba
and NumbaPro
were created by Continuum Analytics, the same company that produces the Anaconda distribution. Numba
is part of the standard Anaconda distribution, but NumbaPro
is a commercial product that must be purchased separately as part of the
Accelerate
package. However, NumbaPro
can be downloaded for a free trial period.
These libraries are unique in that they allow the acceleration of code with the addition of a few lines of code. As the first example, let's consider the following lines of code to multiply two matrices:
def matrix_multiply(A, B): m, n = A.shape n, r = B.shape C = zeros((m, r), float64) for i in range(m): for j in range(r): acc = 0 for k in range(n): acc += A[i, k] * B[k, j] C[i, j] = acc return C
The preceding code uses the straightforward definition of matrix multiplication and looks very much like code that would be written if we were implementing the algorithm in C. It is not Python-like and definitely not optimized. (In a real-world situation, one would simply use the NumPy
built-in matrix multiplication.) Note, in particular, that the dimensions of the matrices are not checked: it is assumed that the number of columns of A
is equal to the number of rows of B
.
Let's first try the computation with small matrices, as follows:
A = array([[1,2,0],[1,-3,4],[0,-2,1],[3,7,-4]], dtype=float64) B = array([[3,4],[-2,0],[2,4]], dtype = float64) C = matrix_multiply(A, B) print A print B print C
We start by defining the matrices A
and B
(note that the dimensions are compatible for multiplication). As in all examples in this section, we are careful to include a data type specification (this may improve optimization). Then, we simply call matrix_multiply
, store the result in the array C
, and print the three matrices. The result is the following:
[[ 1. 2. 0.] [ 1. -3. 4.] [ 0. -2. 1.] [ 3. 7. -4.]] [[ 3 4] [ -2 0] [ 2 4]] [[ -1. 4.] [ 17. 20.] [ 6. 4.] [ -13. -4.]]
You can verify that the algorithm is correct by manually checking a few entries. Alternatively, we can check whether the result agrees with the built-in matrix multiplication, as follows:
C - A.dot(B) array([[ 0. , 0.], [ 0. , 0.], [ 0. , 0.], [ 0. , 0.]])
Everything seems to be fine. Now, we want to define some larger random matrices, as follows:
n = 100 A = rand(n, n) B = rand(n, n)
In a 64-bit architecture, the preceding lines of code will automatically produce matrices of 64-bit floats. Next, we multiply the matrices and time the result as follows:
%%timeit C = matrix_multiply(A, B)
The output of the preceding computation is as follows:
1 loops, best of 3: 472 ms per loop
The timing results will, of course, differ depending on the machine running the code. This example was run on an Intel Core i7 processor at 3.5 GHz with 16 GB of memory running a Microsoft Windows 7, 64-bit operating system.
Let's now see how we can quickly optimize this function. First, load the jit
function from the Numba
module, as follows:
from numba import jit
Then, define the function with the @jit
decorator preceding it, as follows:
@jit def matrix_multiply_jit(A, B): m, n = A.shape n, r = B.shape C = zeros((m, r), float64) for i in range(m): for j in range(r): acc = 0 for k in range(n): acc += A[i, k] * B[k, j] C[i, j] = acc return C
Note that the only change to the code is the addition of the decorator. (We also changed the name of the function to avoid confusion, but this is not necessary.) Decorators are an advanced Python topic, but we do not need to go into the details of how they work. More information about decorators is available in the excellent blog postings by Simeon Franklin at http://simeonfranklin.com/blog/2012/jul/1/python-decorators-in-12-steps/.
Let's now time our code, as follows:
%%timeit C = matrix_multiply_jit(A, B)
The following is the resultant output:
1000 loops, best of 3: 1.8 ms per loop
This is a 260-fold improvement with a single line of code! You should keep things in perspective here since this kind of acceleration cannot be expected for generic code. Remember that we wrote our code purposefully in a way that does not use the already-optimized functions from NumPy
. For the sake of comparison and full disclosure, let's compare this with the built-in dot()
method:
%%timeit C = A.dot(B)
The resultant output is as follows:
10000 loops, best of 3: 28.6 µs per loop
So, even with acceleration, our function cannot compete with built-in NumPy
. We emphasize again that the goal of this section is to present an overview of acceleration techniques and not delve deeply into sophisticated optimization methods.
It is worth having some understanding of how the @jit
decorator works. When a function decorated by @jit
is called, the library attempts to infer the data type of the arguments and return value, and on the fly produces a compiled version of the function and then calls it. The result is a function call that is comparable to code written in C.
Instead of letting the type of arguments and return value be inferred, it is possible to specify the data types, which may result in improved performance. The following table consists of the data types supported and the abbreviations used by Numba
:
Data type |
Abbreviation |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
These names are all defined in the Numba
module. For example, to define a function that adds two floating-point values, we use the following code:
from numba import jit, f8 @jit (f8(f8,f8)) def my_sum(a, b): return a + b
Note the decorator syntax, which is as follows:
@jit (f8(f8,f8))
This specifies a function that takes two float64
arguments and returns a float64
value. The function is then called, as follows:
my_sum(3.5, 6.9)
This produces the expected result. However, if we try something like the following code, we get an error:
a = array([1.4, 2.0]) b = array([2.3, 5,2]) my_sum(a,b)
It is, however, possible to use arrays with the @jit
decorator. To define a function that adds two one-dimensional arrays, one would use the following lines of code:
@jit (f8[:](f8[:],f8[:])) def vector_sum(a, b): return a + b
Note how a vector is specified. A two-dimensional array is denoted by f8[:,:]
, a three-dimensional array by f8[:,:,:]
, and so on.
NumbaPro
is the commercial version of Numba
and adds several enhancements. We will focus on parallel processing using the Graphics Processing Unit (GPU) as an example of an exciting new technology that is made easily available in a notebook.
To run the examples that follow, the reader must have NumbaPro
, a CUDA
-compatible GPU (henceforth called the "device"), and the latest CUDA
-compatible driver.
A list of CUDA
-compatible devices can be found at https://developer.nvidia.com/cuda-gpus. After verifying that you have a compatible device, download and install the latest version of the CUDA SDK
from https://developer.nvidia.com/cuda-downloads for the appropriate platform. The CUDA
toolkit comes with several examples that you can use to test the installation.
The NumbaPro
download is available at https://store.continuum.io/cshop/accelerate/. Download and install the Accelerate
library.
To test the setup, start an IPython notebook and run the following in a cell:
import numbapro numbapro.check_cuda()
If everything is fine, this will print a list of the CUDA
libraries installed by Anaconda as well as a list of the CUDA
-compatible devices available in your system. You will also see a PASSED
message at the end of the display.
Even though CUDA
programming is a relatively easy path to massive parallelism, there are still some concepts that have to be mastered before we can tackle our first CUDA
program. We will outline the basics of the architecture here, discussing only enough to run the examples that follow. For a complete specification, see the CUDA Programming Guide, available at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model.
GPUs were originally designed to process rendering operations with greater speed than the computer's CPU is capable of. This processing acceleration is achieved, in large measure, by massively parallelizing the graphical operations required by the rendering pipeline.
A CUDA
-compatible GPU consists of an array of Streaming Multiprocessors (SMs). Each one of the SMs, by itself, cannot compete with current CPUs in terms of speed. However, the fact that many SMs can cooperate to solve a problem more than compensates for that. The SMs can also access memory that resides in the GPU, referred to as device memory.
In CUDA
, there is a strict separation between code that runs in the CPU and code that runs in the device (the GPU). One particular restriction is that while CPU code can only access regular computer memory, device code can only access device memory. Code that runs in the device is specified in a function called a kernel. The kernel is compiled into a low-level language that is understood by the SMs and runs into each SM asynchronously (meaning that each SM proceeds at its own pace unless special synchronization instructions are found). Thus, a simple computation in CUDA
usually requires the following three steps:
As you will see, the memory transfers are made transparent by Python CUDA
. (There is still the possibility to control it programmatically if needed.)
The kernel is launched simultaneously in an array of SMs, and each thread proceeds independently with the computation. Each SM can run several threads in parallel and can access all the device's memory. (The architecture is more complicated, and there are other kinds of memory available that will not be discussed here.) In the simplest case, each thread will access only a few memory areas, each containing a 64-bit floating value, and the memory accessed by a thread is never accessed by any other thread. So, there is no need for synchronization. In more complex problems, synchronization may be a major issue.
The set of threads being run features a two-level array structure:
3
dimensions The dimensions of a block are stored in a variable called blockDim
. Threads in a block are identified by the variable, threadIdx
. This is a structure with three integer fields: threadIdx.x
, threadIdx.y
, and threadIdx.z
. These fields uniquely identify each thread in the block.3
dimensions. The dimensions of the grid are stored in a variable called gridDim
. Blocks in a grid are identified by the variable, gridIdx
. This is a structure with three integer fields: gridIdx.x
, gridIdx.y
, and gridIdx.z
. These fields uniquely identify each block in the grid.An example of this organizational structure is given in the following figure:
In the preceding example, gridDim
is (2, 3, 1)
since there are two rows and three columns of blocks (and a single space dimension). All the blocks in the grid are one-dimensional, so blockDim
is (4, 1, 1)
. The third thread in the first block of the bottom row, for example, is identified by the following lines of code:
blockIdx.x=0, blockIdx.y=1, blockIdx.z=1 threadIdx.x=2, threadIdx.y=1, threadIdx.z=1
At runtime, each individual thread has access to this identifying information.
A key point of the
CUDA
architecture is the following:
We are now ready to define the kernel using Python CUDA
. We will write a function that computes the sum of two vectors in the GPU. Run the following code in a cell:
from numbapro import cuda @cuda.jit('void(float64[:], float64[:], float64[:])') def sum(a, b, result): i = cuda.threadIdx.x result[i] = a[i] + b[i]
We assume that there is only one block of threads, and each thread is responsible for adding the elements of the array at a single position. The array position that a thread is responsible for is identified by the value of threadIdx.x
. Note that the kernel has no return value. We need to specify an array, result
, to hold the return value of the computation.
Let's now see how this function is called. Note that the grid and block geometry is not defined in the kernel. (The kernel can obtain geometry information if necessary; more on that later.) This is done when the kernel is launched:
a = array([1,2,3,4,5], dtype=float64) b = array([5,4,3,2,1], dtype=float64) result = array([0,0,0,0,0], dtype=float64) sum[1,5](a, b, result) print result
The preceding lines of code give the following output:
[ 6. 6. 6. 6. 6.]
The main point in this code is the following line:
sum[1,5](a, b, result)
The preceding line launches the kernel in a grid with 1
block, with 5
threads in the block. Both the grid and the blocks are one-dimensional. Let's now add larger vectors:
n = 64 a = arange(0,n,dtype=float64) b = arange(n,0,-1, dtype=float64) result = zeros(n,dtype=float64) sum[1,n](a, b, result) print result[:5]
The preceding lines of code are essentially the same as before but a little more generic in that the size of the array can be changed. What we want to do is increase the size of n
. If you try a value such as n=10000
, an error of type CUDA_ERROR_INVALID_VALUE
occurs. The problem is that there is a hard limit on the number of threads that can be run by a single SM, that is, there is a limit to the number of threads that can be executed in a single block. To be able to handle large vectors, we need to modify the code so that it can handle multiple blocks. To this end, change the definition of the sum()
function in the following way:
from numbapro import cuda @cuda.jit('void(float64[:], float64[:], float64[:], int32)') def sum(a, b, result, n): tx = cuda.threadIdx.x bx = cuda.blockIdx.x bsz = cuda.blockDim.x i = tx + bx * bsz if i < n: result[i] = a[i] + b[i]
The first thing to note is that we include an argument of type int32
to hold the size of the arrays being added. The main point now is that threads in different blocks must address different areas of memory, so the computation of the index i
associated to a thread is more complicated. Essentially, we must know the number of blocks that come before the current block, multiply that by the block dimension, and add the current thread index. Then, before adding the relevant memory positions, we check if the index is valid. This prevents the thread from accessing areas that are not part of the input/output arrays and is an essential check in more complex code. To test the code, run the following:
n = 100000 a = arange(0,n,dtype=float64) b = arange(n,0,-1, dtype=float64) result = zeros(n,dtype=float64) sum[1000,64](a, b, result, n) print result[:5]
The preceding code should run without a hitch. Note that we are specifying a grid with 1000
blocks and 64
threads per block. The number of blocks in a grid is unlimited, the device being responsible for allocating the SMs in an optimal way. Note that the number of blocks must be large enough to cover the input/output arrays. In our case, this means blockDim.x * gridDim.x >= n
.
We are now ready to compute with large vectors. Try the following code:
n = 100000 a = rand(n) b = rand(n) result = zeros(n, dtype=float64) bd = 10000 gd = 64 if bd * gd < n: print 'Block/grid dimensions too small' else: sum[bd,gd](a, b, result, n) print result[:10]
The reader should experiment with different values of
n
, bd
, and gd
. Remember that the maximum value of gd
depends on the device in your computer. An interesting experiment is to check how the computation scales for larger values of n
.