In this final subsection, we want to talk about Numba. It is probably one of the hottest ways to speed up your Python code with almost no changes. Numba compiles Python code—vanilla Python or NumPy-based—into C code using LLVM. By doing so—and by leveraging a suite of optimizations along the way—it drastically increases the speed of the code, especially if you use a lot of loops and NumPy arrays.
The great thing about Numba is that, in the best-case scenario, it will improve your code by adding a simple decorator over your function or class—that is, if you're lucky. If you're not, you'll have to work through the documentation and somewhat obscure error messages and experiment with datatype annotations. In some cases, Numba could be more performant than NumPy! As if that isn't enough, Numba can also compile your code for CUDA, leveraging the heavy performance of GPUs—which are often an order of magnitude faster than CPUs!
Here is a simple example. The compute_distances function resembles the behavior of euclidean_distances and performs fairly well:
def distance(p1, p2):
distance = 0
for c1, c2, in zip(p1,p2):
distance += (c2-c1)**2
return np.sqrt(distance)
def compute_distances(points1, points2):
A = np.zeros(shape=(len(points1), len(points2)))
for i, p1 in enumerate(points1):
for j, p2 in enumerate(points2):
A[i, j] = distance(p1, p2)
return A
%timeit compute_distances([(0, 0)]*100, [(1,1)]*200)
The performance (output) of the preceding code snippet is as follows:
>>> 43.8 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, once we add a decorator to each function, performance increases more than tenfold:
@jit()
def distance(p1, p2):
distance = 0
for c1, c2, in zip(p1,p2):
distance += (c2-c1)**2
return np.sqrt(distance)
@jit()
def compute_distances(points1, points2):
A = np.zeros(shape=(len(points1), len(points2)))
for i, p1 in enumerate(points1):
for j, p2 in enumerate(points2):
A[i, j] = distance(p1, p2)
return A
%timeit compute_distances([(0, 0)]*100, [(1,1)]*200)
The performance (output) of the preceding code snippet is as follows:
>>> 3.02 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
On that run Numba shows a deprecation warning—future versions will require to specify a list type; in the current version it works as it is.
In our experience, Numba is great for non-trivial, multi-nested computations, where it is easier to write in pure Python (and optimize with Numba) than in NumPy. At the same time, it isn't very mature code (as NumPy is), and different changes occurring in the API happen fairly often.
In this section, we covered a few ways to improve the performance of Python code. Starting from a naive, slow, but easy algorithm implementation, we took on different angles in order to make it faster, such as using vectorized C-based loops, specific data structures that are efficient for the task, running operations on multiple cores or multiple machines, and using modern compilers. Some of those solutions can and should be combined. All of them have their own benefits, limitations, and requirements—larger memory, more CPUs and computers, specific knowledge, and so on. Don't rush to implement any optimization before you're sure you need it. Once you are sure, though, a wide range of possibilities is available.
For more information on Numba, check out the following resources:
- Numba—Tell Those C++ Bullies to Get Lost, SciPy 2017 Tutorial, Gil Forsyth and Lorena Barba: https://www.youtube.com/watch?v=1AwG0T4gaO0
- Accelerating Python with the Numba JIT Compiler, SciPy 2015, Stanley Seibert: https://www.youtube.com/watch?v=eYIPEDnp5C4
Now, let's talk about an important topic we've ignored so far—concurrency.