Much can be said about performance optimization, but truthfully, if you have read the entire book up to this point, you know most of the Python-specific techniques to write fast code. The most important factor in application performance will always be the choice of algorithms, and by extension, the data structures. Searching for an item within list
is almost always a worse idea than searching for an item in dict
or set
.
Within any application, the right choice of algorithm is by far the most important performance characteristic, which is why I am repeating it to illustrate the results of a bad choice:
In [1]: a = list(range(1000000)) In [2]: b = dict.fromkeys(range(1000000)) In [3]: %timeit 'x' in a 10 loops, best of 3: 20.5 ms per loop In [4]: %timeit 'x' in b 10000000 loops, best of 3: 41.6 ns per loop
Checking whether an item is within a list
is an O(n)
operation and checking whether an item is within a dict
is an O(1)
operation. A huge difference when n=1000000
obviously, in this simple test we can see that for 1 million items it's 500 times faster.
All other performance tips combined together might make your code twice as fast, but using the right algorithm for the job can cause a much greater improvement. Using an algorithm that takes O(n)
time instead of O(n^2)
time will make your code 1000
times faster for n=1000,
and with a larger n
the difference only grows further.
One of the most obscure components of the CPython interpreter is the global interpreter lock (GIL), a mutual exclusion lock (mutex) required to prevent memory corruption. The Python memory manager is not thread-safe and that is why the GIL is needed. Without the GIL, multiple threads might alter memory at the same time, causing all sorts of unexpected and potentially dangerous results.
So what is the impact of the GIL in a real-life application? Within single-threaded applications it makes no difference whatsoever and is actually an extremely fast method for memory consistency. Within multithreaded applications however, it can slow your application down a bit, because only a single thread can access the GIL at a time. So if your code has to access the GIL a lot, it might benefit from some restructuring.
Luckily, Python offers a few other options for parallel processing: the asyncio
module that we saw earlier and the multiprocessing
library that we will see in Chapter 13, Multiprocessing – When a Single CPU Core Is Not Enough.
In many languages a try/except
type of block incurs quite a performance hit, but within Python this is not the case. It's not that an if
statement is heavy, but if you expect your try/except
to succeed most of the time and only fail in rare cases, it's definitely a valid alternative. As always though, focus on readability and conveying the purpose of the code. If the intention of the code is clearer using an if
statement, use the if
statement. If try/except
conveys the intention in a better way, use that.
Evaluating code lazily using generators is almost always a better idea than calculating the entire dataset. The most important rule of performance optimization is probably that you shouldn't calculate anything you're not going to use. If you're not sure that you are going to need it, don't calculate it.
Don't forget that you can easily chain multiple generators, so everything is calculated only when it's actually needed. Do be careful that this won't result in recalculation though; itertools.tee
is generally a better idea than recalculating your results completely.
You might have seen benchmarks saying that using +=
is much slower than joining strings. At one point this made quite a lot of difference indeed. With Python 3 however, most of the differences have vanished.
In [1]: %%timeit ...: s = '' ...: for i in range(1000000): ...: s += str(i) ...: 1 loops, best of 3: 362 ms per loop In [2]: %%timeit ...: ss = [] ...: for i in range(1000000): ...: ss.append(str(i)) ...: s = ''.join(ss) ...: 1 loops, best of 3: 332 ms per loop In [3]: %timeit ''.join(str(i) for i in range(1000000)) 1 loops, best of 3: 324 ms per loop In [4]: %timeit ''.join([str(i) for i in range(1000000)]) 1 loops, best of 3: 294 ms per loop
There are still some differences of course, but they are so small that I recommend to simply ignore them and choose the most readable option instead.
As is the case with string concatenation, once a significant difference now too small to mention.
In [1]: %%timeit ...: x = 0 ...: for i in range(1000000): ...: x += i ...: 10 loops, best of 3: 73.2 ms per loop In [2]: %timeit x = sum(i for i in range(1000000)) 10 loops, best of 3: 75.3 ms per loop In [3]: %timeit x = sum([i for i in range(1000000)]) 10 loops, best of 3: 71.2 ms per loop In [4]: %timeit x = sum(range(1000000)) 10 loops, best of 3: 25.6 ms per loop
What does help though is letting Python handle everything internally using native functions, as can be seen in the last example.
Once again, readability counts more than performance. There are a few cases where map
is faster than list comprehensions and generators, but only if the map
function can use a predefined function. As soon as you need to whip out lambda,
it's actually slower. Not that it matters much, since readability should be key anyhow, use generators or list comprehensions instead of map
:
In [1]: %timeit list(map(lambda x: x/2, range(1000000))) 10 loops, best of 3: 182 ms per loop In [2]: %timeit list(x/2 for x in range(1000000)) 10 loops, best of 3: 122 ms per loop In [3]: %timeit [x/2 for x in range(1000000)] 10 loops, best of 3: 84.7 ms per loop
As you can see, the list comprehension is obviously quite a bit faster than the generator. In many cases I would still recommend the generator over the list comprehension though, if only because of the memory usage and the potential laziness. If for some reason you are only going to use the first 10 items, you're still wasting a lot of resources by calculating the full list of items.
We have already covered the functools.lru_cache
decorator in Chapter 5, Decorators – Enabling Code Reuse by Decorating but the importance should not be underestimated. Regardless of how fast and smart your code is, not having to calculate results is always better and that's what caching does. Depending on your use case, there are many options available. Within a simple script, functools.lru_cache
is a very good contender, but between multiple executions of an application, the cPickle
module can be a life saver as well.
If multiple servers are involved, I recommend taking a look at Redis. The Redis server is a single threaded in-memory server which is extremely fast and has many useful data structures available. If you see articles or tutorials about improving performance using Memcached, simply replace Memcached with Redis everywhere. Redis is superior to Memcached in every way and in its most basic form the API is compatible.
A common problem in application load times is that everything is loaded immediately at the start of the program, while with many applications this is actually not needed and certain parts of the application only require loading when they are actually used. To facilitate this, one can occasionally move the imports inside of functions so they can be loaded on demand.
While it's a valid strategy in some cases, I don't generally recommend it for two reasons:
This is actually a very broad tip, but useful nonetheless. If there's a highly optimized library which suits your purpose, you most likely won't be able to beat its performance without a significant amount of effort. Libraries such as numpy
, pandas
, scipy
, and sklearn
are highly optimized for performance and their native operations can be incredibly fast. If they suit your purpose, be sure to give them a try. Just to illustrate how fast numpy
can be compared to plain Python, refer to the following:
In [1]: import numpy In [2]: a = list(range(1000000)) In [3]: b = numpy.arange(1000000) In [4]: %timeit c = [x for x in a if x > 500000] 10 loops, best of 3: 44 ms per loop In [5]: %timeit d = b[b > 500000] 1000 loops, best of 3: 1.61 ms per loop
The numpy
code does exactly the same as the Python code, except that it uses numpy
arrays instead of Python lists. This little difference has made the code more than 25 times faster.
Just-in-time (JIT) compiling is a method of dynamically compiling (parts) an application during runtime. Because there is much more information available at runtime, this can have a huge effect and make your application much faster.
The numba
package provides selective JIT compiling for you, allowing you to mark the functions that are JIT compiler compatible. Essentially, if your functions follow the functional programming paradigm of basing the calculations only on the input, then it will most likely work with the JIT compiler.
Basic example of how the numba
JIT compiler can be used:
import numba @numba.jit def sum(array): total = 0.0 for value in array: total += value return value
The use cases for these are limited, but if you are using numpy
or pandas you will most likely benefit from numba
.
Another very interesting fact to note is that numba
supports not only CPU optimized execution but GPU as well. This means that for certain operations you can use the fast processor in your video card to process the results.
We will see more about this in Chapter 14, Extensions in C/C++, System Calls, and C/C++ Libraries, but if high performance is really required, then a native C function can help quite a lot. This doesn't even have to be that difficult. The Cython module makes it trivial to write parts of your code with performance very close to native C code.
Following is an example from the Cython manual to approximate the value of pi:
cdef inline double recip_square(int i): return 1./(i*i) def approx_pi(int n=10000000): cdef double val = 0. cdef int k for k in xrange(1,n+1): val += recip_square(k) return (6 * val)**.5
While there are some small differences such as cdef
instead of def
and type definitions for the values and parameters, the code is largely the same as regular Python would be, but certainly much faster.