Chapter 2. Profiling to Find Bottlenecks

Profiling lets us find bottlenecks so we can do the least amount of work to get the biggest practical performance gain. While we’d like to get huge gains in speed and reductions in resource usage with little work, practically you’ll aim for your code to run “fast enough” and “lean enough” to fit your needs. Profiling will let you make the most pragmatic decisions for the least overall effort.

Any measurable resource can be profiled (not just the CPU!). In this chapter we look at both CPU time and memory usage. You could apply similar techniques to measure network bandwidth and disk I/O too.

If a program is running too slowly or using too much RAM, then you’ll want to fix whichever parts of your code are responsible. You could, of course, skip profiling and fix what you believe might be the problem—but be wary, as you’ll often end up “fixing” the wrong thing. Rather than using your intuition, it is far more sensible to first profile, having defined a hypothesis, before making changes to the structure of your code.

Sometimes it’s good to be lazy. By profiling first, you can quickly identify the bottlenecks that need to be solved, and then you can solve just enough of these to achieve the performance you need. If you avoid profiling and jump to optimization, then it is quite likely that you’ll do more work in the long run. Always be driven by the results of profiling.

Profiling Efficiently

The first aim of profiling is to test a representative system to identify what’s slow (or using too much RAM, or causing too much disk I/O or network I/O). Profiling typically adds an overhead (10x to 100x slowdowns can be typical), and you still want your code to be used as similarly to in a real-world situation as possible. Extract a test case and isolate the piece of the system that you need to test. Preferably, it’ll have been written to be in its own set of modules already.

The basic techniques that are introduced first in this chapter include the %timeit magic in IPython, time.time(), and a timing decorator. You can use these techniques to understand the behavior of statements and functions.

Then we will cover cProfile (“Using the cProfile Module”), showing you how to use this built-in tool to understand which functions in your code take the longest to run. This will give you a high-level view of the problem so you can direct your attention to the critical functions.

Next, we’ll look at line_profiler (“Using line_profiler for Line-by-Line Measurements”), which will profile your chosen functions on a line-by-line basis. The result will include a count of the number of times each line is called and the percentage of time spent on each line. This is exactly the information you need to understand what’s running slowly and why.

Armed with the results of line_profiler, you’ll have the information you need to move on to using a compiler ([Link to Come]).

In [Link to Come] ([Link to Come]), you’ll learn how to use perf stat to understand the number of instructions that are ultimately executed on a CPU and how efficiently the CPU’s caches are utilized. This allows for advanced-level tuning of matrix operations. You should take a look at that example when you’re done with this chapter.

After line_profiler if you’re working with long-running systems then you’ll be interested in dowser ([Link to Come]) to introspect live objects via a web browser interface and py-spy to peek into already-running Python processes.

To help you understand why your RAM usage is high, we’ll show you memory_profiler (“Using memory_profiler to Diagnose Memory Usage”). It is particularly useful for tracking RAM usage over time on a labeled chart, so you can explain to colleagues why certain functions use more RAM than expected.

Note

Whatever approach you take to profiling your code, you must remember to have adequate unit test coverage in your code. Unit tests help you to avoid silly mistakes and help to keep your results reproducible. Avoid them at your peril.

Always profile your code before compiling or rewriting your algorithms. You need evidence to determine the most efficient ways to make your code run faster.

Finally, we’ll give you an introduction to the Python bytecode inside CPython (“Using the dis Module to Examine CPython Bytecode”), so you can understand what’s happening “under the hood.” In particular, having an understanding of how Python’s stack-based virtual machine operates will help you to understand why certain coding styles run more slowly than others.

Before the end of the chapter, we’ll review how to integrate unit tests while profiling (“Unit Testing During Optimization to Maintain Correctness”), to preserve the correctness of your code while you make it run more efficiently.

We’ll finish with a discussion of profiling strategies (“Strategies to Profile Your Code Successfully”), so you can reliably profile your code and gather the correct data to test your hypotheses. Here you’ll learn about how dynamic CPU frequency scaling and features like TurboBoost can skew your profiling results and how they can be disabled.

To walk through all of these steps, we need an easy-to-analyze function. The next section introduces the Julia set. It is a CPU-bound function that’s a little hungry for RAM; it also exhibits nonlinear behavior (so we can’t easily predict the outcomes), which means we need to profile it at runtime rather than analyzing it offline.

Introducing the Julia Set

The Julia set is an interesting CPU-bound problem for us to begin with. It is a fractal sequence that generates a complex output image, named after Gaston Julia.

The code that follows is a little longer than a version you might write yourself. It has a CPU-bound component and a very explicit set of inputs. This configuration allows us to profile both the CPU usage and the RAM usage so we can understand which parts of our code are consuming two of our scarce computing resources. This implementation is deliberately suboptimal, so we can identify memory-consuming operations and slow statements. Later in this chapter we’ll fix a slow logic statement and a memory-consuming statement, and in [Link to Come] we’ll significantly speed up the overall execution time of this function.

We will analyze a block of code that produces both a false grayscale plot (Figure 2-1) and a pure grayscale variant of the Julia set (Figure 2-3), at the complex point c=-0.62772-0.42193j. A Julia set is produced by calculating each pixel in isolation; this is an “embarrassingly parallel problem” as no data is shared between points.

Julia set at -0.62772-0.42193i
Figure 2-1. Julia set plot with a false grayscale to highlight detail

If we chose a different c, then we’d get a different image. The location we have chosen has regions that are quick to calculate and others that are slow to calculate; this is useful for our analysis.

The problem is interesting because we calculate each pixel by applying a loop that could be applied an indeterminate number of times. On each iteration we test to see if this coordinate’s value escapes toward infinity, or if it seems to be held by an attractor. Coordinates that cause few iterations are colored darkly in Figure 2-1, and those that cause a high number of iterations are colored white. White regions are more complex to calculate and so take longer to generate.

We define a set of z-coordinates that we’ll test. The function that we calculate squares the complex number z and adds c:

f(z)=z2+c

We iterate on this function while testing to see if the escape condition holds using abs. If the escape function is False, then we break out of the loop and record the number of iterations we performed at this coordinate. If the escape function is never False, then we stop after maxiter iterations. We will later turn this z’s result into a colored pixel representing this complex location.

In pseudocode, it might look like:

for z in coordinates:
    for iteration in range(maxiter):  # limited iterations per point
        if abs(z) < 2.0:  # has the escape condition been broken?
            z = z*z + c
        else:
            break
    # store the iteration count for each z and draw later

To explain this function, let’s try two coordinates.

First, we’ll use the coordinate that we draw in the top-left corner of the plot at -1.8-1.8j. We must test abs(z) < 2 before we can try the update rule:

z = -1.8-1.8j
print abs(z)
2.54558441227

We can see that for the top-left coordinate the abs(z) test will be False on the zeroth iteration as 2.54 >= 2.0, so we do not perform the update rule. The output value for this coordinate is 0.

Now let’s jump to the center of the plot at z = 0 + 0j and try a few iterations:

c = -0.62772-0.42193j
z = 0+0j
for n in range(9):
    z = z*z + c
    print(f"{n}: z={z: .5f}, abs(z)={abs(z):0.3f}, c={c: .5f}")
0: z=-0.62772-0.42193j, abs(z)=0.756, c=-0.62772-0.42193j
1: z=-0.41171+0.10778j, abs(z)=0.426, c=-0.62772-0.42193j
2: z=-0.46983-0.51068j, abs(z)=0.694, c=-0.62772-0.42193j
3: z=-0.66777+0.05793j, abs(z)=0.670, c=-0.62772-0.42193j
4: z=-0.18516-0.49930j, abs(z)=0.533, c=-0.62772-0.42193j
5: z=-0.84274-0.23703j, abs(z)=0.875, c=-0.62772-0.42193j
6: z= 0.02630-0.02242j, abs(z)=0.035, c=-0.62772-0.42193j
7: z=-0.62753-0.42311j, abs(z)=0.757, c=-0.62772-0.42193j
8: z=-0.41295+0.10910j, abs(z)=0.427, c=-0.62772-0.42193j

We can see that each update to z for these first iterations leaves it with a value where abs(z) < 2 is True. For this coordinate we can iterate 300 times, and still the test will be True. We cannot tell how many iterations we must perform before the condition becomes False, and this may be an infinite sequence. The maximum iteration (maxiter) break clause will stop us iterating potentially forever.

In Figure 2-2 we see the first 50 iterations of the preceding sequence. For 0+0j (the solid line with circle markers) the sequence appears to repeat every eighth iteration, but each sequence of seven calculations has a minor deviation from the previous sequence—we can’t tell if this point will iterate forever within the boundary condition, or for a long time, or maybe for just a few more iterations. The dashed cutoff line shows the boundary at +2.

julia non convergence
Figure 2-2. Two coordinate examples evolving for the Julia set

For -0.82+0j (the dashed line with diamond markers), we can see that after the ninth update the absolute result has exceeded the +2 cutoff, so we stop updating this value.

Calculating the Full Julia Set

In this section we break down the code that generates the Julia set. We’ll analyze it in various ways throughout this chapter. As shown in Example 2-1, at the start of our module we import the time module for our first profiling approach and define some coordinate constants.

Example 2-1. Defining global constants for the coordinate space
"""Julia set generator without optional PIL-based image drawing"""
import time

# area of complex space to investigate
x1, x2, y1, y2 = -1.8, 1.8, -1.8, 1.8
c_real, c_imag = -0.62772, -.42193

To generate the plot, we create two lists of input data. The first is zs (complex z-coordinates), and the second is cs (a complex initial condition). Neither list varies, and we could optimize cs to a single c value as a constant. The rationale for building two input lists is so that we have some reasonable-looking data to profile when we profile RAM usage later in this chapter.

To build the zs and cs lists, we need to know the coordinates for each z. In Example 2-2 we build up these coordinates using xcoord and ycoord and a specified x_step and y_step. The somewhat verbose nature of this setup is useful when porting the code to other tools (e.g., to numpy) and to other Python environments, as it helps to have everything very clearly defined for debugging.

Example 2-2. Establishing the coordinate lists as inputs to our calculation function
def calc_pure_python(desired_width, max_iterations):
    """Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set"""
    x_step = (x2 - x1) / desired_width
    y_step = (y1 - y2) / desired_width
    x = []
    y = []
    ycoord = y2
    while ycoord > y1:
        y.append(ycoord)
        ycoord += y_step
    xcoord = x1
    while xcoord < x2:
        x.append(xcoord)
        xcoord += x_step
    # build a list of co-ordinates and the initial condition for each cell.
    # Note that our initial condition is a constant and could easily be removed,
    # we use it to simulate a real-world scenario with several inputs to our function
    zs = []
    cs = []
    for ycoord in y:
        for xcoord in x:
            zs.append(complex(xcoord, ycoord))
            cs.append(complex(c_real, c_imag))

    print("Length of x:", len(x))
    print("Total elements:", len(zs))
    start_time = time.time()
    output = calculate_z_serial_purepython(max_iterations, zs, cs)
    end_time = time.time()
    secs = end_time - start_time
    print(calculate_z_serial_purepython.__name__ + " took", secs, "seconds")

    # This sum is expected for a 1000^2 grid with 300 iterations
    # It ensures that our code evolves exactly as we'd intended
    assert sum(output) == 33219980

Having built the zs and cs lists, we output some information about the size of the lists and calculate the output list via calculate_z_serial_purepython. Finally, we sum the contents of output and assert that it matches the expected output value. Ian uses it here to confirm that no errors creep into the book.

As the code is deterministic, we can verify that the function works as we expect by summing all the calculated values. This is useful as a sanity check—when we make changes to numerical code it is very sensible to check that we haven’t broken the algorithm. Ideally we would use unit tests and we’d test more than one configuration of the problem.

Next, in Example 2-3, we define the calculate_z_serial_purepython function, which expands on the algorithm we discussed earlier. Notably, we also define an output list at the start that has the same length as the input zs and cs lists.

Example 2-3. Our CPU-bound calculation function
def calculate_z_serial_purepython(maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while abs(z) < 2 and n < maxiter:
            z = z * z + c
            n += 1
        output[i] = n
    return output

Now we call the calculation routine in Example 2-4. By wrapping it in a __main__ check, we can safely import the module without starting the calculations for some of the profiling methods. Note that here we’re not showing the method used to plot the output.

Example 2-4. main for our code
if __name__ == "__main__":
    # Calculate the Julia set using a pure Python solution with
    # reasonable defaults for a laptop
    calc_pure_python(desired_width=1000, max_iterations=300)

Once we run the code, we see some output about the complexity of the problem:

# running the above produces:
Length of x: 1000
Total elements: 1000000
calculate_z_serial_purepython took 8.087012767791748 seconds

In the false-grayscale plot (Figure 2-1), the high-contrast color changes gave us an idea of where the cost of the function was slow-changing or fast-changing. Here, in Figure 2-3, we have a linear color map: black is quick to calculate and white is expensive to calculate.

By showing two representations of the same data, we can see that lots of detail is lost in the linear mapping. Sometimes it can be useful to have various representations in mind when investigating the cost of a function.

Julia set at -0.62772-0.42193i
Figure 2-3. Julia plot example using a pure grayscale

Simple Approaches to Timing—print and a Decorator

After Example 2-4, we saw the output generated by several print statements in our code. On Ian’s laptop this code takes approximately 8 seconds to run using CPython 3.7. It is useful to note that there is always some variation in execution time. You must observe the normal variation when you’re timing your code, or you might incorrectly attribute an improvement in your code simply to a random variation in execution time.

Your computer will be performing other tasks while running your code, such as accessing the network, disk, or RAM, and these factors can cause variations in the execution time of your program.

Ian’s laptop is a Dell 9550 with an Intel Core I7 6700HQ (2.6 GHz, 6MB cache, Quad Core with Hyperthreading) and 32 GB of RAM running Linux Mint 19.1 (Ubuntu 18.04).

In calc_pure_python (Example 2-2), we can see several print statements. This is the simplest way to measure the execution time of a piece of code inside a function. It is a very basic approach, but despite being quick and dirty it can be very useful when you’re first looking at a piece of code.

Using print statements is commonplace when debugging and profiling code. It quickly becomes unmanageable, but is useful for short investigations. Try to tidy them up when you’re done with them, or they will clutter your stdout.

A slightly cleaner approach is to use a decorator—here, we add one line of code above the function that we care about. Our decorator can be very simple and just replicate the effect of the print statements. Later, we can make it more advanced.

In Example 2-5 we define a new function, timefn, which takes a function as an argument: the inner function, measure_time, takes *args (a variable number of positional arguments) and **kwargs (a variable number of key/value arguments) and passes them through to fn for execution. Around the execution of fn we capture time.time() and then print the result along with fn.__name__. The overhead of using this decorator is small, but if you’re calling fn millions of times the overhead might become noticeable. We use @wraps(fn) to expose the function name and docstring to the caller of the decorated function (otherwise, we would see the function name and docstring for the decorator, not the function it decorates).

Example 2-5. Defining a decorator to automate timing measurements
from functools import wraps

def timefn(fn):
    @wraps(fn)
    def measure_time(*args, **kwargs):
        t1 = time.time()
        result = fn(*args, **kwargs)
        t2 = time.time()
        print(f"@timefn: {fn.__name__} took {t2 - t1} seconds")
        return result
    return measure_time


@timefn
def calculate_z_serial_purepython(maxiter, zs, cs):
    ...

When we run this version (we keep the print statements from before), we can see that the execution time in the decorated version is very slightly quicker than the call from calc_pure_python. This is due to the overhead of calling a function (the difference is very tiny):

Length of x: 1000
Total elements: 1000000
@timefn:calculate_z_serial_purepython took 8.00485110282898 seconds
calculate_z_serial_purepython took 8.004898071289062 seconds
Note

The addition of profiling information will inevitably slow your code down—some profiling options are very informative and induce a heavy speed penalty. The trade-off between profiling detail and speed will be something you have to consider.

We can use the timeit module as another way to get a coarse measurement of the execution speed of our CPU-bound function. More typically, you would use this when timing different types of simple expressions as you experiment with ways to solve a problem.

Warning

Note that the timeit module temporarily disables the garbage collector. This might impact the speed you’ll see with real-world operations if the garbage collector would normally be invoked by your operations. See the Python documentation for help on this.

From the command line you can run timeit using:

python -m timeit -n 5 -r 1 -s "import julia1" "julia1.calc_pure_python(desired_width=1000, max_iterations=300)"

Note that you have to import the module as a setup step using -s, as calc_pure_python is inside that module. timeit has some sensible defaults for short sections of code, but for longer-running functions it can be sensible to specify the number of loops, whose results are averaged for each test (-n 5), and the number of repetitions (-r 5). The best result of all the repetitions is given as the answer.

By default, if we run timeit on this function without specifying -n and -r it runs 10 loops with 5 repetitions, and this takes 6 minutes to complete. Overriding the defaults can make sense if you want to get your results a little faster.

We’re only interested in the best-case results, as the average and worst case are probably a result of interference by other processes. Choosing the best of five repetitions of five averaged results should give us a fairly stable result:

5 loops, best of 1: 8.45 sec per loop

Try running the benchmark several times to check if you get varying results—you may need more repetitions to settle on a stable fastest-result time. There is no “correct” configuration, so if you see a wide variation in your timing results, do more repetitions until your final result is stable.

Our results show that the overall cost of calling calc_pure_python is 8.45 seconds (as the best case), while single calls to calculate_z_serial_purepython take 8.0 seconds as measured by the @timefn decorator. The difference is mainly the time taken to create the zs and cs lists.

Inside IPython, we can use the magic %timeit in the same way. If you are developing your code interactively in IPython or in a Jupyter Notebook then you can use:

In [1]: import julia1
In [2]: %timeit julia1.calc_pure_python(desired_width=1000, max_iterations=300)

It is worth considering the variation in load that you get on a normal computer. Many background tasks are running (e.g., Dropbox, backups) that could impact the CPU and disk resources at random. Scripts in web pages can also cause unpredictable resource usage. Figure 2-4 shows the single CPU being used at 100% for some of the timing steps we just performed; the other cores on this machine are each lightly working on other tasks.

System Monitor (Ubuntu) showing background CPU usage during timings
Figure 2-4. System Monitor on Ubuntu showing variation in background CPU usage while we time our function

Occasionally the System Monitor shows spikes of activity on this machine. It is sensible to watch your System Monitor to check that nothing else is interfering with your critical resources (CPU, disk, network).

Simple Timing Using the Unix time Command

We can step outside of Python for a moment to use a standard system utility on Unix-like systems. The following will record various views on the execution time of your program, and it won’t care about the internal structure of your code:

$ /usr/bin/time -p python julia1_nopil.py
Length of x: 1000
Total elements: 1000000
calculate_z_serial_purepython took 8.279886722564697 seconds
real 8.84
user 8.73
sys 0.10

Note that we specifically use /usr/bin/time rather than time so we get the system’s time and not the simpler (and less useful) version built into our shell. If you try time --verbose and you get an error, you’re probably looking at the shell’s built-in time command and not the system command.

Using the -p portability flag, we get three results:

  • real records the wall clock or elapsed time.

  • user records the amount of time the CPU spent on your task outside of kernel functions.

  • sys records the time spent in kernel-level functions.

By adding user and sys, you get a sense of how much time was spent in the CPU. The difference between this and real might tell you about the amount of time spent waiting for I/O; it might also suggest that your system is busy running other tasks that are distorting your measurements.

time is useful because it isn’t specific to Python. It includes the time taken to start the python executable, which might be significant if you start lots of fresh processes (rather than having a long-running single process). If you often have short-running scripts where the startup time is a significant part of the overall runtime, then time can be a more useful measure.

We can add the --verbose flag to get even more output:

$ /usr/bin/time --verbose python julia1_nopil.py
Length of x: 1000
Total elements: 1000000
calculate_z_serial_purepython took 8.477287530899048 seconds
	Command being timed: "python julia1_nopil.py"
	User time (seconds): 8.97
	System time (seconds): 0.05
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.03
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 98620
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 26645
	Voluntary context switches: 1
	Involuntary context switches: 27
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Probably the most useful indicator here is Major (requiring I/O) page faults, as this indicates whether the operating system is having to load pages of data from the disk because the data no longer resides in RAM. This will cause a speed penalty.

In our example the code and data requirements are small, so no page faults occur. If you have a memory-bound process, or several programs that use variable and large amounts of RAM, you might find that this gives you a clue as to which program is being slowed down by disk accesses at the operating system level because parts of it have been swapped out of RAM to disk.

Using the cProfile Module

cProfile is a built-in profiling tool in the standard library. It hooks into the virtual machine in CPython to measure the time taken to run every function that it sees. This introduces a greater overhead, but you get correspondingly more information. Sometimes the additional information can lead to surprising insights into your code.

cProfile is one of two profilers in the standard library alongside profile. profile is the original and slower pure-Python profiler and cProfile has the same interface as profile and is written in C for a lower overhead. If you’re curious about the history of these libraries, see Armin Rigo’s 2005 request to include cProfile in the standard library.

A good practice when profiling is to generate a hypothesis about the speed of parts of your code before you profile it. Ian likes to print out the code snippet in question and annotate it. Forming a hypothesis ahead of time means you can measure how wrong you are (and you will be!) and improve your intuition about certain coding styles.

Warning

You should never avoid profiling in favor of a gut instinct (we warn you—you will get it wrong!). It is definitely worth forming a hypothesis ahead of profiling to help you learn to spot possible slow choices in your code, and you should always back up your choices with evidence.

Always be driven by results that you have measured, and always start with some quick-and-dirty profiling to make sure you’re addressing the right area. There’s nothing more humbling than cleverly optimizing a section of code only to realize (hours or days later) that you missed the slowest part of the process and haven’t really addressed the underlying problem at all.

Let’s hypothesize that calculate_z_serial_purepython is the slowest part of the code. In that function, we do a lot of dereferencing and make many calls to basic arithmetic operators and the abs function. These will probably show up as consumers of CPU resources.

Here, we’ll use the cProfile module to run a variant of the code. The output is spartan but helps us figure out where to analyze further.

The -s cumulative flag tells cProfile to sort by cumulative time spent inside each function; this gives us a view into the slowest parts of a section of code. The cProfile output is written to screen directly after our usual print results:

$ python -m cProfile -s cumulative julia1_nopil.py
...
Length of x: 1000
Total elements: 1000000
calculate_z_serial_purepython took 11.498265266418457 seconds
         36221995 function calls in 12.234 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.234   12.234 {built-in method builtins.exec}
        1    0.038    0.038   12.234   12.234 julia1_nopil.py:1(<module>)
        1    0.571    0.571   12.197   12.197 julia1_nopil.py:23(calc_pure_python)
        1    8.369    8.369   11.498   11.498 julia1_nopil.py:9(calculate_z_serial_purepython)
 34219980    3.129    0.000    3.129    0.000 {built-in method builtins.abs}
  2002000    0.121    0.000    0.121    0.000 {method 'append' of 'list' objects}
        1    0.006    0.006    0.006    0.006 {built-in method builtins.sum}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        2    0.000    0.000    0.000    0.000 {built-in method time.time}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Sorting by cumulative time gives us an idea about where the majority of execution time is spent. This result shows us that 36,221,995 function calls occurred in just over 12 seconds (this time includes the overhead of using cProfile). Previously our code took around 8 seconds to execute—we’ve just added a 4-second penalty by measuring how long each function takes to execute.

We can see that the entry point to the code julia1_nopil.py on line 1 takes a total of 12 seconds. This is just the __main__ call to calc_pure_python. ncalls is 1, indicating that this line is only executed once.

Inside calc_pure_python, the call to calculate_z_serial_purepython consumes 11 seconds. Both functions are only called once. We can derive that approximately 1 second is spent on lines of code inside calc_pure_python, separate to calling the CPU-intensive calculate_z_serial_purepython function. We can’t derive which lines, however, take the time inside the function using cProfile.

Inside calculate_z_serial_purepython, the time spent on lines of code (without calling other functions) is 8 seconds. This function makes 34,219,980 calls to abs, which take a total of 3 seconds, along with some other calls that do not cost much time.

What about the {abs} call? This line is measuring the individual calls to the abs function inside calculate_z_serial_purepython. While the per-call cost is negligible (it is recorded as 0.000 seconds), the total time for 34,219,980 calls is 3 seconds. We couldn’t predict in advance exactly how many calls would be made to abs, as the Julia function has unpredictable dynamics (that’s why it is so interesting to look at).

At best we could have said that it will be called a minimum of 1,000,000 times, as we’re calculating 1000*1000 pixels. At most it will be called 300,000,000 times, as we calculate 1,000,000 pixels with a maximum of 300 iterations. So, 34 million calls is roughly 10% of the worst case.

If we look at the original grayscale image (Figure 2-3) and, in our mind’s eye, squash the white parts together and into a corner, we can estimate that the expensive white region accounts for roughly 10% of the rest of the image.

The next line in the profiled output, {method 'append' of 'list' objects}, details the creation of 2,002,000 list items.

Tip

Why 2,002,000 items? Before you read on, think about how many list items are being constructed.

This creation of the 2,002,000 items is occurring in calc_pure_python during the setup phase.

The zs and cs lists will be 1000*1000 items each (generating 1,000,000 * 2 calls), and these are built from a list of 1,000 x- and 1,000 y-coordinates. In total, this is 2,002,000 calls to append.

It is important to note that this cProfile output is not ordered by parent functions; it is summarizing the expense of all functions in the executed block of code. Figuring out what is happening on a line-by-line basis is very hard with cProfile, as we only get profile information for the function calls themselves, not each line within the functions.

Inside calculate_z_serial_purepython we can account for {abs} and in total this functions costs approximately 3.1 seconds. We know that calculate_z_serial_purepython costs 11.4 seconds in total.

The final line of the profiling output refers to lsprof; this is the original name of the tool that evolved into cProfile and can be ignored.

To get some more control over the results of cProfile, we can write a statistics file and then analyze it in Python:

$ python -m cProfile -o profile.stats julia1.py

We can load this into Python as follows, and it will give us the same cumulative time report as before:

In [1]: import pstats
In [2]: p = pstats.Stats("profile.stats")
In [3]: p.sort_stats("cumulative")
Out[3]: <pstats.Stats at 0x7f77088edf28>

In [4]: p.print_stats()
Fri Jun 14 17:59:28 2019    profile.stats

         36221995 function calls in 12.169 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.169   12.169 {built-in method builtins.exec}
        1    0.033    0.033   12.169   12.169 julia1_nopil.py:1(<module>)
        1    0.576    0.576   12.135   12.135 julia1_nopil.py:23(calc_pure_python)
        1    8.266    8.266   11.429   11.429 julia1_nopil.py:9(calculate_z_serial_purepython)
 34219980    3.163    0.000    3.163    0.000 {built-in method builtins.abs}
  2002000    0.123    0.000    0.123    0.000 {method 'append' of 'list' objects}
        1    0.006    0.006    0.006    0.006 {built-in method builtins.sum}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        2    0.000    0.000    0.000    0.000 {built-in method time.time}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

To trace which functions we’re profiling, we can print the caller information. In the following two listings we can see that calculate_z_serial_purepython is the most expensive function, and it is called from one place. If it were called from many places, these listings might help us narrow down the locations of the most expensive parents:

In [5]: p.print_callers()
   Ordered by: cumulative time

Function                                          was called by...
                                                      ncalls  tottime  cumtime
{built-in method builtins.exec}                   <-
julia1_nopil.py:1(<module>)                       <-       1    0.033   12.169  {built-in method builtins.exec}
julia1_nopil.py:23(calc_pure_python)              <-       1    0.576   12.135  julia1_nopil.py:1(<module>)
julia1_nopil.py:9(calculate_z_serial_purepython)  <-       1    8.266   11.429  julia1_nopil.py:23(calc_pure_python)
{built-in method builtins.abs}                    <- 34219980    3.163    3.163  julia1_nopil.py:9(calculate_z_serial_purepython)
{method 'append' of 'list' objects}               <- 2002000    0.123    0.123  julia1_nopil.py:23(calc_pure_python)
{built-in method builtins.sum}                    <-       1    0.006    0.006  julia1_nopil.py:23(calc_pure_python)
{built-in method builtins.print}                  <-       3    0.000    0.000  julia1_nopil.py:23(calc_pure_python)
{built-in method builtins.len}                    <-       2    0.000    0.000  julia1_nopil.py:9(calculate_z_serial_purepython)
                                                           2    0.000    0.000  julia1_nopil.py:23(calc_pure_python)
{built-in method time.time}                       <-       2    0.000    0.000  julia1_nopil.py:23(calc_pure_python)
{method 'disable' of '_lsprof.Profiler' objects}  <-

We can flip this around the other way to show which functions call other functions:

In [6]: p.print_callees()
   Ordered by: cumulative time

Function                                          called...
                                                      ncalls  tottime  cumtime
{built-in method builtins.exec}                   ->       1    0.033   12.169  julia1_nopil.py:1(<module>)
julia1_nopil.py:1(<module>)                       ->       1    0.576   12.135  julia1_nopil.py:23(calc_pure_python)
julia1_nopil.py:23(calc_pure_python)              ->       1    8.266   11.429  julia1_nopil.py:9(calculate_z_serial_purepython)
                                                           2    0.000    0.000  {built-in method builtins.len}
                                                           3    0.000    0.000  {built-in method builtins.print}
                                                           1    0.006    0.006  {built-in method builtins.sum}
                                                           2    0.000    0.000  {built-in method time.time}
                                                     2002000    0.123    0.123  {method 'append' of 'list' objects}
julia1_nopil.py:9(calculate_z_serial_purepython)  -> 34219980    3.163    3.163  {built-in method builtins.abs}
                                                           2    0.000    0.000  {built-in method builtins.len}
{built-in method builtins.abs}                    ->
{method 'append' of 'list' objects}               ->
{built-in method builtins.sum}                    ->
{built-in method builtins.print}                  ->
{built-in method builtins.len}                    ->
{built-in method time.time}                       ->
{method 'disable' of '_lsprof.Profiler' objects}  ->

cProfile is rather verbose, and you need a side screen to see it without lots of word-wrap. Since it is built in, though, it is a convenient tool for quickly identifying bottlenecks. Tools like line_profiler and memory_profiler, which we discuss later in this chapter, will then help you to drill down to the specific lines that you should pay attention to.

Visualizing cProfile Output with SnakeViz

snakeviz is a visualizer that draws the output of cProfile as a diagram, larger boxes are ares of code that take longer to run. It replaces the older runsnake tool.

Use snakeviz to get a high-level understanding of a cProfile statistics file, particularly if you’re investigating a new project where you have few intuitions. The diagrams will help you visualise the memory-usage behaviour of the system. It may also highlight areas that you hadn’t expected to be expensive, potentially pointing you at areas for easy RAM footprint reduction.

To install SnakeViz use $ pip install snakeviz.

In Figure 2-5 we have the visual output of the profile.stats file we’ve just generated. The entry point for the program is shown at the top of the diagram. Each layer down is a function called from the function above.

The width of the diagram represents the entire time taken by the program’s execution. The 4th layer shows that the majority of the time is spent in calculate_z_serial_purepython. The 5th layer breaks this down some more - the unannotated block to the right occupying approximately 25% of that layer represents the time spent in the abs function. Seeing these larger blocks quickly brings home how the time is spent inside your program.

snakeviz visualizing profile.stats
Figure 2-5. snakeviz visualizing profile.stats

The next section down shows a table which is a pretty-printed version of the statistics we’ve just been looking at, you can sort by cumtime (cumulative time), percall (cost per call) and ncalls (number of calls altogether) amongst others. Starting with cumtime will tell you - overall - which functions cost the most. They’re a pretty good place to start your investigations.

If you’re comfortable looking at tables then the console output for cProfile may be adequate for you. To communicate to others we’d strongly suggest you use diagrams - such as this output from snakeviz - to help others quickly understand the point you’re making.

Using line_profiler for Line-by-Line Measurements

In Ian’s opinion, Robert Kern’s line_profiler is the strongest tool for identifying the cause of CPU-bound problems in Python code. It works by profiling individual functions on a line-by-line basis, so you should start with cProfile and use the high-level view to guide which functions to profile with line_profiler.

It is worthwhile printing and annotating versions of the output from this tool as you modify your code, so you have a record of changes (successful or not) that you can quickly refer to. Don’t rely on your memory when you’re working on line-by-line changes.

To install line_profiler, issue the command pip install line_profiler.

A decorator (@profile) is used to mark the chosen function. The kernprof script is used to execute your code, and the CPU time and other statistics for each line of the chosen function are recorded.

Note

The requirement to modify the source code is a minor annoyance, as the addition of a decorator will break your unit tests unless you make a dummy decorator—see “No-op @profile Decorator”.

The arguments are -l for line-by-line (rather than function-level) profiling and -v for verbose output. Without -v you receive an .lprof output that you can later analyze with the line_profiler module. In Example 2-6, we’ll do a full run on our CPU-bound function.

Example 2-6. Running kernprof with line-by-line output on a decorated function to record the CPU cost of each line’s execution
$ kernprof -l -v julia1_lineprofiler.py
...
Wrote profile results to julia1_lineprofiler.py.lprof
Timer unit: 1e-06 s

Total time: 49.2011 s
File: julia1_lineprofiler.py
Function: calculate_z_serial_purepython at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           @profile
    10                                           def calculate_z_serial_purepython(maxiter, zs, cs):
    11                                               """Calculate output list using Julia update rule"""
    12         1       3298.0   3298.0      0.0      output = [0] * len(zs)
    13   1000001     386087.0      0.4      0.8      for i in range(len(zs)):
    14   1000000     365713.0      0.4      0.7          n = 0
    15   1000000     420496.0      0.4      0.9          z = zs[i]
    16   1000000     402887.0      0.4      0.8          c = cs[i]
    17  34219980   18692480.0      0.5     38.0          while abs(z) < 2 and n < maxiter:
    18  33219980   15165236.0      0.5     30.8              z = z * z + c
    19  33219980   13357752.0      0.4     27.1              n += 1
    20   1000000     407153.0      0.4      0.8          output[i] = n
    21         1          1.0      1.0      0.0      return output

Introducing kernprof.py adds a substantial amount to the runtime. In this example, calculate_z_serial_purepython takes 49 seconds; this is up from 8 seconds using simple print statements and 12 seconds using cProfile. The gain is that we get a line-by-line breakdown of where the time is spent inside the function.

The % Time column is the most helpful—we can see that 38% of the time is spent on the while testing. We don’t know whether the first statement (abs(z) < 2) is more expensive than the second (n < maxiter), though. Inside the loop, we see that the update to z is also fairly expensive. Even n += 1 is expensive! Python’s dynamic lookup machinery is at work for every loop, even though we’re using the same types for each variable in each loop—this is where compiling and type specialization ([Link to Come]) give us a massive win. The creation of the output list and the updates on line 20 are relatively cheap compared to the cost of the while loop.

If you haven’t thought about the complexity of Python’s dynamic machinery before, do think about what happens in that n += 1 operation. Python has to check that the n object has an __add__ function (and if it didn’t - it’d walk up any inherited classes to see if they provided this functionality) and then the other object (1 in this case) is passed in so that the __add__ function can decide how to handle the operation. Remember that the second argument might be a float or other object which may, or may not, be compatible. This all happens dynamically.

The obvious way to further analyze the while statement is to break it up. While there has been some discussion in the Python community around the idea of rewriting the .pyc files with more detailed information for multipart, single-line statements, we are unaware of any production tools that offer a more fine-grained analysis than line_profiler.

In Example 2-7, we break the while logic into several statements. This additional complexity will increase the runtime of the function, as we have more lines of code to execute, but it might also help us to understand the costs incurred in this part of the code.

Tip

Before you look at the code, do you think we’ll learn about the costs of the fundamental operations this way? Might other factors complicate the analysis?

Example 2-7. Breaking the compound while statement into individual statements to record the cost of each part of the original parts
$ kernprof -l -v julia1_lineprofiler2.py
...
Wrote profile results to julia1_lineprofiler2.py.lprof
Timer unit: 1e-06 s

Total time: 82.88 s
File: julia1_lineprofiler2.py
Function: calculate_z_serial_purepython at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           @profile
    10                                           def calculate_z_serial_purepython(maxiter, zs, cs):
    11                                               """Calculate output list using Julia update rule"""
    12         1       3309.0   3309.0      0.0      output = [0] * len(zs)
    13   1000001     421828.0      0.4      0.5      for i in range(len(zs)):
    14   1000000     392861.0      0.4      0.5          n = 0
    15   1000000     452537.0      0.5      0.5          z = zs[i]
    16   1000000     427382.0      0.4      0.5          c = cs[i]
    17   1000000     396178.0      0.4      0.5          while True:
    18  34219980   19150179.0      0.6     23.1              not_yet_escaped = abs(z) < 2
    19  34219980   15159735.0      0.4     18.3              iterations_left = n < maxiter
    20  34219980   14352481.0      0.4     17.3              if not_yet_escaped and iterations_left:
    21  33219980   16958693.0      0.5     20.5                  z = z * z + c
    22  33219980   14316108.0      0.4     17.3                  n += 1
    23                                                       else:
    24   1000000     409440.0      0.4      0.5                  break
    25   1000000     439254.0      0.4      0.5          output[i] = n
    26         1          0.0      0.0      0.0      return output

This version takes 82 seconds to execute, while the previous version took 49 seconds. Other factors did complicate the analysis. In this case, having extra statements that have to be executed 34,219,980 times each slows down the code. If we hadn’t used kernprof.py to investigate the line-by-line effect of this change, we might have drawn other conclusions about the reason for the slowdown, as we’d have lacked the necessary evidence.

At this point it makes sense to step back to the earlier timeit technique to test the cost of individual expressions:

Python 3.7.3 (default, Mar 27 2019, 22:11:17)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: z = 0+0j
In [2]: %timeit abs(z) < 2
97.6 ns ± 0.138 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [3]: n = 1
In [4]: maxiter = 300
In [5]: %timeit n < maxiter
42.1 ns ± 0.0355 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

From this simple analysis it looks as though the logic test on n is over two times faster than the call to abs. Since the order of evaluation for Python statements is both left to right and opportunistic, it makes sense to put the cheapest test on the left side of the equation. On 1 in every 301 tests for each coordinate the n < maxiter test will be False, so Python wouldn’t need to evaluate the other side of the and operator.

We never know whether abs(z) < 2 will be False until we evaluate it, and our earlier observations for this region of the complex plane suggest it is True around 10% of the time for all 300 iterations. If we wanted to have a strong understanding of the time complexity of this part of the code, it would make sense to continue the numerical analysis. In this situation, however, we want an easy check to see if we can get a quick win.

We can form a new hypothesis stating, “By swapping the order of the operators in the while statement we will achieve a reliable speedup.” We can test this hypothesis using kernprof, but the additional overheads of profiling this way might add too much noise. Instead, we can use an earlier version of the code, running a test comparing while abs(z) < 2 and n < maxiter: against while n < maxiter and abs(z) < 2:.

The result is a fairly stable improvement of approximately 0.1 seconds. This result is obviously minor and is very problem-specific, though using a more suitable approach to solve this problem (e.g., swapping to using Cython or PyPy, as described in [Link to Come]) would yield greater gains.

We can be confident in our result because:

  • We stated a hypothesis that was easy to test.

  • We changed our code so that only the hypothesis would be tested (never test two things at once!).

  • We gathered enough evidence to support our conclusion.

For completeness, we can run a final kernprof on the two main functions including our optimization to confirm that we have a full picture of the overall complexity of our code.

Example 2-8. Swapping the order of the compound while statement makes the function fractionally faster
$ kernprof -l -v julia1_lineprofiler3.py
...
Wrote profile results to julia1_lineprofiler3.py.lprof
Timer unit: 1e-06 s

Total time: 48.9154 s
File: julia1_lineprofiler3.py
Function: calculate_z_serial_purepython at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           @profile
    10                                           def calculate_z_serial_purepython(maxiter, zs, cs):
    11                                               """Calculate output list using Julia update rule"""
    12         1       3312.0   3312.0      0.0      output = [0] * len(zs)
    13   1000001     383405.0      0.4      0.8      for i in range(len(zs)):
    14   1000000     362177.0      0.4      0.7          n = 0
    15   1000000     414681.0      0.4      0.8          z = zs[i]
    16   1000000     395037.0      0.4      0.8          c = cs[i]
    17  34219980   18686399.0      0.5     38.2          while n < maxiter and abs(z) < 2:
    18  33219980   15010549.0      0.5     30.7              z = z * z + c
    19  33219980   13256992.0      0.4     27.1              n += 1
    20   1000000     402857.0      0.4      0.8          output[i] = n
    21         1          1.0      1.0      0.0      return output

As expected, we can see from the output in Example 2-9 that calculate_z_serial_purepython takes most (97%) of the time of its parent function. The list-creation steps are minor in comparison.

Example 2-9. Testing the line-by-line costs of the setup routine
Total time: 88.334 s
File: julia1_lineprofiler3.py
Function: calc_pure_python at line 24

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    24                                           @profile
    25                                           def calc_pure_python(draw_output, desired_width, max_iterations):
    26                                               """Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set and display"""
...
    44         1          1.0      1.0      0.0      zs = []
    45         1          0.0      0.0      0.0      cs = []
    46      1001        651.0      0.7      0.0      for ycoord in y:
    47   1001000     591973.0      0.6      0.7          for xcoord in x:
    48   1000000     859713.0      0.9      1.0              zs.append(complex(xcoord, ycoord))
    49   1000000     902310.0      0.9      1.0              cs.append(complex(c_real, c_imag))
    50
    51         1         40.0     40.0      0.0      print("Length of x:", len(x))
    52         1          7.0      7.0      0.0      print("Total elements:", len(zs))
    53         1          4.0      4.0      0.0      start_time = time.time()
    54         1   85969310.0 85969310.0     97.3      output = calculate_z_serial_purepython(max_iterations, zs, cs)
    55         1          4.0      4.0      0.0      end_time = time.time()
    56         1          1.0      1.0      0.0      secs = end_time - start_time
    57         1         36.0     36.0      0.0      print(calculate_z_serial_purepython.__name__ + " took", secs, "seconds")
    58
    59         1       6345.0   6345.0      0.0      assert sum(output) == 33219980  # this sum is expected for 1000^2 grid with 300 iterations

Using memory_profiler to Diagnose Memory Usage

Just as Robert Kern’s line_profiler package measures CPU usage, the memory_profiler module by Fabian Pedregosa and Philippe Gervais measures memory usage on a line-by-line basis. Understanding the memory usage characteristics of your code allows you to ask yourself two questions:

  • Could we use less RAM by rewriting this function to work more efficiently?

  • Could we use more RAM and save CPU cycles by caching?

memory_profiler operates in a very similar way to line_profiler, but runs far more slowly. If you install the psutil package (optional but recommended), memory_profiler will run faster. Memory profiling may easily make your code run 10 to 100 times slower. In practice, you will probably use memory_profiler occasionally and line_profiler (for CPU profiling) more frequently.

Install memory_profiler with the command pip install memory_profiler (and optionally pip install psutil).

As mentioned, the implementation of memory_profiler is not as performant as the implementation of line_profiler. It may therefore make sense to run your tests on a smaller problem that completes in a useful amount of time. Overnight runs might be sensible for validation, but you need quick and reasonable iterations to diagnose problems and hypothesize solutions. The code in Example 2-10 uses the full 1,000 × 1,000 grid, and the statistics took about 2 hours to collect on Ian’s laptop.

Note

The requirement to modify the source code is a minor annoyance. As with line_profiler, a decorator (@profile) is used to mark the chosen function. This will break your unit tests unless you make a dummy decorator—see “No-op @profile Decorator”.

When dealing with memory allocation, you must be aware that the situation is not as clear-cut as it is with CPU usage. Generally, it is more efficient to overallocate memory to a process into a local pool that can be used at leisure, as memory allocation operations are relatively expensive. Furthermore, garbage collection is not instant, so objects may be unavailable but still in the garbage collection pool for some time.

The outcome of this is that it is hard to really understand what is happening with memory usage and release inside a Python program, as a line of code may not allocate a deterministic amount of memory as observed from outside the process. Observing the gross trend over a set of lines is likely to lead to better insight than observing the behavior of just one line.

Let’s take a look at the output from memory_profiler in Example 2-10. Inside calculate_z_serial_purepython on line 12, we see that the allocation of 1,000,000 items causes approximately 7 MB of RAM1 to be added to this process. This does not mean that the output list is definitely 7 MB in size, just that the process grew by approximately 7 MB during the internal allocation of the list.

In the parent function on line 46, we see that the allocation of the zs and cs lists changes the Mem usage column from 48MB to 125MB (a change of +77MB). Again, it is worth noting that this is not necessarily the true size of the arrays, just the size that the process grew by once these lists had been created.

At the time of writing the memory_usage module exhibits a bug - the Incrememt column does not always match the change in the Mem usage column. During the first edition of this book these columns were correctly tracked, you might want to check the status of this bug on GitHub: https://github.com/pythonprofilers/memory_profiler/issues/236. We recommend you use the Mem usage column as this correctly tracks the change in process size per line of code.

Example 2-10. memory_profiler’s result on both of our main functions, showing an unexpected memory use in calculate_z_serial_purepython
$ python -m memory_profiler julia1_memoryprofiler.py
...

Line #    Mem usage    Increment   Line Contents
================================================
     9  126.363 MiB  126.363 MiB   @profile
    10                             def calculate_z_serial_purepython(maxiter, zs, cs):
    11                                 """Calculate output list using Julia update rule"""
    12  133.973 MiB    7.609 MiB       output = [0] * len(zs)
    13  136.988 MiB    0.000 MiB       for i in range(len(zs)):
    14  136.988 MiB    0.000 MiB           n = 0
    15  136.988 MiB    0.000 MiB           z = zs[i]
    16  136.988 MiB    0.000 MiB           c = cs[i]
    17  136.988 MiB    0.258 MiB           while n < maxiter and abs(z) < 2:
    18  136.988 MiB    0.000 MiB               z = z * z + c
    19  136.988 MiB    0.000 MiB               n += 1
    20  136.988 MiB    0.000 MiB           output[i] = n
    21  136.988 MiB    0.000 MiB       return output

...

Line #    Mem usage    Increment   Line Contents
================================================
    24   48.113 MiB   48.113 MiB   @profile
    25                             def calc_pure_python(draw_output, desired_width, max_iterations):
    26                                 """Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set and display"""
    27   48.113 MiB    0.000 MiB       x_step = (x2 - x1) / desired_width
    28   48.113 MiB    0.000 MiB       y_step = (y1 - y2) / desired_width
    29   48.113 MiB    0.000 MiB       x = []
    30   48.113 MiB    0.000 MiB       y = []
    31   48.113 MiB    0.000 MiB       ycoord = y2
    32   48.113 MiB    0.000 MiB       while ycoord > y1:
    33   48.113 MiB    0.000 MiB           y.append(ycoord)
    34   48.113 MiB    0.000 MiB           ycoord += y_step
    35   48.113 MiB    0.000 MiB       xcoord = x1
    36   48.113 MiB    0.000 MiB       while xcoord < x2:
    37   48.113 MiB    0.000 MiB           x.append(xcoord)
    38   48.113 MiB    0.000 MiB           xcoord += x_step
    44   48.113 MiB    0.000 MiB       zs = []
    45   48.113 MiB    0.000 MiB       cs = []
    46  125.961 MiB    0.000 MiB       for ycoord in y:
    47  125.961 MiB    0.258 MiB           for xcoord in x:
    48  125.961 MiB    0.512 MiB               zs.append(complex(xcoord, ycoord))
    49  125.961 MiB    0.512 MiB               cs.append(complex(c_real, c_imag))
    50
    51  125.961 MiB    0.000 MiB       print("Length of x:", len(x))
    52  125.961 MiB    0.000 MiB       print("Total elements:", len(zs))
    53  125.961 MiB    0.000 MiB       start_time = time.time()
    54  136.609 MiB   10.648 MiB       output = calculate_z_serial_purepython(max_iterations, zs, cs)
    55  136.609 MiB    0.000 MiB       end_time = time.time()
    56  136.609 MiB    0.000 MiB       secs = end_time - start_time
    57  136.609 MiB    0.000 MiB       print(calculate_z_serial_purepython.__name__ + " took", secs, "seconds")
    58
    59  136.609 MiB    0.000 MiB       assert sum(output) == 33219980  # this sum is expected for 1000^2 grid with 300 iterations

Another way to visualize the change in memory use is to sample over time and plot the result. memory_profiler has a utility called mprof, used once to sample the memory usage and a second time to visualize the samples. It samples by time and not by line, so it barely impacts the runtime of the code.

Figure 2-6 is created using mprof run julia1_memoryprofiler.py. This writes a statistics file that is then visualized using mprof plot. Our two functions are bracketed: this shows where in time they are entered, and we can see the growth in RAM as they run. Inside calculate_z_serial_purepython, we can see the steady increase in RAM usage throughout the execution of the function; this is caused by all the small objects (int and float types) that are created.

memory_profiler report using mprof
Figure 2-6. memory_profiler report using mprof

In addition to observing the behavior at the function level, we can add labels using a context manager. The snippet in Example 2-11 is used to generate the graph in Figure 2-7. We can see the create_output_list label: it appears momentarily at around 1.5 seconds after calculate_z_serial_purepython and results in the process being allocated more RAM. We then pause for a second; the time.sleep(1) is an artificial addition to make the graph easier to understand.

Example 2-11. Using a context manager to add labels to the mprof graph
@profile
def calculate_z_serial_purepython(maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    with profile.timestamp("create_output_list"):
        output = [0] * len(zs)
    time.sleep(1)
    with profile.timestamp("calculate_output"):
        for i in range(len(zs)):
            n = 0
            z = zs[i]
            c = cs[i]
            while n < maxiter and abs(z) < 2:
                z = z * z + c
                n += 1
            output[i] = n
    return output

In the calculate_output block that runs for most of the graph, we see a very slow, linear increase in RAM usage. This will be from all of the temporary numbers used in the inner loops. Using the labels really helps us to understand at a fine-grained level where memory is being consumed. Interestingly we see the “peak RAM usage” line - a dashed vertical line just before 10 seconds - occurring before the termination of the program. Potentially this is due to the garbage collector recovering some RAM from the temporary objects used during calculate_output.

memory_profiler report using mprof with labels
Figure 2-7. memory_profiler report using mprof with labels

What happens if we simplify our code and remove the creation of the zs and cs lists? We have to then calculate these co-ordinates inside calculate_z_serial_purepython (so the same work is performed) but we’ll save RAM by not storing them in lists. You can see the code in Example 2-12.

In Figure 2-8, we see a major change in behavior - the overall envelope of RAM usage drops from 140MB to 60MB reducing our RAM usage by half!

memory_profiler after removing two large lists
Figure 2-8. memory_profiler after removing two large lists
Example 2-12. Creating complex co-ordinates on-the-fly to save RAM
@profile
def calculate_z_serial_purepython(maxiter, x, y):
    """Calculate output list using Julia update rule"""
    output = []
    for ycoord in y:
        for xcoord in x:
            z = complex(xcoord, ycoord)
            c = complex(c_real, c_imag)
            n = 0
            while n < maxiter and abs(z) < 2:
                z = z * z + c
                n += 1
            output.append(n)
    return output

If we want to measure the RAM used by several statements we can use the IPython magic %memit, which works just like %timeit. In [Link to Come] we will look at using %memit to measure the memory cost of lists and discuss various ways of using RAM more efficiently.

memory_profiler offers an interesting aid to debugging a large process via the --pdb-mmem=XXX flag. The pdb debugger will be activated once the process exceeds XXXMB. This will drop you in directly at the point in your code where too many allocations are occurring, if you’re in a space constrained envrionment.

Introspecting an Existing Process with PySpy

pyspy is an intriguing new sampling profiler - rather than requiring any code changes it introspects an already-running Python process and reports in the console with a top-like display. Being a sampling profiler it has almost no run-time impact on your code. It is written in Rust and requires elevated privileges to introspect another process.

This tool could be very useful in a production environment with long-running processes or complicated installation requirements. It supports Windows, Mac and Linux. Install it using pip install py-spy (note the dash, there’s a separate pyspy project that isn’t related). If your process is already running then you’ll want to use ps to get its process identifier (the PID), then this can be passed into py-spy as shown in Example 2-13.

Example 2-13. Running PySpy at the command line
$ ps -A -o pid,rss,cmd | ack python
...
15953 96156 python julia1_nopil.py
...
$ sudo env "PATH=$PATH" py-spy --pid 15953

If you look at Figure 2-9 you’ll see a static picture of a top-like display, in the console this updates every second to show which functions are currently taking most of the time.

Introspecting a Python process using PySpy
Figure 2-9. Introspecting a Python process using PySpy

PySpy can also export a flame chart, here we’ll run that option whilst asking PySpy to run our code directly without requiring a PID using $ py-spy --flame profile.svg -- python julia1_nopil.py. You’ll see in Figure 2-10 that the width of the display represents the entire program’s run-time and and each layer moving down the image represents functions called from above.

Flame chart for PySpy
Figure 2-10. Flame chart for PySpy

Using the dis Module to Examine CPython Bytecode

So far we’ve reviewed various ways to measure the cost of Python code (both for CPU and RAM usage). We haven’t yet looked at the underlying bytecode used by the virtual machine, though. Understanding what’s going on “under the hood” helps to build a mental model of what’s happening in slow functions, and it’ll help when you come to compile your code. So, let’s introduce some bytecode.

The dis module lets us inspect the underlying bytecode that we run inside the stack-based CPython virtual machine. Having an understanding of what’s happening in the virtual machine that runs your higher-level Python code will help you to understand why some styles of coding are faster than others. It will also help when you come to use a tool like Cython, which steps outside of Python and generates C code.

The dis module is built in. You can pass it code or a module, and it will print out a disassembly. In Example 2-14 we disassemble the outer loop of our CPU-bound function.

Tip

You should try to disassemble one of your own functions and attempt to exactly follow how the disassembled code matches to the disassembled output. Can you match the following dis output to the original function?

Example 2-14. Using the built-in dis to understand the underlying stack-based virtual machine that runs our Python code
In [1]: import dis
In [2]: import julia1_nopil
In [3]: dis.dis(julia1_nopil.calculate_z_serial_purepython)
 11           0 LOAD_CONST               1 (0)
              2 BUILD_LIST               1
              4 LOAD_GLOBAL              0 (len)
              6 LOAD_FAST                1 (zs)
              8 CALL_FUNCTION            1
             10 BINARY_MULTIPLY
             12 STORE_FAST               3 (output)

 12          14 SETUP_LOOP              94 (to 110)
             16 LOAD_GLOBAL              1 (range)
             18 LOAD_GLOBAL              0 (len)
             20 LOAD_FAST                1 (zs)
             22 CALL_FUNCTION            1
             24 CALL_FUNCTION            1
             26 GET_ITER
        >>   28 FOR_ITER                78 (to 108)
             30 STORE_FAST               4 (i)

 13          32 LOAD_CONST               1 (0)
             34 STORE_FAST               5 (n)
...
 19     >>   98 LOAD_FAST                5 (n)
            100 LOAD_FAST                3 (output)
            102 LOAD_FAST                4 (i)
            104 STORE_SUBSCR
            106 JUMP_ABSOLUTE           28
        >>  108 POP_BLOCK

 20     >>  110 LOAD_FAST                3 (output)
            112 RETURN_VALUE

The output is fairly straightforward, if terse. The first column contains line numbers that relate to our original file. The second column contains several >> symbols; these are the destinations for jump points elsewhere in the code. The third column is the operation address along with the operation name. The fourth column contains the parameters for the operation. The fifth column contains annotations to help line up the bytecode with the original Python parameters.

Refer back to Example 2-3 to match the bytecode to the corresponding Python code. The bytecode starts by putting the constant value 0 onto the stack, and then it builds a single-element list. Next, it searches the namespaces to find the len function, puts it on the stack, searches the namespaces again to find zs, and then puts that onto the stack. On line 12 it calls the len function from the stack, which consumes the zs reference in the stack; then it applies a binary multiply to the last two arguments (the length of zs and the single-element list) and stores the result in output. That’s the first line of our Python function now dealt with. Follow the next block of bytecode to understand the behavior of the second line of Python code (the outer for loop).

Tip

The jump points (>>) match to instructions like JUMP_ABSOLUTE and POP_JUMP_IF_FALSE. Go through your own disassembled function and match the jump points to the jump instructions.

Having introduced bytecode, we can now ask: what’s the bytecode and time cost of writing a function out explicitly versus using built-ins to perform the same task?

Different Approaches, Different Complexity

There should be one—and preferably only one—obvious way to do it. Although that way may not be obvious at first unless you’re Dutch…

Tim Peters, The Zen of Python

There will be various ways to express your ideas using Python. Generally it should be clear what the most sensible option is, but if your experience is primarily with an older version of Python or another programming language, then you may have other methods in mind. Some of these ways of expressing an idea may be slower than others.

You probably care more about readability than speed for most of your code, so your team can code efficiently, without being puzzled by performant but opaque code. Sometimes you will want performance, though (without sacrificing readability). Some speed testing might be what you need.

Take a look at the two code snippets in Example 2-15. They both do the same job, but the first will generate a lot of additional Python bytecode, which will cause more overhead.

Example 2-15. A naive and a more efficient way to solve the same summation problem
def fn_expressive(upper=1_000_000):
    total = 0
    for n in range(upper):
        total += n
    return total

def fn_terse(upper=1_000_000):
    return sum(range(upper))

assert fn_expressive() == fn_terse(), "We expect an identical result from these functions"

Both functions calculate the sum of a range of integers. A simple rule of thumb (but one you must back up using profiling!) is that more lines of bytecode will execute more slowly than fewer equivalent lines of bytecode that use built-in functions. In Example 2-16 we use IPython’s %timeit magic function to measure the best execution time from a set of runs. fn_terse runs over twice as fast as fn_expressive!

Example 2-16. Using %timeit to test our hypothesis that using built-in functions should be faster than writing our own functions
In [2]: %timeit fn_expressive()
52.4 ms ± 86.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit fn_terse()
18.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

If we use the dis module to investigate the code for each function, as shown in Example 2-17, we can see that the virtual machine has 17 lines to execute with the more expressive function and only 6 to execute with the very readable but more terse second function.

Example 2-17. Using dis to view the number of bytecode instructions involved in our two functions
In [4]: import dis
In [5]: dis.dis(fn_expressive)
  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (total)

  3           4 SETUP_LOOP              24 (to 30)
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_FAST                0 (upper)
             10 CALL_FUNCTION            1
             12 GET_ITER
        >>   14 FOR_ITER                12 (to 28)
             16 STORE_FAST               2 (n)

  4          18 LOAD_FAST                1 (total)
             20 LOAD_FAST                2 (n)
             22 INPLACE_ADD
             24 STORE_FAST               1 (total)
             26 JUMP_ABSOLUTE           14
        >>   28 POP_BLOCK

  5     >>   30 LOAD_FAST                1 (total)
             32 RETURN_VALUE

In [6]: dis.dis(fn_terse)
  8           0 LOAD_GLOBAL              0 (sum)
              2 LOAD_GLOBAL              1 (range)
              4 LOAD_FAST                0 (upper)
              6 CALL_FUNCTION            1
              8 CALL_FUNCTION            1
             10 RETURN_VALUE

The difference between the two code blocks is striking. Inside fn_expressive(), we maintain two local variables and iterate over a list using a for statement. The for loop will be checking to see if a StopIteration exception has been raised on each loop. Each iteration applies the total.__add__ function, which will check the type of the second variable (n) on each iteration. These checks all add a little expense.

Inside fn_terse(), we call out to an optimized C list comprehension function that knows how to generate the final result without creating intermediate Python objects. This is much faster, although each iteration must still check for the types of the objects that are being added together (in Chapter 4 we look at ways of fixing the type so we don’t need to check it on each iteration).

As noted previously, you must profile your code—if you just rely on this heuristic, then you will inevitably write slower code at some point. It is definitely worth learning if Python has a shorter and still readable way to solve your problem built in. If so, it is more likely to be easily readable by another programmer, and it will probably run faster.

Unit Testing During Optimization to Maintain Correctness

If you aren’t already unit testing your code, then you are probably hurting your longer-term productivity. Ian (blushing) is embarrassed to note that once he spent a day optimizing his code, having disabled unit tests because they were inconvenient, only to discover that his significant speedup result was due to breaking a part of the algorithm he was improving. You do not need to make this mistake even once.

Tip

Add unit tests to your code for a saner life. You’ll be giving your current self and your colleagues faith that your code works and you’ll be giving a present to your future-self who has to maintain this code later. You really will save a lot of time in the long-term by adding tests to your code

In addition to unit testing, you should also strongly consider using coverage.py. It checks to see which lines of code are exercised by your tests and identifies the sections that have no coverage. This quickly lets you figure out whether you’re testing the code that you’re about to optimize, such that any mistakes that might creep in during the optimization process are quickly caught.

No-op @profile Decorator

Your unit tests will fail with a NameError exception if your code uses @profile from line_profiler or memory_profiler. The reason is that the unit test framework will not be injecting the @profile decorator into the local namespace. The no-op decorator shown here solves this problem. It is easiest to add it to the block of code that you’re testing and remove it when you’re done.

With the no-op decorator, you can run your tests without modifying the code that you’re testing. This means you can run your tests after every profile-led optimization you make so you’ll never be caught out by a bad optimization step.

Let’s say we have the trivial ex.py module shown in Example 2-18. It has a test (for pytest) and a function that we’ve been profiling with either line_profiler or memory_profiler.

Example 2-18. Simple function and test case where we wish to use @profile
import time

def test_some_fn():
    """Check basic behaviours for our function"""
    assert some_fn(2) == 4
    assert some_fn(1) == 1
    assert some_fn(-1) == 1


@profile
def some_fn(useful_input):
    """An expensive function that we wish to both test and profile"""
    # artificial 'we're doing something clever and expensive' delay
    time.sleep(1)
    return useful_input ** 2


if __name__ == "__main__":
    print(f"Example call `some_fn(2)` == {some_fn(2)}")

If we run pytest on our code, we’ll get a NameError:

Example 2-19. A missing decorator during testing breaks out tests in an unhelpful way!
$ pytest utility.py
=========================================================== test session starts ============================================================
platform linux -- Python 3.7.3, pytest-4.6.2, py-1.8.0, pluggy-0.12.0
rootdir: noop_profile_decorator
plugins: cov-2.7.1
collected 0 items / 1 errors

================================================================== ERRORS ==================================================================
_______________________________________________________ ERROR collecting utility.py ________________________________________________________
utility.py:20: in <module>
    @profile
E   NameError: name 'profile' is not defined

The solution is to add a no-op decorator at the start of our module (you can remove it once you’re done with profiling). If the @profile decorator is not found in one of the namespaces (because line_profiler or memory_profiler is not being used), then the no-op version we’ve written is added. If line_profiler or memory_profiler has injected the new function into the namespace, then our no-op version is ignored.

For both line_profiler and memory_profiler, we can add the code in Example 2-20.

Example 2-20. Add a no-op @profile decorator to the namespace while unit testing
# check for line_profiler or memory_profiler in the local scope, both
# are injected by their respective tools or they're absent
# if these tools aren't being used (in which case we need to substite
# a dummy @profile decorator)
if 'line_profiler' not in dir() and 'profile' not in dir():
    def profile(func):
        return func

Having added the no-op decorator we can now run our pytest succesfully, as shown in Example 2-21, along with our profilers - with no additional code changes.

Example 2-21. With the no-op decorator we have working tests and both of our profilers work correctly
$ pytest utility.py
=========================================================== test session starts ============================================================
platform linux -- Python 3.7.3, pytest-4.6.2, py-1.8.0, pluggy-0.12.0
rootdir: /home/ian/workspace/personal_projects/high_performance_python_book_2e/high-performance-python-2e/examples_ian/ian/ch02/noop_profile_decorator
plugins: cov-2.7.1
collected 1 item

utility.py .                                                                                                                         [100%]

========================================================= 1 passed in 3.04 seconds =========================================================

$ kernprof -l -v utility.py
Example call `some_fn(2)` == 4
...
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                           @profile
    21                                           def some_fn(useful_input):
    22                                               """An expensive function that we wish to both test and profile"""
    23                                               # artificial 'we're doing something clever and expensive' delay
    24         1    1001345.0 1001345.0    100.0      time.sleep(1)
    25         1          7.0      7.0      0.0      return useful_input ** 2

$ python -m memory_profiler utility.py
Example call `some_fn(2)` == 4
Filename: utility.py

Line #    Mem usage    Increment   Line Contents
================================================
    20   48.277 MiB   48.277 MiB   @profile
    21                             def some_fn(useful_input):
    22                                 """An expensive function that we wish to both test and profile"""
    23                                 # artificial 'we're doing something clever and expensive' delay
    24   48.277 MiB    0.000 MiB       time.sleep(1)
    25   48.277 MiB    0.000 MiB       return useful_input ** 2

You can save yourself a few minutes by avoiding the use of these decorators, but once you’ve lost hours making a false optimization that breaks your code, you’ll want to integrate this into your workflow.

Strategies to Profile Your Code Successfully

Profiling requires some time and concentration. You will stand a better chance of understanding your code if you separate the section you want to test from the main body of your code. You can then unit test the code to preserve correctness, and you can pass in realistic fabricated data to exercise the inefficiencies you want to address.

Do remember to disable any BIOS-based accelerators, as they will only confuse your results. On Ian’s laptop the Intel TurboBoost feature can temporarily accelerate a CPU above its normal maximum speed if it is cool enough. This means that a cool CPU may run the same block of code faster than a hot CPU. Your operating system may also control the clock speed—a laptop on battery power is likely to more aggressively control CPU speed than a laptop on mains power. To create a more stable benchmarking configuration, we:

  • Disable TurboBoost in the BIOS.

  • Disable the operating system’s ability to override the SpeedStep (you will find this in your BIOS if you’re allowed to control it).

  • Only use mains power (never battery power).

  • Disable background tools like backups and Dropbox while running experiments.

  • Run the experiments many times to obtain a stable measurement.

  • Possibly drop to run level 1 (Unix) so that no other tasks are running.

  • Reboot and rerun the experiments to double-confirm the results.

Try to hypothesize the expected behavior of your code and then validate (or disprove!) the hypothesis with the result of a profiling step. Your choices will not change (you can only use the profiled results to drive your decisions), but your intuitive understanding of the code will improve, and this will pay off in future projects as you will be more likely to make performant decisions. Of course, you will verify these performant decisions by profiling as you go.

Do not skimp on the preparation. If you try to performance test code deep inside a larger project without separating it from the larger project, you are likely to witness side effects that will sidetrack your efforts. It is likely to be harder to unit test a larger project when you’re making fine-grained changes, and this may further hamper your efforts. Side effects could include other threads and processes impacting CPU and memory usage and network and disk activity, which will skew your results.

Naturally you’re already using source code control (e.g. git or mercurial) so you’ll be able to run multiple experiments in different branches without ever losing “the versions that work well”. If you’re not using source code control - do yourself a huge favour and start to do so!

For web servers, investigate dowser and dozer; you can use these to visualize in real time the behavior of objects in the namespace. Definitely consider separating the code you want to test out of the main web application if possible, as this will make profiling significantly easier.

Make sure your unit tests exercise all the code paths in the code that you’re analyzing. Anything you don’t test that is used in your benchmarking may cause subtle errors that will slow down your progress. Use coverage.py to confirm that your tests are covering all the code paths.

Unit testing a complicated section of code that generates a large numerical output may be difficult. Do not be afraid to output a text file of results to run through diff or to use a pickled object. For numeric optimization problems, Ian likes to create long text files of floating-point numbers and use diff—minor rounding errors show up immediately, even if they’re rare in the output.

If your code might be subject to numerical rounding issues due to subtle changes, then you are better off with a large output that can be used for a before-and-after comparison. One cause of rounding errors is the difference in floating-point precision between CPU registers and main memory. Running your code through a different code path can cause subtle rounding errors that might later confound you—it is better to be aware of this as soon as they occur.

Obviously, it makes sense to use a source code control tool while you are profiling and optimizing. Branching is cheap, and it will preserve your sanity.

Wrap-Up

Having looked at profiling techniques, you should have all the tools you need to identify bottlenecks around CPU and RAM usage in your code. Next we’ll look at how Python implements the most common containers, so you can make sensible decisions about representing larger collections of data.

1 memory_profiler measures memory usage according to the International Electrotechnical Commission’s MiB (mebibyte) of 2^20 bytes. This is slightly different from the more common but also more ambiguous MB (megabyte has two commonly accepted definitions!). 1 MiB is equal to 1.048576 (or approximately 1.05) MB. For the purposes of our discussion, unless dealing with very specific amounts, we’ll consider the two equivalent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset