Profiling lets us find bottlenecks so we can do the least amount of work to get the biggest practical performance gain. While we’d like to get huge gains in speed and reductions in resource usage with little work, practically you’ll aim for your code to run “fast enough” and “lean enough” to fit your needs. Profiling will let you make the most pragmatic decisions for the least overall effort.
Any measurable resource can be profiled (not just the CPU!). In this chapter we look at both CPU time and memory usage. You could apply similar techniques to measure network bandwidth and disk I/O too.
If a program is running too slowly or using too much RAM, then you’ll want to fix whichever parts of your code are responsible. You could, of course, skip profiling and fix what you believe might be the problem—but be wary, as you’ll often end up “fixing” the wrong thing. Rather than using your intuition, it is far more sensible to first profile, having defined a hypothesis, before making changes to the structure of your code.
Sometimes it’s good to be lazy. By profiling first, you can quickly identify the bottlenecks that need to be solved, and then you can solve just enough of these to achieve the performance you need. If you avoid profiling and jump to optimization, then it is quite likely that you’ll do more work in the long run. Always be driven by the results of profiling.
The first aim of profiling is to test a representative system to identify what’s slow (or using too much RAM, or causing too much disk I/O or network I/O). Profiling typically adds an overhead (10x to 100x slowdowns can be typical), and you still want your code to be used as similarly to in a real-world situation as possible. Extract a test case and isolate the piece of the system that you need to test. Preferably, it’ll have been written to be in its own set of modules already.
The basic techniques that are introduced first in this chapter include the %timeit
magic in IPython, time.time()
, and a timing decorator. You can use these techniques to understand the behavior of statements and functions.
Then we will cover cProfile
(“Using the cProfile Module”), showing you how to use this built-in tool to understand which functions in your code take the longest to run. This will give you a high-level view of the problem so you can direct your attention to the critical functions.
Next, we’ll look at line_profiler
(“Using line_profiler for Line-by-Line Measurements”), which will profile your chosen functions on a line-by-line basis. The result will include a count of the number of times each line is called and the percentage of time spent on each line. This is exactly the information you need to understand what’s running slowly and why.
Armed with the results of line_profiler
, you’ll have the information you need to move on to using a compiler ([Link to Come]).
In [Link to Come] ([Link to Come]), you’ll learn how to use perf stat
to understand the number of instructions that are ultimately executed on a CPU and how efficiently the CPU’s caches are utilized. This allows for advanced-level tuning of matrix operations. You should take a look at that example when you’re done with this chapter.
After line_profiler
if you’re working with long-running systems then you’ll be interested in dowser
([Link to Come]) to introspect live objects via a web browser interface and py-spy
to peek into already-running Python processes.
To help you understand why your RAM usage is high, we’ll show you memory_profiler
(“Using memory_profiler to Diagnose Memory Usage”). It is particularly useful for tracking RAM usage over time on a labeled chart, so you can explain to colleagues why certain functions use more RAM than expected.
Whatever approach you take to profiling your code, you must remember to have adequate unit test coverage in your code. Unit tests help you to avoid silly mistakes and help to keep your results reproducible. Avoid them at your peril.
Always profile your code before compiling or rewriting your algorithms. You need evidence to determine the most efficient ways to make your code run faster.
Finally, we’ll give you an introduction to the Python bytecode inside CPython (“Using the dis Module to Examine CPython Bytecode”), so you can understand what’s happening “under the hood.” In particular, having an understanding of how Python’s stack-based virtual machine operates will help you to understand why certain coding styles run more slowly than others.
Before the end of the chapter, we’ll review how to integrate unit tests while profiling (“Unit Testing During Optimization to Maintain Correctness”), to preserve the correctness of your code while you make it run more efficiently.
We’ll finish with a discussion of profiling strategies (“Strategies to Profile Your Code Successfully”), so you can reliably profile your code and gather the correct data to test your hypotheses. Here you’ll learn about how dynamic CPU frequency scaling and features like TurboBoost can skew your profiling results and how they can be disabled.
To walk through all of these steps, we need an easy-to-analyze function. The next section introduces the Julia set. It is a CPU-bound function that’s a little hungry for RAM; it also exhibits nonlinear behavior (so we can’t easily predict the outcomes), which means we need to profile it at runtime rather than analyzing it offline.
The Julia set is an interesting CPU-bound problem for us to begin with. It is a fractal sequence that generates a complex output image, named after Gaston Julia.
The code that follows is a little longer than a version you might write yourself. It has a CPU-bound component and a very explicit set of inputs. This configuration allows us to profile both the CPU usage and the RAM usage so we can understand which parts of our code are consuming two of our scarce computing resources. This implementation is deliberately suboptimal, so we can identify memory-consuming operations and slow statements. Later in this chapter we’ll fix a slow logic statement and a memory-consuming statement, and in [Link to Come] we’ll significantly speed up the overall execution time of this function.
We will analyze a block of code that produces both a false grayscale plot (Figure 2-1) and a pure grayscale variant of the Julia set (Figure 2-3), at the complex point c=-0.62772-0.42193j
. A Julia set is produced by calculating each pixel in isolation; this is an “embarrassingly parallel problem” as no data is shared between points.
If we chose a different c
, then we’d get a different image. The location we have chosen has regions that are quick to calculate and others that are slow to calculate; this is useful for our analysis.
The problem is interesting because we calculate each pixel by applying a loop that could be applied an indeterminate number of times. On each iteration we test to see if this coordinate’s value escapes toward infinity, or if it seems to be held by an attractor. Coordinates that cause few iterations are colored darkly in Figure 2-1, and those that cause a high number of iterations are colored white. White regions are more complex to calculate and so take longer to generate.
We define a set of z-coordinates that we’ll test. The function that we calculate squares the complex number z
and adds c
:
We iterate on this function while testing to see if the escape condition holds using abs
. If the escape function is False
, then we break out of the loop and record the number of iterations we performed at this coordinate. If the escape function is never False
, then we stop after maxiter
iterations. We will later turn this z
’s result into a colored pixel representing this complex location.
In pseudocode, it might look like:
for
z
in
coordinates
:
for
iteration
in
range
(
maxiter
):
# limited iterations per point
if
abs
(
z
)
<
2.0
:
# has the escape condition been broken?
z
=
z
*
z
+
c
else
:
break
# store the iteration count for each z and draw later
To explain this function, let’s try two coordinates.
First, we’ll use the coordinate that we draw in the top-left corner of the plot at -1.8-1.8j
. We must test abs(z) < 2
before we can try the update rule:
z
=
-
1.8
-
1.8j
abs
(
z
)
2.54558441227
We can see that for the top-left coordinate the abs(z)
test will be False
on the zeroth iteration as 2.54 >= 2.0
, so we do not perform the update rule. The output
value for this coordinate is 0
.
Now let’s jump to the center of the plot at z = 0 + 0j
and try a few iterations:
c
=
-
0.62772
-
0.42193j
z
=
0
+
0j
for
n
in
range
(
9
):
z
=
z
*
z
+
c
(
f
"{n}: z={z: .5f}, abs(z)={abs(z):0.3f}, c={c: .5f}"
)
0: z=-0.62772-0.42193j, abs(z)=0.756, c=-0.62772-0.42193j 1: z=-0.41171+0.10778j, abs(z)=0.426, c=-0.62772-0.42193j 2: z=-0.46983-0.51068j, abs(z)=0.694, c=-0.62772-0.42193j 3: z=-0.66777+0.05793j, abs(z)=0.670, c=-0.62772-0.42193j 4: z=-0.18516-0.49930j, abs(z)=0.533, c=-0.62772-0.42193j 5: z=-0.84274-0.23703j, abs(z)=0.875, c=-0.62772-0.42193j 6: z= 0.02630-0.02242j, abs(z)=0.035, c=-0.62772-0.42193j 7: z=-0.62753-0.42311j, abs(z)=0.757, c=-0.62772-0.42193j 8: z=-0.41295+0.10910j, abs(z)=0.427, c=-0.62772-0.42193j
We can see that each update to z
for these first iterations leaves it with a value where abs(z) < 2
is True
. For this coordinate we can iterate 300 times, and still the test will be True
. We cannot tell how many iterations we must perform before the condition becomes False
, and this may be an infinite sequence. The maximum iteration (maxiter
) break clause will stop us iterating potentially forever.
In Figure 2-2 we see the first 50 iterations of the preceding sequence. For 0+0j
(the solid line with circle markers) the sequence appears to repeat every eighth iteration, but each sequence of seven calculations has a minor deviation from the previous sequence—we can’t tell if this point will iterate forever within the boundary condition, or for a long time, or maybe for just a few more iterations. The dashed cutoff
line shows the boundary at +2
.
For -0.82+0j
(the dashed line with diamond markers), we can see that after the ninth update the absolute result has exceeded the +2
cutoff, so we stop updating this value.
In this section we break down the code that generates the Julia set. We’ll analyze it in various ways throughout this chapter. As shown in Example 2-1, at the start of our module we import the time
module for our first profiling approach and define some coordinate constants.
"""Julia set generator without optional PIL-based image drawing"""
import
time
# area of complex space to investigate
x1
,
x2
,
y1
,
y2
=
-
1.8
,
1.8
,
-
1.8
,
1.8
c_real
,
c_imag
=
-
0.62772
,
-.
42193
To generate the plot, we create two lists of input data. The first is zs
(complex z-coordinates), and the second is cs
(a complex initial condition). Neither list varies, and we could optimize cs
to a single c
value as a constant. The rationale for building two input lists is so that we have some reasonable-looking data to profile when we profile RAM usage later in this chapter.
To build the zs
and cs
lists, we need to know the coordinates for each z
. In Example 2-2 we build up these coordinates using xcoord
and ycoord
and a specified x_step
and y_step
. The somewhat verbose nature of this setup is useful when porting the code to other tools (e.g., to numpy
) and to other Python environments, as it helps to have everything very clearly defined for debugging.
def
calc_pure_python
(
desired_width
,
max_iterations
):
"""Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set"""
x_step
=
(
x2
-
x1
)
/
desired_width
y_step
=
(
y1
-
y2
)
/
desired_width
x
=
[]
y
=
[]
ycoord
=
y2
while
ycoord
>
y1
:
y
.
append
(
ycoord
)
ycoord
+=
y_step
xcoord
=
x1
while
xcoord
<
x2
:
x
.
append
(
xcoord
)
xcoord
+=
x_step
# build a list of co-ordinates and the initial condition for each cell.
# Note that our initial condition is a constant and could easily be removed,
# we use it to simulate a real-world scenario with several inputs to our function
zs
=
[]
cs
=
[]
for
ycoord
in
y
:
for
xcoord
in
x
:
zs
.
append
(
complex
(
xcoord
,
ycoord
))
cs
.
append
(
complex
(
c_real
,
c_imag
))
(
"Length of x:"
,
len
(
x
))
(
"Total elements:"
,
len
(
zs
))
start_time
=
time
.
time
()
output
=
calculate_z_serial_purepython
(
max_iterations
,
zs
,
cs
)
end_time
=
time
.
time
()
secs
=
end_time
-
start_time
(
calculate_z_serial_purepython
.
__name__
+
" took"
,
secs
,
"seconds"
)
# This sum is expected for a 1000^2 grid with 300 iterations
# It ensures that our code evolves exactly as we'd intended
assert
sum
(
output
)
==
33219980
Having built the zs
and cs
lists, we output some information about the size of the lists and calculate the output
list via calculate_z_serial_purepython
. Finally, we sum
the contents of output
and assert
that it matches the expected output value. Ian uses it here to confirm that no errors creep into the book.
As the code is deterministic, we can verify that the function works as we expect by summing all the calculated values. This is useful as a sanity check—when we make changes to numerical code it is very sensible to check that we haven’t broken the algorithm. Ideally we would use unit tests and we’d test more than one configuration of the problem.
Next, in Example 2-3, we define the calculate_z_serial_purepython
function, which expands on the algorithm we discussed earlier. Notably, we also define an output
list at the start that has the same length as the input zs
and cs
lists.
def
calculate_z_serial_purepython
(
maxiter
,
zs
,
cs
):
"""Calculate output list using Julia update rule"""
output
=
[
0
]
*
len
(
zs
)
for
i
in
range
(
len
(
zs
)):
n
=
0
z
=
zs
[
i
]
c
=
cs
[
i
]
while
abs
(
z
)
<
2
and
n
<
maxiter
:
z
=
z
*
z
+
c
n
+=
1
output
[
i
]
=
n
return
output
Now we call the calculation routine in Example 2-4. By wrapping it in a __main__
check, we can safely import the module without starting the calculations for some of the profiling methods. Note that here we’re not showing the method used to plot the output.
if
__name__
==
"__main__"
:
# Calculate the Julia set using a pure Python solution with
# reasonable defaults for a laptop
calc_pure_python
(
desired_width
=
1000
,
max_iterations
=
300
)
Once we run the code, we see some output about the complexity of the problem:
# running the above produces:
Length
of
x
:
1000
Total
elements
:
1000000
calculate_z_serial_purepython
took
8.087012767791748
seconds
In the false-grayscale plot (Figure 2-1), the high-contrast color changes gave us an idea of where the cost of the function was slow-changing or fast-changing. Here, in Figure 2-3, we have a linear color map: black is quick to calculate and white is expensive to calculate.
By showing two representations of the same data, we can see that lots of detail is lost in the linear mapping. Sometimes it can be useful to have various representations in mind when investigating the cost of a function.
After Example 2-4, we saw the output generated by several print
statements in our code. On Ian’s laptop this code takes approximately 8 seconds to run using CPython 3.7. It is useful to note that there is always some variation in execution time. You must observe the normal variation when you’re timing your code, or you might incorrectly attribute an improvement in your code simply to a random variation in execution time.
Your computer will be performing other tasks while running your code, such as accessing the network, disk, or RAM, and these factors can cause variations in the execution time of your program.
Ian’s laptop is a Dell 9550 with an Intel Core I7 6700HQ (2.6 GHz, 6MB cache, Quad Core with Hyperthreading) and 32 GB of RAM running Linux Mint 19.1 (Ubuntu 18.04).
In calc_pure_python
(Example 2-2), we can see several print
statements. This is the simplest way to measure the execution time of a piece of code inside a function. It is a very basic approach, but despite being quick and dirty it can be very useful when you’re first looking at a piece of code.
Using print
statements is commonplace when debugging and profiling code. It quickly becomes unmanageable, but is useful for short investigations. Try to tidy them up when you’re done with them, or they will clutter your stdout
.
A slightly cleaner approach is to use a decorator—here, we add one line of code above the function that we care about. Our decorator can be very simple and just replicate the effect of the print
statements. Later, we can make it more advanced.
In Example 2-5 we define a new function, timefn
, which takes a function as an argument: the inner function, measure_time
, takes *args
(a variable number of positional arguments) and **kwargs
(a variable number of key/value arguments) and passes them through to fn
for execution. Around the execution of fn
we capture time.time()
and then print
the result along with fn.__name__
. The overhead of using this decorator is small, but if you’re calling fn
millions of times the overhead might become noticeable. We use @wraps(fn)
to expose the function name and docstring to the caller of the decorated function (otherwise, we would see the function name and docstring for the decorator, not the function it decorates).
from
functools
import
wraps
def
timefn
(
fn
):
@wraps
(
fn
)
def
measure_time
(
*
args
,
**
kwargs
):
t1
=
time
.
time
()
result
=
fn
(
*
args
,
**
kwargs
)
t2
=
time
.
time
()
(
f
"@timefn: {fn.__name__} took {t2 - t1} seconds"
)
return
result
return
measure_time
@timefn
def
calculate_z_serial_purepython
(
maxiter
,
zs
,
cs
):
...
When we run this version (we keep the print
statements from before), we can see that the execution time in the decorated version is very slightly quicker than the call from calc_pure_python
. This is due to the overhead of calling a function (the difference is very tiny):
Length
of
x
:
1000
Total
elements
:
1000000
@timefn
:
calculate_z_serial_purepython
took
8.00485110282898
seconds
calculate_z_serial_purepython
took
8.004898071289062
seconds
The addition of profiling information will inevitably slow your code down—some profiling options are very informative and induce a heavy speed penalty. The trade-off between profiling detail and speed will be something you have to consider.
We can use the timeit
module as another way to get a coarse measurement of the execution speed of our CPU-bound function. More typically, you would use this when timing different types of simple expressions as you experiment with ways to solve a problem.
Note that the timeit
module temporarily disables the garbage collector. This might impact the speed you’ll see with real-world operations if the garbage collector would normally be invoked by your operations. See the Python documentation for help on this.
From the command line you can run timeit
using:
python -m timeit -n 5 -r 1 -s "import julia1" "julia1.calc_pure_python(desired_width=1000, max_iterations=300)"
Note that you have to import the module as a setup step using -s
, as calc_pure_python
is inside that module. timeit
has some sensible defaults for short sections of code, but for longer-running functions it can be sensible to specify the number of loops, whose results are averaged for each test (-n 5
), and the number of repetitions (-r 5
). The best result of all the repetitions is given as the answer.
By default, if we run timeit
on this function without specifying -n
and -r
it runs 10 loops with 5 repetitions, and this takes 6 minutes to complete. Overriding the defaults can make sense if you want to get your results a little faster.
We’re only interested in the best-case results, as the average and worst case are probably a result of interference by other processes. Choosing the best of five repetitions of five averaged results should give us a fairly stable result:
5
loops
,
best
of
1
:
8.45
sec
per
loop
Try running the benchmark several times to check if you get varying results—you may need more repetitions to settle on a stable fastest-result time. There is no “correct” configuration, so if you see a wide variation in your timing results, do more repetitions until your final result is stable.
Our results show that the overall cost of calling calc_pure_python
is 8.45 seconds (as the best case), while single calls to calculate_z_serial_purepython
take 8.0 seconds as measured by the @timefn
decorator. The difference is mainly the time taken to create the zs
and cs
lists.
Inside IPython, we can use the magic %timeit
in the same way. If you are developing your code interactively in IPython or in a Jupyter Notebook then you can use:
In
[
1
]:
import
julia1
In
[
2
]:
%
timeit
julia1
.
calc_pure_python
(
desired_width
=
1000
,
max_iterations
=
300
)
It is worth considering the variation in load that you get on a normal computer. Many background tasks are running (e.g., Dropbox, backups) that could impact the CPU and disk resources at random. Scripts in web pages can also cause unpredictable resource usage. Figure 2-4 shows the single CPU being used at 100% for some of the timing steps we just performed; the other cores on this machine are each lightly working on other tasks.
Occasionally the System Monitor shows spikes of activity on this machine. It is sensible to watch your System Monitor to check that nothing else is interfering with your critical resources (CPU, disk, network).
We can step outside of Python for a moment to use a standard system utility on Unix-like systems. The following will record various views on the execution time of your program, and it won’t care about the internal structure of your code:
$ /usr/bin/time -p python julia1_nopil.py Length of x: 1000 Total elements: 1000000 calculate_z_serial_purepython took 8.279886722564697 seconds real 8.84 user 8.73 sys 0.10
Note that we specifically use /usr/bin/time
rather than time
so we get the system’s time
and not the simpler (and less useful) version built into our shell. If you try time --verbose
and you get an error, you’re probably looking at the shell’s built-in time
command and not the system command.
Using the -p
portability flag, we get three results:
real
records the wall clock or elapsed time.
user
records the amount of time the CPU spent on your task outside of kernel functions.
sys
records the time spent in kernel-level functions.
By adding user
and sys
, you get a sense of how much time was spent in the CPU. The difference between this and real
might tell you about the amount of time spent waiting for I/O; it might also suggest that your system is busy running other tasks that are distorting your measurements.
time
is useful because it isn’t specific to Python. It includes the time taken to start the python
executable, which might be significant if you start lots of fresh processes (rather than having a long-running single process). If you often have short-running scripts where the startup time is a significant part of the overall runtime, then time
can be a more useful measure.
We can add the --verbose
flag to get even more output:
$ /usr/bin/time --verbose python julia1_nopil.py Length of x: 1000 Total elements: 1000000 calculate_z_serial_purepython took 8.477287530899048 seconds Command being timed: "python julia1_nopil.py" User time (seconds): 8.97 System time (seconds): 0.05 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.03 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 98620 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 26645 Voluntary context switches: 1 Involuntary context switches: 27 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
Probably the most useful indicator here is Major (requiring I/O) page faults
, as this indicates whether the operating system is having to load pages of data from the disk because the data no longer resides in RAM. This will cause a speed penalty.
In our example the code and data requirements are small, so no page faults occur. If you have a memory-bound process, or several programs that use variable and large amounts of RAM, you might find that this gives you a clue as to which program is being slowed down by disk accesses at the operating system level because parts of it have been swapped out of RAM to disk.
cProfile
is a built-in profiling tool in the standard library. It hooks into the virtual machine in CPython to measure the time taken to run every function that it sees. This introduces a greater overhead, but you get correspondingly more information. Sometimes the additional information can lead to surprising insights into your code.
cProfile
is one of two profilers in the standard library alongside profile
. profile
is the original and slower pure-Python profiler and cProfile
has the same interface as profile
and is written in C
for a lower overhead. If you’re curious about the history of these libraries, see Armin Rigo’s 2005 request to include cProfile
in the standard library.
A good practice when profiling is to generate a hypothesis about the speed of parts of your code before you profile it. Ian likes to print out the code snippet in question and annotate it. Forming a hypothesis ahead of time means you can measure how wrong you are (and you will be!) and improve your intuition about certain coding styles.
You should never avoid profiling in favor of a gut instinct (we warn you—you will get it wrong!). It is definitely worth forming a hypothesis ahead of profiling to help you learn to spot possible slow choices in your code, and you should always back up your choices with evidence.
Always be driven by results that you have measured, and always start with some quick-and-dirty profiling to make sure you’re addressing the right area. There’s nothing more humbling than cleverly optimizing a section of code only to realize (hours or days later) that you missed the slowest part of the process and haven’t really addressed the underlying problem at all.
Let’s hypothesize that calculate_z_serial_purepython
is the slowest part of the code. In that function, we do a lot of dereferencing and make many calls to basic arithmetic operators and the abs
function. These will probably show up as consumers of CPU resources.
Here, we’ll use the cProfile
module to run a variant of the code. The output is spartan but helps us figure out where to analyze further.
The -s cumulative
flag tells cProfile
to sort by cumulative time spent inside each function; this gives us a view into the slowest parts of a section of code. The cProfile
output is written to screen directly after our usual print
results:
$ python -m cProfile -s cumulative julia1_nopil.py ... Length of x: 1000 Total elements: 1000000 calculate_z_serial_purepython took 11.498265266418457 seconds 36221995 function calls in 12.234 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 12.234 12.234 {built-in method builtins.exec} 1 0.038 0.038 12.234 12.234 julia1_nopil.py:1(<module>) 1 0.571 0.571 12.197 12.197 julia1_nopil.py:23(calc_pure_python) 1 8.369 8.369 11.498 11.498 julia1_nopil.py:9(calculate_z_serial_purepython) 34219980 3.129 0.000 3.129 0.000 {built-in method builtins.abs} 2002000 0.121 0.000 0.121 0.000 {method 'append' of 'list' objects} 1 0.006 0.006 0.006 0.006 {built-in method builtins.sum} 3 0.000 0.000 0.000 0.000 {built-in method builtins.print} 2 0.000 0.000 0.000 0.000 {built-in method time.time} 4 0.000 0.000 0.000 0.000 {built-in method builtins.len} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Sorting by cumulative time gives us an idea about where the majority of execution time is spent. This result shows us that 36,221,995 function calls occurred in just over 12 seconds (this time includes the overhead of using cProfile
). Previously our code took around 8 seconds to execute—we’ve just added a 4-second penalty by measuring how long each function takes to execute.
We can see that the entry point to the code julia1_nopil.py
on line 1 takes a total of 12 seconds. This is just the __main__
call to calc_pure_python
. ncalls
is 1, indicating that this line is only executed once.
Inside calc_pure_python
, the call to calculate_z_serial_purepython
consumes 11 seconds. Both functions are only called once. We can derive that approximately 1 second is spent on lines of code inside calc_pure_python
, separate to calling the CPU-intensive calculate_z_serial_purepython
function. We can’t derive which lines, however, take the time inside the function using cProfile
.
Inside calculate_z_serial_purepython
, the time spent on lines of code (without calling other functions) is 8 seconds. This function makes 34,219,980 calls to abs
, which take a total of 3 seconds, along with some other calls that do not cost much time.
What about the {abs}
call? This line is measuring the individual calls to the abs
function inside calculate_z_serial_purepython
. While the per-call cost is negligible (it is recorded as 0.000 seconds), the total time for 34,219,980 calls is 3 seconds. We couldn’t predict in advance exactly how many calls would be made to abs
, as the Julia function has unpredictable dynamics (that’s why it is so interesting to look at).
At best we could have said that it will be called a minimum of 1,000,000 times, as we’re calculating 1000*1000
pixels. At most it will be called 300,000,000 times, as we calculate 1,000,000 pixels with a maximum of 300 iterations. So, 34 million calls is roughly 10% of the worst case.
If we look at the original grayscale image (Figure 2-3) and, in our mind’s eye, squash the white parts together and into a corner, we can estimate that the expensive white region accounts for roughly 10% of the rest of the image.
The next line in the profiled output, {method 'append' of 'list' objects}
, details the creation of 2,002,000 list items.
Why 2,002,000 items? Before you read on, think about how many list items are being constructed.
This creation of the 2,002,000 items is occurring in calc_pure_python
during the setup phase.
The zs
and cs
lists will be 1000*1000
items each (generating 1,000,000 * 2 calls), and these are built from a list of 1,000 x- and 1,000 y-coordinates. In total, this is 2,002,000 calls to append.
It is important to note that this cProfile
output is not ordered by parent functions; it is summarizing the expense of all functions in the executed block of code. Figuring out what is happening on a line-by-line basis is very hard with cProfile
, as we only get profile information for the function calls themselves, not each line within the functions.
Inside calculate_z_serial_purepython
we can account for {abs}
and in total this functions costs approximately 3.1 seconds. We know that calculate_z_serial_purepython
costs 11.4 seconds in total.
The final line of the profiling output refers to lsprof
; this is the original name of the tool that evolved into cProfile
and can be ignored.
To get some more control over the results of cProfile
, we can write a statistics file and then analyze it in Python:
$ python -m cProfile -o profile.stats julia1.py
We can load this into Python as follows, and it will give us the same cumulative time report as before:
In
[
1
]:
import
pstats
In
[
2
]:
p
=
pstats
.
Stats
(
"profile.stats"
)
In
[
3
]:
p
.
sort_stats
(
"cumulative"
)
Out
[
3
]:
<
pstats
.
Stats
at
0x7f77088edf28
>
In
[
4
]:
p
.
print_stats
()
Fri
Jun
14
17
:
59
:
28
2019
profile
.
stats
36221995
function
calls
in
12.169
seconds
Ordered
by
:
cumulative
time
ncalls
tottime
percall
cumtime
percall
filename
:
lineno
(
function
)
1
0.000
0.000
12.169
12.169
{
built
-
in
method
builtins
.
exec
}
1
0.033
0.033
12.169
12.169
julia1_nopil
.
py
:
1
(
<
module
>
)
1
0.576
0.576
12.135
12.135
julia1_nopil
.
py
:
23
(
calc_pure_python
)
1
8.266
8.266
11.429
11.429
julia1_nopil
.
py
:
9
(
calculate_z_serial_purepython
)
34219980
3.163
0.000
3.163
0.000
{
built
-
in
method
builtins
.
abs
}
2002000
0.123
0.000
0.123
0.000
{
method
'append'
of
'list'
objects
}
1
0.006
0.006
0.006
0.006
{
built
-
in
method
builtins
.
sum
}
3
0.000
0.000
0.000
0.000
{
built
-
in
method
builtins
.
}
4
0.000
0.000
0.000
0.000
{
built
-
in
method
builtins
.
len
}
2
0.000
0.000
0.000
0.000
{
built
-
in
method
time
.
time
}
1
0.000
0.000
0.000
0.000
{
method
'disable'
of
'_lsprof.Profiler'
objects
}
To trace which functions we’re profiling, we can print the caller information. In the following two listings we can see that calculate_z_serial_purepython
is the most expensive function, and it is called from one place. If it were called from many places, these listings might help us narrow down the locations of the most expensive parents:
In
[
5
]:
p
.
print_callers
()
Ordered
by
:
cumulative
time
Function
was
called
by
...
ncalls
tottime
cumtime
{
built
-
in
method
builtins
.
exec
}
<-
julia1_nopil
.
py
:
1
(
<
module
>
)
<-
1
0.033
12.169
{
built
-
in
method
builtins
.
exec
}
julia1_nopil
.
py
:
23
(
calc_pure_python
)
<-
1
0.576
12.135
julia1_nopil
.
py
:
1
(
<
module
>
)
julia1_nopil
.
py
:
9
(
calculate_z_serial_purepython
)
<-
1
8.266
11.429
julia1_nopil
.
py
:
23
(
calc_pure_python
)
{
built
-
in
method
builtins
.
abs
}
<-
34219980
3.163
3.163
julia1_nopil
.
py
:
9
(
calculate_z_serial_purepython
)
{
method
'append'
of
'list'
objects
}
<-
2002000
0.123
0.123
julia1_nopil
.
py
:
23
(
calc_pure_python
)
{
built
-
in
method
builtins
.
sum
}
<-
1
0.006
0.006
julia1_nopil
.
py
:
23
(
calc_pure_python
)
{
built
-
in
method
builtins
.
}
<-
3
0.000
0.000
julia1_nopil
.
py
:
23
(
calc_pure_python
)
{
built
-
in
method
builtins
.
len
}
<-
2
0.000
0.000
julia1_nopil
.
py
:
9
(
calculate_z_serial_purepython
)
2
0.000
0.000
julia1_nopil
.
py
:
23
(
calc_pure_python
)
{
built
-
in
method
time
.
time
}
<-
2
0.000
0.000
julia1_nopil
.
py
:
23
(
calc_pure_python
)
{
method
'disable'
of
'_lsprof.Profiler'
objects
}
<-
We can flip this around the other way to show which functions call other functions:
In
[
6
]:
p
.
print_callees
()
Ordered
by
:
cumulative
time
Function
called
...
ncalls
tottime
cumtime
{
built
-
in
method
builtins
.
exec
}
->
1
0.033
12.169
julia1_nopil
.
py
:
1
(
<
module
>
)
julia1_nopil
.
py
:
1
(
<
module
>
)
->
1
0.576
12.135
julia1_nopil
.
py
:
23
(
calc_pure_python
)
julia1_nopil
.
py
:
23
(
calc_pure_python
)
->
1
8.266
11.429
julia1_nopil
.
py
:
9
(
calculate_z_serial_purepython
)
2
0.000
0.000
{
built
-
in
method
builtins
.
len
}
3
0.000
0.000
{
built
-
in
method
builtins
.
}
1
0.006
0.006
{
built
-
in
method
builtins
.
sum
}
2
0.000
0.000
{
built
-
in
method
time
.
time
}
2002000
0.123
0.123
{
method
'append'
of
'list'
objects
}
julia1_nopil
.
py
:
9
(
calculate_z_serial_purepython
)
->
34219980
3.163
3.163
{
built
-
in
method
builtins
.
abs
}
2
0.000
0.000
{
built
-
in
method
builtins
.
len
}
{
built
-
in
method
builtins
.
abs
}
->
{
method
'append'
of
'list'
objects
}
->
{
built
-
in
method
builtins
.
sum
}
->
{
built
-
in
method
builtins
.
}
->
{
built
-
in
method
builtins
.
len
}
->
{
built
-
in
method
time
.
time
}
->
{
method
'disable'
of
'_lsprof.Profiler'
objects
}
->
cProfile
is rather verbose, and you need a side screen to see it without lots of word-wrap. Since it is built in, though, it is a convenient tool for quickly identifying bottlenecks. Tools like line_profiler
and memory_profiler
, which we discuss later in this chapter, will then help you to drill down to the specific lines that you should pay attention to.
snakeviz
is a visualizer that draws the output of cProfile
as a diagram, larger boxes are ares of code that take longer to run. It replaces the older runsnake
tool.
Use snakeviz
to get a high-level understanding of a cProfile
statistics file, particularly if you’re investigating a new project where you have few intuitions. The diagrams will help you visualise the memory-usage behaviour of the system. It may also highlight areas that you hadn’t expected to be expensive, potentially pointing you at areas for easy RAM footprint reduction.
To install SnakeViz use $ pip install snakeviz
.
In Figure 2-5 we have the visual output of the profile.stats
file we’ve just generated. The entry point for the program is shown at the top of the diagram. Each layer down is a function called from the function above.
The width of the diagram represents the entire time taken by the program’s execution. The 4th layer shows that the majority of the time is spent in calculate_z_serial_purepython
. The 5th layer breaks this down some more - the unannotated block to the right occupying approximately 25% of that layer represents the time spent in the abs
function. Seeing these larger blocks quickly brings home how the time is spent inside your program.
The next section down shows a table which is a pretty-printed version of the statistics we’ve just been looking at, you can sort by cumtime
(cumulative time), percall
(cost per call) and ncalls
(number of calls altogether) amongst others. Starting with cumtime
will tell you - overall - which functions cost the most. They’re a pretty good place to start your investigations.
If you’re comfortable looking at tables then the console output for cProfile
may be adequate for you. To communicate to others we’d strongly suggest you use diagrams - such as this output from snakeviz
- to help others quickly understand the point you’re making.
In Ian’s opinion, Robert Kern’s line_profiler
is the strongest tool for identifying the cause of CPU-bound problems in Python code. It works by profiling individual functions on a line-by-line basis, so you should start with cProfile
and use the high-level view to guide which functions to profile with line_profiler
.
It is worthwhile printing and annotating versions of the output from this tool as you modify your code, so you have a record of changes (successful or not) that you can quickly refer to. Don’t rely on your memory when you’re working on line-by-line changes.
To install line_profiler
, issue the command pip install line_profiler
.
A decorator (@profile
) is used to mark the chosen function. The kernprof
script is used to execute your code, and the CPU time and other statistics for each line of the chosen function are recorded.
The requirement to modify the source code is a minor annoyance, as the addition of a decorator will break your unit tests unless you make a dummy decorator—see “No-op @profile Decorator”.
The arguments are -l
for line-by-line (rather than function-level) profiling and -v
for verbose output. Without -v
you receive an .lprof output that you can later analyze with the line_profiler
module. In Example 2-6, we’ll do a full run on our CPU-bound function.
$ kernprof -l -v julia1_lineprofiler.py ... Wrote profile results to julia1_lineprofiler.py.lprof Timer unit: 1e-06 s Total time: 49.2011 s File: julia1_lineprofiler.py Function: calculate_z_serial_purepython at line 9 Line # Hits Time Per Hit % Time Line Contents ============================================================== 9 @profile 10 def calculate_z_serial_purepython(maxiter, zs, cs): 11 """Calculate output list using Julia update rule""" 12 1 3298.0 3298.0 0.0 output = [0] * len(zs) 13 1000001 386087.0 0.4 0.8 for i in range(len(zs)): 14 1000000 365713.0 0.4 0.7 n = 0 15 1000000 420496.0 0.4 0.9 z = zs[i] 16 1000000 402887.0 0.4 0.8 c = cs[i] 17 34219980 18692480.0 0.5 38.0 while abs(z) < 2 and n < maxiter: 18 33219980 15165236.0 0.5 30.8 z = z * z + c 19 33219980 13357752.0 0.4 27.1 n += 1 20 1000000 407153.0 0.4 0.8 output[i] = n 21 1 1.0 1.0 0.0 return output
Introducing kernprof.py
adds a substantial amount to the runtime. In this example, calculate_z_serial_purepython
takes 49 seconds; this is up from 8 seconds using simple print
statements and 12 seconds using cProfile
. The gain is that we get a line-by-line breakdown of where the time is spent inside the function.
The % Time
column is the most helpful—we can see that 38% of the time is spent on the while
testing. We don’t know whether the first statement (abs(z) < 2
) is more expensive than the second (n < maxiter
), though. Inside the loop, we see that the update to z
is also fairly expensive. Even n += 1
is expensive! Python’s dynamic lookup machinery is at work for every loop, even though we’re using the same types for each variable in each loop—this is where compiling and type specialization ([Link to Come]) give us a massive win. The creation of the output
list and the updates on line 20 are relatively cheap compared to the cost of the while
loop.
If you haven’t thought about the complexity of Python’s dynamic machinery before, do think about what happens in that n += 1
operation. Python has to check that the n
object has an __add__
function (and if it didn’t - it’d walk up any inherited classes to see if they provided this functionality) and then the other object (1
in this case) is passed in so that the __add__
function can decide how to handle the operation. Remember that the second argument might be a float
or other object which may, or may not, be compatible. This all happens dynamically.
The obvious way to further analyze the while
statement is to break it up. While there has been some discussion in the Python community around the idea of rewriting the .pyc files with more detailed information for multipart, single-line statements, we are unaware of any production tools that offer a more fine-grained analysis than line_profiler
.
In Example 2-7, we break the while
logic into several statements. This additional complexity will increase the runtime of the function, as we have more lines of code to execute, but it might also help us to understand the costs incurred in this part of the code.
Before you look at the code, do you think we’ll learn about the costs of the fundamental operations this way? Might other factors complicate the analysis?
$ kernprof -l -v julia1_lineprofiler2.py ... Wrote profile results to julia1_lineprofiler2.py.lprof Timer unit: 1e-06 s Total time: 82.88 s File: julia1_lineprofiler2.py Function: calculate_z_serial_purepython at line 9 Line # Hits Time Per Hit % Time Line Contents ============================================================== 9 @profile 10 def calculate_z_serial_purepython(maxiter, zs, cs): 11 """Calculate output list using Julia update rule""" 12 1 3309.0 3309.0 0.0 output = [0] * len(zs) 13 1000001 421828.0 0.4 0.5 for i in range(len(zs)): 14 1000000 392861.0 0.4 0.5 n = 0 15 1000000 452537.0 0.5 0.5 z = zs[i] 16 1000000 427382.0 0.4 0.5 c = cs[i] 17 1000000 396178.0 0.4 0.5 while True: 18 34219980 19150179.0 0.6 23.1 not_yet_escaped = abs(z) < 2 19 34219980 15159735.0 0.4 18.3 iterations_left = n < maxiter 20 34219980 14352481.0 0.4 17.3 if not_yet_escaped and iterations_left: 21 33219980 16958693.0 0.5 20.5 z = z * z + c 22 33219980 14316108.0 0.4 17.3 n += 1 23 else: 24 1000000 409440.0 0.4 0.5 break 25 1000000 439254.0 0.4 0.5 output[i] = n 26 1 0.0 0.0 0.0 return output
This version takes 82 seconds to execute, while the previous version took 49 seconds. Other factors did complicate the analysis. In this case, having extra statements that have to be executed 34,219,980 times each slows down the code. If we hadn’t used kernprof.py
to investigate the line-by-line effect of this change, we might have drawn other conclusions about the reason for the slowdown, as we’d have lacked the necessary evidence.
At this point it makes sense to step back to the earlier timeit
technique to test the cost of individual expressions:
Python
3.7
.
3
(
default
,
Mar
27
2019
,
22
:
11
:
17
)
Type
'copyright'
,
'credits'
or
'license'
for
more
information
IPython
7.5
.
0
--
An
enhanced
Interactive
Python
.
Type
'?'
for
help
.
In
[
1
]:
z
=
0
+
0j
In
[
2
]:
%
timeit
abs
(
z
)
<
2
97.6
ns
±
0.138
ns
per
loop
(
mean
±
std
.
dev
.
of
7
runs
,
10000000
loops
each
)
In
[
3
]:
n
=
1
In
[
4
]:
maxiter
=
300
In
[
5
]:
%
timeit
n
<
maxiter
42.1
ns
±
0.0355
ns
per
loop
(
mean
±
std
.
dev
.
of
7
runs
,
10000000
loops
each
)
From this simple analysis it looks as though the logic test on n
is over two times faster than the call to abs
. Since the order of evaluation for Python statements is both left to right and opportunistic, it makes sense to put the cheapest test on the left side of the equation. On 1 in every 301 tests for each coordinate the n < maxiter
test will be False
, so Python wouldn’t need to evaluate the other side of the and
operator.
We never know whether abs(z) < 2
will be False
until we evaluate it, and our earlier observations for this region of the complex plane suggest it is True
around 10% of the time for all 300 iterations. If we wanted to have a strong understanding of the time complexity of this part of the code, it would make sense to continue the numerical analysis. In this situation, however, we want an easy check to see if we can get a quick win.
We can form a new hypothesis stating, “By swapping the order of the operators in the while
statement we will achieve a reliable speedup.” We can test this hypothesis using kernprof
, but the additional overheads of profiling this way might add too much noise. Instead, we can use an earlier version of the code, running a test comparing while abs(z) < 2 and n < maxiter:
against while n < maxiter and abs(z) < 2:
.
The result is a fairly stable improvement of approximately 0.1 seconds. This result is obviously minor and is very problem-specific, though using a more suitable approach to solve this problem (e.g., swapping to using Cython or PyPy, as described in [Link to Come]) would yield greater gains.
We can be confident in our result because:
We stated a hypothesis that was easy to test.
We changed our code so that only the hypothesis would be tested (never test two things at once!).
We gathered enough evidence to support our conclusion.
For completeness, we can run a final kernprof
on the two main functions including our optimization to confirm that we have a full picture of the overall complexity of our code.
$ kernprof -l -v julia1_lineprofiler3.py ... Wrote profile results to julia1_lineprofiler3.py.lprof Timer unit: 1e-06 s Total time: 48.9154 s File: julia1_lineprofiler3.py Function: calculate_z_serial_purepython at line 9 Line # Hits Time Per Hit % Time Line Contents ============================================================== 9 @profile 10 def calculate_z_serial_purepython(maxiter, zs, cs): 11 """Calculate output list using Julia update rule""" 12 1 3312.0 3312.0 0.0 output = [0] * len(zs) 13 1000001 383405.0 0.4 0.8 for i in range(len(zs)): 14 1000000 362177.0 0.4 0.7 n = 0 15 1000000 414681.0 0.4 0.8 z = zs[i] 16 1000000 395037.0 0.4 0.8 c = cs[i] 17 34219980 18686399.0 0.5 38.2 while n < maxiter and abs(z) < 2: 18 33219980 15010549.0 0.5 30.7 z = z * z + c 19 33219980 13256992.0 0.4 27.1 n += 1 20 1000000 402857.0 0.4 0.8 output[i] = n 21 1 1.0 1.0 0.0 return output
As expected, we can see from the output in Example 2-9 that calculate_z_serial_purepython
takes most (97%) of the time of its parent function. The list-creation steps are minor in comparison.
Total time: 88.334 s File: julia1_lineprofiler3.py Function: calc_pure_python at line 24 Line # Hits Time Per Hit % Time Line Contents ============================================================== 24 @profile 25 def calc_pure_python(draw_output, desired_width, max_iterations): 26 """Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set and display""" ... 44 1 1.0 1.0 0.0 zs = [] 45 1 0.0 0.0 0.0 cs = [] 46 1001 651.0 0.7 0.0 for ycoord in y: 47 1001000 591973.0 0.6 0.7 for xcoord in x: 48 1000000 859713.0 0.9 1.0 zs.append(complex(xcoord, ycoord)) 49 1000000 902310.0 0.9 1.0 cs.append(complex(c_real, c_imag)) 50 51 1 40.0 40.0 0.0 print("Length of x:", len(x)) 52 1 7.0 7.0 0.0 print("Total elements:", len(zs)) 53 1 4.0 4.0 0.0 start_time = time.time() 54 1 85969310.0 85969310.0 97.3 output = calculate_z_serial_purepython(max_iterations, zs, cs) 55 1 4.0 4.0 0.0 end_time = time.time() 56 1 1.0 1.0 0.0 secs = end_time - start_time 57 1 36.0 36.0 0.0 print(calculate_z_serial_purepython.__name__ + " took", secs, "seconds") 58 59 1 6345.0 6345.0 0.0 assert sum(output) == 33219980 # this sum is expected for 1000^2 grid with 300 iterations
Just as Robert Kern’s line_profiler
package measures CPU usage, the memory_profiler
module by Fabian Pedregosa and Philippe Gervais measures memory usage on a line-by-line basis. Understanding the memory usage characteristics of your code allows you to ask yourself two questions:
Could we use less RAM by rewriting this function to work more efficiently?
Could we use more RAM and save CPU cycles by caching?
memory_profiler
operates in a very similar way to line_profiler
, but runs far more slowly. If you install the psutil
package (optional but recommended), memory_profiler
will run faster. Memory profiling may easily make your code run 10 to 100 times slower. In practice, you will probably use memory_profiler
occasionally and line_profiler
(for CPU profiling) more frequently.
Install memory_profiler
with the command pip install memory_profiler
(and optionally pip install psutil
).
As mentioned, the implementation of memory_profiler
is not as performant as the implementation of line_profiler
. It may therefore make sense to run your tests on a smaller problem that completes in a useful amount of time. Overnight runs might be sensible for validation, but you need quick and reasonable iterations to diagnose problems and hypothesize solutions. The code in Example 2-10 uses the full 1,000 × 1,000 grid, and the statistics took about 2 hours to collect on Ian’s laptop.
The requirement to modify the source code is a minor annoyance. As with line_profiler
, a decorator (@profile
) is used to mark the chosen function. This will break your unit tests unless you make a dummy decorator—see “No-op @profile Decorator”.
When dealing with memory allocation, you must be aware that the situation is not as clear-cut as it is with CPU usage. Generally, it is more efficient to overallocate memory to a process into a local pool that can be used at leisure, as memory allocation operations are relatively expensive. Furthermore, garbage collection is not instant, so objects may be unavailable but still in the garbage collection pool for some time.
The outcome of this is that it is hard to really understand what is happening with memory usage and release inside a Python program, as a line of code may not allocate a deterministic amount of memory as observed from outside the process. Observing the gross trend over a set of lines is likely to lead to better insight than observing the behavior of just one line.
Let’s take a look at the output from memory_profiler
in Example 2-10. Inside calculate_z_serial_purepython
on line 12, we see that the allocation of 1,000,000 items causes approximately 7 MB of RAM1 to be added to this process. This does not mean that the output
list is definitely 7 MB in size, just that the process grew by approximately 7 MB during the internal allocation of the list.
In the parent function on line 46, we see that the allocation of the zs
and cs
lists changes the Mem usage
column from 48MB to 125MB (a change of +77MB). Again, it is worth noting that this is not necessarily the true size of the arrays, just the size that the process grew by once these lists had been created.
At the time of writing the memory_usage
module exhibits a bug - the Incrememt
column does not always match the change in the Mem usage
column. During the first edition of this book these columns were correctly tracked, you might want to check the status of this bug on GitHub: https://github.com/pythonprofilers/memory_profiler/issues/236. We recommend you use the Mem usage
column as this correctly tracks the change in process size per line of code.
$
python
-
m
memory_profiler
julia1_memoryprofiler
.
py
...
Line
# Mem usage Increment Line Contents
================================================
9
126.363
MiB
126.363
MiB
@profile
10
def
calculate_z_serial_purepython
(
maxiter
,
zs
,
cs
):
11
"""Calculate output list using Julia update rule"""
12
133.973
MiB
7.609
MiB
output
=
[
0
]
*
len
(
zs
)
13
136.988
MiB
0.000
MiB
for
i
in
range
(
len
(
zs
)):
14
136.988
MiB
0.000
MiB
n
=
0
15
136.988
MiB
0.000
MiB
z
=
zs
[
i
]
16
136.988
MiB
0.000
MiB
c
=
cs
[
i
]
17
136.988
MiB
0.258
MiB
while
n
<
maxiter
and
abs
(
z
)
<
2
:
18
136.988
MiB
0.000
MiB
z
=
z
*
z
+
c
19
136.988
MiB
0.000
MiB
n
+=
1
20
136.988
MiB
0.000
MiB
output
[
i
]
=
n
21
136.988
MiB
0.000
MiB
return
output
...
Line
# Mem usage Increment Line Contents
================================================
24
48.113
MiB
48.113
MiB
@profile
25
def
calc_pure_python
(
draw_output
,
desired_width
,
max_iterations
):
26
"""Create a list of complex co-ordinates (zs) and complex parameters (cs), build Julia set and display"""
27
48.113
MiB
0.000
MiB
x_step
=
(
x2
-
x1
)
/
desired_width
28
48.113
MiB
0.000
MiB
y_step
=
(
y1
-
y2
)
/
desired_width
29
48.113
MiB
0.000
MiB
x
=
[]
30
48.113
MiB
0.000
MiB
y
=
[]
31
48.113
MiB
0.000
MiB
ycoord
=
y2
32
48.113
MiB
0.000
MiB
while
ycoord
>
y1
:
33
48.113
MiB
0.000
MiB
y
.
append
(
ycoord
)
34
48.113
MiB
0.000
MiB
ycoord
+=
y_step
35
48.113
MiB
0.000
MiB
xcoord
=
x1
36
48.113
MiB
0.000
MiB
while
xcoord
<
x2
:
37
48.113
MiB
0.000
MiB
x
.
append
(
xcoord
)
38
48.113
MiB
0.000
MiB
xcoord
+=
x_step
44
48.113
MiB
0.000
MiB
zs
=
[]
45
48.113
MiB
0.000
MiB
cs
=
[]
46
125.961
MiB
0.000
MiB
for
ycoord
in
y
:
47
125.961
MiB
0.258
MiB
for
xcoord
in
x
:
48
125.961
MiB
0.512
MiB
zs
.
append
(
complex
(
xcoord
,
ycoord
))
49
125.961
MiB
0.512
MiB
cs
.
append
(
complex
(
c_real
,
c_imag
))
50
51
125.961
MiB
0.000
MiB
(
"Length of x:"
,
len
(
x
))
52
125.961
MiB
0.000
MiB
(
"Total elements:"
,
len
(
zs
))
53
125.961
MiB
0.000
MiB
start_time
=
time
.
time
()
54
136.609
MiB
10.648
MiB
output
=
calculate_z_serial_purepython
(
max_iterations
,
zs
,
cs
)
55
136.609
MiB
0.000
MiB
end_time
=
time
.
time
()
56
136.609
MiB
0.000
MiB
secs
=
end_time
-
start_time
57
136.609
MiB
0.000
MiB
(
calculate_z_serial_purepython
.
__name__
+
" took"
,
secs
,
"seconds"
)
58
59
136.609
MiB
0.000
MiB
assert
sum
(
output
)
==
33219980
# this sum is expected for 1000^2 grid with 300 iterations
Another way to visualize the change in memory use is to sample over time and plot the result. memory_profiler
has a utility called mprof
, used once to sample the memory usage and a second time to visualize the samples. It samples by time and not by line, so it barely impacts the runtime of the code.
Figure 2-6 is created using mprof run julia1_memoryprofiler.py
. This writes a statistics file that is then visualized using mprof plot
. Our two functions are bracketed: this shows where in time they are entered, and we can see the growth in RAM as they run. Inside calculate_z_serial_purepython
, we can see the steady increase in RAM usage throughout the execution of the function; this is caused by all the small objects (int
and float
types) that are created.
In addition to observing the behavior at the function level, we can add labels using a context manager. The snippet in Example 2-11 is used to generate the graph in Figure 2-7. We can see the create_output_list
label: it appears momentarily at around 1.5 seconds after calculate_z_serial_purepython
and results in the process being allocated more RAM. We then pause for a second; the time.sleep(1)
is an artificial addition to make the graph easier to understand.
@profile
def
calculate_z_serial_purepython
(
maxiter
,
zs
,
cs
):
"""Calculate output list using Julia update rule"""
with
profile
.
timestamp
(
"create_output_list"
):
output
=
[
0
]
*
len
(
zs
)
time
.
sleep
(
1
)
with
profile
.
timestamp
(
"calculate_output"
):
for
i
in
range
(
len
(
zs
)):
n
=
0
z
=
zs
[
i
]
c
=
cs
[
i
]
while
n
<
maxiter
and
abs
(
z
)
<
2
:
z
=
z
*
z
+
c
n
+=
1
output
[
i
]
=
n
return
output
In the calculate_output
block that runs for most of the graph, we see a very slow, linear increase in RAM usage. This will be from all of the temporary numbers used in the inner loops. Using the labels really helps us to understand at a fine-grained level where memory is being consumed. Interestingly we see the “peak RAM usage” line - a dashed vertical line just before 10 seconds - occurring before the termination of the program. Potentially this is due to the garbage collector recovering some RAM from the temporary objects used during calculate_output
.
What happens if we simplify our code and remove the creation of the zs
and cs
lists? We have to then calculate these co-ordinates inside calculate_z_serial_purepython
(so the same work is performed) but we’ll save RAM by not storing them in lists. You can see the code in Example 2-12.
In Figure 2-8, we see a major change in behavior - the overall envelope of RAM usage drops from 140MB to 60MB reducing our RAM usage by half!
@profile
def
calculate_z_serial_purepython
(
maxiter
,
x
,
y
):
"""Calculate output list using Julia update rule"""
output
=
[]
for
ycoord
in
y
:
for
xcoord
in
x
:
z
=
complex
(
xcoord
,
ycoord
)
c
=
complex
(
c_real
,
c_imag
)
n
=
0
while
n
<
maxiter
and
abs
(
z
)
<
2
:
z
=
z
*
z
+
c
n
+=
1
output
.
append
(
n
)
return
output
If we want to measure the RAM used by several statements we can use the IPython magic %memit
, which works just like %timeit
. In [Link to Come] we will look at using %memit
to measure the memory cost of lists and discuss various ways of using RAM more efficiently.
memory_profiler
offers an interesting aid to debugging a large process via the --pdb-mmem=XXX
flag. The pdb
debugger will be activated once the process exceeds XXXMB. This will drop you in directly at the point in your code where too many allocations are occurring, if you’re in a space constrained envrionment.
pyspy
is an intriguing new sampling profiler - rather than requiring any code changes it introspects an already-running Python process and reports in the console with a top
-like display. Being a sampling profiler it has almost no run-time impact on your code. It is written in Rust and requires elevated privileges to introspect another process.
This tool could be very useful in a production environment with long-running processes or complicated installation requirements. It supports Windows, Mac and Linux. Install it using pip install py-spy
(note the dash, there’s a separate pyspy
project that isn’t related). If your process is already running then you’ll want to use ps
to get its process identifier (the PID), then this can be passed into py-spy
as shown in Example 2-13.
$ ps -A -o pid,rss,cmd | ack python ... 15953 96156 python julia1_nopil.py ... $ sudo env "PATH=$PATH" py-spy --pid 15953
If you look at Figure 2-9 you’ll see a static picture of a top
-like display, in the console this updates every second to show which functions are currently taking most of the time.
PySpy can also export a flame chart, here we’ll run that option whilst asking PySpy to run our code directly without requiring a PID using $ py-spy --flame profile.svg -- python julia1_nopil.py
. You’ll see in Figure 2-10 that the width of the display represents the entire program’s run-time and and each layer moving down the image represents functions called from above.
So far we’ve reviewed various ways to measure the cost of Python code (both for CPU and RAM usage). We haven’t yet looked at the underlying bytecode used by the virtual machine, though. Understanding what’s going on “under the hood” helps to build a mental model of what’s happening in slow functions, and it’ll help when you come to compile your code. So, let’s introduce some bytecode.
The dis
module lets us inspect the underlying bytecode that we run inside the stack-based CPython virtual machine. Having an understanding of what’s happening in the virtual machine that runs your higher-level Python code will help you to understand why some styles of coding are faster than others. It will also help when you come to use a tool like Cython, which steps outside of Python and generates C code.
The dis
module is built in. You can pass it code or a module, and it will print out a disassembly. In Example 2-14 we disassemble the outer loop of our CPU-bound function.
You should try to disassemble one of your own functions and attempt to exactly follow how the disassembled code matches to the disassembled output. Can you match the following dis
output to the original function?
In
[
1
]:
import
dis
In
[
2
]:
import
julia1_nopil
In
[
3
]:
dis
.
dis
(
julia1_nopil
.
calculate_z_serial_purepython
)
11
0
LOAD_CONST
1
(
0
)
2
BUILD_LIST
1
4
LOAD_GLOBAL
0
(
len
)
6
LOAD_FAST
1
(
zs
)
8
CALL_FUNCTION
1
10
BINARY_MULTIPLY
12
STORE_FAST
3
(
output
)
12
14
SETUP_LOOP
94
(
to
110
)
16
LOAD_GLOBAL
1
(
range
)
18
LOAD_GLOBAL
0
(
len
)
20
LOAD_FAST
1
(
zs
)
22
CALL_FUNCTION
1
24
CALL_FUNCTION
1
26
GET_ITER
>>
28
FOR_ITER
78
(
to
108
)
30
STORE_FAST
4
(
i
)
13
32
LOAD_CONST
1
(
0
)
34
STORE_FAST
5
(
n
)
...
19
>>
98
LOAD_FAST
5
(
n
)
100
LOAD_FAST
3
(
output
)
102
LOAD_FAST
4
(
i
)
104
STORE_SUBSCR
106
JUMP_ABSOLUTE
28
>>
108
POP_BLOCK
20
>>
110
LOAD_FAST
3
(
output
)
112
RETURN_VALUE
The output is fairly straightforward, if terse. The first column contains line numbers that relate to our original file. The second column contains several >>
symbols; these are the destinations for jump points elsewhere in the code. The third column is the operation address along with the operation name. The fourth column contains the parameters for the operation. The fifth column contains annotations to help line up the bytecode with the original Python parameters.
Refer back to Example 2-3 to match the bytecode to the corresponding Python code. The bytecode starts by putting the constant value 0 onto the stack, and then it builds a single-element list. Next, it searches the namespaces to find the len
function, puts it on the stack, searches the namespaces again to find zs
, and then puts that onto the stack. On line 12 it calls the len
function from the stack, which consumes the zs
reference in the stack; then it applies a binary multiply to the last two arguments (the length of zs
and the single-element list) and stores the result in output
. That’s the first line of our Python function now dealt with. Follow the next block of bytecode to understand the behavior of the second line of Python code (the outer for
loop).
The jump points (>>
) match to instructions like JUMP_ABSOLUTE
and POP_JUMP_IF_FALSE
. Go through your own disassembled function and match the jump points to the jump instructions.
Having introduced bytecode, we can now ask: what’s the bytecode and time cost of writing a function out explicitly versus using built-ins to perform the same task?
There should be one—and preferably only one—obvious way to do it. Although that way may not be obvious at first unless you’re Dutch…
Tim Peters, The Zen of Python
There will be various ways to express your ideas using Python. Generally it should be clear what the most sensible option is, but if your experience is primarily with an older version of Python or another programming language, then you may have other methods in mind. Some of these ways of expressing an idea may be slower than others.
You probably care more about readability than speed for most of your code, so your team can code efficiently, without being puzzled by performant but opaque code. Sometimes you will want performance, though (without sacrificing readability). Some speed testing might be what you need.
Take a look at the two code snippets in Example 2-15. They both do the same job, but the first will generate a lot of additional Python bytecode, which will cause more overhead.
def
fn_expressive
(
upper
=
1
_000_000
):
total
=
0
for
n
in
range
(
upper
):
total
+=
n
return
total
def
fn_terse
(
upper
=
1
_000_000
):
return
sum
(
range
(
upper
))
assert
fn_expressive
()
==
fn_terse
(),
"We expect an identical result from these functions"
Both functions calculate the sum of a range of integers. A simple rule of thumb (but one you must back up using profiling!) is that more lines of bytecode will execute more slowly than fewer equivalent lines of bytecode that use built-in functions. In Example 2-16 we use IPython’s %timeit
magic function to measure the best execution time from a set of runs. fn_terse
runs over twice as fast as fn_expressive
!
In
[
2
]:
%
timeit
fn_expressive
()
52.4
ms
±
86.4
µ
s
per
loop
(
mean
±
std
.
dev
.
of
7
runs
,
10
loops
each
)
In
[
3
]:
%
timeit
fn_terse
()
18.1
ms
±
1.38
ms
per
loop
(
mean
±
std
.
dev
.
of
7
runs
,
100
loops
each
)
If we use the dis
module to investigate the code for each function, as shown in Example 2-17, we can see that the virtual machine has 17 lines to execute with the more expressive function and only 6 to execute with the very readable but more terse second function.
In
[
4
]:
import
dis
In
[
5
]:
dis
.
dis
(
fn_expressive
)
2
0
LOAD_CONST
1
(
0
)
2
STORE_FAST
1
(
total
)
3
4
SETUP_LOOP
24
(
to
30
)
6
LOAD_GLOBAL
0
(
range
)
8
LOAD_FAST
0
(
upper
)
10
CALL_FUNCTION
1
12
GET_ITER
>>
14
FOR_ITER
12
(
to
28
)
16
STORE_FAST
2
(
n
)
4
18
LOAD_FAST
1
(
total
)
20
LOAD_FAST
2
(
n
)
22
INPLACE_ADD
24
STORE_FAST
1
(
total
)
26
JUMP_ABSOLUTE
14
>>
28
POP_BLOCK
5
>>
30
LOAD_FAST
1
(
total
)
32
RETURN_VALUE
In
[
6
]:
dis
.
dis
(
fn_terse
)
8
0
LOAD_GLOBAL
0
(
sum
)
2
LOAD_GLOBAL
1
(
range
)
4
LOAD_FAST
0
(
upper
)
6
CALL_FUNCTION
1
8
CALL_FUNCTION
1
10
RETURN_VALUE
The difference between the two code blocks is striking. Inside fn_expressive()
, we maintain two local variables and iterate over a list using a for
statement. The for
loop will be checking to see if a StopIteration
exception has been raised on each loop. Each iteration applies the total.__add__
function, which will check the type of the second variable (n
) on each iteration. These checks all add a little expense.
Inside fn_terse()
, we call out to an optimized C list comprehension function that knows how to generate the final result without creating intermediate Python objects. This is much faster, although each iteration must still check for the types of the objects that are being added together (in Chapter 4 we look at ways of fixing the type so we don’t need to check it on each iteration).
As noted previously, you must profile your code—if you just rely on this heuristic, then you will inevitably write slower code at some point. It is definitely worth learning if Python has a shorter and still readable way to solve your problem built in. If so, it is more likely to be easily readable by another programmer, and it will probably run faster.
If you aren’t already unit testing your code, then you are probably hurting your longer-term productivity. Ian (blushing) is embarrassed to note that once he spent a day optimizing his code, having disabled unit tests because they were inconvenient, only to discover that his significant speedup result was due to breaking a part of the algorithm he was improving. You do not need to make this mistake even once.
Add unit tests to your code for a saner life. You’ll be giving your current self and your colleagues faith that your code works and you’ll be giving a present to your future-self who has to maintain this code later. You really will save a lot of time in the long-term by adding tests to your code
In addition to unit testing, you should also strongly consider using coverage.py
. It checks to see which lines of code are exercised by your tests and identifies the sections that have no coverage. This quickly lets you figure out whether you’re testing the code that you’re about to optimize, such that any mistakes that might creep in during the optimization process are quickly caught.
Your unit tests will fail with a NameError
exception if your code uses @profile
from line_profiler
or memory_profiler
. The reason is that the unit test framework will not be injecting the @profile
decorator into the local namespace. The no-op decorator shown here solves this problem. It is easiest to add it to the block of code that you’re testing and remove it when you’re done.
With the no-op decorator, you can run your tests without modifying the code that you’re testing. This means you can run your tests after every profile-led optimization you make so you’ll never be caught out by a bad optimization step.
Let’s say we have the trivial ex.py
module shown in Example 2-18. It has a test (for pytest
) and a function that we’ve been profiling with either line_profiler
or memory_profiler
.
import
time
def
test_some_fn
():
"""Check basic behaviours for our function"""
assert
some_fn
(
2
)
==
4
assert
some_fn
(
1
)
==
1
assert
some_fn
(
-
1
)
==
1
@profile
def
some_fn
(
useful_input
):
"""An expensive function that we wish to both test and profile"""
# artificial 'we're doing something clever and expensive' delay
time
.
sleep
(
1
)
return
useful_input
**
2
if
__name__
==
"__main__"
:
(
f
"Example call `some_fn(2)` == {some_fn(2)}"
)
If we run pytest
on our code, we’ll get a NameError
:
$ pytest utility.py =========================================================== test session starts ============================================================ platform linux -- Python 3.7.3, pytest-4.6.2, py-1.8.0, pluggy-0.12.0 rootdir: noop_profile_decorator plugins: cov-2.7.1 collected 0 items / 1 errors ================================================================== ERRORS ================================================================== _______________________________________________________ ERROR collecting utility.py ________________________________________________________ utility.py:20: in <module> @profile E NameError: name 'profile' is not defined
The solution is to add a no-op decorator at the start of our module (you can remove it once you’re done with profiling). If the @profile
decorator is not found in one of the namespaces (because line_profiler
or memory_profiler
is not being used), then the no-op version we’ve written is added. If line_profiler
or memory_profiler
has injected the new function into the namespace, then our no-op version is ignored.
For both line_profiler
and memory_profiler
, we can add the code in Example 2-20.
# check for line_profiler or memory_profiler in the local scope, both
# are injected by their respective tools or they're absent
# if these tools aren't being used (in which case we need to substite
# a dummy @profile decorator)
if
'line_profiler'
not
in
dir
()
and
'profile'
not
in
dir
():
def
profile
(
func
):
return
func
Having added the no-op decorator we can now run our pytest
succesfully, as shown in Example 2-21, along with our profilers - with no additional code changes.
$
pytest
utility
.
py
===========================================================
test
session
starts
============================================================
platform
linux
--
Python
3.7
.
3
,
pytest
-
4.6
.
2
,
py
-
1.8
.
0
,
pluggy
-
0.12
.
0
rootdir
:
/
home
/
ian
/
workspace
/
personal_projects
/
high_performance_python_book_2e
/
high
-
performance
-
python
-
2
e
/
examples_ian
/
ian
/
ch02
/
noop_profile_decorator
plugins
:
cov
-
2.7
.
1
collected
1
item
utility
.
py
.
[
100
%
]
=========================================================
1
passed
in
3.04
seconds
=========================================================
$
kernprof
-
l
-
v
utility
.
py
Example
call
`some_fn(2)`
==
4
...
Line
# Hits Time Per Hit % Time Line Contents
==============================================================
20
@profile
21
def
some_fn
(
useful_input
):
22
"""An expensive function that we wish to both test and profile"""
23
# artificial 'we're doing something clever and expensive' delay
24
1
1001345.0
1001345.0
100.0
time
.
sleep
(
1
)
25
1
7.0
7.0
0.0
return
useful_input
**
2
$
python
-
m
memory_profiler
utility
.
py
Example
call
`some_fn(2)`
==
4
Filename
:
utility
.
py
Line
# Mem usage Increment Line Contents
================================================
20
48.277
MiB
48.277
MiB
@profile
21
def
some_fn
(
useful_input
):
22
"""An expensive function that we wish to both test and profile"""
23
# artificial 'we're doing something clever and expensive' delay
24
48.277
MiB
0.000
MiB
time
.
sleep
(
1
)
25
48.277
MiB
0.000
MiB
return
useful_input
**
2
You can save yourself a few minutes by avoiding the use of these decorators, but once you’ve lost hours making a false optimization that breaks your code, you’ll want to integrate this into your workflow.
Profiling requires some time and concentration. You will stand a better chance of understanding your code if you separate the section you want to test from the main body of your code. You can then unit test the code to preserve correctness, and you can pass in realistic fabricated data to exercise the inefficiencies you want to address.
Do remember to disable any BIOS-based accelerators, as they will only confuse your results. On Ian’s laptop the Intel TurboBoost feature can temporarily accelerate a CPU above its normal maximum speed if it is cool enough. This means that a cool CPU may run the same block of code faster than a hot CPU. Your operating system may also control the clock speed—a laptop on battery power is likely to more aggressively control CPU speed than a laptop on mains power. To create a more stable benchmarking configuration, we:
Disable TurboBoost in the BIOS.
Disable the operating system’s ability to override the SpeedStep (you will find this in your BIOS if you’re allowed to control it).
Only use mains power (never battery power).
Disable background tools like backups and Dropbox while running experiments.
Run the experiments many times to obtain a stable measurement.
Possibly drop to run level 1 (Unix) so that no other tasks are running.
Reboot and rerun the experiments to double-confirm the results.
Try to hypothesize the expected behavior of your code and then validate (or disprove!) the hypothesis with the result of a profiling step. Your choices will not change (you can only use the profiled results to drive your decisions), but your intuitive understanding of the code will improve, and this will pay off in future projects as you will be more likely to make performant decisions. Of course, you will verify these performant decisions by profiling as you go.
Do not skimp on the preparation. If you try to performance test code deep inside a larger project without separating it from the larger project, you are likely to witness side effects that will sidetrack your efforts. It is likely to be harder to unit test a larger project when you’re making fine-grained changes, and this may further hamper your efforts. Side effects could include other threads and processes impacting CPU and memory usage and network and disk activity, which will skew your results.
Naturally you’re already using source code control (e.g. git or mercurial) so you’ll be able to run multiple experiments in different branches without ever losing “the versions that work well”. If you’re not using source code control - do yourself a huge favour and start to do so!
For web servers, investigate dowser
and dozer
; you can use these to visualize in real time the behavior of objects in the namespace. Definitely consider separating the code you want to test out of the main web application if possible, as this will make profiling significantly easier.
Make sure your unit tests exercise all the code paths in the code that you’re analyzing. Anything you don’t test that is used in your benchmarking may cause subtle errors that will slow down your progress. Use coverage.py
to confirm that your tests are covering all the code paths.
Unit testing a complicated section of code that generates a large numerical output may be difficult. Do not be afraid to output a text file of results to run through diff
or to use a pickled
object. For numeric optimization problems, Ian likes to create long text files of floating-point numbers and use diff
—minor rounding errors show up immediately, even if they’re rare in the output.
If your code might be subject to numerical rounding issues due to subtle changes, then you are better off with a large output that can be used for a before-and-after comparison. One cause of rounding errors is the difference in floating-point precision between CPU registers and main memory. Running your code through a different code path can cause subtle rounding errors that might later confound you—it is better to be aware of this as soon as they occur.
Obviously, it makes sense to use a source code control tool while you are profiling and optimizing. Branching is cheap, and it will preserve your sanity.
Having looked at profiling techniques, you should have all the tools you need to identify bottlenecks around CPU and RAM usage in your code. Next we’ll look at how Python implements the most common containers, so you can make sensible decisions about representing larger collections of data.
1 memory_profiler
measures memory usage according to the International Electrotechnical Commission’s MiB (mebibyte) of 2^20 bytes. This is slightly different from the more common but also more ambiguous MB (megabyte has two commonly accepted definitions!). 1 MiB is equal to 1.048576 (or approximately 1.05) MB. For the purposes of our discussion, unless dealing with very specific amounts, we’ll consider the two equivalent.