Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

cProfile – finding the slowest components

The profile module makes it easily possible to analyze the relative CPU cycles used in a script/application. Be very careful not to compare these with the results from the timeit module. While the timeit module tries as best as possible to give an accurate benchmark of the absolute amount of time it takes to execute a code snippet, the profile module is only useful for relative results. The reason is that the profiling code itself incurs such a slowdown that the results are not comparable with non-profiled code. There is a way to make it a bit more accurate however, but more about that later.

Note

Within this section we will be talking about the profile module but in the examples we will actually use the cProfile module. The cProfile module is a high-performance emulation of the pure Python profile module.

First profiling run

Let's profile our Fibonacci function from Chapter 5, Decorators– Enabling Code Reuse by Decorating, both with and without the cache function. First, the code:

import sys
import functools


@functools.lru_cache()
def fibonacci_cached(n):
    if n < 2:
        return n
    else:
        return fibonacci_cached(n - 1) + fibonacci_cached(n - 2)


def fibonacci(n):
    if n < 2:
        return n
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


if __name__ == '__main__':
    n = 30
    if sys.argv[-1] == 'cache':
        fibonacci_cached(n)
    else:
        fibonacci(n)

Note

For readabilities sake, all cProfile statistics will be stripped of the percall and cumtime columns in all cProfile outputs. These columns are irrelevant for the purposes of these examples.

First we'll execute the function without cache:

# python3 -m cProfile -s calls test_fibonacci.py no_cache
         2692557 function calls (21 primitive calls) in 0.815
         seconds

   Ordered by: call count

   ncalls tottime percall filename:lineno(function)
2692537/1   0.815   0.815 test_fibonacci.py:13(fibonacci)
        7   0.000   0.000 {built-in method builtins.getattr}
        5   0.000   0.000 {built-in method builtins.setattr}
        1   0.000   0.000 {method 'update' of 'dict' objects}
        1   0.000   0.000 {built-in method builtins.isinstance}
        1   0.000   0.000 functools.py:422(decorating_function)
        1   0.000   0.815 test_fibonacci.py:1(<module>)
        1   0.000   0.000 {method 'disable' of '_lsprof.Profiler'}
        1   0.000   0.815 {built-in method builtins.exec}
        1   0.000   0.000 functools.py:43(update_wrapper)
        1   0.000   0.000 functools.py:391(lru_cache)

That's quite a lot of calls, isn't it? Apparently, we called the test_fibonacci function nearly 3 million times. That is where the profiling modules provide a lot of insight. Let's analyze the metrics a bit further:

Ncalls: The number of calls that were made to the function
Tottime: The total time spent in seconds within this function with all sub-functions excluded
Percall, tottime / ncalls
Cumtime: The total time spent within this function, including sub-functions
Percall, cumtime / ncalls

Which is the most useful depends on your use case. It's quite simple to change the sort order using the -s parameter within the default output. But now let's see what the result is with the cached version. Once again, with stripped output:

# python3 -m cProfile -s calls test_fibonacci.py cache
         51 function calls (21 primitive calls) in 0.000 seconds

   Ordered by: call count

   ncalls tottime percall filename:lineno(function)
     31/1   0.000   0.000 test_fibonacci.py:5(fibonacci_cached)
        7   0.000   0.000 {built-in method builtins.getattr}
        5   0.000   0.000 {built-in method builtins.setattr}
        1   0.000   0.000 test_fibonacci.py:1(<module>)
        1   0.000   0.000 {built-in method builtins.isinstance}
        1   0.000   0.000 {built-in method builtins.exec}
        1   0.000   0.000 functools.py:422(decorating_function)
        1   0.000   0.000 {method 'disable' of '_lsprof.Profiler'}
        1   0.000   0.000 {method 'update' of 'dict' objects}
        1   0.000   0.000 functools.py:391(lru_cache)
        1   0.000   0.000 functools.py:43(update_wrapper)

This time we see a tottime of 0.000 because it's just too fast to measure. But also, while the fibonacci_cached function is still the most executed function, it's only being executed 31 times instead of 3 million.

Calibrating your profiler

To illustrate the difference between profile and cProfile, let's try the uncached run again with the profile module instead. Just a heads up, this is much slower so don't be surprised if it stalls a little:

# python3 -m profile -s calls test_fibonacci.py no_cache
         2692558 function calls (22 primitive calls) in 7.696 seconds

   Ordered by: call count

   ncalls tottime percall filename:lineno(function)
2692537/1   7.695   7.695 test_fibonacci.py:13(fibonacci)
        7   0.000   0.000 :0(getattr)
        5   0.000   0.000 :0(setattr)
        1   0.000   0.000 :0(isinstance)
        1   0.001   0.001 :0(setprofile)
        1   0.000   0.000 :0(update)
        1   0.000   0.000 functools.py:43(update_wrapper)
        1   0.000   7.696 profile:0(<code object <module> ...>)
        1   0.000   7.695 test_fibonacci.py:1(<module>)
        1   0.000   0.000 functools.py:391(lru_cache)
        1   0.000   7.695 :0(exec)
        1   0.000   0.000 functools.py:422(decorating_function)
        0   0.000         profile:0(profiler)

Huge difference, isn't it? Now the code is nearly 10 times slower and the only difference is using the pure Python profile module instead of the cProfile module. This does indicate a big problem with the profile module. The overhead from the module itself is great enough to skew the results, which means we should account for that offset. That's what the Profile.calibrate() function takes care of, as it calculates the bias incurred by the profile module. To calculate the bias, we can use the following script:

import profile

if __name__ == '__main__':
    profiler = profile.Profile()
    for i in range(10):
        print(profiler.calibrate(100000))

The numbers will vary slightly but you should be able to get a fair estimate of the bias using this code. If the numbers still vary a lot, you can increase the trials from 100000 to something even larger. This type of calibration only works for the profile module, but if you are looking for more accurate results and the cProfile module does not work for you due to inheritance or not being supported on your platform, you can use this code to set your bias globally and get more accurate results:

import profile


# The number here is bias calculated earlier
profile.Profile.bias = 2.0939406059394783e-06

For a specific Profile instance:

import profile


profiler = profile.Profile(bias=2.0939406059394783e-06)

Note that in general a smaller bias is better to use than a large one, because a large bias could cause very strange results. In some cases you will even get negative timings. Let's give it a try for our Fibonacci code:

import sys
import pstats
import profile
import functools


@functools.lru_cache()
def fibonacci_cached(n):
    if n < 2:
        return n
    else:
        return fibonacci_cached(n - 1) + fibonacci_cached(n - 2)


def fibonacci(n):
    if n < 2:
        return n
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


if __name__ == '__main__':
    profiler = profile.Profile(bias=2.0939406059394783e-06)
    n = 30

    if sys.argv[-1] == 'cache':
        profiler.runcall(fibonacci_cached, n)
    else:
        profiler.runcall(fibonacci, n)

    stats = pstats.Stats(profiler).sort_stats('calls')
    stats.print_stats()

While running it, it indeed appears that I've used a bias that's too large:

# python3 test_fibonacci.py no_cache
         2692539 function calls (3 primitive calls) in -0.778
         seconds

   Ordered by: call count

   ncalls tottime percall filename:lineno(function)
2692537/1  -0.778  -0.778 test_fibonacci.py:15(fibonacci)
        1   0.000   0.000 :0(setprofile)
        1   0.000  -0.778 profile:0(<function fibonacci at 0x...>)
        0   0.000         profile:0(profiler)

Still, it shows how the code can be used properly. You can even incorporate the bias calculation within the script using a snippet like this:

import profile


if __name__ == '__main__':
    profiler = profile.Profile()
    profiler.bias = profiler.calibrate(100000)

Selective profiling using decorators

Calculating simple timings is easy enough using decorators, but profiling is also important. Both are useful but serve different goals. Let's look at both the options:

import cProfile
import datetime
import functools


def timer(function):
    @functools.wraps(function)
    def _timer(*args, **kwargs):
        start = datetime.datetime.now()
        try:
            return function(*args, **kwargs)
        finally:
            end = datetime.datetime.now()
            print('%s: %s' % (function.__name__, end - start))
    return _timer


def profiler(function):
    @functools.wraps(function)
    def _profiler(*args, **kwargs):
        profiler = cProfile.Profile()
        try:
            profiler.enable()
            return function(*args, **kwargs)
        finally:
            profiler.disable()
            profiler.print_stats()
    return _profiler


@profiler
def profiled_fibonacci(n):
    return fibonacci(n)


@timer
def timed_fibonacci(n):
    return fibonacci(n)


def fibonacci(n):
    if n < 2:
        return n
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


if __name__ == '__main__':
    timed_fibonacci(32)
    profiled_fibonacci(32)

The code is simple enough, just a basic timer and profiler printing some default statistics. Which functions best for you depends on your use-case of course, but they definitely both have their uses. The added advantage of this selective profiling is that the output is more limited which helps with readability:

# python3 test_fibonacci.py
 timed_fibonacci: 0:00:01.050200
         7049157 function calls (3 primitive calls) in 2.024
         seconds

   Ordered by: standard name

   ncalls tottime percall filename:lineno(function)
        1   0.000   2.024 test_fibonacci.py:31(profiled_fibonacci)
7049155/1   2.024   2.024 test_fibonacci.py:41(fibonacci)
        1   0.000   0.000 {method 'disable' of '_lsprof.Profiler'}

As you can see, the profiler still makes the code about twice as slow, but it's definitely usable.

Using profile statistics

To get some more intricate profiling results, we will profile the pystone script. The pystone script is an internal Python performance test which benchmarks the Python interpreter fairly thoroughly. First, let's create the statistics using this script:

from test import pystone
import cProfile


if __name__ == '__main__':
    profiler = cProfile.Profile()
    profiler.runcall(pystone.main)
    profiler.dump_stats('pystone.profile')

When executing the script, you should get something like this:

# python3 test_pystone.py
Pystone(1.2) time for 50000 passes = 0.725432
This machine benchmarks at 68924.4 pystones/second

After running the script, you should have a pystone.profile file containing the profiling results. These results can be And the pystone.profile file which contains all of the profiling statistics. These statistics can be viewed through the pstats module which is bundled with Python:

import pstats


stats = pstats.Stats('pystone.profile')
stats.strip_dirs()
stats.sort_stats('calls', 'cumtime')
stats.print_stats(10)

In some cases, it can be interesting to combine the results from multiple measurements. That is possible by specifying multiple files or by using stats.add(*filenames). But first, let's look at the regular output:

# python3 parse_statistics.py

         1050012 function calls in 0.776 seconds

   Ordered by: call count, cumulative time
   List reduced from 21 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   150000    0.032    0.000    0.032    0.000 pystone.py:214(Proc7)
   150000    0.027    0.000    0.027    0.000 pystone.py:232(Func1)
   100000    0.016    0.000    0.016    0.000 {built-in method builtins.chr}
   100000    0.010    0.000    0.010    0.000 {built-in method builtins.ord}
    50002    0.029    0.000    0.029    0.000 pystone.py:52(__init__)
    50000    0.127    0.000    0.294    0.000 pystone.py:144(Proc1)
    50000    0.094    0.000    0.094    0.000 pystone.py:219(Proc8)
    50000    0.048    0.000    0.077    0.000 pystone.py:60(copy)
    50000    0.051    0.000    0.061    0.000 pystone.py:240(Func2)
    50000    0.031    0.000    0.043    0.000 pystone.py:171(Proc3)

Obviously, the parameters can easily be modified to change the sort order and the number of output lines. But that is not the only possibility of the statistics. There are quite a few packages around which can parse these results and visualize them. One option is RunSnakeRun, which although useful does not run on Python 3 currently. Also, we have QCacheGrind, a very nice visualizer for profile statistics but which requires some manual compiling to get running or some searching for binaries of course.

Let's look at the output from QCacheGrind. In the case of Windows, the QCacheGrindWin package provides a binary, whereas within Linux it is most likely available through your package manager, and with OS X you can try brew install qcachegrind --with-graphviz. But there is one more package you will require: the pyprof2calltree package. It transforms the profile output into a format that QCacheGrind understands. So, after a simple pip install pyprof2calltree, we can now convert the profile file into a callgrind file:

# pyprof2calltree -i pystone.profile -o pystone.callgrind
writing converted data to: pystone.callgrind
# qcachegrind pystone.callgrind

This results in running of the QCacheGrind application. After switching to the appropriate tabs, you should see something like the following image:

For a simple script such as this, pretty much all output works. However, with full applications, a tool such as QCacheGrind is invaluable. Looking at the output generated by QCacheGrind, it is immediately obvious which process took the most time. The structure at the top right shows bigger rectangles if the amount of time taken was greater, which is a very useful visualization of the chunks of CPU time that were used. The list at the left is very similar to cProfile and therefore nothing new. The tree at the bottom right can be very valuable or very useless as it is in this case. It shows you the percentage of CPU time taken in a function and more importantly, the relationship of that function with the other functions.

Because these tools scale depending on the input the results are useful for just about any application. Whether a function takes 100 milliseconds or 100 minutes makes no difference, the output will show a clear overview of the slow parts, which is what we will try to fix.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for cProfile – finding the slowest components

Create new playlist

Sign In

Sign Up

cProfile – finding the slowest components

Note

First profiling run

Note

Calibrating your profiler

Selective profiling using decorators

Using profile statistics

Table of Contents for
cProfile – finding the slowest components