CPU profiling of Python applications

by Roman Podoliaka, PyCon Ukraine 2016, Lviv, 23/04/2016

@rpodoliaka

Jupyter notebook: http://bit.ly/1qYFIJ2

slides: http://podoliaka.org/talks/profiling/

Why this topic

  • I'm a development manager at Mirantis - don't write much code, mostly debug (performance) issues

  • Mirantis is a contributor to OpenStack (~ "free" AWS)

  • OpenStack, as a cloud platform, must work fast and be able to scale - performance / scale issues are in every single service

  • when working on a performance issue, (even experienced) engineers tend to guess (wrong) instead of using proven methods

Why you should care about performance

They often say:

  • modern CPUs are "fast enough"

  • Python is one of slowest programming languages out there - if we wanted performance, we would choose C/C++/Rust/Java/Go/etc

... that's true but:

  • Python can be fast, if used properly

  • writing efficient code saves money in modern "cloud economy" (e.g. you need fewer VM instances or the same number of VMs for a shorter period of time, or just less expensive VMs)

  • even if you don't use clouds, a particular problem domain can have strict performance requirements (e.g. when you have to process a chunk of data in time before the next chunk arrives)

Case study: Huffman coding

  • a basic data compression algorithm

  • the underlying idea: more common symbols are represented using fewer bits than less common symbols

  • a CPU-bound task

In [149]:
def count_freqs(f, size=1):
    """Returns a (symb -> freq) mapping for a given bytes stream"""
    
    freqs = {}
    
    for symb in iter(lambda: f.read(size), b''):
        if symb in freqs:
            freqs[symb] += 1
        else:
            freqs[symb] = 1
            
    return freqs
In [150]:
import functools


@functools.total_ordering
class Node(object):
    """Represents a node in a Huffman tree. If the symbol is None, this is an internal node."""
    
    def __init__(self, freq=0, symb=None, left=None, right=None):
        self.freq = freq
        self.symb = symb
        self.left = left
        self.right = right
        
    def __eq__(self, rhs):
        return self.freq == rhs.freq
    
    def __lt__(self, rhs):
        return self.freq < rhs.freq

In [151]:
def build_tree(freqs):
    """Build a Huffman tree given a symbol frequencies map."""
    
    queue = [Node(freq=v, symb=k) for k, v in freqs.items()]
    
    while len(queue) > 1:
        queue.sort(reverse=True)
        
        right = queue.pop()
        left = queue.pop()
        
        queue.append(Node(freq=left.freq + right.freq, left=left, right=right))
        
    return queue.pop()
In [152]:
def huffman_prepare(f):
    return build_tree(count_freqs(f))
    
In [153]:
# a piece of art

war_and_peace = open('data/War_and_Peace.txt', 'rb').read()
print(len(war_and_peace))
3226757
In [155]:
def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()
1 loop, best of 3: 1.85 s per loop

~2 seconds for a 3 MB text file - that's not really fast.

How to make it run faster?

Start trying things at random!..

In [156]:
@functools.total_ordering
class Node(object):
    # slots must be faster!
    __slots__ = ('freq', 'symb', 'left', 'right')
    
    def __init__(self, freq=0, symb=None, left=None, right=None):
        self.freq = freq
        self.symb = symb
        self.left = left
        self.right = right
        
    def __eq__(self, rhs):
        return self.freq == rhs.freq
    
    def __lt__(self, rhs):
        return self.freq < rhs.freq
In [157]:
def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()
1 loop, best of 3: 1.82 s per loop
In [158]:
import collections


def count_freqs(f, size=1):
    # defaultdict is definitely better!
    freqs = collections.defaultdict(int)
    
    for symb in iter(lambda: f.read(size), b''):
        freqs[symb] += 1
            
    return freqs
In [159]:
def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()
1 loop, best of 3: 1.72 s per loop
In [160]:
import operator


def build_tree(freqs):
    queue = [Node(freq=v, symb=k) for k, v in freqs.items()]
    
    while len(queue) > 1:
        # key + operator.attrgetter make sorting fast!!!
        queue.sort(key=operator.attrgetter('freq'), reverse=True)
        
        right = queue.pop()
        left = queue.pop()
        
        queue.append(Node(freq=left.freq + right.freq, left=left, right=right))
        
    return queue.pop()
In [161]:
def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()
1 loop, best of 3: 1.71 s per loop

... and developers keep walking in circles

Stop trying things at random!

In [162]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
...

In the face of ambiguity, refuse the temptation to guess.

What you should really do

  • make observations: benchmark your code and understand how much time it takes to execute each part of it

  • choose the part to be improved (ideally, you should be able to predict the performance effect of the improvement in advance)

  • analyze and improve the code part in question, based on the observations done

  • repeat until satisfied with results (e.g. execution time)

Amdahl's law

  • allows to find the maximum expected improvement to an overall system, when only one part of the system is improved

Now you essentially need to somehow get a profile of your code, i.e. to understand how it runs and where it spends time

Deterministic profiling

  • user code is somehow instrumeted to get precise timings for function calls and returns (e.g. at compile time for programming languages like C or C++)

  • for interpreted / VM powered languages support for deterministic profiling is usually provided by the runtime

  • not only allows to get the precise timings for calls, but to see the exact number of calls (the latter has an interesting use case: e.g. SQLAlchemy tests check for unexpected changes in function call counts)

sys.setprofile(profilefunc)

  • the passed function will be called on each function call / return / exception raised

  • trace functions should have three arguments: frame, event, and arg

  • thread-specific (i.e. the hook must be set for each thread being profiled)

profile

In [163]:
# https://github.com/python/cpython/blob/2.7/Lib/profile.py

def trace_dispatch(self, frame, event, arg):
    timer = self.timer
    t = timer()
    t = t[0] + t[1] - self.t - self.bias

    if event == "c_call":
        self.c_func_name = arg.__name__

    if self.dispatch[event](self, frame, t):
        t = timer()
        self.t = t[0] + t[1]
    else:
        r = timer()
        self.t = r[0] + r[1] - t # put back unrecorded delta

cProfile

  • same as profile, but implemented in C to reduce the profiling overhead

  • default choice when you need to profile Python code

cProfile usage: profiling a standalone script

python -m cProfile -s cumulative my_script.py

cProfile usage: profiling a Python statement

import cProfile
import re

cProfile.run('re.compile("foo|bar")', 'restats')

cProfile usage: programmatically

import cProfile


pr = cProfile.Profile()
pr.enable()

# ... do something ...

pr.disable()

back to the case study...

In [165]:
# profiling in Jupyter/iPython

%prun -s cumulative benchmark()
 


6454139 function calls in 3.258 seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000    3.258    3.258 benchmark
      1    0.000    0.000    3.257    3.257 huffman_prepare
      1    1.457    1.457    3.256    3.256 count_freqs
3226758    1.119    0.000    1.799    0.000 <lambda>
3226758    0.680    0.000    0.680    0.000 {method 'read' of '_io.BufferedReader' objects}
      1    0.000    0.000    0.001    0.001 <ipython-input-160-4e0db1ec3d41>:4(build_tree)
     87    0.001    0.000    0.001    0.000 {method 'sort' of 'list' objects}

In [166]:
# profiling in Jupyter/iPython

%prun -D data/cprofile.result benchmark()
 
*** Profile stats marshalled to file 'data/cprofile.result'. 

In [170]:
# now that we know where to look for improvement, let's reduce the number of read() calls

def count_freqs(f, size=65536):
    """Returns a (symb -> freq) mapping for a given bytes stream"""
    
    freqs = {}
    
    for chunk in iter(lambda: f.read(size), b''):
        for symb in chunk:
            if symb in freqs:
                freqs[symb] += 1
            else:
                freqs[symb] = 1
            
    return freqs
In [171]:
# now we are talking!

def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()
1 loop, best of 3: 604 ms per loop

let's check how the cProfile report changed...

In [173]:
# profiling in Jupyter/iPython

%prun -s cumulative benchmark()
 

   725 function calls in 0.684 seconds

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.680    0.680    0.682    0.682 count_freqs
    51    0.000    0.000    0.002    0.000 <lambda>
    51    0.002    0.000    0.002    0.000 {method 'read' of '_io.BufferedReader' objects}
     1    0.000    0.000    0.001    0.001 build_tree
    87    0.001    0.000    0.001    0.000 {method 'sort' of 'list' objects}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     1    0.000    0.000    0.000    0.000 {built-in method iter}

Deterministic profiling (summary)

Pros:

  • all events are monitored (function calls, returns, exceptions)
  • precise timings are made for the intervals between events

Cons:

  • (relatively) high overhead, as all calls are instrumented
  • calls done in native libraries / modules are not profiled

Statistical profiling

  • a thread call stack is examined at regular intervals

  • a sampling period value can be tuned to achieve the acceptable execution overhead

  • fuction calls shorter than the sampling period value will be missed

Call stack

/Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(61)<module>()
-> benchmark()
  /Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(57)benchmark()
-> return huffman_prepare(f)
  /Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(52)huffman_prepare()
-> return build_tree(count_freqs(f))
> /Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(10)count_freqs()
-> if symb in freqs:

Sampling: implementation

import collections
import signal


class Sampler(object):
    def __init__(self, interval=0.001):
        self.stack_counts = collections.defaultdict(int)
        self.interval = 0.001

    def _sample(self, signum, frame):
        stack = []
        while frame is not None:
            formatted_frame = '{}({})'.format(frame.f_code.co_name,
                                              frame.f_globals.get('__name__'))
            stack.append(formatted_frame)
            frame = frame.f_back

        formatted_stack = ';'.join(reversed(stack))
        self.stack_counts[formatted_stack] += 1
        signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)

    def start(self):
        signal.signal(signal.VTALRM, self._sample)
        signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)

plop

python -m plop.collector <program.py>
In [176]:
import plop.collector as plop


collector = plop.Collector(interval=0.01, mode="prof")
collector.start()

benchmark()

collector.stop()
In [177]:
formatter = plop.PlopFormatter()
formatter.store(collector, 'data/plop.result')

Statistical profiling (summary)

Pros:

  • low overhead, thus it's possible to have profiling enabled in production

  • it's possible to get call stacks from native libraries

Cons:

  • timings are not precise - you can easily miss short function calls

  • you can't get the exact number of function calls done

What else to check out

The take-home message

  • don't try to guess when unsure - use the existing methodologies

  • the performance bottlneck is usually not where you would expect it to be

  • pick the right tool for the job (exact numbers vs execution overhead)

  • Python can be fast. Let's profile our software before pointing a finger!

Thank you!