Comparison Of Various Code Optimization Methods¶

There are a number of ways to optimize the performance of Python code. Below is a short summary adapted from http://people.duke.edu/~ccc14/sta-663/MakingCodeFast.html regarding various optimization strategies.

There is a traditional sequence for writing code, and it goes like this:

Make it run
Make it right (testing)
Make it fast (optimization)

Making it fast is the last step, and you should only optimize when it is necessary. Also, it is good to know when a program is “fast enough” for your needs. Optimization has a price:

Cost in programmer time
Optimized code is often more complex
Optimized code is often less generic

However, having fast code is often necessary for statistical computing, so we will spend some time learning how to make code run faster. To do so, we need to understand why our code is slow. Code can be slow because of differnet resource limitations:

CPU-bound - CPU is working flat out
Memory-bound - Out of RAM - swapping to hard disk
IO-bound - Lots of data transfer to and from hard disk
Network-bound - CPU is waiting for data to come over network or from memory (“starvation”)

Different bottlenekcs may require different approaches. However, there is a natural order to making code fast:

Cheat

Use a better machine (e.g. if RAM is limiting, buy more RAM)
Solve a simpler problem (e.g. will a subsample of the data suffice?)
Solve a different problem

Find out what is slowing down the code (profiling)

Using timeit
Using cProfile
Using memory_profiler

Use better algorithms and data structures
Off-load heavy computations to numpy/scipy
Use compiled code written in another language

Calling code written in C (ctypes, cython)
Calling code written in Fotran (f2py)
Calling code written in Julia (pyjulia)

Convert Python code to compiled code

Using numexpr
Using numba
Using cython

Write parallel programs

Ahmdahl and Gustafsson’s laws
Embarassinlgy parallel problems
Problems requiring communication and syncrhonization

Execute in parallel

On multi-core machines
On multiple machines
On GPUs

This notebook will focus on 4 and 6. We will use the example of calculating the pairwsise Euclidean distance between all points. Examples are adapted from http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.html.

In [1]:

%matplotlib inline
%precision 2
import numpy as np
import matplotlib.pyplot as plt
import numexpr as ne
from numba import jit

In [2]:

xs = np.random.random((1000, 3))
xs.shape

Out[2]:

(1000, 3)

Python¶

This is the pure python version of the algorithm. Used as a baseline for comparison.

In [3]:

def pdist_python(xs):
    n, p = xs.shape
    D = np.empty((n, n), np.float)
    for i in range(n):
        for j in range(n):
            s = 0.0
            for k in range(p):
                tmp = xs[i,k] - xs[j,k]
                s += tmp * tmp
            D[i, j] = s**0.5
    return D

In [4]:

%timeit -n 10 pdist_python(xs)

10 loops, best of 3: 3.17 s per loop

Numpy¶

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

a powerful N-dimensional array object
sophisticated (broadcasting) functions
tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

Library documentation: http://www.numpy.org/

In [5]:

def pdist_numpy(xs):
    return np.sqrt(((xs[:,None,:] - xs)**2).sum(-1))

In [6]:

%timeit -n 100 pdist_numpy(xs)

100 loops, best of 3: 50.3 ms per loop

Numexpr¶

Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python.

In addition, its multi-threaded capabilities can make use of all your cores, which may accelerate computations, most specially if they are not memory-bounded (e.g. those using transcendental functions).

Library documentation: https://github.com/pydata/numexpr

In [7]:

def pdist_numexpr(xs):
    a = xs[:, np.newaxis, :]
    return np.sqrt(ne.evaluate('sum((a-xs)**2, axis=2)'))

In [8]:

%timeit -n 100 pdist_numexpr(xs)

100 loops, best of 3: 19.4 ms per loop

Numba¶

Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.

Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.

Library documentation: http://numba.pydata.org/

In [9]:

pdist_numba = jit(pdist_python)

In [10]:

%timeit -n 100 pdist_numba(xs)

100 loops, best of 3: 11.1 ms per loop

Cython¶

Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language. It makes writing C extensions for Python as easy as Python itself. Cython gives you the combined power of Python and C to let you:

write Python code that calls back and forth from and to C or C++ code natively at any point
easily tune readable Python code into plain C performance by adding static type declarations
use combined source code level debugging to find bugs in your Python, Cython and C code
interact efficiently with large data sets, e.g. using multi-dimensional NumPy arrays
quickly build your applications within the large, mature and widely used CPython ecosystem
integrate natively with existing code and data from legacy, low-level or high-performance libraries and applications

Library details here: http://cython.org/

In [11]:

%load_ext Cython

In [12]:

%%cython

import numpy as np
cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def pdist_cython(double[:, ::1] xs):
    cdef int n = xs.shape[0]
    cdef int p = xs.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((n, n), dtype=np.float)
    for i in range(n):
        for j in range(n):
            d = 0.0
            for k in range(p):
                tmp = xs[i, k] - xs[j, k]
                d += tmp * tmp
            D[i, j] = sqrt(d)
    return np.asarray(D)

In [13]:

%timeit -n 100 pdist_cython(xs)

100 loops, best of 3: 11 ms per loop

Scipy¶

Scipy has an optimized version of this particular function already built in. It exploits symmetry in the problem that we're not taking advantage of it in the "naive" implementations above.

Library documentation: http://www.scipy.org/

In [14]:

from scipy.spatial.distance import pdist as pdist_scipy

In [15]:

%timeit -n 100 pdist_scipy(xs)

100 loops, best of 3: 5.79 ms per loop

Summary¶

Here's all of them together.

In [16]:

print('Python')
%timeit -n 10 pdist_python(xs)
print('Numpy')
%timeit -n 100 pdist_numpy(xs)
print('Numexpr')
%timeit -n 100 pdist_numexpr(xs)
print('Numba')
%timeit -n 100 pdist_numba(xs)
print('Cython')
%timeit -n 100 pdist_cython(xs)
print('Scipy')
%timeit -n 100 pdist_scipy(xs)

Python
10 loops, best of 3: 3.15 s per loop
Numpy
100 loops, best of 3: 49.5 ms per loop
Numexpr
100 loops, best of 3: 19.3 ms per loop
Numba
100 loops, best of 3: 11 ms per loop
Cython
100 loops, best of 3: 11 ms per loop
Scipy
100 loops, best of 3: 5.68 ms per loop

Some observations:

Pure python is much, much slower than all of the other methods (close to 1000x difference!)
Simply using Numpy where possible results in a huge speed-up
Numba is surprisingly effective given how easy it is to utilize, on par with compiled C code using Cython
Algorithm optimizations (such as those employed in the Scipy implementation) can easily trump other methods