Comparison Of Various Code Optimization Methods¶
There are a number of ways to optimize the performance of Python code. Below is a short summary adapted from http://people.duke.edu/~ccc14/sta-663/MakingCodeFast.html regarding various optimization strategies.
There is a traditional sequence for writing code, and it goes like this:
- Make it run
- Make it right (testing)
- Make it fast (optimization)
Making it fast is the last step, and you should only optimize when it is necessary. Also, it is good to know when a program is “fast enough” for your needs. Optimization has a price:
- Cost in programmer time
- Optimized code is often more complex
- Optimized code is often less generic
However, having fast code is often necessary for statistical computing, so we will spend some time learning how to make code run faster. To do so, we need to understand why our code is slow. Code can be slow because of differnet resource limitations:
CPU-bound - CPU is working flat out
Memory-bound - Out of RAM - swapping to hard disk
IO-bound - Lots of data transfer to and from hard disk
Network-bound - CPU is waiting for data to come over network or from memory (“starvation”)
Different bottlenekcs may require different approaches. However, there is a natural order to making code fast:
- Cheat
- Use a better machine (e.g. if RAM is limiting, buy more RAM)
- Solve a simpler problem (e.g. will a subsample of the data suffice?)
- Solve a different problem
- Find out what is slowing down the code (profiling)
- Using timeit
- Using cProfile
- Using memory_profiler
Use better algorithms and data structures
Off-load heavy computations to numpy/scipy
Use compiled code written in another language
- Calling code written in C (ctypes, cython)
- Calling code written in Fotran (f2py)
- Calling code written in Julia (pyjulia)
- Convert Python code to compiled code
- Using numexpr
- Using numba
- Using cython
- Write parallel programs
- Ahmdahl and Gustafsson’s laws
- Embarassinlgy parallel problems
- Problems requiring communication and syncrhonization
- Execute in parallel
- On multi-core machines
- On multiple machines
- On GPUs
This notebook will focus on 4 and 6. We will use the example of calculating the pairwsise Euclidean distance between all points. Examples are adapted from http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.html.
%matplotlib inline
%precision 2
import numpy as np
import matplotlib.pyplot as plt
import numexpr as ne
from numba import jit
xs = np.random.random((1000, 3))
xs.shape
(1000, 3)
Python¶
This is the pure python version of the algorithm. Used as a baseline for comparison.
def pdist_python(xs):
n, p = xs.shape
D = np.empty((n, n), np.float)
for i in range(n):
for j in range(n):
s = 0.0
for k in range(p):
tmp = xs[i,k] - xs[j,k]
s += tmp * tmp
D[i, j] = s**0.5
return D
%timeit -n 10 pdist_python(xs)
10 loops, best of 3: 3.17 s per loop
Numpy¶
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
Library documentation: http://www.numpy.org/
def pdist_numpy(xs):
return np.sqrt(((xs[:,None,:] - xs)**2).sum(-1))
%timeit -n 100 pdist_numpy(xs)
100 loops, best of 3: 50.3 ms per loop
Numexpr¶
Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python.
In addition, its multi-threaded capabilities can make use of all your cores, which may accelerate computations, most specially if they are not memory-bounded (e.g. those using transcendental functions).
Library documentation: https://github.com/pydata/numexpr
def pdist_numexpr(xs):
a = xs[:, np.newaxis, :]
return np.sqrt(ne.evaluate('sum((a-xs)**2, axis=2)'))
%timeit -n 100 pdist_numexpr(xs)
100 loops, best of 3: 19.4 ms per loop
Numba¶
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
Library documentation: http://numba.pydata.org/
pdist_numba = jit(pdist_python)
%timeit -n 100 pdist_numba(xs)
100 loops, best of 3: 11.1 ms per loop
Cython¶
Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language. It makes writing C extensions for Python as easy as Python itself. Cython gives you the combined power of Python and C to let you:
- write Python code that calls back and forth from and to C or C++ code natively at any point
- easily tune readable Python code into plain C performance by adding static type declarations
- use combined source code level debugging to find bugs in your Python, Cython and C code
- interact efficiently with large data sets, e.g. using multi-dimensional NumPy arrays
- quickly build your applications within the large, mature and widely used CPython ecosystem
- integrate natively with existing code and data from legacy, low-level or high-performance libraries and applications
Library details here: http://cython.org/
%load_ext Cython
%%cython
import numpy as np
cimport cython
from libc.math cimport sqrt
@cython.boundscheck(False)
@cython.wraparound(False)
def pdist_cython(double[:, ::1] xs):
cdef int n = xs.shape[0]
cdef int p = xs.shape[1]
cdef double tmp, d
cdef double[:, ::1] D = np.empty((n, n), dtype=np.float)
for i in range(n):
for j in range(n):
d = 0.0
for k in range(p):
tmp = xs[i, k] - xs[j, k]
d += tmp * tmp
D[i, j] = sqrt(d)
return np.asarray(D)
%timeit -n 100 pdist_cython(xs)
100 loops, best of 3: 11 ms per loop
Scipy¶
Scipy has an optimized version of this particular function already built in. It exploits symmetry in the problem that we're not taking advantage of it in the "naive" implementations above.
Library documentation: http://www.scipy.org/
from scipy.spatial.distance import pdist as pdist_scipy
%timeit -n 100 pdist_scipy(xs)
100 loops, best of 3: 5.79 ms per loop
Summary¶
Here's all of them together.
print('Python')
%timeit -n 10 pdist_python(xs)
print('Numpy')
%timeit -n 100 pdist_numpy(xs)
print('Numexpr')
%timeit -n 100 pdist_numexpr(xs)
print('Numba')
%timeit -n 100 pdist_numba(xs)
print('Cython')
%timeit -n 100 pdist_cython(xs)
print('Scipy')
%timeit -n 100 pdist_scipy(xs)
Python 10 loops, best of 3: 3.15 s per loop Numpy 100 loops, best of 3: 49.5 ms per loop Numexpr 100 loops, best of 3: 19.3 ms per loop Numba 100 loops, best of 3: 11 ms per loop Cython 100 loops, best of 3: 11 ms per loop Scipy 100 loops, best of 3: 5.68 ms per loop
Some observations:
- Pure python is much, much slower than all of the other methods (close to 1000x difference!)
- Simply using Numpy where possible results in a huge speed-up
- Numba is surprisingly effective given how easy it is to utilize, on par with compiled C code using Cython
- Algorithm optimizations (such as those employed in the Scipy implementation) can easily trump other methods