3

I have a frequently occurring problem where I have two arrays of the same length: one with values and one with dynamic decay factors; and wish to calculate a vector of the decayed cumulative sum at each position. Using a Python loop to express the desired recurrence we have the following:

c = np.empty_like(x)
c[0] = x[0]
for i in range(1, len(x)):
    c[i] = c[i-1] * d[i] + x[i]

The Python code is very clear and readable, but slows things down significantly. I can get around this by using Numba to JIT-compile it, or Cython to precompile it. If the discount factors were a fixed number (which is not the case), I could have used SciPy's signals library and do an lfilter (see https://stackoverflow.com/a/47971187/1017986).

Is there a more "Numpythonic" way to express this without sacrificing clarity or efficiency?

8
  • The first thing that comes to mind is to remove x[i] elements from the loop and add the whole array later with c[1:] += x[1:]. It's only part of the problem though. Commented Jan 3, 2024 at 15:40
  • This problem is serial in nature.Some ufunc have an accumulate method, which is used in functions like cumsum. But it's hard generalize this, or use the usual parallel oriented numpy building blocks. So numba/cython is the best way get speed, even if it isn't as readable. Commented Jan 3, 2024 at 16:48
  • This loops have a long dependency chain of operation and you want to save every value of the chain. As a result, executing this chain of operation is already algorithmically optimal (assuming the d[i] value can be any possible value). Numpy mainly support basic or parallel-like operations (same operation applied on all items). A parallel-like operation is fundamentally less efficient here since it would mean to do more work. That being said, accessing Numpy items from CPython is very inefficient. Numba or Cython can indeed fix that (or any module providing exactly this vectorized function) Commented Jan 3, 2024 at 16:49
  • You can see Numba and Cython a way to extend Numpy vectorized function. Since you say that this is a "frequently occurring problem", then it means you can write the function once and reuse it many times. You could even put that in a pre-compiled module (and share it if this is not the only useful function). Note the readability overhead of Numba is just 1 line (@nb.njit(...)). Commented Jan 3, 2024 at 16:53
  • Yes, it is frequently occurring because I have reduced a lot of things down to it... :-) This is part of the reason why I'm surprised to not find it (but I guess it is falling between the chairs between Numpy and SciPy. Commented Jan 5, 2024 at 10:52

2 Answers 2

2

This is a self-answer to guide future visitors with concrete advice.

TL;DR; prefer Numba, avoid pure Python

It is possible to get a numerically stable version of Riccardo Bucco's answer by performing the calculations in log-domain. Assuming d[0] == 1, the implementation looks like this:

def f_numpy_stable(x, d):
    p = np.cumsum(np.log(d))
    return np.exp(p + np.logaddexp.accumulate(np.log(x) - p))

Comparing it agains the reference implementation in pure Python with random inputs gives outputs that pass an np.allclose test, which I consider to be stable enough. Intuitively, such complicated calculations should be slower than a simple loop that does not involve conversions to and from log-domain. Let's test that intuition.

There are five solutions when collating the above with the question as well as the comments and answer (thank you Riccardo Bucco):

  1. Native Python (slow, readable)
  2. JIT-compile with Numba (fast, readable)
  3. Precompile with Cython (fast)
  4. Decompose into cumulative product and sum (pure Numpy, potentially numerically unstable)
  5. Decompose but calculate in log-space (pure Numpy)

I coded up and tested all five with fairly large arrays (10k–100M entries), and the result was (on my Intel MacBook pro) that Numba is fastest, followed by Cython (up to 50% slower than Numba, less for larger data sets), followed by Riccardo Bucco's pure Numpy solution (about 3x slower than Numba), followed (surprisingly) by the numerically stable version of the pure Numpy solution (less than 10x slower than the unstable version), with the slowest by far being the pure Python implementation. Here is the full table (all times in seconds).

Array length Python Stable Numpy Numpy Cython Numba
10,000 00.003'840 00.000'546 00.000'062 00.000'030 00.000'019
100,000 00.039'600 00.005'550 00.000'545 00.000'296 00.000'192
1,000,000 00.401 00.056'500 00.009'880 00.003'860 00.002'550
10,000,000 03.850 00.590 00.092'600 00.040'300 00.031'900
100,000,000 40.600 07.020 01.660 00.667 00.551

Here is the code I used for all three implementations. I made sure to trigger Numba's JIT-compilation with a separate call before running timeit.

def f_python(x, d):
    result = np.empty_like(x)
    result[0] = x[0]
    for i in range(1,x.shape[0]):
        result[i] = result[i-1] * d[i] + x[i]
    return result

@numba.jit
def f_numba(x, d):
    result = np.empty_like(x)
    result[0] = x[0]
    for i in range(1,x.shape[0]):
        result[i] = result[i-1] * d[i] + x[i]
    return result

def f_numpy(x, d):
    result = np.cumprod(d)
    return result * np.cumsum(x / result)

def f_numpy_stable(x, d):
    p = np.cumsum(np.log(d))
    return np.exp(p + np.logaddexp.accumulate(np.log(x) - p))

With the Cython implementation being compiled in a separate notebook cell:

%%cython
import numpy as np
cimport numpy as np

cpdef np.ndarray[np.float64_t, ndim=1] f_cython(np.ndarray[np.float64_t, ndim=1] x, np.ndarray[np.float64_t, ndim=1] d):
    cdef:
        int i = 0
        int N = x.shape[0]
        np.ndarray[np.float64_t, ndim=1] result = np.empty_like(x)
    result[0] = x[0]
    for i in range(1, N):
        result[i] = result[i-1] * d[i] + x[i]
    return result

I was a bit surprised that Numba did better than Cython. Given that the decomposition into separate cumulative product and sum is either numerically unstable or fairly slow (although faster than pure Python), my advice would definitely be to avoid it (especially for large data sets) and go with Cython or Numba. Given that the easier solution to just import numba and add a decorator is also faster, it seems like a slam dunk choice to me.

Happy to be shown all the ways I could improve the Cython implementation! :-)

Sign up to request clarification or add additional context in comments.

1 Comment

amazing, thanks a lot for sharing your experiments!!
1

Mathematically speaking, this is a first-order non-homogeneous recurrence relation with variable coefficients.

Please note that coefficients (in your case, values in d[1:]) must be different from 0.

Here is a way to solve your recurrence relation using NumPy functions. Note that d[0] is supposed to be equal to 1 (your algorithm is not using it anyway). In this solution, as you can see, the values in c do not depend on previous values in c. I.e., c[i] does not depend on c[i-1].

g = np.insert(x[1:], 0, 0)

pd = np.cumprod(d)
c = pd * (x[0] + np.cumsum(g / pd))

Let's make an example. Here I define two functions, one that computes the recurrence relation with your code, and one that uses NumPy:

def rec(d, x):
    c = [x[0]]
    for i in range(1, len(x)):
        c.append(d[i] * c[-1] + x[i])
    return np.array(c)

def num(d, x):
    g = np.insert(x[1:], 0, 0)
    pd = np.cumprod(d)
    return pd * (x[0] + np.cumsum(g / pd))

Here are some usage examples:

>>> rec(np.array([3, 4, 5.2, 6.1]), np.array([2, 3.2, 4, 5.7]))
array([  2.   ,  11.2  ,  62.24 , 385.364])
>>> num(np.array([3, 4, 5.2, 6.1]), np.array([2, 3.2, 4, 5.7]))
array([  2.   ,  11.2  ,  62.24 , 385.364])

>>> rec(np.array([3, 4.3, 5.2]), np.array([6.7, 7.1, 1.2]))
array([  6.7  ,  35.91 , 187.932])
>>> num(np.array([3, 4.3, 5.2]), np.array([6.7, 7.1, 1.2]))
array([  6.7  ,  35.91 , 187.932])

If you're interested in the mathematical details behind this answer (which are out of scope here), check this Wikipedia page.

Edit

This solution, while mathematically correct, might introduce numerical issues during the computation (because of the numpy.cumprod function), especially if d contains many values that are close to 0.

2 Comments

Thanks! In my case, d[0] == 1 (by definition), and it looks like that simplifies it to pf * np.cumsum(x / pf). The numerical stability may be an issue though. Thanks also for the pointer to the math behind it.
Hey, your solution inspired me to try a variant in log-domain that is both faster than pure Python and numerically stable. Adding it in a self answer, cheers!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.