Dynamically discounted cumulative sum in Numpy

Question

I have a frequently occurring problem where I have two arrays of the same length: one with values and one with dynamic decay factors; and wish to calculate a vector of the decayed cumulative sum at each position. Using a Python loop to express the desired recurrence we have the following:

c = np.empty_like(x)
c[0] = x[0]
for i in range(1, len(x)):
    c[i] = c[i-1] * d[i] + x[i]

The Python code is very clear and readable, but slows things down significantly. I can get around this by using Numba to JIT-compile it, or Cython to precompile it. If the discount factors were a fixed number (which is not the case), I could have used SciPy's signals library and do an lfilter (see https://stackoverflow.com/a/47971187/1017986).

Is there a more "Numpythonic" way to express this without sacrificing clarity or efficiency?

The first thing that comes to mind is to remove x[i] elements from the loop and add the whole array later with c[1:] += x[1:]. It's only part of the problem though. — Guimoute
– Guimoute, Commented Jan 3, 2024 at 15:40
This problem is serial in nature.Some ufunc have an accumulate method, which is used in functions like cumsum. But it's hard generalize this, or use the usual parallel oriented numpy building blocks. So numba/cython is the best way get speed, even if it isn't as readable. — hpaulj
– hpaulj, Commented Jan 3, 2024 at 16:48
This loops have a long dependency chain of operation and you want to save every value of the chain. As a result, executing this chain of operation is already algorithmically optimal (assuming the d[i] value can be any possible value). Numpy mainly support basic or parallel-like operations (same operation applied on all items). A parallel-like operation is fundamentally less efficient here since it would mean to do more work. That being said, accessing Numpy items from CPython is very inefficient. Numba or Cython can indeed fix that (or any module providing exactly this vectorized function) — Jérôme Richard
– Jérôme Richard, Commented Jan 3, 2024 at 16:49
You can see Numba and Cython a way to extend Numpy vectorized function. Since you say that this is a "frequently occurring problem", then it means you can write the function once and reuse it many times. You could even put that in a pre-compiled module (and share it if this is not the only useful function). Note the readability overhead of Numba is just 1 line (@nb.njit(...)). — Jérôme Richard
– Jérôme Richard, Commented Jan 3, 2024 at 16:53
Yes, it is frequently occurring because I have reduced a lot of things down to it... :-) This is part of the reason why I'm surprised to not find it (but I guess it is falling between the chairs between Numpy and SciPy. — masaers
– masaers, Commented Jan 5, 2024 at 10:52

masaers · Accepted Answer · 2024-01-08 18:02:05Z

This is a self-answer to guide future visitors with concrete advice.

TL;DR; prefer Numba, avoid pure Python

It is possible to get a numerically stable version of Riccardo Bucco's answer by performing the calculations in log-domain. Assuming d[0] == 1, the implementation looks like this:

def f_numpy_stable(x, d):
    p = np.cumsum(np.log(d))
    return np.exp(p + np.logaddexp.accumulate(np.log(x) - p))

Comparing it agains the reference implementation in pure Python with random inputs gives outputs that pass an np.allclose test, which I consider to be stable enough. Intuitively, such complicated calculations should be slower than a simple loop that does not involve conversions to and from log-domain. Let's test that intuition.

There are five solutions when collating the above with the question as well as the comments and answer (thank you Riccardo Bucco):

Native Python (slow, readable)
JIT-compile with Numba (fast, readable)
Precompile with Cython (fast)
Decompose into cumulative product and sum (pure Numpy, potentially numerically unstable)
Decompose but calculate in log-space (pure Numpy)

I coded up and tested all five with fairly large arrays (10k–100M entries), and the result was (on my Intel MacBook pro) that Numba is fastest, followed by Cython (up to 50% slower than Numba, less for larger data sets), followed by Riccardo Bucco's pure Numpy solution (about 3x slower than Numba), followed (surprisingly) by the numerically stable version of the pure Numpy solution (less than 10x slower than the unstable version), with the slowest by far being the pure Python implementation. Here is the full table (all times in seconds).

Array length	Python	Stable Numpy	Numpy	Cython	Numba
10,000	00.003'840	00.000'546	00.000'062	00.000'030	00.000'019
100,000	00.039'600	00.005'550	00.000'545	00.000'296	00.000'192
1,000,000	00.401	00.056'500	00.009'880	00.003'860	00.002'550
10,000,000	03.850	00.590	00.092'600	00.040'300	00.031'900
100,000,000	40.600	07.020	01.660	00.667	00.551

Here is the code I used for all three implementations. I made sure to trigger Numba's JIT-compilation with a separate call before running timeit.

def f_python(x, d):
    result = np.empty_like(x)
    result[0] = x[0]
    for i in range(1,x.shape[0]):
        result[i] = result[i-1] * d[i] + x[i]
    return result

@numba.jit
def f_numba(x, d):
    result = np.empty_like(x)
    result[0] = x[0]
    for i in range(1,x.shape[0]):
        result[i] = result[i-1] * d[i] + x[i]
    return result

def f_numpy(x, d):
    result = np.cumprod(d)
    return result * np.cumsum(x / result)

def f_numpy_stable(x, d):
    p = np.cumsum(np.log(d))
    return np.exp(p + np.logaddexp.accumulate(np.log(x) - p))

With the Cython implementation being compiled in a separate notebook cell:

%%cython
import numpy as np
cimport numpy as np

cpdef np.ndarray[np.float64_t, ndim=1] f_cython(np.ndarray[np.float64_t, ndim=1] x, np.ndarray[np.float64_t, ndim=1] d):
    cdef:
        int i = 0
        int N = x.shape[0]
        np.ndarray[np.float64_t, ndim=1] result = np.empty_like(x)
    result[0] = x[0]
    for i in range(1, N):
        result[i] = result[i-1] * d[i] + x[i]
    return result

I was a bit surprised that Numba did better than Cython. Given that the decomposition into separate cumulative product and sum is either numerically unstable or fairly slow (although faster than pure Python), my advice would definitely be to avoid it (especially for large data sets) and go with Cython or Numba. Given that the easier solution to just import numba and add a decorator is also faster, it seems like a slam dunk choice to me.

Happy to be shown all the ways I could improve the Cython implementation! :-)

Riccardo Bucco · Accepted Answer · 2024-01-05 16:48:56Z

Mathematically speaking, this is a first-order non-homogeneous recurrence relation with variable coefficients.

Please note that coefficients (in your case, values in d[1:]) must be different from 0.

Here is a way to solve your recurrence relation using NumPy functions. Note that d[0] is supposed to be equal to 1 (your algorithm is not using it anyway). In this solution, as you can see, the values in c do not depend on previous values in c. I.e., c[i] does not depend on c[i-1].

g = np.insert(x[1:], 0, 0)

pd = np.cumprod(d)
c = pd * (x[0] + np.cumsum(g / pd))

Let's make an example. Here I define two functions, one that computes the recurrence relation with your code, and one that uses NumPy:

def rec(d, x):
    c = [x[0]]
    for i in range(1, len(x)):
        c.append(d[i] * c[-1] + x[i])
    return np.array(c)

def num(d, x):
    g = np.insert(x[1:], 0, 0)
    pd = np.cumprod(d)
    return pd * (x[0] + np.cumsum(g / pd))

Here are some usage examples:

>>> rec(np.array([3, 4, 5.2, 6.1]), np.array([2, 3.2, 4, 5.7]))
array([  2.   ,  11.2  ,  62.24 , 385.364])
>>> num(np.array([3, 4, 5.2, 6.1]), np.array([2, 3.2, 4, 5.7]))
array([  2.   ,  11.2  ,  62.24 , 385.364])

>>> rec(np.array([3, 4.3, 5.2]), np.array([6.7, 7.1, 1.2]))
array([  6.7  ,  35.91 , 187.932])
>>> num(np.array([3, 4.3, 5.2]), np.array([6.7, 7.1, 1.2]))
array([  6.7  ,  35.91 , 187.932])

If you're interested in the mathematical details behind this answer (which are out of scope here), check this Wikipedia page.

Edit

This solution, while mathematically correct, might introduce numerical issues during the computation (because of the numpy.cumprod function), especially if d contains many values that are close to 0.

Thanks! In my case, d[0] == 1 (by definition), and it looks like that simplifies it to pf * np.cumsum(x / pf). The numerical stability may be an issue though. Thanks also for the pointer to the math behind it.
Hey, your solution inspired me to try a variant in log-domain that is both faster than pure Python and numerically stable. Adding it in a self answer, cheers!

Collectives™ on Stack Overflow

Dynamically discounted cumulative sum in Numpy

2 Answers 2

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Linked

Related