1

I have a function I need to implement which does not vectorize well, and so I am implementing it in Cython to make the nested for-loops workable. My problem is of a similar complexity as naive matrix-matrix multiplication, but at least one of the matrices is binary and often very sparse. The sparse matrix also changes throughout the program, so the simplest way I thought to implement it is with the nested list format (a list of all the non-zero indices).

I suspect my slowdown is due to the use of python lists, but since I have to remove and append to the inner lists I think this makes other formats (namely csr) less efficient. I've looked at other questions here regarding sparse matrices in Cython, and none seem to have the remove-and-append feature that is necessary for my application.

Here are the sparse and dense functions:

import numpy as np
cimport numpy as np
from libc.math cimport exp

def spgreedy(np.ndarray[double, ndim=2] J, 
             np.ndarray[double, ndim=1] h,
             list[list[int]] S,
             double temp,
             ):

    cdef int n, d 
    n = len(S)
    d = J.shape[0]

    cdef int j, k
    cdef double dot, curr, prob

    for s in S:
        for j in range(d):

            if j in s:
                s.remove(j)

            dot = h[j]
            for k in s:
                dot += J[j, k]

            curr = dot / temp

            if curr < -100:
                prob = 0.0
            elif curr > 100:
                prob = 1.0
            else:
                prob = 1.0 / (1.0 + exp(-curr))

            if np.random.rand() < prob:
                s.append(j)

    return S

def greedy(np.ndarray[double, ndim=2] J, 
           np.ndarray[double, ndim=1] h,
           np.ndarray[int, ndim=2] S,
           double temp,
           ):

    cdef int n, d 
    n = len(S)
    d = J.shape[0]

    cdef int i, j, k
    cdef double dot, curr, prob

    for i in range(n):
        for j in range(d):

            dot = h[j]
            for k in range(d):
                dot += J[j, k] * S[i,k]

            curr = dot / temp

            if curr < -100:
                prob = 0.0
            elif curr > 100:
                prob = 1.0
            else:
                prob = 1.0 / (1.0 + exp(-curr))

            S[i,j] = 1*(np.random.rand() < prob)

    return S

and when I compare the two on some random sparse data

d = 1000
n = 50

J = 1.0*np.random.choice([-1,0,1], size=(d,d), p=[0.05,0.9,0.05])
J = np.triu(J,1) + np.triu(J,1).T
h = -np.ones(d)

S = np.random.choice([0,1], size=(n, d), p=[0.95, 0.05])
Slist = [np.where(s)[0].tolist() for s in S]

t0 = time()
_ = greedy(J, h, S, 1e-5)
print(time() - t0)

t0 = time()
_ = spgreedy(J, h, Slist, 1e-5)
print(time() - t0)

I find that the dense version is almost an order of magnitude faster?

dense: 0.04369068145751953
sparse: 0.18976712226867676

I suspect that the discrepancy comes from the use of python lists, but I also feel kind of bad cycling through that inner for k in range(d) loop when the vast majority are contributing nothing to the sum! Maybe there's no way to take advantage of this given how dynamic it is, but I thought I'd ask in case there is another type which is better suited.

1
  • 1
    I don't think "remove and append" is likely to work hugely well with Python lists in general just because they're contiguous in memory so removing necessarily involves a lot of copying. I don't know what the right solution is but I think list is just the wrong structure here Commented May 22 at 19:27

2 Answers 2

1

Several issue could explain this performance gap.

First of all, remove is expensive because it has to travel all items of the list, remove the target item and then shift all the following ones. It has a complexity of O(len(s)) here. The thing is the order does not seems to be important here. Thus, you can use another data structure for fast insertion/deletion. For example, a hash-set can be used to insert/remove items in a (amortised) constant time. Even better: you can design your own custom fast hash-set specifically designed for this use-case. Indeed, you can use 2 flat arrays containing indices (sparse part, containing indices to items of the second arrays) and values (dense part, containing the actual items currently stored in s). The two arrays can be preallocated (since the size of s is bounded by d). The index array enables you to know whether the item exist in constant time (just check if index[i] is a valid index). The value array enables you to iterate over the values of s efficiently. You can add a new item i in the data structure by first checking index[i] (technically you do not need that in this case since the item is supposed not to be in the data structure yet), then add an entry in the end of value array (you need to store the position of the last item with a preallocated array -- e.g. in a variable called lastEntryIndex), store the position of the added entry in index[i]. To remove an item, you need to check index[i], swap values[index[i]] with value[lastEntryIndex] and update index[lastEntryIndex] to the value index[i] before resetting index[i] to an invalid index. All of this is done in constant time. Moreover, iterating over the values can then be fast (as opposed to some hash-set implementations). Not to mention arrays are generally faster than lists (if there is no bound checking nor wrap-around enabled).

Moreover, the innermost loop of greedy can benefit from SIMD units of the CPU (depending on how you compile the program -- which is not mentioned in the question) while the one of sgreedy cannot. Indeed, the former can read contiguous arrays so a C compiler can generate vector instruction reading fixed-size blocks from memory multiply/add blocks together, before adding all items of the block together. On mainstream compiler targetting x86-64 CPUs, the default SIMD size is 128-bit (SSE2) so 2 double-precision numbers per vector register. Modern x86-64 CPU typically have 2 128-bit SIMD units. Most of them nowadays also have at least 2 256-bit SIMD units (which can be used with the compilation flag -mavx in GCC/Clang). This is not really possible to benefit from SIMD units in sgreedy because of the indexing. Technically, it could be possible with SIMD gather instructions on relatively-recent x86-64 CPUs, but they are not very efficient on most CPUs anyway (and the instruction set is certainly also not even enabled by default). This means the greedy innermost loop can be executed much more efficiently than the one in sgreedy. There is not much to do about this (though using gather instructions manually may help a bit to improve performance). This is the price to pay for sparse operations.

Finally, regarding the actual assembly code generated by the underlying C compiler, the innermost loop of sgreedy might be bound by the latency of addition operations (forming a long chain of dependent instructions). This typically happens if the compiler does not unroll the loop at all. Manual unrolling can be used to break the dependency chain and mitigate this latency issue. Here is an example:

i = 0
s_len = len(s)

# Compute items 2 per 2
while i + 1 < s_len:
    dot1 += J[j, s[i]]
    dot2 += J[j, s[i+1]]
    i += 2

# Last item
if i < s_len:
    dot1 += J[j, s[i]]

dot = dot1 + dot2

Be careful to the type if you use this code in Cython (e.g. i, s_len, dot1, etc. must be typed for sake of performance). You could compute items 4 by 4 to possibly get better performance (this is dependent of the target CPU and compilation flags). You should also disable wrap-around and bound checking to avoid expensive checks in this loop (at the expense of a possibly less-safe execution).

Moreover, note that the outermost loop can certainly benefit from multithreading (but be careful about possible race conditions).

Last but not least, please note that np.random.rand() can take a significant time in Cython (because it is not a native function but a pure-Python one) compared to a native random function. That being said, the target random function must be guaranteed to be thread-safe so to avoid race conditions. rand_r is an example of thread-safe native function. On my machine (AMD 5945WX CPU), np.random.rand() takes 210 ns/call while rand_r should be at least 10 times faster.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks for this very detailed response! I posted a full response with a new implementation based on my understanding of your suggestions, and it already has helped a fair amount. I couldn't find documentation for rand_r so I tried implementing one myself, which seems faster but not quite 10 times.
1

Though I would update with my current best solution based on the responses. Thanks to @Jérôme Richard for the detailed analysis, I've re-implemented in a way that I think incorporates the most impactful suggestions:

cimport numpy as cnp

ctypedef cnp.npy_uint32 UINT32_t

from libc.math cimport exp
from libc.stdlib cimport malloc, free

cdef inline UINT32_t DEFAULT_SEED = 1
cdef enum:
    # Max value for our rand_r replacement (near the bottom).
    # We don't use RAND_MAX because it's different across platforms and
    # particularly tiny on Windows/MSVC.
    # It corresponds to the maximum representable value for
    # 32-bit signed integers (i.e. 2^31 - 1).
    RAND_R_MAX = 2147483647

##################################################################

cdef class FiniteSet:
    """
    Custom unordered set with O(1) add/remove methods
    which takes advantage of the fixed maximum set size
    """

    cdef:
        int* index     # Index of element in value array
        int* value     # Array of contained element
        int size       # Current number of elements
        
    def __cinit__(self, cnp.ndarray[int] indices, int maximum):
        self.size = len(indices)
        
        # Allocate arrays
        self.index = <int*>malloc(maximum * sizeof(int))
        self.value = <int*>malloc(maximum * sizeof(int))
        
        if not self.index or not self.value:
            raise MemoryError("Could not allocate memory")
        
        # Initialize index array to -1 (invalid)
        cdef int i
        for i in range(maximum):
            self.index[i] = -1
            
        # Set up initial values
        for i in range(self.size):
            self.value[i] = indices[i]
            self.index[indices[i]] = i 

    def __dealloc__(self):
        """Cleanup C memory"""
        if self.index:
            free(self.index)
        if self.value:
            free(self.value)

    cdef inline bint contains(self, int i) nogil:
        return self.index[i] >= 0 

    cdef inline void add(self, int i) nogil:
        """
        Add element to set
        """
        # if not self.contains(i):
        if self.index[i] < 0:
            self.value[self.size] = i
            self.index[i] = self.size
            ## Increase
            self.size += 1

    cdef inline void remove(self, int i) nogil:
        """
        Remove element from set
        """
        # if self.contains(i):
        if self.index[i] >= 0:
            self.value[self.index[i]] = self.value[self.size - 1]
            self.index[self.value[self.size - 1]] = self.index[i]
            self.index[i] = -1
            self.size -= 1

cdef class XORRNG:
    """
    Custom XORRNG sampler that I copied from the scikit-learn source code
    """

    cdef UINT32_t state

    def __cinit__(self, UINT32_t seed=DEFAULT_SEED):
        if (seed == 0):
            seed = DEFAULT_SEED
        self.state = seed

    cdef inline double sample(self) nogil:
        """Generate a pseudo-random np.uint32 from a np.uint32 seed"""
        self.state ^= <UINT32_t>(self.state << 13)
        self.state ^= <UINT32_t>(self.state >> 17)
        self.state ^= <UINT32_t>(self.state << 5)

        # Use the modulo to make sure that we don't return a values greater than the
        # maximum representable value for signed 32bit integers (i.e. 2^31 - 1).
        # Note that the parenthesis are needed to avoid overflow: here
        # RAND_R_MAX is cast to UINT32_t before 1 is added.
        return <double>(self.state % ((<UINT32_t>RAND_R_MAX) + 1))/RAND_R_MAX

def xorrng_py(int seed=1):
    cdef XORRNG prng = XORRNG(seed)
    return prng.sample()

###############################################################

cdef XORRNG prng = XORRNG()

def spgreedy(cnp.ndarray[double, ndim=2] J, 
             cnp.ndarray[double, ndim=1] h,
             FiniteSet s,
             double temp,
             ):

    cdef int d
    d = J.shape[0]

    cdef int j, k
    cdef double dot, curr, prob

    for j in range(d):

        s.remove(j)

        dot = h[j]
        for k in range(s.size):
            dot += J[j, s.value[k]]

        curr = dot / temp

        if curr < -100:
            prob = 0.0
        elif curr > 100:
            prob = 1.0
        else:
            prob = 1.0 / (1.0 + exp(-curr))

        if (prng.sample() < prob):
            s.add(j)

    return s

which is now better than the dense algorithm (in the problems I've tested on). It also takes a similar amount of time as numpy's matrix multiplication (just doing J@s+h) which is both reassuring, since that code is very optimized I assume, but a bit disappointing because I don't get much benefit from the sparsity after all. I also don't know much about Cython so it's very possible I did something silly in this implementation as well!

average run time of each function on synthetic data as in the original question

5 Comments

Good. Here are few minor improvements. Divisions and modulus are a bit expensive. You can pre-compute the inverse of temp and RAND_R_MAX so to use (faster) multiplications instead; Since prob is a probability, it certainly does not need to be precise so you can use single precision instead which may be slightly faster (especially using expf instead of exp).
You can use a view of J[j, :] so to be sure there is no 2D indexing in the innermost loop. A good compiler should already optimise the later if bound-checking and wrap-around are disabled but they seems enabled (they are by default). You can also disable them (at least for this function) for a minor performance boost.
You can compact the index/value arrays using uint16_t if you know the values are smaller than 65536. This will only speed up the computation if the arrays are quite big (i.e. >2000) so they may not fit in cache. I guess this is fine here so it likely not worth it.
Last but not least: multithreading is missing here and it can certainly make this significantly faster. But this is not so easy (especially in Cython) to do since you need to take care of using thread-local data in order to avoid race condition on shared data structures. Maybe this is done outside of the function and not shown?
Oh and I strongly advise you to check the loop unrolling optimisation. You might get a significant speed up now since J[j, s.value[k]] is quite fast now. This is probably the most important optimisation to perform with multithreading. Possibly a twice faster code if you are lucky. The best would be to profile the code so to know the bottleneck though.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.