I have a function I need to implement which does not vectorize well, and so I am implementing it in Cython to make the nested for-loops workable. My problem is of a similar complexity as naive matrix-matrix multiplication, but at least one of the matrices is binary and often very sparse. The sparse matrix also changes throughout the program, so the simplest way I thought to implement it is with the nested list format (a list of all the non-zero indices).
I suspect my slowdown is due to the use of python lists, but since I have to remove and append to the inner lists I think this makes other formats (namely csr) less efficient. I've looked at other questions here regarding sparse matrices in Cython, and none seem to have the remove-and-append feature that is necessary for my application.
Here are the sparse and dense functions:
import numpy as np
cimport numpy as np
from libc.math cimport exp
def spgreedy(np.ndarray[double, ndim=2] J,
np.ndarray[double, ndim=1] h,
list[list[int]] S,
double temp,
):
cdef int n, d
n = len(S)
d = J.shape[0]
cdef int j, k
cdef double dot, curr, prob
for s in S:
for j in range(d):
if j in s:
s.remove(j)
dot = h[j]
for k in s:
dot += J[j, k]
curr = dot / temp
if curr < -100:
prob = 0.0
elif curr > 100:
prob = 1.0
else:
prob = 1.0 / (1.0 + exp(-curr))
if np.random.rand() < prob:
s.append(j)
return S
def greedy(np.ndarray[double, ndim=2] J,
np.ndarray[double, ndim=1] h,
np.ndarray[int, ndim=2] S,
double temp,
):
cdef int n, d
n = len(S)
d = J.shape[0]
cdef int i, j, k
cdef double dot, curr, prob
for i in range(n):
for j in range(d):
dot = h[j]
for k in range(d):
dot += J[j, k] * S[i,k]
curr = dot / temp
if curr < -100:
prob = 0.0
elif curr > 100:
prob = 1.0
else:
prob = 1.0 / (1.0 + exp(-curr))
S[i,j] = 1*(np.random.rand() < prob)
return S
and when I compare the two on some random sparse data
d = 1000
n = 50
J = 1.0*np.random.choice([-1,0,1], size=(d,d), p=[0.05,0.9,0.05])
J = np.triu(J,1) + np.triu(J,1).T
h = -np.ones(d)
S = np.random.choice([0,1], size=(n, d), p=[0.95, 0.05])
Slist = [np.where(s)[0].tolist() for s in S]
t0 = time()
_ = greedy(J, h, S, 1e-5)
print(time() - t0)
t0 = time()
_ = spgreedy(J, h, Slist, 1e-5)
print(time() - t0)
I find that the dense version is almost an order of magnitude faster?
dense: 0.04369068145751953
sparse: 0.18976712226867676
I suspect that the discrepancy comes from the use of python lists, but I also feel kind of bad cycling through that inner for k in range(d) loop when the vast majority are contributing nothing to the sum! Maybe there's no way to take advantage of this given how dynamic it is, but I thought I'd ask in case there is another type which is better suited.

listis just the wrong structure here