2
$\begingroup$

I have learnt scientific computing back in the time when C and C++ were the thing, and I'm relatively new to python. Back then, I could estimate the algorithmic cost rather easily because, even with the STL, I knew how data was being handled (of course there was already low-level optimisation so your code might do better than expected, but still you had a guideline).

Now using numpy massively I'm at a loss, I have never found information on the complexity/expected CPU cost of numpy functions.

This puzzles me when I need to choose between some options, e.g. whether I should store some data representation or just repeat its production, depending on how many calls I'm expecting to do.

To give one concrete example: I have N vector arrays, which have no reason to be contiguous in memory, I'll need to call them M times for element-by-element multiplication where those N arrays appear as a single vector (concatenation) : depending on M and N, when should I store the concatenated array? (assuming the entire data is very small compared to RAM size) Is it so impossible to know (in a portable manner or even in a single environment) that I just shouldn't care?

$\endgroup$
2
  • 1
    $\begingroup$ How is this different from the same scenario in C++, using N vectors? $\endgroup$ Commented May 29 at 17:52
  • 1
    $\begingroup$ Big parts of numpy are written in C and C++: github.com/numpy/numpy/tree/main/numpy/_core/src You should assume that whatever algorithms you call or implement will use data layout and loops just like any other C code would. $\endgroup$ Commented May 29 at 19:49

1 Answer 1

4
$\begingroup$

If you allocate a numpy array and don't make a non-contiguous view of it, then it will be contiguous in memory, even if it's multidimensional, see https://numpy.org/doc/stable/dev/internals.html . If you make a non-contiguous view, you are back to strided access, which is the same thing that occurs in C when you address a linear array as foo[stride*z] with stride != 1. If you make your own non-contiguous structures in numpy, you have the same pointer chasing as in C.

So Q = A + B will be O(3N) store+loads, also in numpy. If A or B are more expensive to compute than two loads from RAM (e.g. need multiple uncached loads from RAM), then storing them might be sensible. The same complexity assumptions as in C should typically hold for operations implemented as ufuncs. Most of these are implemented in C/C++ as Wolfgang Bangerth said, so their per-element cost should be close to what you get in C/C++. Stuff like sorting has the complexity written in the docs and is also implemented in C/C++.

Tangentially related: I've been personally wondering about whether Q = A + B + C + ... will produce temporaries and so be possibly inferior to a loop over the summands. So I wrote a small test script (see the end of the post) for it. Once the array is big enough, I typically saw results like (statistics on 5 runs of 20 adds each)

 mean stddev min max Q=A+B 0.474155 0.007966 0.468307 0.485545 Q=A+B+C 0.671866 0.004024 0.667896 0.678112 Q=sum(A, B) 0.343913 0.002298 0.341384 0.346913 Q=sum(A,B,C) 0.553073 0.001132 0.551913 0.554523 Q=sum(4) 0.765597 0.001289 0.763745 0.767078 

with sum(...) being implemented as first copying over A into Q, then adding the rest of the elements as binary operations, avoiding temporaries. This would suggest to me that something besides adding is happening in Q = A+B + ... . Anyhow, as you can see there's a (roughly) linear relationship between the number of operands and how long it takes, regardless of how you sum them. As for the per-element cost, the above are samples for 20 operations at a time, so the min of sum(A,B) comes to 0.0174085s per vector add, with a STREAM ADD benchmark of the same size on the same machine getting 0.014576s, which isn't too far off.

Test code:

import timeit import numpy as np from scipy import stats def add2(Q, A, B): Q[:] = A[:] + B[:] def add3(Q, A, B, C): Q[:] = A[:] + B[:] + C[:] def addk(Q, *args): Q[:] = args[0] for arg in args[1:]: Q[:] += arg outprec=8 def pprint_header(prelen): cols = ["mean", "stddev", "min", "max"] flen=outprec+1 fmt = ("{:<%d}" % (flen) ) * (len(cols)) print(" "*prelen, fmt.format(*cols)) def pprint(lab, r, prelen): sr = stats.describe(r) sdev = np.sqrt(sr.variance) dats = [sr.mean, np.sqrt(sr.variance), sr.minmax[0], sr.minmax[1]] fmt = ("{:<%df} " % (outprec)) * len(dats) print(f"{lab:<{prelen}}", fmt.format(*dats)) N = 1<<23 # sufficient to be memory-bound and not suffer from python-level inefficiencies, hopefully Q = np.zeros(N) A = np.ones(N) B = np.ones(N) C = np.ones(N) D = np.ones(N) tests = [lambda : add2(Q, A, B), lambda: add3(Q, A, B, C), lambda: addk(Q, A, B), lambda: addk(Q, A, B, C), lambda: addk(Q, A, B, C, D)] names = ["Q=A+B", "Q=A+B+C", "Q=sum(A, B)", "Q=sum(A,B,C)", "Q=sum(4)"] prelen = max([len(x) for x in names]) for _ in range(3): tests[-1]() # warmup pprint_header(prelen); for test, name in zip(tests, names): r = timeit.repeat(test, number=20, repeat=5) pprint(name, r, prelen) ``` 
$\endgroup$
1
  • 1
    $\begingroup$ You are right, that was an excellent copy paste failure. Running with the edited version doesn't show much timing difference, probably since it's already past the relevant cache sizes. $\endgroup$ Commented Jun 1 at 23:59

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.