If you allocate a numpy array and don't make a non-contiguous view of it, then it will be contiguous in memory, even if it's multidimensional, see https://numpy.org/doc/stable/dev/internals.html . If you make a non-contiguous view, you are back to strided access, which is the same thing that occurs in C when you address a linear array as foo[stride*z] with stride != 1. If you make your own non-contiguous structures in numpy, you have the same pointer chasing as in C.
So Q = A + B will be O(3N) store+loads, also in numpy. If A or B are more expensive to compute than two loads from RAM (e.g. need multiple uncached loads from RAM), then storing them might be sensible. The same complexity assumptions as in C should typically hold for operations implemented as ufuncs. Most of these are implemented in C/C++ as Wolfgang Bangerth said, so their per-element cost should be close to what you get in C/C++. Stuff like sorting has the complexity written in the docs and is also implemented in C/C++.
Tangentially related: I've been personally wondering about whether Q = A + B + C + ... will produce temporaries and so be possibly inferior to a loop over the summands. So I wrote a small test script (see the end of the post) for it. Once the array is big enough, I typically saw results like (statistics on 5 runs of 20 adds each)
mean stddev min max Q=A+B 0.474155 0.007966 0.468307 0.485545 Q=A+B+C 0.671866 0.004024 0.667896 0.678112 Q=sum(A, B) 0.343913 0.002298 0.341384 0.346913 Q=sum(A,B,C) 0.553073 0.001132 0.551913 0.554523 Q=sum(4) 0.765597 0.001289 0.763745 0.767078
with sum(...) being implemented as first copying over A into Q, then adding the rest of the elements as binary operations, avoiding temporaries. This would suggest to me that something besides adding is happening in Q = A+B + ... . Anyhow, as you can see there's a (roughly) linear relationship between the number of operands and how long it takes, regardless of how you sum them. As for the per-element cost, the above are samples for 20 operations at a time, so the min of sum(A,B) comes to 0.0174085s per vector add, with a STREAM ADD benchmark of the same size on the same machine getting 0.014576s, which isn't too far off.
Test code:
import timeit import numpy as np from scipy import stats def add2(Q, A, B): Q[:] = A[:] + B[:] def add3(Q, A, B, C): Q[:] = A[:] + B[:] + C[:] def addk(Q, *args): Q[:] = args[0] for arg in args[1:]: Q[:] += arg outprec=8 def pprint_header(prelen): cols = ["mean", "stddev", "min", "max"] flen=outprec+1 fmt = ("{:<%d}" % (flen) ) * (len(cols)) print(" "*prelen, fmt.format(*cols)) def pprint(lab, r, prelen): sr = stats.describe(r) sdev = np.sqrt(sr.variance) dats = [sr.mean, np.sqrt(sr.variance), sr.minmax[0], sr.minmax[1]] fmt = ("{:<%df} " % (outprec)) * len(dats) print(f"{lab:<{prelen}}", fmt.format(*dats)) N = 1<<23 # sufficient to be memory-bound and not suffer from python-level inefficiencies, hopefully Q = np.zeros(N) A = np.ones(N) B = np.ones(N) C = np.ones(N) D = np.ones(N) tests = [lambda : add2(Q, A, B), lambda: add3(Q, A, B, C), lambda: addk(Q, A, B), lambda: addk(Q, A, B, C), lambda: addk(Q, A, B, C, D)] names = ["Q=A+B", "Q=A+B+C", "Q=sum(A, B)", "Q=sum(A,B,C)", "Q=sum(4)"] prelen = max([len(x) for x in names]) for _ in range(3): tests[-1]() # warmup pprint_header(prelen); for test, name in zip(tests, names): r = timeit.repeat(test, number=20, repeat=5) pprint(name, r, prelen) ```
Nvectors? $\endgroup$