6

i have a question on how to calculate distances in numpy as fast as it can,

def getR1(VVm,VVs,HHm,HHs): t0=time.time() R=VVs.flatten()[numpy.newaxis,:]-VVm.flatten()[:,numpy.newaxis] R*=R R1=HHs.flatten()[numpy.newaxis,:]-HHm.flatten()[:,numpy.newaxis] R1*=R1 R+=R1 del R1 print "R1\t",time.time()-t0, R.shape, #11.7576191425 (108225, 10500) print numpy.max(R) #4176.26290975 # uses 17.5Gb ram return R def getR2(VVm,VVs,HHm,HHs): t0=time.time() precomputed_flat = numpy.column_stack((VVs.flatten(), HHs.flatten())) measured_flat = numpy.column_stack((VVm.flatten(), HHm.flatten())) deltas = precomputed_flat[None,:,:] - measured_flat[:, None, :] #print time.time()-t0, deltas.shape # 5.861109972 (108225, 10500, 2) R = numpy.einsum('ijk,ijk->ij', deltas, deltas) print "R2\t",time.time()-t0,R.shape, #14.5291359425 (108225, 10500) print numpy.max(R) #4176.26290975 # uses 26Gb ram return R def getR3(VVm,VVs,HHm,HHs): from numpy.core.umath_tests import inner1d t0=time.time() precomputed_flat = numpy.column_stack((VVs.flatten(), HHs.flatten())) measured_flat = numpy.column_stack((VVm.flatten(), HHm.flatten())) deltas = precomputed_flat[None,:,:] - measured_flat[:, None, :] #print time.time()-t0, deltas.shape # 5.861109972 (108225, 10500, 2) R = inner1d(deltas, deltas) print "R3\t",time.time()-t0, R.shape, #12.6972110271 (108225, 10500) print numpy.max(R) #4176.26290975 #Uses 26Gb return R def getR4(VVm,VVs,HHm,HHs): from scipy.spatial.distance import cdist t0=time.time() precomputed_flat = numpy.column_stack((VVs.flatten(), HHs.flatten())) measured_flat = numpy.column_stack((VVm.flatten(), HHm.flatten())) R=spdist.cdist(precomputed_flat,measured_flat, 'sqeuclidean') #.T print "R4\t",time.time()-t0, R.shape, #17.7022118568 (108225, 10500) print numpy.max(R) #4176.26290975 # uses 9 Gb ram return R def getR5(VVm,VVs,HHm,HHs): from scipy.spatial.distance import cdist t0=time.time() precomputed_flat = numpy.column_stack((VVs.flatten(), HHs.flatten())) measured_flat = numpy.column_stack((VVm.flatten(), HHm.flatten())) R=spdist.cdist(precomputed_flat,measured_flat, 'euclidean') #.T print "R5\t",time.time()-t0, R.shape, #15.6070930958 (108225, 10500) print numpy.max(R) #64.6240118667 # uses only 9 Gb ram return R def getR6(VVm,VVs,HHm,HHs): from scipy.weave import blitz t0=time.time() R=VVs.flatten()[numpy.newaxis,:]-VVm.flatten()[:,numpy.newaxis] blitz("R=R*R") # R*=R R1=HHs.flatten()[numpy.newaxis,:]-HHm.flatten()[:,numpy.newaxis] blitz("R1=R1*R1") # R1*=R1 blitz("R=R+R1") # R+=R1 del R1 print "R6\t",time.time()-t0, R.shape, #11.7576191425 (108225, 10500) print numpy.max(R) #4176.26290975 return R 

results in the following times:

R1 11.7737319469 (108225, 10500) 4909.66881791 R2 15.1279799938 (108225, 10500) 4909.66881791 R3 12.7408981323 (108225, 10500) 4909.66881791 R4 17.3336868286 (10500, 108225) 4909.66881791 R5 15.7530870438 (10500, 108225) 70.0690289494 R6 11.670968771 (108225, 10500) 4909.66881791 

While the last one gives sqrt((VVm-VVs)^2+(HHm-HHs)^2), while the others give (VVm-VVs)^2+(HHm-HHs)^2, This is not really important, since otherwise further on in my code i take the minimum of R[i,:] for each i, and sqrt doesnt influence the minimum value anyways, (and if i am interested in the distance, i just take sqrt(value), instead of doing the sqrt over the entire array, so there is really no timing difference due to that.

The question remains: how come the first solution is the best, (the reason the second and third are slower is because deltas=... takes 5.8seconds, (which is also why those two methods take 26Gb)), And why is the sqeuclidean slower than the euclidean?

sqeuclidean should just do (VVm-VVs)^2+(HHm-HHs)^2, while i think it does something different. Anyone know how to find the sourcecode (C or whatever is at the bottom) of that method? I think it does sqrt((VVm-VVs)^2+(HHm-HHs)^2)^2 (the only reason i can think why it would be slower than (VVm-VVs)^2+(HHm-HHs)^2 - I know its a stupid reason, anyone got a more logical one?)

Since i know nothing of C, how would i inline this with scipy.weave? and is that code compilable normally like you do with python? or do i need special stuff installed for that?

Edit: ok, i tried it with scipy.weave.blitz, (R6 method), and that is slightly faster, but i assume someone who knows more C than me can still improve this speed? I just took the lines which are of the form a+=b or *=, and looked up how they would be in C, and put them in the blitz statement, but i guess if i put lines with the statements with flatten and newaxis in C as well, that it should go faster too, but i dont know how i can do that (someone who knows C maybe explain?). Right now, the difference between the stuff with blitz and my first method are not big enough to really be caused by C vs numpy i guess?

I guess the other methods like with deltas=... can go much faster too, when i would put it in C ?

8
  • 2
    consider trying something along the lines of jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2 (esp 'numpy with broadcasting' part) Commented Jul 8, 2013 at 13:08
  • You could probably shave a few seconds off by not allocating memory for R (i.e., just use R1 += R3). Commented Jul 8, 2013 at 13:11
  • @bogatron yeah, same way as R1*=R1, but still, that wont reduce it down to like 1sec or so, (which i assume should happen when it is fully in C from numpy)? Commented Jul 8, 2013 at 13:12
  • Not down to 1 sec but if you're using 32-bit floats, that would save you having to allocate about 4 GB of RAM, which is significant. And if it lets you avoid using swap, then it will be a significant improvement. I'll be surprised it can get down to 1 sec in C (even with no python), considering how much memory it requires (unless you have a lot of RAM and are significantly multithreaded) Commented Jul 8, 2013 at 13:16
  • got about 50Gb of ram, so its not swapping yet Commented Jul 8, 2013 at 13:30

1 Answer 1

7

Whenever you have multiplications and sums, try to use one of the dot product functions or np.einsum. Since you are preallocating your arrays, rather than having different arrays for horizontal and vertical coordinates, stack them both together:

precomputed_flat = np.column_stack((svf.flatten(), shf.flatten())) measured_flat = np.column_stack((VVmeasured.flatten(), HHmeasured.flatten())) deltas = precomputed_flat - measured_flat[:, None, :] 

From here, the simplest would be:

dist = np.einsum('ijk,ijk->ij', deltas, deltas) 

You could also try something like:

from numpy.core.umath_tests import inner1d dist = inner1d(deltas, deltas) 

There is of course also SciPy's spatial module cdist:

from scipy.spatial.distance import cdist dist = cdist(precomputed_flat, measured_flat, 'euclidean') 

EDIT I cannot run tests on such a large dataset, but these timings are rather enlightening:

len_a, len_b = 10000, 1000 a = np.random.rand(2, len_a) b = np.random.rand(2, len_b) c = np.random.rand(len_a, 2) d = np.random.rand(len_b, 2) In [3]: %timeit a[:, None, :] - b[..., None] 10 loops, best of 3: 76.7 ms per loop In [4]: %timeit c[:, None, :] - d 1 loops, best of 3: 221 ms per loop 

For the above smaller dataset, I can get a slight speed up over your method with scipy.spatial.distance.cdist and match it with inner1d, by arranging data differently in memory:

precomputed_flat = np.vstack((svf.flatten(), shf.flatten())) measured_flat = np.vstack((VVmeasured.flatten(), HHmeasured.flatten())) deltas = precomputed_flat[:, None, :] - measured_flat import scipy.spatial.distance as spdist from numpy.core.umath_tests import inner1d In [13]: %timeit r0 = a[0, None, :] - b[0, :, None]; r1 = a[1, None, :] - b[1, :, None]; r0 *= r0; r1 *= r1; r0 += r1 10 loops, best of 3: 146 ms per loop In [14]: %timeit deltas = (a[:, None, :] - b[..., None]).T; inner1d(deltas, deltas) 10 loops, best of 3: 145 ms per loop In [15]: %timeit spdist.cdist(a.T, b.T) 10 loops, best of 3: 124 ms per loop In [16]: %timeit deltas = a[:, None, :] - b[..., None]; np.einsum('ijk,ijk->jk', deltas, deltas) 10 loops, best of 3: 163 ms per loop 
Sign up to request clarification or add additional context in comments.

13 Comments

alternatively to np.einsum one can use np.tensordot(), which has also a very flexible notation...
Sadly, all 3 methods you suggest are slower, (the deltas=... takes six seconds already, which is why they are slower)
Funny how memory management ruins the best laid plans... I don't fully understand what's going on, but see my edit. You may want to try the above methods on your huge arrays to see if the timings behave differently, but there may be some margin to win with scipy.
Your fastest way still is slower than mine, but i think it is because in my case, i just do (x-x')**2+(y-y')**2, (later on, i get the minimal value from it, but taking the sqrt of it wont change the place of the minimal value, so i dont do that in my calculation, while the cdist does that too (i assume), (the one [15]: times out at 15.7 sec, while mine times out at 11.8 sec, but i assume its due to the sqrt being taken in the cdist routine, but i really think this way, if theres some routine like that without the sqrt in it, it would end up much faster
According to the docs, spdist.cdist(a.T, b.T, 'sqeuclidean') should do just that, can't test it right now. Anyway, interesting to see how memory handling becomes everything when you are using a lot of it!
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.