4

I have already written the following piece of code, which does exactly what I want, but it goes way too slow. I am certain that there is a way to make it faster, but I cant seem to find how it should be done. The first part of the code is just to show what is of which shape.

two images of measurements (VV1 and HH1)
precomputed values, VV simulated and HH simulated, which both depend on 3 parameters (precomputed for (101, 31, 11) values)
the index 2 is just to put the VV and HH images in the same ndarray, instead of making two 3darrays

VV1 = numpy.ndarray((54, 43)).flatten() HH1 = numpy.ndarray((54, 43)).flatten() precomp = numpy.ndarray((101, 31, 11, 2)) 

two of the three parameters we let vary

comp = numpy.zeros((len(parameter1), len(parameter2))) for i,(vv,hh) in enumerate(zip(VV1,HH1)): comp0 = numpy.zeros((len(parameter1),len(parameter2))) for j in range(len(parameter1)): for jj in range(len(parameter2)): comp0[j,jj] = numpy.min((vv-precomp[j,jj,:,0])**2+(hh-precomp[j,jj,:,1])**2) comp+=comp0 

The obvious thing i know i should do is get rid of as many for-loops as I can, but I don't know how to make the numpy.min behave properly when working with more dimensions.

A second thing (less important if it can get vectorized, but still interesting) i noticed is that it takes mostly CPU time, and not RAM, but i searched a long time already, but i cant find a way to write something like "parfor" instead of "for" in matlab, (is it possible to make an @parallel decorator, if i just put the for-loop in a separate method?)

edit: in reply to Janne Karila: yeah that definately improves it a lot,

for (vv,hh) in zip(VV1,HH1): comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2) 

Is definitely a lot faster, but is there any possibility to remove the outer for-loop too? And is there a way to make a for-loop parallel, with an @parallel or something?

3
  • +0 because someone downvoted, then someone else upvoted. (Not me, in either case.) Commented May 17, 2013 at 11:39
  • Ah, you beat me to answer that last question myself ;-) Commented May 17, 2013 at 11:42
  • +1 the topic is interesting, important and well put Commented May 17, 2013 at 11:53

3 Answers 3

6

This can replace the inner loops, j and jj

comp0 = numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2) 

This may be a replacement for the whole loop, though all this indexing is stretching my mind a bit. (this creates a large intermediate array though)

comp = numpy.sum( numpy.min((VV1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,0])**2 +(HH1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,1])**2, axis=2), axis=0) 
Sign up to request clarification or add additional context in comments.

4 Comments

yeah that definately improves a lot in it, for (vv,hh) in zip(VV1,HH1): comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2) Is definately a lot faster, but is there any possibility to remove the outer for-loop too?
@usethedeathstar I made an attempt to remove the outer loop too, check it out.
yeah that last version you posted definately is gonna swap (instead of 101,31,11 a more likely amount would be 300 for each parameter, with float64, that already makes a 216Mb matrix, without the 2k+ pixels to use in the fourth dimension, and im limited to 50Gb of ram ;-) but i can ofc just limit the amount of values i take for each parameter, and than zoom in later on the part of interest
@usethedeathstar If you want to get by with only one call to numpy.min, you need to have the entire array in memory. weave.blitz might be interesting to you. I'm not too familiar with it, but it should offer a way to avoid temporaries.
0

One way to parallelize the loop is to construct it in such a way as to use map. In that case, you can then use multiprocessing.Pool to use a parallel map.

I would change this:

for (vv,hh) in zip(VV1,HH1): comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2) 

To something like this:

def buildcomp(vvhh): vv, hh = vvhh return numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2) if __name__=='__main__': from multiprocessing import Pool nthreads = 2 p = Pool(nthreads) complist = p.map(buildcomp, np.column_stack((VV1,HH1))) comp = np.dstack(complist).sum(-1) 

Note that the dstack assumes that each comp.ndim is 2, because it will add a third axis, and sum along it. This will slow it down a bit because you have to build the list, stack it, then sum it, but these are all either parallel or numpy operations.

I also changed the zip to a numpy operation np.column_stack, since zip is much slower for long arrays, assuming they're already 1d arrays (which they are in your example).

I can't easily test this so if there's a problem, feel free to let me know.

4 Comments

can this construction with the column_stack with a tuple than be used to make an @parallel decorator? (
by "@parallel decorator", I assume you refer to the IPython parallel setup, which I've never used. This looks like a case that would work for it. My answer is already parallel using multiprocessing, but you could look into the ipython solution as well.
No, by the @parallel, i mean just a decorator that is general in use, and that executes in parallel whatever function you use it on, (could be useful for image processing or stuff that massively uses ndarrays for simulations etc-
@usethedeathstar As far as I know there is no general use parallel decorator, as there are several different ways to implement parallelism in python, some of which use threads, some use processes, and therefore have different handling of the GIL. The decorator I was referring to is the one provided by IPython's parallel construct but I don't believe it can be used outside of IPython. Using numpy instead of builtins like zip will also run parallel
0

In computer science, there is the concept of Big O notation, used for getting an approximation of how much work is required to do something. To make a program fast, do as little as possible.

This is why Janne's answer is so much faster, you do fewer calculations. Taking this principle farther, we can apply the concept of memoization, because you are CPU bound instead of RAM bound. You can use the memory library, if it needs to be more complex than the following example.

class AutoVivification(dict): """Implementation of perl's autovivification feature.""" def __getitem__(self, item): try: return dict.__getitem__(self, item) except KeyError: value = self[item] = type(self)() return value memo = AutoVivification() def memoize(n, arr, end): if not memo[n][arr][end]: memo[n][arr][end] = (n-arr[...,end])**2 return memo[n][arr][end] for (vv,hh) in zip(VV1,HH1): first = memoize(vv, precomp, 0) second = memoize(hh, precomp, 1) comp+= numpy.min(first+second, axis=2) 

Anything that has already been computed gets saved to memory in the dictionary, and we can look it up later instead of recomputing it. You can even break down the math being done into smaller steps that are each memoized if necessary.

The AutoVivification dictionary is just to make it easier to save the results inside of memoize, because I'm lazy. Again, you can memoize any of the math you do, so if numpy.min is slow, memoize it too.

1 Comment

hmm, i know of memoize already, but thats only useful if you may need to calculate the same values many times, (like fibonacci n other recursive things), which probably wont be the case here. And the reason her answer is faster is (as far as i understand it) due to the way the numpy.min is used only once, but on the entire matrix, (putting it in C or whatever is used in numpy, which swaps the for-loops to a C-code thing which works on ndarrays instead of floats) hmm your memoize is implemented a bit different than the decorator i use (if i think i might need it)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.