Return to Answer

added 233 characters in body

edited Jul 8, 2013 at 2:37

113.9k
7
269
488

The idea has two advantages. Firstly, you can just go through all combinations very easily by counting. Secondly, you don't need to store the {0,1,..} combinations explicitly because you can have them very memory efficient in a list of integers.

With the above idea the most direct algorithm to test all n-tuples of {0,1} is to go through all numbers 0 to 2^n-1 and test theeach binary representation whether for its relevance. If it is relevant. If yes, then you store it. With this you consume (more or less) only the memory which your final selectedresult list would require anyway. Again, remember that you don't need to store a list of combinations as result because a list of integers representing the combinations is sufficient.

To try this we define a function iterativeSelect which implements the idea. It further sets a progess variable so that we have a feeling how long it will take. To store a relevant combination I Sow the integer representing the combination. All result combinations are collected using Reap at the outside. Reaping and sowing is a good way to collect results memory efficient and I strongly advice you not to use something like AppendTo.

Combining the two approaches is now easy. Assuming that on your machine has enough memory to run parallelSelector for n=25 easily and you require to calculate n=30. Then you make a sequential loop that iterates to 2^30 in a stepsize of 2^25. In each step i (of the 32 steps) you calculate the interval [i,i+2^25-1] in parallel. This is kind of similar to what Sjoerd suggested in his first comment. We can even approximate the runtime quite exact. Since one parallel run for n=25 takes about 5 seconds on my machine and we need 32 steps to iterate to 2^30 I expect about 160 seconds. And indeed

First row shows the cpu usage and the second one again the memory in use. You see that the memory consumption is constantly low because it only needs what is used in a parallel run. When you correlate this with the cpu usage you see that the memory peaks are where the parallel selector runs. Additionally, you clearly see every single iteration because then all my core work very hard.

The idea has two advantages. Firstly, you can just go through all combinations very easily. Secondly, you don't need to store the {0,1,..} combinations explicitly because you can have them very memory efficient in a list of integers.

With the above idea the most direct algorithm to test all n-tuples of {0,1} is to go through all numbers 0 to 2^n-1 and test the binary representation whether it is relevant. If yes, you store it. With this you consume (more or less) only the memory which your final selected list would require anyway. Again, remember that you don't need to store a list of combinations as result because a list of integers representing the combinations is sufficient.

Combining the two approaches is now easy. Assuming that on your machine has enough memory to run parallelSelector for n=25 easily and you require to calculate n=30. Then you make a sequential loop that iterates to 2^30 in a stepsize of 2^25. In each (of the 32 steps) you calculate the interval in parallel. This is similar to what Sjoerd suggested in his first comment. We can even approximate the runtime quite exact. Since one parallel run for n=25 takes about 5 seconds on my machine and we need 32 steps to iterate to 2^30 I expect about 160 seconds. And indeed

With the above idea the most direct algorithm to test all n-tuples of {0,1} is to go through all numbers 0 to 2^n-1 and test each binary representation whether for its relevance. If it is relevant then you store it. With this you consume (more or less) only the memory which your final result list would require anyway. Again, remember that you don't need to store a list of combinations as result because a list of integers representing the combinations is sufficient.

Source Link

answered Jul 8, 2013 at 2:20

halirutan

113.9k
7
269
488

Your explanations are very detailed and I have to admit it's too detailed for me to go through it, trying to understand without a minimal working example. Therefore, view this as some ideas for your first, short-version question. I will present two different approaches, where the first one takes very long, but consumes almost no memory and the second one is very fast, but uses much memory during the computation. Finally, I'll show you how you can combine them so that they work with your memory restrictions.

When I understood you correctly, then you have a function which tells you whether or not a combination {0,1,0,0,1,1,..} is relevant. I will call this function selector and what I use for the demonstration here is very simple: A combination is relevant if it contains exactly two 1's. A simple definition of this selector is therefore

selector[combination_] := Plus @@ combination === 2;

Whether or not my idea works for your problem depends partially on your selector. Here the question is for instance, can we include your selector in compiled code or how fast it is.

Another issue you'll see here is, that I don't make use of ParallelTable or related parallel constructs. The reason is simple: Since those functions distribute your problem to many subkernels they need to distribute parts of your data as well. If you are not absolutely sure that you know what you do there it can happen very easily that this consumes much memory while the speed up is often non-existent. In fact, I will make use of another kind of parallelization in my second approach.

General idea

The general idea bases on the fact that it is quite easy (and memory efficient) to create all tuples of {0,1} if you remember that all numbers are stored binary inside your computer. Therefore, to create all combinations of length 4 you only need to count from 0 to 15 and look at the binary representation. Mathematica has the IntegerDigits function to give you the representation of a number in a different base. Therefore an equivalent algorithm to Tuples[{0, 1}, 4] is the following

IntegerDigits[Range[0,15],2,4] (* {{0,0,0,0},{0,0,0,1},{0,0,1,0},{0,0,1,1}, {0,1,0,0},{0,1,0,1},{0,1,1,0},{0,1,1,1}, {1,0,0,0},{1,0,0,1},{1,0,1,0},{1,0,1,1}, {1,1,0,0},{1,1,0,1},{1,1,1,0},{1,1,1,1}} *)

Approach: Slow but memory efficient --

iterativeSelect[n_, test_] := Last@Last@Reap[Do[ If[Mod[i, 2^10] === 0, progress = i]; If[test[IntegerDigits[i, 2, n]], Sow[i] ], {i, 0, 2^n - 1} ] ]

With selector defined as in the beginning of this answer, let's test this approach for n=25

n = 25; ProgressIndicator[Dynamic[progress], {0, 2^n - 1}] m1 = MemoryInUse[]/2^20.; result = iterativeSelect[n, selector]; // AbsoluteTiming m2 = MemoryInUse[]/2^20.; m2 - m1

This takes about 156 seconds here and consumes almost no memory (~5kB). The equivalent run with Tuples

Select[Tuples[{0, 1}, 25], selector] // Length

occupies about 17GB of memory during the run (but frees it when it is finished) and takes maybe 1.5 minutes.

If we want to make an extrapolation how long this method would run for n=30 we could remember that

$$2^{30} = 32\cdot2^{25}$$

so I guess something around $156\cdot32$ seconds which is about 80 minutes.

Approach: Fast but memory consuming --

The second approach I'm suggesting is to make a compiled function which takes a list of integers representing the combinations and tests the whole list in parallel. What you get is a list of True|False which you can use with Pick to select only relevant combinations.

This clearly is memory consuming because we need the integer list completely during the run. For n=30 this requires about 8GB

ByteCount[Range[2^30]]/2^20.

Remember, that we need a second list of True|False which is the result of the parallel selector. Furthermore, we don't now what Pick does internally.

Another restriction is that you have to be able to compile your selector function completely, otherwise it will not be of much use. Long story short, in the next code block the options to Compile are significant.

parallelSelector = Compile[{{n, _Integer}}, Plus @@ IntegerDigits[n, 2] === 2, CompilationTarget -> "C", Parallelization -> True, RuntimeAttributes -> {Listable}, RuntimeOptions -> "Speed"];

What you should always do is to use CompilePrint from the <<CompiledFunctionsTools` package the check whether your compiled function is free of any MainEvaluate call.

A speed test of the parallelSelector function only reveals how fast it is compared to the first approach

parallelSelector[Range[0, 2^25]]; // AbsoluteTiming (* 3.094913 seconds *)

To create now all relevant combinations, we create the True|False list and select the True entries using Pick

result = Block[{data = Range[0, 2^25]}, Pick[data, parallelSelector[data]] ]; // AbsoluteTiming

This takes about 5 seconds here and with 32GB of RAM I didn't see any noticeable memory consumption during the run, but a list of 2^25 integers needs only 256MB of memory. If I do the same calculation for n=28 on my machine it finishes very fast within 40 seconds but it needs a peak memory of about 11 GB during the run

enter image description here

Combining the two approaches

Combining the two approaches is now easy. Assuming that on your machine has enough memory to run parallelSelector for n=25 easily and you require to calculate n=30. Then you make a sequential loop that iterates to 2^30 in a stepsize of 2^25. In each (of the 32 steps) you calculate the interval in parallel. This is similar to what Sjoerd suggested in his first comment. We can even approximate the runtime quite exact. Since one parallel run for n=25 takes about 5 seconds on my machine and we need 32 steps to iterate to 2^30 I expect about 160 seconds. And indeed

sequencialParallelIterate[n_, step_, parallelTest_] := Flatten@Last@Last@Reap@Do[ Sow@Block[{data = Range[i, i + 2^step - 1]}, Pick[data, parallelTest[data]] ], {i, 0, 2^n - 1, 2^step} ] sequencialParallelIterate[30, 25, parallelSelector]; // AbsoluteTiming

needs 166 seconds here. If I'm allowed to refer to my system monitor once more:

enter image description here

Conclusion

The benefit of the presented method is that you can tune it easily to your memory/runtime needs. If you are low on memory, you calculate smaller chunks in parallel and iterate more often. If you have enough memory you can make the chunks who are computed with a low-level parallelization larger and you gain speed.

The fast parallel method strongly depends that you don't do very complex things in your selector. If your selector depends on global variables they can most probably be included in the compiled function (using With or another strategy) but if you need something non-trivial like a call to Solve things get complicated if not impossible.