Revisions to ParallelTable 70 times slower on 16 cores than Table on single core

deleted 249 characters in body

edited Sep 4, 2013 at 0:43

238.9k
32
653
1.3k

I cannot run the OP's Table version instead of my Map version because, as I mentioned above, the Table version would require ~31 GB of memory. This computer doesn't have that much memory so it would slow down considerably due to swapping.

deleted 4 characters in body

Source Link

edited Sep 3, 2013 at 20:04

Szabolcs

238.9k
32
653
1.3k

`blackbox[blackbox[{a_, b_}] := SelectbyWRange[-Im[SmoothDFT[a, ht, 2]*Conjugate[SmoothDFT[b, ht, 2]]], {-834., 834.}, {19.5, 20.5}]`]

So, at least with version 9.0.1 and when using Map, I can't reproduce the slowdown. I get the expected speedup. But I did not test with v8.

Update: I have now tested with Mathematica 8.0.4. The results are the same as with 9. I cannot reproduce the problem.

I cannot run the OP's Table version instead of my Map version because, as I mentioned above, the Table version would require ~31 GB of memory. This computer doesn't have that much memory so it would slow down considerably due to swapping.

`blackbox[{a_, b_}] := SelectbyWRange[-Im[SmoothDFT[a, ht, 2]*Conjugate[SmoothDFT[b, ht, 2]]], {-834., 834.}, {19.5, 20.5}]`

So, at least with version 9.0.1 and when using Map, I can't reproduce the slowdown. I get the expected speedup. But I did not test with v8.

blackbox[{a_, b_}] := SelectbyWRange[-Im[SmoothDFT[a, ht, 2]*Conjugate[SmoothDFT[b, ht, 2]]], {-834., 834.}, {19.5, 20.5}]

So, at least with version 9.0.1 and when using Map, I can't reproduce the slowdown. I get the expected speedup. But I did not test with v8.

Update: I have now tested with Mathematica 8.0.4. The results are the same as with 9. I cannot reproduce the problem.

I cannot run the OP's Table version instead of my Map version because, as I mentioned above, the Table version would require ~31 GB of memory. This computer doesn't have that much memory so it would slow down considerably due to swapping.

Source Link

answered Sep 2, 2013 at 23:15

Szabolcs

238.9k
32
653
1.3k

Short version:

I tested your code with a slight rewrite on v9, and I cannot reproduce the slowdown. I get a slight speedup, precisely as expected.

I tested your code with version 9 on a 4-core machine. Note that this CPU has hyper-threading so Mathematica is actually running 8 subkernels.

To try to isolate your specific code from the problem, I packaged it up into a "blackbox" function. Let's just define it and forget about what it does exactly for a moment.

`blackbox[{a_, b_}] := SelectbyWRange[-Im[SmoothDFT[a, ht, 2]*Conjugate[SmoothDFT[b, ht, 2]]], {-834., 834.}, {19.5, 20.5}]`

Notice that you only use the Table index n for indexing into an array, so the problem can be reformulated as a Map. This way ParallelMap will avoid transferring all of the input vectors to all subkernels. It'll only transfer those parts that need to be processed. This'll reduce the transfer time, and it'll reduce the memory usage. a and b together take 1.8 GB of memory. If you duplicate them 16 times, you'll need at least 17*1.8 = 31 GB of memory in your machine. This may actually be the cause of the slowdown you see.

Let's define

c = Transpose[{a, b}];

Then the calculation is simply

blackbox /@ c

Since I'm impatient, I only fed part of c to the function and used

blackbox /@ Take[c, 64];

for benchmarking. Note that to achieve a reasonable speedup, you should use an input vector which is longer than the number of cores you have, preferably much longer. In your example you use only 2 on a 16 core machine. This doesn't make much sense: it would give a 2x speedup at most.

Now let's do the benchmarking. On my four core machine I get:

Timing[blackbox /@ Take[c, 64];] --> 28 s AbsoluteTiming[blackbox /@ Take[c, 64];] --> 9.4 s AbsoluteTiming[ParallelMap[blackbox, Take[c, 64]];] --> 6.9 s

Notice that Timing gives a 3 times longer time than AbsoluteTiming for the sequential Map. This is because it measures the time for each CPU core separately, then adds them up. Checking the task manager, I see that the sequential calculation uses 300% CPU, i.e. 3 cores are running at the same time. Something in blackbox must already be parallelized in the kernel directly (most likely Fourier).

Thus by further high level parallelization we can expect a speedup of about 30% that corresponds to going from a 300% CPU usage to a full 400% in this four-core machine. This is exactly what happens on my machine when I use ParallelMap: the timing went from 9.4 s to 6.9 s.

So, at least with version 9.0.1 and when using Map, I can't reproduce the slowdown. I get the expected speedup. But I did not test with v8.

Post Made Community Wiki by Szabolcs

occurred Sep 2, 2013 at 23:15

Stack Exchange Network

Return to Answer