Revision 15136ec4-5f94-45eb-9f74-ecfd1431694c

Short version:

I tested your code with a slight rewrite on v9, and I cannot reproduce the slowdown. I get a slight speedup, precisely as expected.

----

I tested your code with version 9 on a 4-core machine. Note that this CPU has hyper-threading so Mathematica is actually running 8 subkernels.

To try to isolate your specific code from the problem, I packaged it up into a "blackbox" function. Let's just define it and forget about what it does exactly for a moment.

 blackbox[{a_, b_}] := SelectbyWRange[-Im[SmoothDFT[a, ht, 2]*Conjugate[SmoothDFT[b, ht, 2]]], {-834., 834.}, {19.5, 20.5}]

Notice that you only use the `Table` index `n` for indexing into an array, so the problem can be reformulated as a `Map`. This way `ParallelMap` will avoid transferring *all* of the input vectors to *all* subkernels. It'll only transfer those parts that need to be processed. This'll reduce the transfer time, and it'll reduce the memory usage. `a` and `b` together take 1.8 GB of memory. If you duplicate them 16 times, you'll need at least 17*1.8 = 31 GB of memory in your machine. This may actually be the cause of the slowdown you see.

Let's define

 c = Transpose[{a, b}];

Then the calculation is simply

 blackbox /@ c

Since I'm impatient, I only fed part of `c` to the function and used

 blackbox /@ Take[c, 64];

for benchmarking. Note that to achieve a reasonable speedup, you should use an input vector which is longer than the number of cores you have, preferably much longer. In your example you use only 2 on a 16 core machine. This doesn't make much sense: it would give a 2x speedup at most.

Now let's do the benchmarking. On my four core machine I get:

 Timing[blackbox /@ Take[c, 64];] --> 28 s
 AbsoluteTiming[blackbox /@ Take[c, 64];] --> 9.4 s
 AbsoluteTiming[ParallelMap[blackbox, Take[c, 64]];] --> 6.9 s

Notice that `Timing` gives a 3 times longer time than `AbsoluteTiming` for the sequential `Map`. This is because it measures the time for each CPU core separately, then adds them up. Checking the task manager, I see that the sequential calculation uses 300% CPU, i.e. 3 cores are running at the same time. Something in `blackbox` must already be parallelized in the kernel directly (most likely `Fourier`).

Thus by further high level parallelization we can expect a speedup of about 30% that corresponds to going from a 300% CPU usage to a full 400% in this four-core machine. This is exactly what happens on my machine when I use `ParallelMap`: the timing went from 9.4 s to 6.9 s.

So, at least with version 9.0.1 and when using `Map`, I can't reproduce the slowdown. I get the expected speedup. But I did not test with v8.

----

**Update:** I have now tested with Mathematica 8.0.4. The results are the same as with 9. I cannot reproduce the problem.

I cannot run the OP's `Table` version instead of my `Map` version because, as I mentioned above, the `Table` version would require ~31 GB of memory. This computer doesn't have that much memory so it would slow down considerably due to swapping.