Short version:
I tested your code with a slight rewrite on v9, and I cannot reproduce the slowdown. I get a slight speedup, precisely as expected.
----
I tested your code with version 9 on a 4-core machine. Note that this CPU has hyper-threading so Mathematica is actually running 8 subkernels.
To try to isolate your specific code from the problem, I packaged it up into a "blackbox" function. Let's just define it and forget about what it does exactly for a moment.
blackbox[{a_, b_}] := SelectbyWRange[-Im[SmoothDFT[a, ht, 2]*Conjugate[SmoothDFT[b, ht, 2]]], {-834., 834.}, {19.5, 20.5}]
Notice that you only use the `Table` index `n` for indexing into an array, so the problem can be reformulated as a `Map`. This way `ParallelMap` will avoid transferring *all* of the input vectors to *all* subkernels. It'll only transfer those parts that need to be processed. This'll reduce the transfer time, and it'll reduce the memory usage. `a` and `b` together take 1.8 GB of memory. If you duplicate them 16 times, you'll need at least 17*1.8 = 31 GB of memory in your machine. This may actually be the cause of the slowdown you see.
Let's define
c = Transpose[{a, b}];
Then the calculation is simply
blackbox /@ c
Since I'm impatient, I only fed part of `c` to the function and used
blackbox /@ Take[c, 64];
for benchmarking. Note that to achieve a reasonable speedup, you should use an input vector which is longer than the number of cores you have, preferably much longer. In your example you use only 2 on a 16 core machine. This doesn't make much sense: it would give a 2x speedup at most.
Now let's do the benchmarking. On my four core machine I get:
Timing[blackbox /@ Take[c, 64];] --> 28 s
AbsoluteTiming[blackbox /@ Take[c, 64];] --> 9.4 s
AbsoluteTiming[ParallelMap[blackbox, Take[c, 64]];] --> 6.9 s
Notice that `Timing` gives a 3 times longer time than `AbsoluteTiming` for the sequential `Map`. This is because it measures the time for each CPU core separately, then adds them up. Checking the task manager, I see that the sequential calculation uses 300% CPU, i.e. 3 cores are running at the same time. Something in `blackbox` must already be parallelized in the kernel directly (most likely `Fourier`).
Thus by further high level parallelization we can expect a speedup of about 30% that corresponds to going from a 300% CPU usage to a full 400% in this four-core machine. This is exactly what happens on my machine when I use `ParallelMap`: the timing went from 9.4 s to 6.9 s.
So, at least with version 9.0.1 and when using `Map`, I can't reproduce the slowdown. I get the expected speedup. But I did not test with v8.
----
**Update:** I have now tested with Mathematica 8.0.4. The results are the same as with 9. I cannot reproduce the problem.
I cannot run the OP's `Table` version instead of my `Map` version because, as I mentioned above, the `Table` version would require ~31 GB of memory. This computer doesn't have that much memory so it would slow down considerably due to swapping.