Having some activity formally I/O bound doesn't imply it can't be parallelized. As a radically marginal but expressive example, consider you have to read something from tape drivers, and a tape seek is average 5 minutes. You have to read something from two different tapes, each installed into own driver (device). If you issue requests in parallel, you'll get average time approximately 5 minutes. If to issue requests in parallelone after other, result time is 10 minutes.
If I got it right, your case is for the same request set but in a single process instead of different processes. At a glance, I'd suspect that kernel I/O scheduler differentiates threads and processes, and provides some kind of I/O bandwidth limiting with a bucket per process. Another variant is that your implementation spends too much for proper transition between Python and C land. But all these are just speculations without real facts.
The problem is that performance is really hard. Folks are spending man years to tune their code and to find a tiny detail that affects all or, vice versa, to rewrite entire layers to achieve 1-2% speedup. And, after that, next change in subordinated layers (CPU, kernel, etc.) can void all these results. So, if you see difference less that, say, 30%, just select the variant you see the best for now and switch to another task :)