Return to Answer

fix typo

edited Apr 12, 2020 at 8:04

1.6k
13
17

Having some activity formally I/O bound doesn't imply it can't be parallelized. As a radically marginal but expressive example, consider you have to read something from tape drivers, and a tape seek is average 5 minutes. You have to read something from two different tapes, each installed into own driver (device). If you issue requests in parallel, you'll get average time approximately 5 minutes. If to issue requests in parallelone after other, result time is 10 minutes.

If I got it right, your case is for the same request set but in a single process instead of different processes. At a glance, I'd suspect that kernel I/O scheduler differentiates threads and processes, and provides some kind of I/O bandwidth limiting with a bucket per process. Another variant is that your implementation spends too much for proper transition between Python and C land. But all these are just speculations without real facts.

The problem is that performance is really hard. Folks are spending man years to tune their code and to find a tiny detail that affects all or, vice versa, to rewrite entire layers to achieve 1-2% speedup. And, after that, next change in subordinated layers (CPU, kernel, etc.) can void all these results. So, if you see difference less that, say, 30%, just select the variant you see the best for now and switch to another task :)

Having some activity formally I/O bound doesn't imply it can't be parallelized. As a radically marginal but expressive example, consider you have to read something from tape drivers, and a tape seek is average 5 minutes. You have to read something from two different tapes, each installed into own driver (device). If you issue requests in parallel, you'll get average time approximately 5 minutes. If to issue requests in parallel, result time is 10 minutes.

If I got it right, your case is for the same request set but in a single process instead of different processes. At a glance, I'd suspect that kernel I/O scheduler differentiates threads and processes, and provides some kind of I/O bandwidth limiting with a bucket per process. Another variant is that your implementation spends too much for proper transition between Python and C land. But all these are just speculations without real facts.

The problem is that performance is really hard. Folks are spending man years to tune their code and to find a tiny detail that affects all or, vice versa, to rewrite entire layers to achieve 1-2% speedup. And, after that, next change in subordinated layers (CPU, kernel, etc.) can void all these results. So, if you see difference less that, say, 30%, just select the variant you see the best for now and switch to another task :)

Having some activity formally I/O bound doesn't imply it can't be parallelized. As a radically marginal but expressive example, consider you have to read something from tape drivers, and a tape seek is average 5 minutes. You have to read something from two different tapes, each installed into own driver (device). If you issue requests in parallel, you'll get average time approximately 5 minutes. If to issue requests one after other, result time is 10 minutes.

If I got it right, your case is for the same request set but in a single process instead of different processes. At a glance, I'd suspect that kernel I/O scheduler differentiates threads and processes, and provides some kind of I/O bandwidth limiting with a bucket per process. Another variant is that your implementation spends too much for proper transition between Python and C land. But all these are just speculations without real facts.

The problem is that performance is really hard. Folks are spending man years to tune their code and to find a tiny detail that affects all or, vice versa, to rewrite entire layers to achieve 1-2% speedup. And, after that, next change in subordinated layers (CPU, kernel, etc.) can void all these results. So, if you see difference less that, say, 30%, just select the variant you see the best for now and switch to another task :)

answered Apr 12, 2020 at 6:57

1.6k
13
17

Having some activity formally I/O bound doesn't imply it can't be parallelized. As a radically marginal but expressive example, consider you have to read something from tape drivers, and a tape seek is average 5 minutes. You have to read something from two different tapes, each installed into own driver (device). If you issue requests in parallel, you'll get average time approximately 5 minutes. If to issue requests in parallel, result time is 10 minutes.

If I got it right, your case is for the same request set but in a single process instead of different processes. At a glance, I'd suspect that kernel I/O scheduler differentiates threads and processes, and provides some kind of I/O bandwidth limiting with a bucket per process. Another variant is that your implementation spends too much for proper transition between Python and C land. But all these are just speculations without real facts.

The problem is that performance is really hard. Folks are spending man years to tune their code and to find a tiny detail that affects all or, vice versa, to rewrite entire layers to achieve 1-2% speedup. And, after that, next change in subordinated layers (CPU, kernel, etc.) can void all these results. So, if you see difference less that, say, 30%, just select the variant you see the best for now and switch to another task :)