Why are the iterations of my CMA-ES implementation slowing down with multiprocessing?

Question

I've implemented CMA-ES in python using the cma package. In an effort to speed up the algorithm, I'm initializing several instances of the algorithm in parallel using the multiprocessing package. This is in addition to using multiprocessing to evaluate the fitness function.

My problem is the speed of iterations. The iteration speed roughly doubles when i'm doubling the amount of instances. This is confusing to me as the computer I'm running on has 8 cores and 16 threads. If i run 2 instances and use 4 processes to evaluate the fitness function, there shouldn't be any obvious bottlenecks but it doubles the iteration time.

I realize that multiprocessing is intended for computers with two or more actual cpus and not cores but can't i leverage my cpus multiple cores somehow?

What am I missing?

I've tried disabling SMT / Hyperthreading but that made no difference.

I feel like the idea of initializing several instances of cma is a pretty good approach but it's not doing what i thought it was going to.

Hard to tell without the code or any reproducible example. I guess this is the overhead of IPC and pickling as usual with multiprocessing. It could also be because the program is memory bound. Did you profile it? — Jérôme Richard
– Jérôme Richard, Commented Apr 12, 2024 at 10:41
The number of socket use should have no significant impact (only NUMA effects but they should not be a problem with multiprocessing AFAIK). Thus having 2 socket of 4-core CPU or a 8 core CPU should be equivalent in your case in term of computational power. SMT is only a problem if you run more threads or processes than cores which is not the case here. It can be a problem with a bad binding though, but the same can still happen without SMT. — Jérôme Richard
– Jérôme Richard, Commented Apr 12, 2024 at 10:44
Are your CPU cores actually all busy? How long is one evaluation when it runs single-threaded? CMA-ES needs to run a serial update step for each iteration. It has to block until the slowest parallel evaluation is done, and then do an update step that doesn't easy parallelize (with complexity N^2). — maxy
– maxy, Commented May 3, 2024 at 8:44

paulduf · Accepted Answer · 2024-05-17 12:22:50Z

The code is already multicore, as it is calling optimized libraries (e.g. NumPy) under the hood for costly operations (mostly linear algebra). See, for example, eigenmethod in the GaussianSampler (link). So you should not bother implementing optimizations on this side. It's still a good idea to optimize the independent part of the code related to the objective computation though.

Collectives™ on Stack Overflow

Why are the iterations of my CMA-ES implementation slowing down with multiprocessing?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related