I've implemented CMA-ES in python using the cma package. In an effort to speed up the algorithm, I'm initializing several instances of the algorithm in parallel using the multiprocessing package. This is in addition to using multiprocessing to evaluate the fitness function.
My problem is the speed of iterations. The iteration speed roughly doubles when i'm doubling the amount of instances. This is confusing to me as the computer I'm running on has 8 cores and 16 threads. If i run 2 instances and use 4 processes to evaluate the fitness function, there shouldn't be any obvious bottlenecks but it doubles the iteration time.
I realize that multiprocessing is intended for computers with two or more actual cpus and not cores but can't i leverage my cpus multiple cores somehow?
What am I missing?
I've tried disabling SMT / Hyperthreading but that made no difference.
I feel like the idea of initializing several instances of cma is a pretty good approach but it's not doing what i thought it was going to.