8

Original Question

I am trying to use multiprocessing Pool in Python. This is my code:

def f(x): return x def foo(): p = multiprocessing.Pool() mapper = p.imap_unordered for x in xrange(1, 11): res = list(mapper(f,bar(x))) 

This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?

How can I resolve this problem?

minimal, complete, verifiable example

To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.

#!/usr/bin/python import time import itertools import threading import multiprocessing import random def f(x): return x def ngrams(input_tmp, n): input = input_tmp.split() if n > len(input): n = len(input) output = [] for i in range(len(input)-n+1): output.append(input[i:i+n]) return output def foo(): p = multiprocessing.Pool() mapper = p.imap_unordered num = 100000000 #100 rand_list = random.sample(xrange(100000000), num) rand_str = ' '.join(str(i) for i in rand_list) for n in xrange(1, 100): res = list(mapper(f, ngrams(rand_str, n))) if __name__ == '__main__': start = time.time() foo() print 'Total time taken: '+str(time.time() - start) 

When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.

Caution: When num is too large it may crash your system/VM.

8
  • What does bar return? Commented May 7, 2015 at 7:50
  • bar returns an igraph object and x determines the depth of the graph. Will that have any effect on this? Commented May 7, 2015 at 7:54
  • If the igraph is big enough, it may be spending more time pickling the parameters and results and pushing them over the pipes than doing actual work (especially if your actual work is just return x!), which would definitely serialize everything to one CPU. And it's certainly plausible that a graph of depth 5 wouldn't have this problem, but a graph of depth 9 would. Commented May 7, 2015 at 7:56
  • 2
    Can you give us a minimal, complete, verifiable example with some good fake data? Then we might be able to analyze and/or test it rather than having to guess. Commented May 7, 2015 at 7:59
  • 2
    OK, it definitely is what I thought. You're spending about 75% of your time fighting over the queues, 25% of your time on multiprocessing overhead, and 0% of your time doing actual work. In 2.7, the overhead is serialized, so everything happens on one core. In 3.4, the overhead is parallelized, so your other cores have a bit of work to do, but it's still just overhead (useless work). Commented May 7, 2015 at 9:17

2 Answers 2

8

First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.

If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.

So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.

Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).

From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.

This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.

If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.

But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)

Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.

And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.

A few observations:

  • In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
  • If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
  • If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
  • If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
  • Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
  • It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.
Sign up to request clarification or add additional context in comments.

Comments

2

The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.

ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.

To make things clearer, consider that this piece of code...

for n in xrange(1, 100): res = list(mapper(f, ngrams(rand_str, n))) 

...is equivalent to this:

for n in xrange(1, 100): arg = ngrams(rand_str, n) res = list(mapper(f, arg)) 

Also the following is a CPU-intensive operation that is being performed on your main process:

num = 100000000 rand_list = random.sample(xrange(100000000), num) 

You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

4 Comments

But ngrams finishes before anything gets mapped at all, and even after it finishes, the whole thing is still on one core, so this isn't the explanation.
@abarnert: starting a huge number of processes that do nothing but exit soon will never put a high load on the CPUs.
He's not starting a huge number of processes, he's starting 8 processes in a pool. And they don't do nothing—they pull a huge value off a shared queue, unpickle it, repickle it, and put it on a shared queue. And that definitely puts a high load on the CPUs—but only on one core.
The question is, why only on one core? And I'm pretty sure the answer is either (a) the queue contention itself swamps the pickling work, or (b) the pickling is happening with the queue locked; I'm still trying to rule out (b).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.