bpo-35493: Use Process.sentinel instead of sleeping for polling worker status in multiprocessing.Pool #11488

pablogsal · 2019-01-09T23:40:18Z

This is a very simple fix for this problem using Process.sentinel. If you want a more sophisticated solution, please advice.

import multiprocessing import time CONCURRENCY = 1 NTASK = 100 def noop(): pass with multiprocessing.Pool(CONCURRENCY, maxtasksperchild=1) as pool: start_time = time.monotonic() results = [pool.apply_async(noop, ()) for _ in range(NTASK)] for result in results: result.get() dt = time.monotonic() - start_time pool.terminate() pool.join() print("Total: %.1f sec" % dt)

Before this PR

Total: 10.2 sec

After this PR:

Total: 0.5 sec

https://bugs.python.org/issue35493

pablogsal · 2019-01-09T23:48:25Z

Lib/multiprocessing/pool.py

 while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
 pool._maintain_pool()
- time.sleep(0.1)
+ pool._wait_for_updates(timeout=0.2)


The timeout is needed for detecting changes in thread._state

asyncio uses the "self-pipe" pattern to wake up itself when it gets an event from a different thread or when it gets a Unix signal. Would it be possible to use a self-pipe (or something else) to wake up the wait when thread._state changes?

We would need to manage also the case when pool._cache is empty.

@vstinner I have used the self-pipe pattern to receive notifications on pool._cache and thread._state changes. I maintained the 0.2 timeout for making sure the old behaviour is maintained if some change is not notified using the self._change_notifier queue (by mistake or because of external reasons).

Why 0.2 and not 0.1 or 1.0? I understand that replacing 0.1 with 0.2 doubles the latency of thread pool. Am I right?

I used 0.2 to maintain backwards compatibility with the old behaviour. See my other comment explaining why I maintained the timeout.

pablogsal · 2019-01-12T17:32:10Z

@vstinner I have used the self-pipe pattern to receive notifications on pool._cache and thread._state changes. I maintained the 0.2 timeout for making sure the old behaviour is maintained if some change is not notified using the self._change_notifier queue (by mistake or because of external reasons).

vstinner · 2019-01-14T10:28:19Z

Lib/multiprocessing/pool.py

 while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
 pool._maintain_pool()
- time.sleep(0.1)
+ pool._wait_for_updates(timeout=0.2)


@vstinner I have used the self-pipe pattern to receive notifications on pool._cache and thread._state changes. I maintained the 0.2 timeout for making sure the old behaviour is maintained if some change is not notified using the self._change_notifier queue (by mistake or because of external reasons).

Why 0.2 and not 0.1 or 1.0? I understand that replacing 0.1 with 0.2 doubles the latency of thread pool. Am I right?

vstinner · 2019-01-14T10:29:01Z

Lib/multiprocessing/pool.py


 @classmethod
- def _terminate_pool(cls, taskqueue, inqueue, outqueue, pool,
+ def _terminate_pool(cls, taskqueue, inqueue, outqueue, pool, pool_notifier,


i would prefer to reuse the same variable name: pool_notifier => change_notifier.

Lib/multiprocessing/pool.py

Misc/NEWS.d/next/Library/2019-01-09-23-43-08.bpo-35493.kEcRGE.rst

Lib/multiprocessing/pool.py

bedevere-bot · 2019-01-14T10:36:42Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Lib/multiprocessing/pool.py

vstinner

Oh wow, any multiprocessing is so complex... I'm happy with @pablogsal handles it instead of me :-D

vstinner · 2019-01-15T11:31:53Z

Lib/multiprocessing/pool.py

+ *self_notifier_sentinels]
+ wait(sentinels, timeout=timeout)
+ while not self._change_notifier.empty():
+ self._change_notifier.get()


I don't think that this code is safe, it looks like a race condition: https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use

I suggest to call get(block=False) in a loop until you get an Empty exception.

Note: the race condition is not really critical, since it's fine if we miss a few events.

There is no race condition as long as this is the only thread that pops from the queue.

Also, sadly queue.SimpleQueue has no block=False option (we could add one, but I think is not needed).

vstinner · 2019-01-15T11:34:45Z

Lib/multiprocessing/pool.py

+ sentinels = [*task_queue_sentinels,
+ *worker_sentinels,
+ *self_notifier_sentinels]
+ wait(sentinels, timeout=timeout)


Can you please add a comment on wait() to explain that it completes when at least one sentinel is set and that it's important to not wait until all sentinels completed, but exit frequently to refresh the pool.

This point is non-trivial and it surprised me when I wrote PR #11136, my comment #11136 (comment):

My change doesn't work: self._worker_state_event isn't set when a worker completes, whereas _maintain_pool() should be called frequently to check when a worker completed.

wait is already documented: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.connection.wait

I don't ask to documen the behavior of wait, but more explicit that we stop as soon as the first event complete on purpose.

It looks obvious to me, especially as the function is named wait_for_updates, but I guess it doesn't hurt to add a comment.

Lib/multiprocessing/pool.py

vstinner · 2019-01-15T11:37:37Z

Misc/NEWS.d/next/Library/2019-01-09-23-43-08.bpo-35493.kEcRGE.rst

@@ -0,0 +1,3 @@
+Use :func:`multiprocessing.connection.wait` instead of polling each 0.2


Oops, I clicked on the wrong button :-( Your NEWS entry still doesn't explain that only process pools are affected.

Another issue: "polling each 0.2 seconds": currently the code uses 0.1 seconds.

Sorry, I completely forgot about that while fighting Windows issues :/

pablogsal · 2019-01-15T12:10:59Z

I made some changes to eliminate some problems that I found on Windows. I have run all multiprocessing test in a loop with this patch manually on almost all our Windows buildbots and it passes without problems, so I think the pipe/socket solution is resilient also on Windows.

pablogsal · 2019-02-07T00:07:15Z

@vstinner @pitrou I think I have addressed all the review comments. Could you review again? Thanks!

…r status in multiprocessing.Pool

pablogsal · 2019-02-12T20:23:51Z

@vstinner @pitrou I had to rebase since the changes in https://bugs.python.org/issue35378 make the handle_workers independent on the pool itself, so we cannot rely on the thread keeping the pool alive or to use self in the waiting_for_updates function. Some new changes are needed, for example, gathering the sentinels needs to be done in the constructor to avoid references to self:

https://github.com/python/cpython/pull/11488/files#diff-2d95253d6de7bbeebbeb131c5f3aecd9R213

the if blocks are needed to make the ThreadPool constructor not crash.

Also, the minimal timeout needs to still exist because now that the thread does not keep the pool alive, it needs a wait to exit from _wait_for_updates if the pool does not unblock the sentinels because is dead

To avoid hanging if the pool dies too quickly, I have changed the del to push a notification to unblock the worker thread:

https://github.com/python/cpython/pull/11488/files#diff-2d95253d6de7bbeebbeb131c5f3aecd9R269

pitrou

A few minor points below.

pitrou · 2019-02-18T16:39:21Z

Lib/multiprocessing/pool.py

+ else:
+ self_notifier_sentinels = []
+
+ sentinels = [*task_queue_sentinels, *self_notifier_sentinels]


How about a dedicated method that returns this list and that you can override in ThreadPool?

pitrou · 2019-02-18T16:39:46Z

Lib/multiprocessing/pool.py

+ maxtasksperchild, wrap_exception)
+
+ worker_sentinels = [worker.sentinel for worker in
+ pool if hasattr(worker, "sentinel")]


Same here: how about a dedicated method to get worker sentinels that you can override in ThreadPool?

The problem here is that we don't have the reference to the pool object (self) after https://bugs.python.org/issue35378 so we can only call class/static methods. The most we can do here is make a static/class method that takes the list of workers (called pool here) and then it returns the list of sentinels.

What do you think?

That sounds fine to me.

pitrou · 2019-02-18T16:40:05Z

Lib/multiprocessing/pool.py

 self._setup_queues()
 self._taskqueue = queue.SimpleQueue()
- self._cache = {}
+ # The _change_notifier queue exist to wake ip self._handle_workers()


pitrou · 2019-02-18T16:40:11Z

Lib/multiprocessing/pool.py

 self._taskqueue = queue.SimpleQueue()
- self._cache = {}
+ # The _change_notifier queue exist to wake ip self._handle_workers()
+ # when the cache (self._cache) is empty or when ther is a change in


"when there is"

pitrou · 2019-02-18T16:40:40Z

Lib/multiprocessing/pool.py

+ def __delitem__(self, item):
+ super().__delitem__(item)
+ if not self:
+ self.notifier.put(None)


Can you add a comment explaining why it's important to wake up when the cache is emptied?

vstinner · 2019-02-18T23:43:36Z

I tested manually example from https://bugs.python.org/issue35493#msg331797:

Without the change: 10.2 sec
With the change: 0.3 sec

It's 34x faster, nice :-)

vstinner

LGTM, but I would prefer that @pitrou or @applio also review the change.

LGTM means that I reviewed the change and it seems like you covered any changes which can wake up _wait_for_updates().

It seems like @pitrou proposed a different implementation, but I don't recall care of this level of detail. I let you deal with that :-)

I tested manually that the PR fix the bug that I reported: see my previous comment and https://bugs.python.org/issue35493#msg331797 initial message.

vstinner · 2019-02-18T23:40:25Z

Lib/multiprocessing/pool.py

 if self._state == RUN:
 _warn(f"unclosed running multiprocessing pool {self!r}",
 ResourceWarning, source=self)
+ if getattr(self, '_change_notifier') is not None:


Why not just self._change_notifier?

getattr(obj, attr) raises AttributeError if the attribute doesn't exist. Maybe you want to write getattr(self, '_change_notifier', None)?

vstinner · 2019-02-18T23:40:55Z

Lib/multiprocessing/pool.py

+ """
+ def __init__(self, *args, notifier=None, **kwds):
+ self.notifier = notifier
+ super().__init__(*args, **kwds)


PEP 8, please add an empty line between methods.

…e cache class and fix typos

pablogsal · 2019-03-10T22:15:35Z

@pitrou Commit 41cf470 should address everything. Could you take a final look?

Lib/multiprocessing/pool.py

pitrou · 2019-03-16T17:54:22Z

I added a minor comment, otherwise LGTM. Thanks @pablogsal !

bedevere-bot · 2019-03-16T22:34:27Z

@pablogsal: Please replace # with GH- in the commit message next time. Thanks!

pablogsal · 2019-03-16T22:37:52Z

Hummm...weird. GitHub has notified me that there was a problem merging the PR and when I clicked "retry" it has not used the message I wrote for the commit.

Anyway, thank you everyone that participated in the review :)

rgommers · 2020-04-10T11:03:42Z

@pablogsal this seems to have caused https://bugs.python.org/issue38501 (hangs on both macOS and Windows with Python 3.8). That bug shows up in multiple SciPy modules as hangs (e.g. scipy/scipy#11835). Could you please have a look at [bpo-38501](https://bugs.python.org/issue38501)?

the-knights-who-say-ni added the CLA signed label Jan 9, 2019

bedevere-bot added the awaiting merge label Jan 9, 2019

pablogsal requested review from pitrou and vstinner January 9, 2019 23:43

pablogsal self-assigned this Jan 9, 2019

pablogsal force-pushed the bpo35493 branch from 14d1e7c to 55170b9 Compare January 9, 2019 23:47

pablogsal commented Jan 9, 2019

View reviewed changes

vstinner requested changes Jan 14, 2019

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting merge labels Jan 14, 2019

pitrou requested changes Jan 14, 2019

View reviewed changes

Lib/multiprocessing/pool.py Outdated Show resolved Hide resolved

Lib/multiprocessing/pool.py Show resolved Hide resolved

Lib/multiprocessing/pool.py Outdated Show resolved Hide resolved

Lib/multiprocessing/pool.py Show resolved Hide resolved

vstinner reviewed Jan 14, 2019

View reviewed changes

Lib/multiprocessing/pool.py Outdated Show resolved Hide resolved

pablogsal force-pushed the bpo35493 branch 8 times, most recently from 7ba415b to fc3fa24 Compare January 14, 2019 22:09

vstinner reviewed Jan 15, 2019

View reviewed changes

pablogsal added 4 commits February 12, 2019 19:47

bpo-35493: Use Process.sentinel instead of sleeping for polling worke…

b5c7d83

…r status in multiprocessing.Pool

Use self-pipe pattern to avoid polling for changes

839f2e2

Refactor some variable names and add comments

46d9625

Restore timeout and poll

9224c33

pablogsal force-pushed the bpo35493 branch from 289a393 to f111e16 Compare February 12, 2019 19:49

pablogsal force-pushed the bpo35493 branch 2 times, most recently from f271edb to 71d5fc9 Compare February 12, 2019 20:11

Use reader object only on wait()

e1ee023

pablogsal force-pushed the bpo35493 branch from 71d5fc9 to e1ee023 Compare February 12, 2019 20:21

Recompute worker sentinels every time

8d72dc8

pablogsal force-pushed the bpo35493 branch from 6c30779 to 8d72dc8 Compare February 12, 2019 21:45

Remove timeout and use change notifier

ab44556

pitrou requested changes Feb 18, 2019

View reviewed changes

vstinner approved these changes Feb 18, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting changes labels Feb 18, 2019

Refactor some methods to be overloaded by the ThreadPool, document th…

41cf470

…e cache class and fix typos

pablogsal requested a review from pitrou March 10, 2019 22:17

pitrou reviewed Mar 16, 2019

View reviewed changes

Lib/multiprocessing/pool.py Show resolved Hide resolved

pitrou approved these changes Mar 16, 2019

View reviewed changes

pablogsal force-pushed the bpo35493 branch from cd51dc5 to 41cf470 Compare March 16, 2019 21:20

pablogsal merged commit 7c99454 into python:master Mar 16, 2019

bedevere-bot removed the awaiting merge label Mar 16, 2019

pablogsal deleted the bpo35493 branch March 16, 2019 22:34

valeriupredoi mentioned this pull request Mar 2, 2020

Failed NCL diagnostic script does not exit but hangs ESMValGroup/ESMValCore#531

Closed

		@@ -0,0 +1,3 @@
		Use :func:`multiprocessing.connection.wait` instead of polling each 0.2

Uh oh!

bpo-35493: Use Process.sentinel instead of sleeping for polling worker status in multiprocessing.Pool #11488

bpo-35493: Use Process.sentinel instead of sleeping for polling worker status in multiprocessing.Pool #11488

Uh oh!

Conversation

pablogsal commented Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pablogsal Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal commented Jan 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commented Jan 14, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal commented Jan 15, 2019

pablogsal commented Feb 7, 2019

pablogsal commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Feb 18, 2019

vstinner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal commented Mar 10, 2019

Uh oh!

pitrou commented Mar 16, 2019

bedevere-bot commented Mar 16, 2019

pablogsal commented Mar 16, 2019

rgommers commented Apr 10, 2020 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Labels

6 participants

pablogsal commented Jan 9, 2019 •

edited

Loading

pablogsal Jan 9, 2019 •

edited

Loading

pablogsal commented Jan 12, 2019 •

edited

Loading

pablogsal commented Feb 12, 2019 •

edited

Loading

rgommers commented Apr 10, 2020 •

edited by bedevere-bot

Loading