Profile dask distributed clusters with py-spy.
import dask import distributed from dask_pyspy import pyspy client = distributed.Client() df = dask.datasets.timeseries( start="2000-01-01", end="2000-01-14", partition_freq="1h", freq="60s", ) with pyspy("worker-profiles"): df.set_index("id").mean().compute()Using pyspy or pyspy_on_scheduler attaches a profiler to the Python process, records a profile, and sends the file(s) back to the client.
By default, py-spy profiles are recorded in speedscope format.
dask-pyspy (and, transitively, py-spy) must be installed in the environment where the scheduler is running.
dask-pyspy tries hard to work out-of-the-box, but if your cluster is running inside Docker, or on macOS, you'll need to configure things so it's allowed to run. See the privileges for py-spy section.
python -m pip install dask-pyspy Make sure this package is also installed in the software environment of your cluster!
The pyspy and pyspy_on_scheduler functions are context managers. Entering them starts py-spy on the workers / scheduler. Exiting them stops py-spy, sends the profile data back to the client, and writes it to disk.
with pyspy_on_scheduler("scheduler-profile.json"): # Profile the scheduler. # Writes to the `scheduler-profile.json` file locally. x.compute() with pyspy("worker-profiles"): # Most basic usage. # Writes a profile per worker to the `worker-profiles` directory locally. # Files are named by worker addresses. x.compute() with pyspy("worker-profiles", native=True): # Collect stack traces from native extensions written in Cython, C or C++. # You should usually turn this on to get much richer profiling information. # However, only recommended when your cluster is running on Linux. x.compute() with pyspy("worker-profiles", workers=2): # Only profile 2 workers (selected randomly) x.compute() with pyspy("worker-profiles", workers=['tcp://10.0.1.2:4567', 'tcp://10.0.1.3:5678']): # Profile specific workers by specifying their addresses x.compute() with pyspy("worker-profiles", format="flamegraph", gil=True, idle=False, nonblocking=True, extra_pyspy_args=["--foo", "bar"]): # Look, you can pass any arguments you want to `py-spy`! # Refer to the `py-spy` command reference for what these mean. # You don't usually need to do this though. We've picked good defaults for you. x.compute() with pyspy("worker-profiles", log_level="info"): # Set py-spy's internal log level. # Useful if py-spy isn't behaving. # Refer to the ``env_logger`` crate for details: # https://docs.rs/env_logger/latest/env_logger/index.html#enabling-logging x.compute()For more information, refer to the docstrings of the functions.
By default, profiles are recorded in speedscope format, so just drop them into https://www.speedscope.app to view them.
This is a handy pattern:
with pyspy("worker-profiles"): input("Press enter when done profiling") # or maybe: # time.sleep(10)Ways you can use it:
- Open a second terminal/Jupyter session/etc.
- In that session, connect to your existing cluster.
- Run the block above. Press enter to stop profiling once you feel like you've got enough.
persisted = my_thing.persist() with pyspy("worker-profiles"): # Watch the dashboard, hit enter when you think you've got enough. input("Press enter when done profiling") # optional, to get actual result: persisted.compute() del persistedtl;dr:
- On macOS clusters, you have to launch your cluster with
sudo - For
docker run, pass--cap-add SYS_PTRACE, or download this newerseccomp.jsonfile and use--seccomp=default.json. - On Windows clusters, you're on your own, sorry.
You may need to run the dask process as root for py-spy to be able to profile it (especially on macOS). See https://github.com/benfred/py-spy#when-do-you-need-to-run-as-sudo.
In a Docker container, dask-pyspy will "just work" for Docker/moby versions >= 21.xx. As of right now (Nov 2022), Docker 21.xx doesn't exist yet, so read on.
moby/moby#42083 allowlisted by default the process_vm_readv system call that py-spy uses, which used to be blocked unless you set --cap-add SYS_PTRACE. Allowing this specific system call in unprivileged containers has been safe to do for a while (since linux kernel versions > 4.8), but just wasn't enabled in Docker. So your options right now are:
- (low/no security impact) Download the newer
seccomp.jsonfile from moby/master and pass it to Docker via--seccomp=default.json. - (more convenient) Pass
--cap-add SYS_PTRACEto Docker. This enables more than you need, but it's one less step.
On Ubuntu-based containers, ptrace system calls are further blocked: processes are prohibited from ptracing each other even within the same UID. To work around this, dask-pyspy automatically uses prctl(2) to mark the scheduler process as ptrace-able by itself and any child processes, then launches py-spy as a child process.
- If you're running something that crashes your cluster, you probably won't be able to get a profile out of it. Transferring results back to the client relies on a stable connection and things in dask all working properly.
- Profiling slows things down. Especially if using
pyspy_on_scheduler, expect noticeably slower results. This is probably not a thing you want to have always-on. - This package is very much in development. I made it for my personal use and am sharing in case it's useful. Please don't be mad if it breaks.
Install Poetry. To create a virtual environment, install dev dependencies, and install the package for local development:
$ poetry install There is one very very basic end-to-end test for py-spy. Running it requires Docker and docker-compose, though the building and running of the containers is managed by pytest-docker-compose, so all you have to do is:
$ pytest tests and wait a long time for the image to build and run.