Warn but proceed if we are not able to retrieve Slurm partition information#5465
Conversation
| Will add a test later, no need to allow the workflow to run for now (if it's pending the approval of a contributor). Created PR after talking with @mr-c and @adamnovak on today's CWL4HPC call, just so I don't forget :) |
| | ||
| except Exception as e: | ||
| log.warning("Could not retrieve exceutor log due to: '%s'.", e) | ||
| log.warning("Could not retrieve executor log due to: '%s'.", e) |
There was a problem hiding this comment.
I was looking for an example in the project of how warnings are logged and found this random typo with the IDE 👍
ac516a6 to 5de69b2 Compare | Test added, addressed @mr-c's feedback. I think this will be difficult for a reviewer to reproduce. I am sorry. I did test it just now on BSC MN5. It passed the partition problem, but failed somewhere else (will comment in CWL4HPC element room). Cheers. |
| @kinow We actually had someone with a similar problem with their Slurm cluster not liking to have memory requested of it, and there's a Then the workflow needs to already know to request the right number of cores to ensure it gets the right amount of memory, though, and that's not portable. Is the fact that your cluster doesn't allow memory requests, and the amount of memory you get assigned per core for a particular partition, available in a machine-readable form? I guess we could add a feature where you can tell Toil how much memory per core a cluster gives, and it can compute a core count to request to ensure jobs all have their minimum required memory. |
adamnovak left a comment
There was a problem hiding this comment.
This is OK, but I would adjust the test to try and keep it out of the guts of the PartitionSet, and also refactor the code to make not knowing partitions a thing that's more obviously contemplated.
| try: | ||
| self._get_partition_info() | ||
| self._get_gpu_partitions() | ||
| except CalledProcessErrorStderr as e: |
There was a problem hiding this comment.
I think in a couple other places we've gotten in trouble because some ways of not being able to call a child process can produce IIRC a FileNotFoundError or some kind of permission error.
Can we expect sinfo to always be a binary we can start?
There was a problem hiding this comment.
Ah, I think that might happen too.
Can we expect sinfo to always be a binary we can start?
In Autosubmit, we have a mysterious bug that happens very rarely on... BSC MN5 and CSC LUMI, I think? Definitely BSC MN5. What happens is that sometimes sbatch cannot be found, or just fails to run completely with a random error. We detected that while trying to understand why our workflow retries were not working.
So in the ideal world, yes, it will be always a binary.
In the real world, I think it's quite possible a parallel file system or NFS might just get out of sync, lose connection to metadata server, or have a random vendor bug and cause Slurm commands to completely disappear.
Should we watch for both CalledProcessErrorStderr and FileNotFoundError here (or another one, I can test calling a random binary name to confirm the exact exception)?
There was a problem hiding this comment.
Since this happens once at Toil startup, I think we want the case where, at that very moment, an sinfo binary that should exist wasn't available, to result in a fatal error. We don't want Toil to behave differently for the whole run because of a transient error during startup.
So if we don't expect any sites to deliberately not have an sinfo binary at all, then we should not handle FileNotFoundError and we should let it kill Toil.
| # NOTE: This is by design. The class has a type definition, but not variable declaration; | ||
| # so upon a failure, nothing is created. Maybe it would be better to always have a | ||
| # variable with a type (even if Optional[Any]). | ||
| self.assertFalse(hasattr(partition_set, 'all_partitions')) |
There was a problem hiding this comment.
I think you are right and it would make sense to refactor this to be nullable and not just completely absent for the case where we don't know anything about partitions.
| It also looks like this PR is triggering a bug(?) in sphinx-autoapi, where during https://ucsc-ci.com/databiosphere/toil/-/jobs/105157#L1213 We then fail the docs build, because a "warning" during the docs build might be a broken link. Sometimes a duplicate import can cause this, but I don't see any here immediately; it might be necessary to delete bits of the code until |
5de69b2 to a89502d Compare
Oh, it was duplicated import, @adamnovak ! Sorry, removed that, tested before/after the change locally, and removing the duplicate fixed the build, as you commented. Rebased, and edited my commit. |
| BTW, tested this PR on BSC MN5 with |
| Just writing the error message from GitLab CICD runner here so I can access it more easily later. SKIPPED [1] src/toil/test/wdl/wdltoil_test_kubernetes.py:52: Set TOIL_TEST_INTEGRATIVE=True to include this integration test, or run `make integration_test_local` to run all integration tests. SKIPPED [1] src/toil/test/src/jobTest.py:40: Skipped because TOIL_TEST_QUICK is "True" FAILED src/toil/test/batchSystems/test_slurm.py::SlurmTest::test_sinfo_not_permitted - AttributeError: 'PartitionSet' object has no attribute 'all_partitions'. Did you mean: 'get_partition'? = 1 failed, 265 passed, 608 skipped, 1 xpassed, 40 warnings, 32 subtests passed in 467.05s (0:07:47) = make: *** [Makefile:193: test_offline] Error 1I think I ran the test I changed on PyCharm. I did get this error before, which is why I added a comment in the code about the variable having only type annotations, no default value, so it's never created... that sounds a lot like the error above. Will have a look at it later. |
| Yeah I was wondering how it worked in the case where we didn't actually run partition detection, since there weren't a bunch of I think the right answer is to default the data structures to |
Sounds good to me. Let me try that, and run this test locally too (on pycharm and command line now). Thanks Adam. |
Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>
9dc03f3 to 05a552b Compare There was a problem hiding this comment.
Set the properties to None, updated types. Ran the slurm tests in the IDE, all passed:
Mypy passed:
(venv) kinow@ranma:~/Development/python/workspace/toil$ make mypy MYPYPATH=/home/kinow/Development/python/workspace/toil/contrib/mypy-stubs mypy --strict /home/kinow/Development/python/workspace/toil/src/toil/{cwl/cwltoil.py,test/cwl/cwlTest.py} Success: no issues found in 2 source files /home/kinow/Development/python/workspace/toil/contrib/admin/mypy-with-ignore.py /home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (7.0.1)/charset_normalizer (3.4.4) doesn't match a supported version! warnings.warn( Success: no issues found in 131 source files Running export TOIL_TEST_QUICK=True; make test there were 11 failures, but reading the logs I'm not sure if they are caused by the Slurm change. e.g.,
================================================================= FAILURES ================================================================= _________________________________________________ [doctest] toil.deferred.DeferredFunction _________________________________________________ [gw0] linux -- Python 3.13.7 /home/kinow/Development/python/workspace/toil/venv/bin/python 036 037 >>> from collections import defaultdict 038 >>> df = DeferredFunction.create(defaultdict, None, {'x':1}, y=2) 039 >>> df 040 DeferredFunction(defaultdict, ...) 041 >>> df.invoke() == defaultdict(None, x=1, y=2) Expected: True Got nothing /home/kinow/Development/python/workspace/toil/src/toil/deferred.py:41: DocTestFailure ----------------------------------------------------------- Captured stdout call ----------------------------------------------------------- True ------------------------------------------------------------ Captured log call ------------------------------------------------------------- 20:26:47 DEBUG Module dir is /usr/lib/python3.13, our prefix is /home/kinow/Development/python/workspace/toil/venv, virtualenv: True 20:26:47 DEBUG Running deferred function DeferredFunction(defaultdict, ...). 20:26:47 WARNING The localize() method should only be invoked on a worker. ____________________________________________________ [doctest] toil.lib.retry.old_retry ____________________________________________________ [gw0] linux -- Python 3.13.7 /home/kinow/Development/python/workspace/toil/venv/bin/python 553 >>> false = lambda _:False 554 >>> i = 0 555 >>> for attempt in old_retry( delays=[0], timeout=.1, predicate=true ): 556 ... with attempt: 557 ... i += 1 558 ... raise RuntimeError('foo') 559 Traceback (most recent call last): 560 ... 561 RuntimeError: foo 562 >>> i > 1 Expected: True Got nothing /home/kinow/Development/python/workspace/toil/src/toil/lib/retry.py:562: DocTestFailure .. ... 20:26:47 ERROR Got foo and no time is left to retry __________________________________________________________ [doctest] toil.lib.web __________________________________________________________ [gw0] linux -- Python 3.13.7 /home/kinow/Development/python/workspace/toil/venv/bin/python 015 016 Contains functions for making web requests with Toil. 017 018 All web requests should go through this module, to make sure they use the right 019 user agent. 020 >>> httpserver = getfixture("httpserver") 021 >>> handler = httpserver.expect_request("/path").respond_with_json({}) 022 >>> from toil.lib.web import web_session 023 >>> web_session.get(httpserver.url_for("/path")) Expected: <Response [200]> Got nothing /home/kinow/Development/python/workspace/toil/src/toil/lib/web.py:23: DocTestFailure ----------------------------------------------------------- Captured stdout call ----------------------------------------------------------- <Response [200]> ------------------------------------------------------------ Captured log call ------------------------------------------------------------- 20:26:53 INFO 127.0.0.1 - - [17/Mar/2026 20:26:53] "GET /path HTTP/1.1" 200 - This is the only failure that seemed to be related:
FAILED src/toil/test/cwl/cwlTest.py::TestCWLWorkflow::test_slurm_node_memory - subprocess.TimeoutExpired: Command '['toil-cwl-runner', '--jobStore=/tmp/pytest-of-kinow/pytest-4/test_slurm_node_memory0/jobStoreDir',...But I cannot tell if it's failing due to my changes, or if it's something missing in my environment (I had to install several requirements-* to fix some WDL/Kubernetes/etc tests).
/home/kinow/Development/python/workspace/toil/venv/bin/python /home/kinow/Development/python/pycharm-2024.1.4/plugins/python-ce/helpers/pycharm/_jb_pytest_runner.py --target toil/test/cwl/cwlTest.py::TestCWLWorkflow.test_slurm_node_memory Testing started at 8:48 PM ... Launching pytest with arguments toil/test/cwl/cwlTest.py::TestCWLWorkflow::test_slurm_node_memory --no-header --no-summary -q in /home/kinow/Development/python/workspace/toil/src /home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (7.0.1)/charset_normalizer (3.4.4) doesn't match a supported version! warnings.warn( ============================= test session starts ============================== collecting ... collected 1 item toil/test/cwl/cwlTest.py::TestCWLWorkflow::test_slurm_node_memory =================== 1 failed, 1 warning in 80.71s (0:01:20) ==================== FAILED [100%]/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (7.0.1)/charset_normalizer (3.4.4) doesn't match a supported version! warnings.warn( /home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/htcondor/__init__.py:49: UserWarning: Neither the environment variable CONDOR_CONFIG, /etc/condor/, /usr/local/etc/, nor ~condor/ contain a condor_config source. Therefore, we are using a null condor_config. _warnings.warn(message) [2026-03-17T20:48:11+0100] [MainThread] [I] [toil.lib.history] Recording workflow creation of 39286742-d094-40c9-842d-dfa5f4bc9b21 in file:/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/jobStoreDir [2026-03-17T20:48:11+0100] [MainThread] [I] [toil.cwl.cwltoil] Importing tool-associated files... [2026-03-17T20:48:11+0100] [MainThread] [I] [toil.cwl.cwltoil] Importing input files... [2026-03-17T20:48:11+0100] [MainThread] [I] [toil.cwl.cwltoil] Starting workflow [2026-03-17T20:48:11+0100] [MainThread] [I] [toil.lib.history] Workflow 39286742-d094-40c9-842d-dfa5f4bc9b21 is a run of /home/kinow/Development/python/workspace/toil/src/toil/test/cwl/measure_default_memory.cwl sinfo: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sinfo: error: fetch_config: DNS SRV lookup failed sinfo: error: _establish_config_source: failed to fetch config sinfo: fatal: Could not establish a configuration source [2026-03-17T20:48:11+0100] [MainThread] [W] [toil.batchSystems.slurm] Could not retrieve Slurm partition info due to: 'Command '['sinfo', '-a', '-o', '%P %G %l %p %c %m']' exit status 1: sinfo: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sinfo: error: fetch_config: DNS SRV lookup failed sinfo: error: _establish_config_source: failed to fetch config sinfo: fatal: Could not establish a configuration source '. [2026-03-17T20:48:11+0100] [MainThread] [I] [toil] Running Toil version 9.3.0a1-71d0b8fe4e0a372ab0188c1d60effc179e8acc52 on host ranma. [2026-03-17T20:48:11+0100] [MainThread] [I] [toil.realtimeLogger] Starting real-time logging. [2026-03-17T20:48:11+0100] [MainThread] [I] [toil.leader] Issued job 'CWLJob' measure_default_memory.cwl kind-CWLJob/instance-n43j3gny v1 with job batch system ID: 1 and disk: 3.0 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:12+0100] [Thread-2] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation submitJob, code 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:12+0100] [Thread-2] [I] [toil.lib.retry] Got Command '['sbatch', '-J', 'toil_job_1_measure_default_memory.cwl', '--signal=B:INT@30', '--export=ALL,TOIL_RT_LOGGING_ADDRESS=192.168.1.28:58406,TOIL_RT_LOGGING_LEVEL=INFO,OMP_NUM_THREADS=1', '--mem=0', '--cpus-per-task=1', '-o', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.out.log', '-e', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.err.log', '--wrap=exec _toil_worker CWLJob file:/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/jobStoreDir kind-CWLJob/instance-n43j3gny --context gASVzgAAAAAAAACMIXRvaWwuYmF0Y2hTeXN0ZW1zLmNsZWFudXBfc3VwcG9ydJSMFFdvcmtlckNsZWFudXBDb250ZXh0lJOUKYGUfZSMEXdvcmtlckNsZWFudXBJbmZvlIwldG9pbC5iYXRjaFN5c3RlbXMuYWJzdHJhY3RCYXRjaFN5c3RlbZSMEVdvcmtlckNsZWFudXBJbmZvlJOUKE5OjCQzOTI4Njc0Mi1kMDk0LTQwYzktODQyZC1kZmE1ZjRiYzliMjGUjAZhbHdheXOUdJSBlHNiLg==']' exit status 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source , trying again in 1s. sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:13+0100] [Thread-2] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation submitJob, code 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:13+0100] [Thread-2] [I] [toil.lib.retry] Got Command '['sbatch', '-J', 'toil_job_1_measure_default_memory.cwl', '--signal=B:INT@30', '--export=ALL,TOIL_RT_LOGGING_ADDRESS=192.168.1.28:58406,TOIL_RT_LOGGING_LEVEL=INFO,OMP_NUM_THREADS=1', '--mem=0', '--cpus-per-task=1', '-o', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.out.log', '-e', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.err.log', '--wrap=exec _toil_worker CWLJob file:/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/jobStoreDir kind-CWLJob/instance-n43j3gny --context gASVzgAAAAAAAACMIXRvaWwuYmF0Y2hTeXN0ZW1zLmNsZWFudXBfc3VwcG9ydJSMFFdvcmtlckNsZWFudXBDb250ZXh0lJOUKYGUfZSMEXdvcmtlckNsZWFudXBJbmZvlIwldG9pbC5iYXRjaFN5c3RlbXMuYWJzdHJhY3RCYXRjaFN5c3RlbZSMEVdvcmtlckNsZWFudXBJbmZvlJOUKE5OjCQzOTI4Njc0Mi1kMDk0LTQwYzktODQyZC1kZmE1ZjRiYzliMjGUjAZhbHdheXOUdJSBlHNiLg==']' exit status 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source , trying again in 1s. squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:13+0100] [MainThread] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation getRunningJobIDs, code 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:13+0100] [MainThread] [I] [toil.lib.retry] Got Command '['squeue', '-h', '--format', '%i %t %M']' exit status 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source , trying again in 1s. sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:14+0100] [Thread-2] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation submitJob, code 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:14+0100] [Thread-2] [I] [toil.lib.retry] Got Command '['sbatch', '-J', 'toil_job_1_measure_default_memory.cwl', '--signal=B:INT@30', '--export=ALL,TOIL_RT_LOGGING_ADDRESS=192.168.1.28:58406,TOIL_RT_LOGGING_LEVEL=INFO,OMP_NUM_THREADS=1', '--mem=0', '--cpus-per-task=1', '-o', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.out.log', '-e', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.err.log', '--wrap=exec _toil_worker CWLJob file:/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/jobStoreDir kind-CWLJob/instance-n43j3gny --context gASVzgAAAAAAAACMIXRvaWwuYmF0Y2hTeXN0ZW1zLmNsZWFudXBfc3VwcG9ydJSMFFdvcmtlckNsZWFudXBDb250ZXh0lJOUKYGUfZSMEXdvcmtlckNsZWFudXBJbmZvlIwldG9pbC5iYXRjaFN5c3RlbXMuYWJzdHJhY3RCYXRjaFN5c3RlbZSMEVdvcmtlckNsZWFudXBJbmZvlJOUKE5OjCQzOTI4Njc0Mi1kMDk0LTQwYzktODQyZC1kZmE1ZjRiYzliMjGUjAZhbHdheXOUdJSBlHNiLg==']' exit status 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source , trying again in 1s. squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:14+0100] [MainThread] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation getRunningJobIDs, code 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:14+0100] [MainThread] [I] [toil.lib.retry] Got Command '['squeue', '-h', '--format', '%i %t %M']' exit status 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source , trying again in 1s. sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:15+0100] [Thread-2] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation submitJob, code 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:15+0100] [Thread-2] [I] [toil.lib.retry] Got Command '['sbatch', '-J', 'toil_job_1_measure_default_memory.cwl', '--signal=B:INT@30', '--export=ALL,TOIL_RT_LOGGING_ADDRESS=192.168.1.28:58406,TOIL_RT_LOGGING_LEVEL=INFO,OMP_NUM_THREADS=1', '--mem=0', '--cpus-per-task=1', '-o', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.out.log', '-e', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.err.log', '--wrap=exec _toil_worker CWLJob file:/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/jobStoreDir kind-CWLJob/instance-n43j3gny --context gASVzgAAAAAAAACMIXRvaWwuYmF0Y2hTeXN0ZW1zLmNsZWFudXBfc3VwcG9ydJSMFFdvcmtlckNsZWFudXBDb250ZXh0lJOUKYGUfZSMEXdvcmtlckNsZWFudXBJbmZvlIwldG9pbC5iYXRjaFN5c3RlbXMuYWJzdHJhY3RCYXRjaFN5c3RlbZSMEVdvcmtlckNsZWFudXBJbmZvlJOUKE5OjCQzOTI4Njc0Mi1kMDk0LTQwYzktODQyZC1kZmE1ZjRiYzliMjGUjAZhbHdheXOUdJSBlHNiLg==']' exit status 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source , trying again in 4s. squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:15+0100] [MainThread] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation getRunningJobIDs, code 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:15+0100] [MainThread] [I] [toil.lib.retry] Got Command '['squeue', '-h', '--format', '%i %t %M']' exit status 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source , trying again in 1s. squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:16+0100] [MainThread] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation getRunningJobIDs, code 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:16+0100] [MainThread] [I] [toil.lib.retry] Got Command '['squeue', '-h', '--format', '%i %t %M']' exit status 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source , trying again in 4s. sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:19+0100] [Thread-2] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation submitJob, code 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:19+0100] [Thread-2] [I] [toil.lib.retry] Got Command '['sbatch', '-J', 'toil_job_1_measure_default_memory.cwl', '--signal=B:INT@30', '--export=ALL,TOIL_RT_LOGGING_ADDRESS=192.168.1.28:58406,TOIL_RT_LOGGING_LEVEL=INFO,OMP_NUM_THREADS=1', '--mem=0', '--cpus-per-task=1', '-o', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.out.log', '-e', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.err.log', '--wrap=exec _toil_worker CWLJob file:/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/jobStoreDir kind-CWLJob/instance-n43j3gny --context gASVzgAAAAAAAACMIXRvaWwuYmF0Y2hTeXN0ZW1zLmNsZWFudXBfc3VwcG9ydJSMFFdvcmtlckNsZWFudXBDb250ZXh0lJOUKYGUfZSMEXdvcmtlckNsZWFudXBJbmZvlIwldG9pbC5iYXRjaFN5c3RlbXMuYWJzdHJhY3RCYXRjaFN5c3RlbZSMEVdvcmtlckNsZWFudXBJbmZvlJOUKE5OjCQzOTI4Njc0Mi1kMDk0LTQwYzktODQyZC1kZmE1ZjRiYzliMjGUjAZhbHdheXOUdJSBlHNiLg==']' exit status 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source , trying again in 16s. squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:20+0100] [MainThread] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation getRunningJobIDs, code 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:20+0100] [MainThread] [I] [toil.lib.retry] Got Command '['squeue', '-h', '--format', '%i %t %M']' exit status 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source , trying again in 16s. sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:35+0100] [Thread-2] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation submitJob, code 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source [2026-03-17T20:48:35+0100] [Thread-2] [I] [toil.lib.retry] Got Command '['sbatch', '-J', 'toil_job_1_measure_default_memory.cwl', '--signal=B:INT@30', '--export=ALL,TOIL_RT_LOGGING_ADDRESS=192.168.1.28:58406,TOIL_RT_LOGGING_LEVEL=INFO,OMP_NUM_THREADS=1', '--mem=0', '--cpus-per-task=1', '-o', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.out.log', '-e', '/tmp/toil_39286742-d094-40c9-842d-dfa5f4bc9b21.1.%j.err.log', '--wrap=exec _toil_worker CWLJob file:/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/jobStoreDir kind-CWLJob/instance-n43j3gny --context gASVzgAAAAAAAACMIXRvaWwuYmF0Y2hTeXN0ZW1zLmNsZWFudXBfc3VwcG9ydJSMFFdvcmtlckNsZWFudXBDb250ZXh0lJOUKYGUfZSMEXdvcmtlckNsZWFudXBJbmZvlIwldG9pbC5iYXRjaFN5c3RlbXMuYWJzdHJhY3RCYXRjaFN5c3RlbZSMEVdvcmtlckNsZWFudXBJbmZvlJOUKE5OjCQzOTI4Njc0Mi1kMDk0LTQwYzktODQyZC1kZmE1ZjRiYzliMjGUjAZhbHdheXOUdJSBlHNiLg==']' exit status 1: sbatch: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host sbatch: error: fetch_config: DNS SRV lookup failed sbatch: error: _establish_config_source: failed to fetch config sbatch: fatal: Could not establish a configuration source , trying again in 64s. squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:36+0100] [MainThread] [E] [toil.batchSystems.abstractGridEngineBatchSystem] Errored operation getRunningJobIDs, code 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source [2026-03-17T20:48:36+0100] [MainThread] [I] [toil.lib.retry] Got Command '['squeue', '-h', '--format', '%i %t %M']' exit status 1: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source , trying again in 64s. /home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (7.0.1)/charset_normalizer (3.4.4) doesn't match a supported version! warnings.warn( /home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/htcondor/__init__.py:49: UserWarning: Neither the environment variable CONDOR_CONFIG, /etc/condor/, /usr/local/etc/, nor ~condor/ contain a condor_config source. Therefore, we are using a null condor_config. _warnings.warn(message) [2026-03-17T20:49:11+0100] [MainThread] [I] [toil.utils.toilKill] Toil process 118499 successfully terminated. [2026-03-17T20:49:12+0100] [MainThread] [I] [toil.realtimeLogger] Stopping real-time logging server. [2026-03-17T20:49:12+0100] [MainThread] [I] [toil.realtimeLogger] Joining real-time logging server thread. src/toil/test/cwl/cwlTest.py:570 (TestCWLWorkflow.test_slurm_node_memory) Traceback (most recent call last): File "/home/kinow/Development/python/workspace/toil/src/toil/test/cwl/cwlTest.py", line 600, in test_slurm_node_memory output, _ = child.communicate(timeout=60) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/usr/lib/python3.13/subprocess.py", line 1222, in communicate stdout, stderr = self._communicate(input, endtime, timeout) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.13/subprocess.py", line 2129, in _communicate self._check_timeout(endtime, orig_timeout, stdout, stderr) ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.13/subprocess.py", line 1269, in _check_timeout raise TimeoutExpired( ...<2 lines>... stderr=b''.join(stderr_seq) if stderr_seq else None) subprocess.TimeoutExpired: Command '['toil-cwl-runner', '--jobStore=/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/jobStoreDir', '--clean=never', '--batchSystem=slurm', '--no-cwl-default-ram', '--slurmDefaultAllMem=True', '--outdir', '/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/outdir', '/home/kinow/Development/python/workspace/toil/src/toil/test/cwl/measure_default_memory.cwl']' timed out after 60 seconds During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/_pytest/runner.py", line 353, in from_call result: TResult | None = func() ~~~~^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/_pytest/runner.py", line 245, in <lambda> lambda: runtest_hook(item=item, **kwds), ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_hooks.py", line 512, in __call__ return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_manager.py", line 120, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 167, in _multicall raise exception File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 139, in _multicall teardown.throw(exception) ~~~~~~~~~~~~~~^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/_pytest/logging.py", line 850, in pytest_runtest_call yield File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 139, in _multicall teardown.throw(exception) ~~~~~~~~~~~~~~^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/_pytest/capture.py", line 900, in pytest_runtest_call return (yield) ^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 139, in _multicall teardown.throw(exception) ~~~~~~~~~~~~~~^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 53, in run_old_style_hookwrapper return result.get_result() ~~~~~~~~~~~~~~~~~^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_result.py", line 103, in get_result raise exc.with_traceback(tb) File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 38, in run_old_style_hookwrapper res = yield ^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 139, in _multicall teardown.throw(exception) ~~~~~~~~~~~~~~^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/_pytest/skipping.py", line 268, in pytest_runtest_call return (yield) ^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 121, in _multicall res = hook_impl.function(*args) File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/_pytest/runner.py", line 179, in pytest_runtest_call item.runtest() ~~~~~~~~~~~~^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/_pytest/python.py", line 1720, in runtest self.ihook.pytest_pyfunc_call(pyfuncitem=self) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_hooks.py", line 512, in __call__ return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_manager.py", line 120, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 167, in _multicall raise exception File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/pluggy/_callers.py", line 121, in _multicall res = hook_impl.function(*args) File "/home/kinow/Development/python/workspace/toil/venv/lib/python3.13/site-packages/_pytest/python.py", line 166, in pytest_pyfunc_call result = testfunction(**testargs) File "/home/kinow/Development/python/workspace/toil/src/toil/test/cwl/cwlTest.py", line 607, in test_slurm_node_memory child.communicate(timeout=20) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/usr/lib/python3.13/subprocess.py", line 1222, in communicate stdout, stderr = self._communicate(input, endtime, timeout) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.13/subprocess.py", line 2129, in _communicate self._check_timeout(endtime, orig_timeout, stdout, stderr) ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.13/subprocess.py", line 1269, in _check_timeout raise TimeoutExpired( ...<2 lines>... stderr=b''.join(stderr_seq) if stderr_seq else None) subprocess.TimeoutExpired: Command '['toil-cwl-runner', '--jobStore=/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/jobStoreDir', '--clean=never', '--batchSystem=slurm', '--no-cwl-default-ram', '--slurmDefaultAllMem=True', '--outdir', '/tmp/pytest-of-kinow/pytest-5/test_slurm_node_memory0/outdir', '/home/kinow/Development/python/workspace/toil/src/toil/test/cwl/measure_default_memory.cwl']' timed out after 20 seconds Process finished with exit code 1Can someone trigger the build to see what happens, please, @mr-c or @adamnovak . Thanks!
| I think the The Makefile has some comments about this and defines a special We never actually run the full The failing Slurm test is because you have I'll pull this in for CI testing. |
| Thanks @adamnovak ! Let me know if it needs any further tweaking, tests, docs, etc. 👍 |


Closes #5461
Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thingin the Toil repo, or from an external repo.camelCasethat want to be insnake_case.docs/running/{cliOptions,cwl,wdl}.rstMerger Checklist