0

I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on the worker node(s):

[rank4]: Traceback (most recent call last): [rank4]: File "/home/user/workspace/ddp/main.py", line 159, in <module> [rank4]: main() [rank4]: File "/home/user/workspace/ddp/main.py", line 90, in main [rank4]: ddp_model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__ [rank4]: _verify_param_shape_across_processes(self.process_group, parameters) [rank4]: File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/utils.py", line 294, in _verify_param_shape_across_processes [rank4]: return dist._verify_params_across_processes(process_group, tensors, logger) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5 [rank4]: ncclInternalError: Internal check failed. [rank4]: Last error: [rank4]: Bootstrap : no socket interface found [rank4]:[W131 14:34:49.202068506 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W0131 14:34:49.846516 2700574 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2700596 closing signal SIGTERM W0131 14:34:49.847558 2700574 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2700598 closing signal SIGTERM E0131 14:34:49.944460 2700574 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2700595) of binary: /home/user/workspace/ddp/.venv3.11/bin/python3.11 Traceback (most recent call last): File "/home/user/workspace/ddp/.venv3.11/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main run(args) File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run elastic_launch( File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-01-31_14:34:49 host : ***** rank : 6 (local_rank: 2) exitcode : 1 (pid: 2700597) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-31_14:34:49 host : ***** rank : 4 (local_rank: 0) exitcode : 1 (pid: 2700595) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 

Environment:

  • PyTorch: 2.6.0
  • NCCL: 2.21.5
  • CUDA: 12.4
  • Python: 3.11

I tried changing the DDP backend from nccl to gloo in my argument parser:

parser.add_argument("--backend", type=str, default="nccl", choices=["nccl", "gloo", "mpi"], help="DDP backend") 

When I set --backend=gloo, the script runs without errors, but it runs on the CPU instead of the GPU. Since I need GPU acceleration, I must use nccl, but that's where the error occurs.

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.