ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect)
After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting with 2 process with backed nccl
NCCL INFO :
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 (RayExecutor pid=423719, ip=172.16.0.2) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 (RayExecutor pid=508760) ---------------------------------------------------------------------------------------------------- (RayExecutor pid=508760) distributed_backend=nccl (RayExecutor pid=508760) All distributed processes registered. Starting with 2 processes (RayExecutor pid=508760) ---------------------------------------------------------------------------------------------------- (RayExecutor pid=508760) (RayExecutor pid=508760) GPU available: True (cuda), used: True (Please ignore the previous info [GPU used: False]). (RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.59<0> (RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation (RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO cudaDriverVersion 11070 (RayExecutor pid=508760) NCCL version 2.14.3+cuda11.7
But as soon as this message i am getting an nccInternalError : Internal check failed RayTaskError(RuntimeError): [36mray::RayExecutor.execute()[39m (pid=508760, ip=172.16.96.59, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7fa16a4327d0>) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py", line 52, in execute return fn(*args, **kwargs) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launcher.py", line 301, in _wrapping_function results = function(*args, **kwargs) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run self.__setup_profiler() File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir dirpath = self.strategy.broadcast(dirpath) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2084, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Proxy Call to rank 0 failed (Connect)
I am running in on premise cluster without any containerization . And single gpu code works successfully (with 16 batch size). ### Versions / Dependencies torch, ray version ('1.13.1', '2.3.0') ray_lightning => 0.3.0 CPU: 8-Core 11th Gen Intel Core i7-11700 (-MT MCP-) speed/min/max: 901/800/4800 MHz Kernel: 5.4.0-128-generic x86_64 Up: 4h 46m Mem: 9283.6/64016.7 MiB (14.5%) Storage: 2.05 TiB (18.7% used) Procs: 332 Shell: bash 5.0.17 inxi: 3.0.38 No LSB modules are available. Distributor ID: Linuxmint Description: Linux Mint 20.3 Release: 20.3 Codename: una ### Reproduction script from pytorch_lightning import Trainer from torch.utils.data import DataLoader import ray ray.init(runtime_env={"working_dir": utils.ROOT_PATH}) dataset_params = utils.config_parse('AUTOENCODER_DATASET') dataset = AutoEncoderDataModule(**dataset_params) dataset.setup() model = AutoEncoder() autoencoder_params = utils.config_parse('AUTOENCODER_TRAIN') print(autoencoder_params) print(torch.cuda.device_count()) dist_env_params = utils.config_parse('DISTRIBUTED_ENV') strategy = None if int(dist_env_params['horovod']) == 1: strategy = rl.HorovodRayStrategy(use_gpu=True, num_workers=2) elif int(dist_env_params['model_parallel']) == 1: strategy = rl.RayShardedStrategy(use_gpu=True, num_workers=2) elif int(dist_env_params['data_parallel']) == 1: strategy = rl.RayStrategy(use_gpu=True, num_workers=2) trainer = Trainer(**autoencoder_params, strategy=strategy ) trainer.fit(model, dataset) pytorch-lightning parameters: [AUTOENCODER_TRAIN] max_epochs = 100 weights_summary = full precision = 16 gradient_clip_val = 0.0 auto_lr_find = True auto_scale_batch_size = True auto_select_gpus = True check_val_every_n_epoch = 1 fast_dev_run = False enable_progress_bar = True detect_anomaly=True python run.py ### Issue Severity High: It blocks me from completing my task.