0

ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect)

After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting with 2 process with backed nccl

NCCL INFO :

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 (RayExecutor pid=423719, ip=172.16.0.2) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 (RayExecutor pid=508760) ---------------------------------------------------------------------------------------------------- (RayExecutor pid=508760) distributed_backend=nccl (RayExecutor pid=508760) All distributed processes registered. Starting with 2 processes (RayExecutor pid=508760) ---------------------------------------------------------------------------------------------------- (RayExecutor pid=508760) (RayExecutor pid=508760) GPU available: True (cuda), used: True (Please ignore the previous info [GPU used: False]). (RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.59<0> (RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation (RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO cudaDriverVersion 11070 (RayExecutor pid=508760) NCCL version 2.14.3+cuda11.7

But as soon as this message i am getting an nccInternalError : Internal check failed 

RayTaskError(RuntimeError): [36mray::RayExecutor.execute()[39m (pid=508760, ip=172.16.96.59, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7fa16a4327d0>) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py", line 52, in execute return fn(*args, **kwargs) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launcher.py", line 301, in _wrapping_function results = function(*args, **kwargs) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run self.__setup_profiler() File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir dirpath = self.strategy.broadcast(dirpath) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2084, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Proxy Call to rank 0 failed (Connect)

I am running in on premise cluster without any containerization . And single gpu code works successfully (with 16 batch size). ### Versions / Dependencies torch, ray version ('1.13.1', '2.3.0') ray_lightning => 0.3.0 CPU: 8-Core 11th Gen Intel Core i7-11700 (-MT MCP-) speed/min/max: 901/800/4800 MHz Kernel: 5.4.0-128-generic x86_64 Up: 4h 46m Mem: 9283.6/64016.7 MiB (14.5%) Storage: 2.05 TiB (18.7% used) Procs: 332 Shell: bash 5.0.17 inxi: 3.0.38 No LSB modules are available. Distributor ID: Linuxmint Description: Linux Mint 20.3 Release: 20.3 Codename: una ### Reproduction script 
from pytorch_lightning import Trainer from torch.utils.data import DataLoader import ray ray.init(runtime_env={"working_dir": utils.ROOT_PATH}) dataset_params = utils.config_parse('AUTOENCODER_DATASET') dataset = AutoEncoderDataModule(**dataset_params) dataset.setup() model = AutoEncoder() autoencoder_params = utils.config_parse('AUTOENCODER_TRAIN') print(autoencoder_params) print(torch.cuda.device_count()) dist_env_params = utils.config_parse('DISTRIBUTED_ENV') strategy = None if int(dist_env_params['horovod']) == 1: strategy = rl.HorovodRayStrategy(use_gpu=True, num_workers=2) elif int(dist_env_params['model_parallel']) == 1: strategy = rl.RayShardedStrategy(use_gpu=True, num_workers=2) elif int(dist_env_params['data_parallel']) == 1: strategy = rl.RayStrategy(use_gpu=True, num_workers=2) trainer = Trainer(**autoencoder_params, strategy=strategy ) trainer.fit(model, dataset) 
pytorch-lightning parameters: [AUTOENCODER_TRAIN] max_epochs = 100 weights_summary = full precision = 16 gradient_clip_val = 0.0 auto_lr_find = True auto_scale_batch_size = True auto_select_gpus = True check_val_every_n_epoch = 1 fast_dev_run = False enable_progress_bar = True detect_anomaly=True python run.py ### Issue Severity High: It blocks me from completing my task. 
4
  • 1
    sorry for worst formatting, this stackoverflow doest want it to look formated.. this stackoverflow only accepts this Commented Mar 14, 2023 at 16:08
  • nccl will open a tcp connection between ranks before starting. I'd make sure the two nodes you have can communicate. I see you're using some kind of linux-on-windows, so checking the firewalls there would be the first thing I'd check. Commented Mar 25, 2023 at 5:20
  • yes thats the problem! :) thanks Commented Mar 27, 2023 at 10:57
  • Great! Feel free to answer your own question with how you checked/enabled firewall :) Commented Mar 27, 2023 at 15:43

1 Answer 1

0

In my case of using the accelerate package, updating it from 0.26.1 to 0.27.2 solved the problem.

Some additional steps:

  • make sure there are no firewall rules preventing the gpus to connect.
  • use tools like ping and netcat to ensure the servers can communicate with no problem.
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.