Run NCCL on custom GKE clusters that use A4X

This page describes how to run NCCL/gIB tests on provisioned clusters that use GPUDirect RDMA. It describes tests for the following scenarios:

If you have nodes that are provisioned with flex-start (Preview), use a basic test on two nodes.
If you have a larger number of nodes that are not provisioned with flex-start, use an NCCL test with Topology Aware Scheduling.

Test on two nodes

Connect to your cluster:
```
gcloud container clusters get-credentials CLUSTER_NAME \  --location=COMPUTE_REGION 
```
Replace the following variables:
- CLUSTER_NAME: the name of your cluster, which, for the clusters created with Cluster Toolkit, is based on the DEPLOYMENT_NAME.
- COMPUTE_REGION: the name of the compute region.

To deploy an NCCL test workload of two test Pods that are running on two A4X nodes, run the following:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-imex-a4x.yaml

Check if the Pods are both running on some nodes:
```
kubectl get pods nccl-test-host-1 nccl-test-host-2 
```
If the two Pods show a Running status, you can proceed to the next step.

Trigger an all-gather test for the A4X nodes:

kubectl exec nccl-test-host-1 -it -- /usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1K -e 8G nccl-host-1 nccl-host-2

The output is similar to the following:

# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 32 float none -1 21.20 0.05 0.04 0 20.56 0.05 0.04 0 2048 64 float none -1 21.03 0.10 0.09 0 20.82 0.10 0.09 0 4096 128 float none -1 21.11 0.19 0.17 0 20.98 0.20 0.17 0 8192 256 float none -1 21.51 0.38 0.33 0 21.15 0.39 0.34 0 16384 512 float none -1 21.85 0.75 0.66 0 21.72 0.75 0.66 0 32768 1024 float none -1 24.08 1.36 1.19 0 23.73 1.38 1.21 0 65536 2048 float none -1 24.68 2.66 2.32 0 24.02 2.73 2.39 0 131072 4096 float none -1 24.93 5.26 4.60 0 24.30 5.40 4.72 0 262144 8192 float none -1 24.86 10.55 9.23 0 24.33 10.78 9.43 0 524288 16384 float none -1 25.10 20.89 18.28 0 24.48 21.41 18.74 0 1048576 32768 float none -1 25.43 41.24 36.09 0 24.82 42.25 36.97 0 2097152 65536 float none -1 32.30 64.93 56.81 0 31.28 67.04 58.66 0 4194304 131072 float none -1 45.92 91.34 79.92 0 44.22 94.84 82.99 0 8388608 262144 float none -1 71.38 117.52 102.83 0 68.98 121.61 106.41 0 16777216 524288 float none -1 74.17 226.20 197.93 0 72.37 231.83 202.85 0 33554432 1048576 float none -1 116.6 287.84 251.86 0 112.7 297.75 260.54 0 67108864 2097152 float none -1 188.9 355.27 310.86 0 184.0 364.71 319.12 0 134217728 4194304 float none -1 309.6 433.56 379.36 0 299.7 447.83 391.85 0 268435456 8388608 float none -1 559.0 480.23 420.20 0 540.3 496.85 434.75 0 536870912 16777216 float none -1 1053.7 509.52 445.83 0 1021.4 525.64 459.93 0 1073741824 33554432 float none -1 2087.4 514.39 450.10 0 2013.8 533.19 466.54 0 2147483648 67108864 float none -1 4154.7 516.88 452.27 0 3987.4 538.57 471.25 0 4294967296 134217728 float none -1 8289.2 518.14 453.37 0 7907.4 543.16 475.26 0 8589934592 268435456 float none -1 16556 518.85 453.99 0 15726 546.24 477.96 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 175.233 #

Test with TAS

To validate the functionality of the provisioned cluster, you can run the following NCCL test with TAS.

Configure Kueue with TAS enabled

Install Kueue with TAS enabled.

Configure Kueue with TAS enabled by creating the following file, which you name a4x-kueue-config.yaml:

apiVersion: kueue.x-k8s.io/v1alpha1 kind: Topology metadata:  name: "a4x-default" spec:  levels:  - nodeLabel: "cloud.google.com/gce-topology-block"  - nodeLabel: "cloud.google.com/gce-topology-subblock"  - nodeLabel: "cloud.google.com/gke-nodepool"  - nodeLabel: "cloud.google.com/gce-topology-host"  - nodeLabel: "kubernetes.io/hostname" --- kind: ResourceFlavor apiVersion: kueue.x-k8s.io/v1beta1 metadata:  name: "a4x" spec:  nodeLabels:  cloud.google.com/gke-accelerator: nvidia-gb200  topologyName: "a4x-default"  tolerations:  - key: "nvidia.com/gpu"  operator: "Exists"  effect: NoSchedule  - key: "kubernetes.io/arch"  operator: "Exists"  effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata:  name: "a4x" spec:  namespaceSelector: {} # match all.  resourceGroups:  - coveredResources: ["nvidia.com/gpu"]  flavors:  - name: "a4x"  resources:  - name: "nvidia.com/gpu"  nominalQuota: 1_000_000_000 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata:  namespace: "default"  name: "a4x" spec:  clusterQueue: "a4x"

Run the test:
```
kubectl apply -f a4x-kueue-config.yaml 
```

Schedule a topology-aware NCCL test with Kueue with TAS enabled

The following workload must be placed within a single NVLink Domain sub-block.

Install JobSet, a Kubernetes-native API for managing of group of Kubernetes Jobs as a unit. Ensure that your non-GPU node pools have enough resources to schedule the JobSet controllers.

Create the following file with the name nccl-tas-test.yaml. Replace NUM_NODES with the intended number of nodes to run the NCCL test, up to 18:

apiVersion: resource.nvidia.com/v1beta1 kind: ComputeDomain metadata:  name: nccl-test-compute-domain spec:  numNodes: NUM_NODES  channel:  resourceClaimTemplate:  name: nccl-test-compute-domain-channel --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata:  name: kueue-tas-nccl-all-gather  labels:  kueue.x-k8s.io/queue-name: a4x spec:  ttlSecondsAfterFinished: 1200  network:  enableDNSHostnames: true  replicatedJobs:  - name: worker  template:  spec:  parallelism: NUM_NODES  completions: NUM_NODES  template:  metadata:  annotations:  kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-subblock"  networking.gke.io/default-interface: 'eth0'  networking.gke.io/interfaces: |  [  {"interfaceName":"eth0","network":"default"},  {"interfaceName":"eth2","network":"rdma-0"},  {"interfaceName":"eth3","network":"rdma-1"},  {"interfaceName":"eth4","network":"rdma-2"},  {"interfaceName":"eth5","network":"rdma-3"}  ]  spec:  activeDeadlineSeconds: 3600  restartPolicy: Never  nodeSelector:  cloud.google.com/gke-accelerator: nvidia-gb200  tolerations:  - key: nvidia.com/gpu  operator: Equal  value: present  effect: NoSchedule  - key: kubernetes.io/arch  operator: Equal  value: arm64  effect: NoSchedule  setHostnameAsFQDN: true  volumes:  - name: gib  hostPath:  path: /home/kubernetes/bin/gib  - name: nvidia  hostPath:  path: /home/kubernetes/bin/nvidia  - name: lib64  hostPath:  path: /lib64  - name: shared-memory  emptyDir:  medium: "Memory"  sizeLimit: 250Gi  resourceClaims:  - name: compute-domain-channel  resourceClaimTemplateName: nccl-test-compute-domain-channel  containers:  - name: nccl-test  stdin: true  tty: true  image: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.0.4  env:  - name: MY_NODE_NAME  valueFrom:  fieldRef:  fieldPath: spec.nodeName  - name: OMPI_ALLOW_RUN_AS_ROOT  value: "1"  - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM  value: "1"  - name: N_NODES  value: "NUM_NODES"  - name: LD_LIBRARY_PATH  value: /usr/local/nvidia/lib64  command:  - bash  - -c  - |  set -x  echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark"  # Install ping  apt update -y  apt install -y iputils-ping  # Start sshd  /scripts/container_entry.sh daemon &  # Get helper variables to form all hostnames  export POSTFIX=$(hostname | cut -d . -f 2-)  export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev )  export NODE_RANK=$JOB_COMPLETION_INDEX  # For every worker, wait till online and add to hostfile  for i in `seq 0 $(($N_NODES-1))`; do  OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX}  until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do  echo Waiting for ${OTHER}...  sleep 10  done  echo ${OTHER} port=222 slots=4 | tee -a /tmp/hostfile;  done  cat /tmp/hostfile  # Launch from head node  if [[ "${NODE_RANK}" -eq "0" ]]; then  # World Level = 0x0, Rail Aligned = 0x7  export NCCL_TESTS_SPLIT_MASK="0x0";  # Force use of libnccl-gib  export NCCL_NET=gIB  # Set all the correct libnccl-gib environment variables  source /usr/local/gib/scripts/set_nccl_env.sh  # Get all relevant NCCL / env vars to pass to all workers  ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g')  mpirun --hostfile /tmp/hostfile \  -x $ENV_VARS \  -mca plm_rsh_no_tree_spawn 1 \  --mca orte_keep_fqdn_hostnames 1 \  --mca btl self,tcp \  --mca btl_tcp_if_include eth0 \  --bind-to none \  --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \  /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1  else  while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do  sleep 5  done  fi  exit 0  volumeMounts:  - name: nvidia  mountPath: /usr/local/nvidia  - name: gib  mountPath: /usr/local/gib  - name: shared-memory  mountPath: /dev/shm  resources:  limits:  nvidia.com/gpu: 4  requests:  nvidia.com/gpu: 4  claims:  - name: compute-domain-channel  restartPolicy: Never

Run the test:
```
kubectl apply -f nccl-tas-test.yaml 
```

Check the test result by reviewing the logs:

kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep kueue-tas-nccl-all-gather-worker-0-0)

The output should be similar to the following:

 # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 8 float none -1 56.72 0.02 0.02 0 56.12 0.02 0.02 0 2048 16 float none -1 56.85 0.04 0.03 0 56.87 0.04 0.03 0 4096 32 float none -1 57.53 0.07 0.07 0 57.47 0.07 0.07 0 8192 64 float none -1 58.43 0.14 0.14 0 58.27 0.14 0.14 0 16384 128 float none -1 59.29 0.28 0.27 0 58.87 0.28 0.27 0 32768 256 float none -1 60.02 0.55 0.53 0 59.60 0.55 0.53 0 65536 512 float none -1 61.83 1.06 1.03 0 61.64 1.06 1.03 0 131072 1024 float none -1 70.99 1.85 1.79 0 70.82 1.85 1.79 0 262144 2048 float none -1 71.56 3.66 3.55 0 71.07 3.69 3.57 0 524288 4096 float none -1 72.62 7.22 6.99 0 71.90 7.29 7.06 0 1048576 8192 float none -1 72.80 14.40 13.95 0 72.31 14.50 14.05 0 2097152 16384 float none -1 73.40 28.57 27.68 0 72.96 28.74 27.85 0 4194304 32768 float none -1 73.86 56.78 55.01 0 73.44 57.12 55.33 0 8388608 65536 float none -1 102.5 81.86 79.30 0 101.4 82.69 80.11 0 16777216 131072 float none -1 158.3 105.97 102.66 0 156.8 107.02 103.68 0 33554432 262144 float none -1 158.4 211.89 205.26 0 157.5 212.99 206.33 0 67108864 524288 float none -1 250.7 267.68 259.32 0 248.7 269.81 261.38 0 134217728 1048576 float none -1 417.7 321.29 311.25 0 414.1 324.13 314.01 0 268435456 2097152 float none -1 728.8 368.32 356.81 0 721.5 372.08 360.45 0 536870912 4194304 float none -1 1226.5 437.72 424.04 0 1216.1 441.46 427.66 0 1073741824 8388608 float none -1 2268.4 473.35 458.56 0 2247.0 477.86 462.93 0 2147483648 16777216 float none -1 4330.6 495.88 480.39 0 4291.6 500.39 484.76 0 4294967296 33554432 float none -1 8640.9 497.05 481.52 0 8544.0 502.69 486.98 0 8589934592 67108864 float none -1 17258 497.75 482.19 0 17052 503.75 488.00 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 157.091

What's next

Collect and Understand NCCL Logs for Troubleshooting to understand the test outputs and troubleshoot issues.
Learn about troubleshooting slow performance.