2

I'm working on a "cluster" that currently has only one computenode, with 8x H100 GPUs.

Slurm is configured such that each GPU is available either as a whole GPU, or as 20 shards. The (from my understanding relevant part of) slurm.conf says:

NodeName=computenode01 RealMemory=773630 Boards=1 SocketsPerBoard=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:8,shard:160 Feature=location=local PartitionName="defq" Default=YES MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=computenode01 SchedulerType=sched/backfill GresTypes=gpu,shard SelectType=select/cons_tres SelectTypeParameters=CR_Core AccountingStorageTRES=gres/gpu,gres/shard 

and the gres.conf looks like this:

AutoDetect=NVML NodeName=computenode01 Name=gpu Count=1 File=/dev/nvidia0 NodeName=computenode01 Name=gpu Count=1 File=/dev/nvidia1 NodeName=computenode01 Name=gpu Count=1 File=/dev/nvidia2 NodeName=computenode01 Name=gpu Count=1 File=/dev/nvidia3 NodeName=computenode01 Name=gpu Count=1 File=/dev/nvidia4 NodeName=computenode01 Name=gpu Count=1 File=/dev/nvidia5 NodeName=computenode01 Name=gpu Count=1 File=/dev/nvidia6 NodeName=computenode01 Name=gpu Count=1 File=/dev/nvidia7 NodeName=computenode01 Name=shard Count=20 File=/dev/nvidia0 NodeName=computenode01 Name=shard Count=20 File=/dev/nvidia1 NodeName=computenode01 Name=shard Count=20 File=/dev/nvidia2 NodeName=computenode01 Name=shard Count=20 File=/dev/nvidia3 NodeName=computenode01 Name=shard Count=20 File=/dev/nvidia4 NodeName=computenode01 Name=shard Count=20 File=/dev/nvidia5 NodeName=computenode01 Name=shard Count=20 File=/dev/nvidia6 NodeName=computenode01 Name=shard Count=20 File=/dev/nvidia7 

Now if I submit a job using sbatch script.sh, and my resource request inside script.sh is

#SBATCH --gres=gpu:1 

this job might be assigned (depending on availability) to the same physical GPU as jobs that were submitted previously and are already running, and which had requested, for example,

#SBATCH --gres=shard:4 

even though the Slurm documentation says that should not happen:

Note the same GPU can be allocated either as a GPU type of GRES or as a shard type of GRES, but not both. In other words, once a GPU has been allocated as a gres/gpu resource it will not be available as a gres/shard. Likewise, once a GPU has been allocated as a gres/shard resource it will not be available as a gres/gpu.

This happens regardless of whether I use srun inside the script (without any further parameters, e.g. srun python my_job.py) or just dummy commands like echo $CUDA_VISIBLE_DEVICES (which lets me check which device the job is assigned to).

I'm running Slurm 23.02.5.

P[P*]S: I found this mailing list post from 2023 describing the same problem, but there was never any response.

Also found a bug report that lists the issue as open, but commenters say that the issue no longer appears in version 24.11.1

Finally the release notes for 23.11.5 state that this particular bug has been removed

-- Fix issue where you could have a gpu allocated as well as a shard on that gpu allocated at the same time.

Our Slurm installation is in the process of being upgraded and I'll confirm whether the upgrade resolved the issue.

1 Answer 1

0

After an upgrade to Slurm 24.11.3 I can no longer reproduce the issue.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.