5

We are looking for some advice with slurm salloc gpu allocations. Currently, given:

% salloc -n 4 -c 2 -gres=gpu:1 % srun env | grep CUDA CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0 

However, we desire more than just device 0 to be used.
Is there a way to specify an salloc with srun/mpirun to get the following?

CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=2 CUDA_VISIBLE_DEVICES=3 

This is desired such that each task gets 1 gpu, but overall gpu usage is spread out among the 4 available devices (see gres.conf below). Not where all tasks get device=0.

That way each task is not waiting on device 0 to free up from other tasks, as is currently the case.

Or is this expected behavior even if we have more than 1 gpu available/free (4 total) for the 4 tasks? What are we missing or misunderstanding?

  • salloc / srun parameter?
  • slurm.conf or gres.conf setting?

Summary We want to be able to use slurm and mpi such that each rank/task uses 1 gpu, but the job can spread tasks/ranks among the 4 gpus. Currently it appears we are limited to device 0 only. We also want to avoid multiple srun submissions within an salloc/sbatch due to mpi usage.

OS: CentOS 7

Slurm version: 16.05.6

Are we forced to use wrapper based methods for this?

Are there differences with slurm version (14 to 16) in how gpus are allocated?

Thank you!

Reference: gres.conf

Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 

3 Answers 3

5

First of all, try requesting four GPUs with

% salloc -n 4 -c 2 -gres=gpu:4 

With --gres=gpu:1, it is the expected behaviour that all tasks see only one GPU. With --gres=gpu:4, the output would be

CUDA_VISIBLE_DEVICES=0,1,2,3 CUDA_VISIBLE_DEVICES=0,1,2,3 CUDA_VISIBLE_DEVICES=0,1,2,3 CUDA_VISIBLE_DEVICES=0,1,2,3 

To get what you want, you can use a wrapper script, or modify your srun command like this:

srun bash -c 'CUDA_VISIBLE_DEVICES=$SLURM_PROCID env' | grep CUDA 

then you will get

CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=2 CUDA_VISIBLE_DEVICES=3 
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the reply. We were expecting --gres=gpu:1 to really be --gres_per_task=gpu:1, like the behavior of the -c, --cpus-per-task= option. But appears to be more like a --gres_per_node=gpu:1. We also hoping to avoid any wrapper based methods. We were assuming slurm should be able to handle this use case, since our expectation is it would be fairly common.
@CharlieHemlock Yes --gres is per node, not per task. I am not sure a per task request would be that common. Most of the time, either the tasks are independent, and they are submitted as job arrays, or they are not independent, and are part of an MPI job that then has full control on all GPUs of the node and distributes tasks to the GPUs the best way for the application at hand.
1

This feature is planned for 19.05. See https://bugs.schedmd.com/show_bug.cgi?id=4979 for details.

Be warned that the 'srun bash...' solution suggested will break if your job doesn't request all GPUs on that node, because another process may be in control of GPU0.

Comments

1

To accomplish one GPU per task you need to use the --gpu-bind switch of the srun command. For example, if I have three nodes with 8 GPUs each and I wish to run eight tasks per node each bound to a unique GPU, the following command would do the trick:

srun -p gfx908_120 -n 24 -G gfx908_120:24 --gpu-bind=single:1 -l bash -c 'echo $(hostname):$ROCR_VISIBLE_DEVICES' 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.