Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Looking at the status of the GPUs on a node though nvidia-smi

    1. The difference between process exclusive and shared mode on nVidia GPUs (Compute Modes)

    2. Important terminology notes

  2. Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variable
    1. How processes are distributed when CUDA_VISIBLE_DEVICES is set
  3. Running multiple processes on the same GPU through the CUDA Multi-Process Service (MPS) 
      Limitations of this feature
    1. Changes to behavior of identifying GPUs
    2. Limitations of this feature

Looking at the status of the GPUs on a node through nvidia-smi

...

Lilac is configured to only give you access to the devices you request, and renumbers the GPUs starting at 0. So if you request 2 GPUs and you get the GPUs in physical slots 1 and 3, you will only see 2 GPUs though nvidia-smi and they will be given the indices 0 and 1

Unsetting CUDA_VISIBLE_DEVICES is the same as setting it to the list of indices for all available GPUs. E.g. if you have 3 GPUs assigned to you in slots 0, 1 and 2; then unsetting CUDA_VISIBLE_DEVICES is the same as setting CUDA_VISIBLE_DEVICES=0,1,2

How processes are distributed when CUDA_VISIBLE_DEVICES is set

This section covers the behavior of how processes are distributed to GPUs with the CUDA_VISIBLE_DEVICES flag. Remember that the behavior for an unset CUDA_VISIBLE_DEVICES is also defined. Although Lilac jobs are started in exclusive mode, its important to cover the shared mode behavior because of the Multi-Process Service (MPS) covered in the next section.

  • exclusive mode:
    • CUDA_VISIBLE_DEVICES is checked
    • for each device in CUDA_VISIBLE_DEVICES
      • If no process on device, start process
      • Loop until process has started
    • If process has not started and all devices have been looped through
      • Fail state, "No Available Devices"
  • shared mode
    • CUDA_VISIBLE_DEVICES is checked
    • Pick the first device in CUDA_VISIBLE_DEVICES
      • Start process on device
    • If out of memory
      • Fail

As you can see, the behavior of shared mode requires some management of CUDA_VISIBLE_DEVICES

Panel
bgColor#fff

...