Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

Looking at the status of the GPUs on a node though nvidia-smi
1. The difference between process exclusive and shared mode on nVidia GPUs (Compute Modes)
2. Important terminology notes
Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variable
1. How processes are distributed when CUDA_VISIBLE_DEVICES is set
Running multiple processes on the same GPU through the CUDA Multi-Process Service (MPS)
1. Changes to behavior of identifying GPUs
2. Limitations of this feature

Looking at the status of the GPUs on a node through `nvidia-smi`

...

Lilac is configured to only give you access to the devices you request, and renumbers the GPUs starting at 0. So if you request 2 GPUs and you get the GPUs in physical slots 1 and 3, you will only see 2 GPUs though nvidia-smi and they will be given the indices 0 and 1.

Unsetting CUDA_VISIBLE_DEVICES is the same as setting it to the list of indices for all available GPUs. E.g. if you have 3 GPUs assigned to you in slots 0, 1 and 2; then unsetting CUDA_VISIBLE_DEVICES is the same as setting CUDA_VISIBLE_DEVICES=0,1,2

How processes are distributed when CUDA_VISIBLE_DEVICES is set

This section covers the behavior of how processes are distributed to GPUs with the CUDA_VISIBLE_DEVICES flag. Remember that the behavior for an unset CUDA_VISIBLE_DEVICES is also defined. Although Lilac jobs are started in exclusive mode, its important to cover the shared mode behavior because of the Multi-Process Service (MPS) covered in the next section.

exclusive mode:
- CUDA_VISIBLE_DEVICES is checked
- for each device in CUDA_VISIBLE_DEVICES
  - If no process on device, start process
  - Loop until process has started
- If process has not started and all devices have been looped through
  - Fail state, "No Available Devices"
shared mode
- CUDA_VISIBLE_DEVICES is checked
- Pick the first device in CUDA_VISIBLE_DEVICES
  - Start process on device
- If out of memory
  - Fail

As you can see, the behavior of shared mode requires some management of CUDA_VISIBLE_DEVICES

Panel

bgColor	#fff

...

Content

Space Tools

General Documentation

LSF Primer

Lilac Cluster Guide

Juno Cluster Guide

Cloud Resources

Backup Policy on server/node local drives

File lists

Versions Compared

Old Version 3

New Version 4

Key

Looking at the status of the GPUs on a node through `nvidia-smi`

How processes are distributed when CUDA_VISIBLE_DEVICES is set

Related articles

Content

Space Tools

Breadcrumbs

Page History

Versions Compared

Key

Looking at the status of the GPUs on a node through nvidia-smi

How processes are distributed when CUDA_VISIBLE_DEVICES is set

Related articles

Looking at the status of the GPUs on a node through `nvidia-smi`