General Documentation
- Welcome FAQ
- Secure Shell SSH
- Available Software
- Installing Software
- Guidelines and Policies
- Glossary
- Grant Support
- Sharing Data
- Containers & Singularity
- UserGroup Presentations
- Jupyter Notebook Usage
LSF Primer
Lilac Cluster Guide
Juno Cluster Guide
Cloud Resources
Backup Policy on server/node local drives
File lists
Page History
...
Looking at the status of the GPUs on a node though
nvidia-smi
The difference between
process exclusive
andshared
mode on nVidia GPUs (Compute Modes)Important terminology notes
- Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variable
- How processes are distributed when CUDA_VISIBLE_DEVICES is set
- Running multiple processes on the same GPU through the CUDA Multi-Process Service (MPS)
- Limitations of this feature
- Changes to behavior of identifying GPUs
- Limitations of this feature
Looking at the status of the GPUs on a node through nvidia-smi
...
Lilac is configured to only give you access to the devices you request, and renumbers the GPUs starting at 0. So if you request 2 GPUs and you get the GPUs in physical slots 1 and 3, you will only see 2 GPUs though nvidia-smi
and they will be given the indices 0 and 1.
Unsetting CUDA_VISIBLE_DEVICES is the same as setting it to the list of indices for all available GPUs. E.g. if you have 3 GPUs assigned to you in slots 0, 1 and 2; then unsetting CUDA_VISIBLE_DEVICES is the same as setting CUDA_VISIBLE_DEVICES=0,1,2
How processes are distributed when CUDA_VISIBLE_DEVICES is set
This section covers the behavior of how processes are distributed to GPUs with the CUDA_VISIBLE_DEVICES flag. Remember that the behavior for an unset CUDA_VISIBLE_DEVICES is also defined. Although Lilac jobs are started in exclusive
mode, its important to cover the shared
mode behavior because of the Multi-Process Service (MPS) covered in the next section.
exclusive
mode:- CUDA_VISIBLE_DEVICES is checked
- for each device in CUDA_VISIBLE_DEVICES
- If no process on device, start process
- Loop until process has started
- If process has not started and all devices have been looped through
- Fail state, "No Available Devices"
shared
mode- CUDA_VISIBLE_DEVICES is checked
- Pick the first device in CUDA_VISIBLE_DEVICES
- Start process on device
- If out of memory
- Fail
- Fail
As you can see, the behavior of shared
mode requires some management of CUDA_VISIBLE_DEVICES
Panel | ||
---|---|---|
| ||
Related articles
...