Page History

Versions Compared

Old Version 1

changes.mady.by.user Levi Naden

Saved on Apr 19, 2017

compared with

New Version 2

changes.mady.by.user Levi Naden

Saved on Apr 19, 2017

Key

This line was added.
This line was removed.
Formatting was changed.

...

Looking at the status of the GPUs on a node though nvidia-smi
1. The difference between process exclusive and shared mode on nVidia GPUs (Compute Modes)
2. Important terminology notes
Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variableThe difference between Process Exclusive and Shared mode on nVidia GPUs
Running multiple processes on the same GPU through the CUDA Multi-Process Service (MPS)
1. Limitations of this feature
2. Changes to behavior of identifying GPUs

...

nVidia provides a tool to view the status of GPUs, such as their current memory load, temperature, and operating mode. The command to see this is info is nvidia-smi, here , but this can only be run on nodes which have GPUs, so this command fails on the head nodes. Here is a sample output from a Lilac node with 4 GPUs on it.

Code Block
+-----------------------------------------------------------------------------+ \| NVIDIA-SMI 367.48 Driver Version: 367.48 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 GeForce GTX 1080 Off \| 0000:02:00.0 Off \| N/A \| \| 0% 39C P8 6W / 180W \| 2MiB / 8113MiB \| 0% E. Process \| +-------------------------------+----------------------+----------------------+ \| 1 GeForce GTX 1080 Off \| 0000:03:00.0 Off \| N/A \| \| 0% 41C P8 11W / 180W \| 2MiB / 8113MiB \| 0% E. Process \| +-------------------------------+----------------------+----------------------+ \| 2 GeForce GTX 1080 Off \| 0000:83:00.0 Off \| N/A \| \| 0% 36C P8 9W / 180W \| 2MiB / 8113MiB \| 0% E. Process \| +-------------------------------+----------------------+----------------------+ \| 3 GeForce GTX 1080 Off \| 0000:84:00.0 Off \| N/A \| \| 0% 35C P8 11W / 180W \| 2MiB / 8113MiB \| 0% E. Process \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| No running processes found \| +-----------------------------------------------------------------------------+

Code Block

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
|  0%   39C    P8     6W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:03:00.0     Off |                  N/A |
|  0%   41C    P8    11W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 0000:83:00.0     Off |                  N/A |
|  0%   36C    P8     9W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 0000:84:00.0     Off |                  N/A |
|  0%   35C    P8    11W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                          
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

...

GPU: The index of the GPU on this physical hardware slotsavailable to your processes. Indexing always starts from 0, but your job may not have GPU 0 assigned to it, so your indices may be different.
- This example: GPU = 0
Name
- Important: The
human-readable name of the
- GPU
, quite often this is simply the model name, truncated to fit.
- This example: Name = GeForce GTX 1080
Memory-Usage: How much memory is being consumed, and how much memory is available on this GPU. As you run jobs on the GPU, this memory is consumed. This is NOT the same as the RAM you request as part of your job, this is on GPU memory and is constant per GPU.
- This example: Memory-Usage = 2MiB / 8113MiB
- Fun Trivia: MiB is "mebibyte", which is a base 2 unit of memory instead of the base 10 that megabyte are in. These are often used interchangeably, even though in reality they are not equal, just close. 1 MB = 1000 kB, 1MiB = 1024 kB.
Compute M.: "Compute Mode" of the GPU,
- is not always mapped to the physical hardware slots
  - There are ways to re-order the GPUs, or even mask some GPUs from showing up to a given job.
  - Lilac jobs are given GPUs and typically and indexed starting at 0, even if your job is running on GPUs in different physical slots on the hardware.
  - This number cannot change during a job, so you don't have to worry about your GPUs being re-assigned or shuffled while a job is running.
Name: The human-readable name of the GPU, quite often this is simply the model name, truncated to fit.
- This example: Name = GeForce GTX 1080
Memory-Usage: How much memory is being consumed, and how much memory is available on this GPU. As you run jobs on the GPU, this memory is consumed. This is NOT the same as the RAM you request as part of your job, this is on GPU memory and is constant per GPU.
- This example: Memory-Usage = 2MiB / 8113MiB
- Fun Trivia: MiB is "mebibyte", which is a base 2 unit of memory instead of the base 10 that megabyte are in. These are often used interchangeably, even though in reality they are not equal, just close. 1 MB = 1000 kB, 1MiB = 1024 kB.
Compute M.: "Compute Mode" of the GPU, how the GPU handles process and thread execution on it. Valid options for the GTX 1080s: E. Process and Default which correspond to exclusive process and shared mode as referenced in the Intro to Lilac documentation.
- This example: Compute M. = E. Process
- The Lilac cluster's mode is natively E. Process or "exclusive process"
- Some nVidia GPUs support a thread exclusive mode, but the GTX 1080s do not, so no further discussion of it will be held here.

...

If the GPUs are in shared or exclusive mode
The value of the CUDA_VISIBLE_DEVICES environment variable.

CUDA_VISIBLE_DEVICES serves as the identifier as to which GPU slots are available for processes to start on. Lilac LSF jobs automatically assign you the CUDA_VISIBLE_DEVICES environment variable per node matching to the GPUs that are available to you job. For instance, you request 2 GPUs per node on 2 nodes, you are given the GPUs in slots 0 and 1 on NodeA, and the GPUs in slots 1 and 2 on NodeB. Your CUDA_VISIBLE_DEVICES variables will look like this:

NodeA: CUDA_VISIBLE_DEVICES=0,1
NodeB: CUDA_VISIBLE_DEVICES=1,2

These indexes match up with the GPU entry from the nvidia-smi output above. When your job starts, these are assigned to you, however, you can change them to control exactly what device your process starts on. If you tell your job to sign into NodeA and then export CUDA_VISIBLE_DEVICES=0, any process which uses the GPU can then only execute on the GPU in slot 0. You can only set this to devices assigned to your job, if you attempt to set this to GPUs not assigned to you, then your processes will not start.

Lilac is configured to only give you access to the devices you request, and renumbers the GPUs starting at 0. So if you request 2 GPUs and you get the GPUs in physical slots 1 and 3, you will only see 2 GPUs though nvidia-smi and they will be

Panel

bgColor	#fff

...

Content

Space Tools

General Documentation

LSF Primer

Lilac Cluster Guide

Juno Cluster Guide

Cloud Resources

Backup Policy on server/node local drives

File lists

Versions Compared

Old Version 1

New Version 2

Key

Related articles