Lilac GPU Primer

What are the GPU Resources on the Lilac cluster, and what do I need?

This page documents the various GPU settings and modes for the Lilac cluster in more detail than the Lilac Cluster Intro page provides. This page looks specifically at the new GTX 1080 GPUs, although many of the options apply to other Nvidia GPUs as well. Where applicable, this primer will talk about how these properties interact with the Lila Cluster

Topics Covered in this Primer

Looking at the status of the GPUs on a node though nvidia-smi
1. Important terminology notes
Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variable
The difference between Process Exclusive and Shared mode on nVidia GPUs
Running multiple processes on the same GPU through the CUDA Multi-Process Service (MPS)
1. Limitations of this feature
2. Changes to behavior of identifying GPUs

Looking at the status of the GPUs on a node through `nvidia-smi`

nVidia provides a tool to view the status of GPUs, such as their current memory load, temperature, and operating mode. The command to see this is nvidia-smi, here is a sample output from a Lilac node with 4 GPUs on it.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
|  0%   39C    P8     6W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:03:00.0     Off |                  N/A |
|  0%   41C    P8    11W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 0000:83:00.0     Off |                  N/A |
|  0%   36C    P8     9W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 0000:84:00.0     Off |                  N/A |
|  0%   35C    P8    11W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                          
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The entries at the top match with the entries per GPU by position, e.g. the Bus-Id entry in the top center cell matches to the 0000:02:00.0 in the second row, center cell. There are many things to digest here, but let's only cover the items that will most likely be important to your jobs.

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
|  0%   39C    P8     6W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

These individual items of note will use the above, shortened example.

GPU: The index of the GPU on this physical hardware slots. Indexing always starts from 0, but your job may not have GPU 0 assigned to it, so your indices may be different.
- This example: GPU = 0
Name: The human-readable name of the GPU, quite often this is simply the model name, truncated to fit.
- This example: Name = GeForce GTX 1080
Memory-Usage: How much memory is being consumed, and how much memory is available on this GPU. As you run jobs on the GPU, this memory is consumed. This is NOT the same as the RAM you request as part of your job, this is on GPU memory and is constant per GPU.
- This example: Memory-Usage = 2MiB / 8113MiB
- Fun Trivia: MiB is "mebibyte", which is a base 2 unit of memory instead of the base 10 that megabyte are in. These are often used interchangeably, even though in reality they are not equal, just close. 1 MB = 1000 kB, 1MiB = 1024 kB.
Compute M.: "Compute Mode" of the GPU, how the GPU handles process and thread execution on it. Valid options for the GTX 1080s: E. Process and Default which correspond to exclusive process and shared mode as referenced in the Intro to Lilac documentation.
- This example: Compute M. = E. Process
- The Lilac cluster's mode is natively E. Process or "exclusive process"
- Some nVidia GPUs support a thread exclusive mode, but the GTX 1080s do not, so no further discussion of it will be held here.

Important Terminology Notes

Because of the possible confusion involving the word "Default" as it relates to the Compute Mode of the GPUs, the Compute M. = Default case will be referred to as shared mode through this document.
Shared in this document refers only to the computation mode of the GPU in that multiple Contexts can share this GPU. Shared does not mean that a GPU is shared between different jobs submitted to the.
Exclusive process will be shorted to exclusive since there are not multiple modes of exclusivity on the GTX 1080s
Exclusive in this document will only refer to the compute mode of the GPU

Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variable

Each physical machine has some number of physical GPUs associated with it. When you start a process on a GPU, which physical GPU it start on depends on a few factors:

If the GPUs are in shared or exclusive mode
The value of the CUDA_VISIBLE_DEVICES environment variable.

Content

Space Tools

General Documentation

LSF Primer

Lilac Cluster Guide

Juno Cluster Guide

Cloud Resources

Backup Policy on server/node local drives

File lists

What are the GPU Resources on the Lilac cluster, and what do I need?

Topics Covered in this Primer

Looking at the status of the GPUs on a node through `nvidia-smi`

Important Terminology Notes

Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variable

Related articles

Content

Space Tools

Breadcrumbs

What are the GPU Resources on the Lilac cluster, and what do I need?

Topics Covered in this Primer

Looking at the status of the GPUs on a node through nvidia-smi

Important Terminology Notes

Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variable

Related articles

Looking at the status of the GPUs on a node through `nvidia-smi`