What are the GPU Resources on the Lilac cluster, and what do I need?

This page documents the various GPU settings and modes for the Lilac cluster in more detail than the Lilac Cluster Intro page provides. This page looks specifically at the new GTX 1080 GPUs, although many of the options apply to other Nvidia GPUs as well. Where applicable, this primer will talk about how these properties interact with the Lila Cluster

Topics Covered in this Primer

  1. Looking at the status of the GPUs on a node though nvidia-smi

    1. The difference between process exclusive and shared mode on nVidia GPUs (Compute Modes)

    2. Important terminology notes

  2. Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variable
  3. Running multiple processes on the same GPU through the CUDA Multi-Process Service (MPS) 
    1. Limitations of this feature
    2. Changes to behavior of identifying GPUs

Looking at the status of the GPUs on a node through nvidia-smi

nVidia provides a tool to view the status of GPUs, such as their current memory load, temperature, and operating mode. The command to see this info is nvidia-smi, but this can only be run on nodes which have GPUs, so this command fails on the head nodes. Here is a sample output from a Lilac node with 4 GPUs on it.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
|  0%   39C    P8     6W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:03:00.0     Off |                  N/A |
|  0%   41C    P8    11W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 0000:83:00.0     Off |                  N/A |
|  0%   36C    P8     9W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 0000:84:00.0     Off |                  N/A |
|  0%   35C    P8    11W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                          
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The entries at the top match with the entries per GPU by position, e.g. the Bus-Id entry in the top center cell matches to the 0000:02:00.0 in the second row, center cell. There are many things to digest here, but let's only cover the items that will most likely be important to your jobs.

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
|  0%   39C    P8     6W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

These individual items of note will use the above, shortened example.

Important Terminology Notes


Identifying and running on specific GPUs by the CUDA_VISIBLE_DEVICES environment variable

Each physical machine has some number of physical GPUs associated with it. When you start a process on a GPU, which physical GPU it start on depends on a few factors:

  1. If the GPUs are in shared or exclusive mode
  2. The value of the CUDA_VISIBLE_DEVICES environment variable.

CUDA_VISIBLE_DEVICES serves as the identifier as to which GPU slots are available for processes to start on. Lilac LSF jobs automatically assign you the CUDA_VISIBLE_DEVICES environment variable per node matching to the GPUs that are available to you job. For instance, you request 2 GPUs per node on 2 nodes, you are given the GPUs in slots 0 and 1 on NodeA, and the GPUs in slots 1 and 2 on NodeB. Your CUDA_VISIBLE_DEVICES variables will look like this:

These indexes match up with the GPU entry from the nvidia-smi output above. When your job starts, these are assigned to you, however, you can change them to control exactly what device your process starts on. If you tell your job to sign into NodeA and then export CUDA_VISIBLE_DEVICES=0, any process which uses the GPU can then only execute on the GPU in slot 0. You can only set this to devices assigned to your job, if you attempt to set this to GPUs not assigned to you, then your processes will not start.

Lilac is configured to only give you access to the devices you request, and renumbers the GPUs starting at 0. So if you request 2 GPUs and you get the GPUs in physical slots 1 and 3, you will only see 2 GPUs though nvidia-smi and they will be 

Provide steps that the user can take to solve the problem. For example "The level 7 printer will flash red when it is out of paper. Add paper to tray 1".

You may want to use a panel to highlight important steps.
  1. Use numbered lists to provide step-by-step help.
  2. Copy and paste or drag and drop images to add them to this page.

Related articles