Cluster resources
The Lilac cluster currently has: ~124 nodes, ~4,180 CPUs, 508 GPUs, 4.1 PB usable computational storage /Lilac and 3.1 PB usable warm storage /warm. There are 13 different node configurations. A detailed summary of the Lilac cluster hardware is at http://hpc.mskcc.org/compute-accounts/
Realtime Lilac cluster information is on Grafana https://hpc-grafana.mskcc.org/d/000000005/cluster-dashboard?refresh=10s&orgId=1&var-cluster=Lilac&var-GPUs=All&var-gpuhost=All
RTM has information about LSF Lilac: http://Lilac-rtm01.mskcc.org/cacti/index.php
User name: guest
Getting cluster info from LSF on the command line
These are some of the commands you can use to get current information about the compute nodes and LSF configuration on the lilac cluster.
- bhosts displays hosts and their static and dynamic resources.
- lshost displays hosts and their static resource information.
- bqueues displays information about queues.
- bmgroup displays the host groups.
- bjobs -u all will show all running and pending jobs.
Logging in
Access to the Lilac cluster is by ssh only. The login node for the Lilac cluster is Lilac.mskcc.org. Please do not run compute jobs on the login node and use the data transfer server juno-xfer01.mskcc.org for moving large datasets.
LSF cluster defaults
We reserve ~12GB of RAM per host for the operating system and GPFS on Lilac hosts.
Each LSF job slot corresponds to one CPU hyperthread. All Lilac compute nodes have access to the Internet. The default LSF queue, ‘cpuqueue’ should be used for CPU jobs only. The gpuqueue queue should be used for GPU jobs only. It is a good idea to always specify the number of threads, memory per thread, and expected wall time for all jobs.
These are the the LSF job default parameters if you do not specify them in your bsub command:
Queue name: cpuqueue
Number of CPU (-n): 1
Memory (RAM): 2GB
span[hosts=1]
Walltime (job runtime): 1 hour
The maximum walltime for jobs on lilac is 7 days.
Local scratch space on nodes
All nodes have a local 1T /scratch drive.
Some nodes have additional local 2T NVMe /fscratch drives. Nodes with NVMe /fscratch can be requested with the bsub argument -R fscratch.
Please clean up your data on scratch drives when your job finishes. Don’t use /tmp for scratch space or any job output.
The cleanup policy for scratch drives is:: files with access time > 31 days deleted daily at 6:15AM for /scratch and 6:45AM for /fscratch.
Service Level Agreements
Some subsets of compute nodes were purchased by partner PIs or individual departments. These systems are placed into the cluster under special queue configurations that enable prioritization for the contributing group. All users benefit from such systems as they allow jobs to run on them while they are idle or under low utilization. The rules for group owned nodes are defined in the LSF scheduler configuration as Service Level Agreements (SLAs) which give specific users proprietary access to subsets of the nodes and define the their loan policies.