View Source

Why can’t my job run now?

Once you submit your job to LSF using bsub, it enters the PENDING sate. You can see all your pending jobs with

bjobs -p

You can see the status of a particular job with its JobID. Look for PENDING REASONS: in the output.

bjobs -p3 -l <jobid>

....

PENDING REASONS:

Job dependency condition not satisfied;

.....

It can be difficult to interpret the PENDING REASONS that LSF in the bjobs output. The cluster may just be very busy. You can see the cluster activity at https://hpc-grafana.mskcc.org/

Other LSF commands such as bhosts, lshost and lshost -gpu will give you current information about the available nodes and resources on the command line. You can also use RTM to view LSF details as the guest user at http://lila-rtm01.mskcc.org g/cacti/index.php and http://juno-rtm01.mskcc.org/cacti/index.php

Things to Check:

Typos in your bsub command.
Any requested GPU models are correct. lshost -gpu will list them with the correct syntax.
The requested memory requirement (-R rusage[mem=4]) is in GB (gigabytes) and is PER SLOT (-n) and not per job.
Make sure that you are in the SLA (Service Level Agreement) for any nodes that you specifically request.
Your job must be able to finish before any scheduled downtime reservation.

The more resources that you request, the longer it will take for the LSF accumulate the resources to satisfy them. Jobs which request resources that the cluster does not possess will remain in the pending state indefinitely. The maximum walltime on lilac is 7 days and on Juno is 31 days. Jobs that are less than 6 hours can run on any node. But those longer than 6

hours can only run on nodes with SLA or on a subset of the shared nodes.

More examples forthcoming

Different pending reasons and how to check for them.

Requested CPUs are not available for requested walltime. Cluster is busy (example1).
Requested RAM(memory) is not available for requested walltime.
Requested RAM(memory) doesn't exist on cluster per host. The output from bjobs won't tell directly this reason.

Requested GPUs are not available for requested walltime.

> bjobs -p3 -l .. Tue Sep 8 16:01:28: Submitted from host , CWD <$HOME>, Requested Resources <select[gpu_model0=='geforcegtx1000']>, Requested GPU; PENDING REASONS: Candidate host pending reasons (0 of 123 hosts). Non-candidate host pending reasons (123 of 123 hosts): Job's resource requirements not satisfied: lp35, lx10, lx11, lx12, lx13, lx14, boson, lt01, lt02, lt03, lt04, lt05, lt06, lt07, lt08 ….. Not specified in job submission: ld01, ld02, ld03, ld04, ld05, ld07, lv01, l i01, lila-sched01, lila-sched02; Load information unavailable: ld06, lg05, lp08, lp09, ls10, ls18, lp21, lp26 Closed by LSF administrator: lw01, lw02, ls05, lu04, lu05, lx09; RUNLIMIT 10.0 min</select[gpu_model0=='geforcegtx1000']>

This job won’t run because the gpu_model0 is not correct in bsub and this resource is not available on Lilac cluster : Candidate host : 0 Correct name is GeForceGTX1080 >lshosts -gpu HOST_NAME gpu_id gpu_model gpu_driver gpu_factor numa_id ls01 0 GeForceGTX1080 440.33.01 6.1 0 1 GeForceGTX1080 440.33.01 6.1 0 2 GeForceGTX1080 440.33.01 6.1 1 3 GeForceGTX1080 440.33.01 6.1 1

GPU type doesn't exist on cluster. The correct syntax for available GPUs can be found with lshost --gpu
Nodes are in system level reservation used for rolling upgrade or for scheduling cluster level downtime.
Nodes are reserved under SLA.

this is my hidden text

When will my job start to run?

bjobs -l <jobid>

Check “ESTIMATION” in the output

Details forthcoming

Why did my job exit abnormally?

bhist -l JID

bhist -n 0 -l JID

Details forthcoming