Why can’t my job run now?
Once you submit your job to LSF using bsub, it enters the PENDING sate. You can see all your pending jobs with
bjobs -p
You can see the status of a particular job with its JobID. Look for PENDING REASONS: in the output.
bjobs -p3 -l <jobid>
....
PENDING REASONS:
Job dependency condition not satisfied;
.....
It can be difficult to interpret the PENDING REASONS in the bjobs output. The cluster may just be very busy. You can see the cluster activity at https://hpc-grafana.mskcc.org/
Other LSF commands such as bhosts, lshost and lshost -gpu will give you current information about the available nodes and resources on the command line. You can also use RTM to view LSF details as the guest user at http://lila-rtm01.mskcc.orgg/cacti/index.php and http://juno-rtm01.mskcc.org/cacti/index.php
Things to check for:
- Typos in your bsub command.
- Any requested GPU models exist. lshost -gpu will list them with the correct syntax.
- The requested memory requirement (-R rusage[mem=4]) is in GB (gigabytes) and is PER SLOT (-n) and not per job.
- Make sure that you are in the SLA (Service Level Agreement) for any nodes that you specifically request.
- Your job must be able to finish before any scheduled downtime reservation.
The more resources that you request, the longer it will take for LSF accumulate the resources to satisfy them. Jobs which request resources that the cluster does not possess will remain in the pending state indefinitely. The maximum walltime on lilac is 7 days and on Juno is 31 days. Jobs that are less than 6 hours can run on any node. But those longer than 6 hours can only run on nodes with SLA or on a subset of the shared nodes.
Examples of different pending reasons and how to check for them.
Requested CPUs are not available. The cluster is busy. The job has been submitted to the ls03 host but ls03 doesn't have 30 free slots.
Requested RAM(memory) is not available. The job asked for 400GB of memory on 5 nodes, which is 2,000GB RAM total: bsub -n 5 -R "span[ptile=5]" -R "rusage[mem=400]"
Requested RAM(memory) doesn't exist on cluster per host. Requested memory is per slot not per job.
Requested GPUs are not available. Again the cluster is busy.
GPU type doesn't exist on cluster or there is a typo or syntax problem with the GPU in the request.
Nodes are in system level reservation used for rolling upgrade or scheduled cluster level downtime.
Nodes are reserved under SLA.
When will my job start to run?
bjobs -l <jobid>
Check for“ESTIMATION” in the output
Details forthcoming
Why did my job exit abnormally?
bhist -l JID
bhist -n 0 -l JID
Details forthcoming
1 Comment
Sveta Mazurkova