Knowledge Center Contents Previous Next Index |
Managing Jobs
Contents
- Understanding Job States
- View Job Information
- Changing Job Order Within Queues
- Switch Jobs from One Queue to Another
- Forcing Job Execution
- Suspending and Resuming Jobs
- Killing Jobs
- Sending a Signal to a Job
- Using Job Groups
- Handling Job Exceptions
Understanding Job States
The
bjobs
command displays the current state of the job.Normal job states
Most jobs enter only three states:
Job state Description PEND Waiting in a queue for scheduling and dispatch RUN Dispatched to a host and running DONE Finished normally with a zero exit value
Suspended job states
If a job is suspended, it has three states:
Job state Description PSUSP Suspended by its owner or the LSF administrator while in PEND state USUSP Suspended by its owner or the LSF administrator after being dispatched SSUSP Suspended by the LSF system after being dispatched
State transitions
A job goes through a series of state transitions until it eventually completes its task, fails, or is terminated. The possible states of a job during its life cycle are shown in the diagram.
Pending jobs
A job remains pending until all conditions for its execution are met. Some of the conditions are:
- Start time specified by the user when the job is submitted
- Load conditions on qualified hosts
- Dispatch windows during which the queue can dispatch and qualified hosts can accept jobs
- Run windows during which jobs from the queue can run
- Limits on the number of job slots configured for a queue, a host, or a user
- Relative priority to other users and jobs
- Availability of the specified resources
- Job dependency and pre-execution conditions
Maximum pending job threshold
If the user or user group submitting the job has reached the pending job threshold as specified by
MAX_PEND_JOBS
(either in theUser
section oflsb.users
, or cluster-wide inlsb.params
), LSF will reject any further job submission requests sent by that user or user group. The system will continue to send the job submission requests with the interval specified bySUB_TRY_INTERVAL
inlsb.params
until it has made a number of attempts equal to theLSB_NTRIES
environment variable. IfLSB_NTRIES
is undefined and LSF rejects the job submission request, the system will continue to send the job submission requests indefinitely as the default behavior.Suspended jobs
A job can be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (superuser), or by LSF.
After a job has been dispatched and started on a host, it can be suspended by LSF. When a job is running, LSF periodically checks the load level on the execution host. If any load index is beyond either its per-host or its per-queue suspending conditions, the lowest priority batch job on that host is suspended.
If the load on the execution host or hosts becomes too high, batch jobs could be interfering among themselves or could be interfering with interactive jobs. In either case, some jobs should be suspended to maximize host performance or to guarantee interactive response time.
LSF suspends jobs according to the priority of the job's queue. When a host is busy, LSF suspends lower priority jobs first unless the scheduling policy associated with the job dictates otherwise.
Jobs are also suspended by the system if the job queue has a run window and the current time goes outside the run window.
A system-suspended job can later be resumed by LSF if the load condition on the execution hosts falls low enough or when the closed run window of the queue opens again.
WAIT state (chunk jobs)
If you have configured chunk job queues, members of a chunk job that are waiting to run are displayed as
WAIT
bybjobs
. Any jobs inWAIT
status are included in the count of pending jobs bybqueues
andbusers
, even though the entire chunk job has been dispatched and occupies a job slot. Thebhosts
command shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.You can switch (
bswitch
) or migrate (bmig
) a chunk job member inWAIT
state to another queue.See Chapter 32, "Chunk Job Dispatch" for more information about chunk jobs.
Exited jobs
An exited job ended with a non-zero exit status.
A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:
- The job is cancelled by its owner or the LSF administrator while pending, or after being dispatched to a host.
- The job is not able to be dispatched before it reaches its termination deadline set by
bsub -t
, and thus is terminated by LSF.- The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.
- The application exits with a non-zero exit code.
You can configure hosts so that LSF detects an abnormally high rate of job exit from a host. See Handling Host-level Job Exceptions for more information.
Post-execution states
Some jobs may not be considered complete until some post-job processing is performed. For example, a job may need to exit from a post-execution job script, clean up job files, or transfer job output after the job completes.
The DONE or EXIT job states do not indicate whether post-processing is complete, so jobs that depend on processing may start prematurely. Use the
post_done
andpost_err
keywords on thebsub -w
command to specify job dependency conditions for job post-processing. The corresponding job states POST_DONE and POST_ERR indicate the state of the post-processing.After the job completes, you cannot perform any job control on the post-processing. Post-processing exit codes are not reported to LSF.
See Chapter 38, "Pre-Execution and Post-Execution Commands" for more information.
View Job Information
The
bjobs
command is used to display job information. By default,bjobs
displays information for the user who invoked the command. For more information aboutbjobs
, see theLSF Reference
and thebjobs(1)
man page.View all jobs for all users
- Run
bjobs -u all
to display all jobs for all users.Job information is displayed in the following order:
- Running jobs
- Pending jobs in the order in which they are scheduled
- Jobs in high-priority queues are listed before those in lower-priority queues
For example:
bjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1004 user1 RUN short hostA hostA job0 Dec 16 09:23 1235 user3 PEND priority hostM job1 Dec 11 13:55 1234 user2 SSUSP normal hostD hostM job3 Dec 11 10:09 1250 user1 PEND short hostA job4 Dec 11 13:59View jobs for specific users
- Run
bjobs
-u
user_name
to display jobs for a specific user:bjobs -u user1
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 2225 user1 USUSP normal hostA job1 Nov 16 11:55 2226 user1 PSUSP normal hostA job2 Nov 16 12:30 2227 user1 PSUSP normal hostA job3 Nov 16 12:31View running jobs
- Run
bjobs -r
to display running jobs.View done jobs
- Run
bjobs -d
to display recently completed jobs.View pending job information
- Run
bjobs -p
to display the reason why a job is pending.- Run
busers -w all
to see the maximum pending job threshold for all users.View suspension reasons
- Run
bjobs -s
to display the reason why a job was suspended.View chunk job wait status and wait reason
- Run
bhist -l
to display jobs inWAIT
status. Jobs are shown asWaiting ...
The
bjobs -l
command does not display aWAIT
reason in the list of pending jobs.View post-execution states
- Run
bhist
to display the POST_DONE and POST_ERR states.The resource usage of post-processing is not included in the job resource usage.
View exception status for jobs (bjobs)
- Run
bjobs
to display job exceptions.bjobs -l
shows exception information for unfinished jobs, andbjobs -x -l
shows finished as well as unfinished jobs.For example, the following
bjobs
command shows that job 2 is running longer than the configured JOB_OVERRUN threshold, and is consuming no CPU time.bjobs
displays the job idle factor, and both job overrun and job idle exceptions. Job 1 finished before the configured JOB_UNDERRUN threshold, sobjobs
shows exception status of underrun:bjobs -x -l -a
Job <2>, User, Project , Status , Queue , Command Wed Aug 13 14:23:35: Submitted from host , CWD <$HOME>, Output File , Specified Hosts; Wed Aug 13 14:23:43: Started on, Execution Home , Execution CWD ; Resource usage collected.IDLE_FACTOR(cputime/runtime): 0.00
MEM: 3 Mbytes; SWAP: 4 Mbytes; NTHREAD: 3 PGID: 5027; PIDs: 5027 5028 5029 SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -EXCEPTION STATUS: overrun idle
------------------------------------------------------------------------------ Job <1>, User, Project , Status , Command Wed Aug 13 14:18:00: Submitted from host , CWD <$HOME>, Output File , Specified Hosts < hostB>; Wed Aug 13 14:18:10: Started on, Execution Home , Execution CWD ; Wed Aug 13 14:18:50: Done successfully. The CPU time used is 0.2 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -EXCEPTION STATUS: underrun
Use
bacct -l -x
to trace the history of job exceptions.Changing Job Order Within Queues
By default, LSF dispatches jobs in a queue in the order of arrival (that is, first-come, first-served), subject to availability of suitable server hosts.
Use the
btop
andbbot
commands to change the position of pending jobs, or of pending job array elements, to affect the order in which jobs are considered for dispatch. Users can only change the relative position of their own jobs, and LSF administrators can change the position of any users' jobs.bbot
Moves jobs relative to your last job in the queue.
If invoked by a regular user,
bbot
moves the selected job after the last job with the same priority submitted by the user to the queue.If invoked by the LSF administrator,
bbot
moves the selected job after the last job with the same priority submitted to the queue.btop
Moves jobs relative to your first job in the queue.
If invoked by a regular user,
btop
moves the selected job before the first job with the same priority submitted by the user to the queue.If invoked by the LSF administrator,
btop
moves the selected job before the first job with the same priority submitted to the queue.Moving a job to the top of the queue
In the following example, job 5311 is moved to the top of the queue. Since job 5308 is already running, job 5311 is placed in the queue after job 5308.
Note that
user1
's job is still in the same position on the queue.user2
cannot usebtop
to get extra jobs at the top of the queue; when one of his jobs moves up the queue, the rest of his jobs move down.bjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD /s500 Oct 23 10:16 5309 user2 PEND night hostA /s200 Oct 23 11:04 5310 user1 PEND night hostB /myjob Oct 23 13:45 5311 user2 PEND night hostA /s700 Oct 23 18:17btop 5311
Job <5311> has been moved to position 1 from top.bjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD /s500 Oct 23 10:16 5311 user2 PEND night hostA /s200 Oct 23 18:17 5310 user1 PEND night hostB /myjob Oct 23 13:45 5309 user2 PEND night hostA /s700 Oct 23 11:04Switch Jobs from One Queue to Another
You can use the command
bswitch
to change jobs from one queue to another. This is useful if you submit a job to the wrong queue, or if the job is suspended because of queue thresholds or run windows and you would like to resume the job.Switch a single job to a different queue
- Run
bswitch
to move pending and running jobs from queue to queue.In the following example, job 5309 is switched to the
priority
queue:bswitch priority 5309
Job <5309> is switched to queuebjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD /job500 Oct 23 10:16 5309 user2 RUN priority hostA hostB /job200 Oct 23 11:04 5311 user2 PEND night hostA /job700 Oct 23 18:17 5310 user1 PEND night hostB /myjob Oct 23 13:45Switch all jobs to a different queue
- Run
bswitch -q from_queue to_queue 0
to switch all the jobs in a queue to another queue.The
-q
option is used to operate on all jobs in a queue. The job ID number 0 specifies that all jobs from the night queue should be switched to the idle queue:The example below selects jobs from the
night
queue and switches them to theidle
queue.
bswitch -q night idle 0
Job <5308> is switched to queueJob <5310> is switched to queue Forcing Job Execution
A pending job can be forced to run with the
brun
command. This operation can only be performed by an LSF administrator.You can force a job to run on a particular host, to run until completion, and other restrictions. For more information, see the
brun
command.When a job is forced to run, any other constraints associated with the job such as resource requirements or dependency conditions are ignored.
In this situation you may see some job slot limits, such as the maximum number of jobs that can run on a host, being violated. A job that is forced to run cannot be preempted.
Force a pending job to run
- Run
brun -m
hostname
job_ID
to force a pending job to run.You must specify the host on which the job will run.
For example, the following command will force the sequential job 104 to run on
hostA
:
brun -m hostA 104
Suspending and Resuming Jobs
A job can be suspended by its owner or the LSF administrator. These jobs are considered user-suspended and are displayed by
bjobs
asUSUSP
.If a user suspends a high priority job from a non-preemptive queue, the load may become low enough for LSF to start a lower priority job in its place. The load created by the low priority job can prevent the high priority job from resuming. This can be avoided by configuring preemptive queues.
Suspend a job
- Run
bstop
job_ID
.
Your job goes into
USUSP
state if the job is already started, or intoPSUSP
state if it is pending.bstop 3421
Job <3421> is being stoppedThe above example suspends job 3421.
UNIX
bstop
sends the following signals to the job:
SIGTSTP
for parallel or interactive jobs-SIGTSTP
is caught by the master process and passed to all the slave processes running on other hosts.SIGSTOP
for sequential jobs-SIGSTOP
cannot be caught by user programs. TheSIGSTOP
signal can be configured with the LSB_SIGSTOP parameter inlsf.conf
.Windows
bstop
causes the job to be suspended.Resume a job
- Run
bresume
job_ID
:bresume 3421
Job <3421> is being resumedresumes job 3421.
Resuming a user-suspended job does not put your job into
RUN
state immediately. If your job was running before the suspension,bresume
first puts your job intoSSUSP
state and then waits forsbatchd
to schedule it according to the load conditions.Killing Jobs
The
bkill
command cancels pending batch jobs and sends signals to running jobs. By default, on UNIX,bkill
sends theSIGKILL
signal to running jobs.Before
SIGKILL
is sent,SIGINT
andSIGTERM
are sent to give the job a chance to catch the signals and clean up. The signals are forwarded frommbatchd
tosbatchd
.sbatchd
waits for the job to exit before reporting the status. Because of these delays, for a short period of time after thebkill
command has been issued,bjobs
may still report that the job is running.On Windows, job control messages replace the
SIGINT
andSIGTERM
signals, and termination is implemented by theTerminateProcess()
system call.Kill a job
- Run
bkill
job_ID.
For example, the following command kills job 3421:
bkill 3421
Job <3421> is being terminatedKill multiple jobs
- Run
bkill 0
to kill all pending jobs in the cluster or usebkill 0
with the-g
,-J
,-m
,-q
, or-u
options to kill all jobs that satisfy these options.The following command kills all jobs dispatched to
the hostA
host:bkill -m hostA 0
Job <267> is being terminated Job <268> is being terminated Job <271> is being terminatedThe following command kills all jobs in the
groupA
job group:bkill -g groupA 0
Job <2083> is being terminated Job <2085> is being terminatedKill a large number of jobs rapidly
Killing multiple jobs with
bkill 0
and other commands is usually sufficient for moderate numbers of jobs. However, killing a large number of jobs (approximately greater than 1000 jobs) can take a long time to finish.
- Run
bkill -b
to kill a large number of jobs faster than with normal means. However, jobs killed in this manner are not logged tolsb.acct
.Local pending jobs are killed immediately and cleaned up as soon as possible, ignoring the time interval specified by CLEAN_PERIOD in
lsb.params
. Other jobs are killed as soon as possible but cleaned up normally (after the CLEAN_PERIOD time interval).If the
-b
option is used withbkill 0
, it kills all applicable jobs and silently skips the jobs that cannot be killed.The
-b
option is ignored if used with-r
or-s
.Force removal of a job from LSF
- Run
bkill -r
to force the removal of the job from LSF. Use this option when a job cannot be killed in the operating system.The
bkill -r
command removes a job from the LSF system without waiting for the job to terminate in the operating system. This sends the same series of signals asbkill
without -r
, except that the job is removed from the system immediately, the job is marked as EXIT, and job resources that LSF monitors are released as soon as LSF receives the first signal.Sending a Signal to a Job
LSF uses signals to control jobs, to enforce scheduling policies, or in response to user requests. The principal signals LSF uses are
SIGSTOP
to suspend a job,SIGCONT
to resume a job, andSIGKILL
to terminate a job.Occasionally, you may want to override the default actions. For example, instead of suspending a job, you might want to kill or checkpoint it. You can override the default job control actions by defining the JOB_CONTROLS parameter in your queue configuration. Each queue can have its separate job control actions.
You can also send a signal directly to a job. You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, LSF does allow you to kill, suspend and resume pending jobs.
You must be the owner of a job or an LSF administrator to send signals to a job.
You use the
bkill -s
command to send a signal to a job. If you issuebkill
without the -s
option, aSIGKILL
signal is sent to the specified jobs to kill them. Twenty seconds beforeSIGKILL
is sent,SIGTERM
andSIGINT
are sent to give the job a chance to catch the signals and clean up.On Windows, job control messages replace the
SIGINT
andSIGTERM
signals, but only customized applications are able to process them. Termination is implemented by theTerminateProcess()
system call.Signals on different platforms
LSF translates signal numbers across different platforms because different host types may have different signal numbering. The real meaning of a specific signal is interpreted by the machine from which the
bkill
command is issued.For example, if you send signal 18 from a SunOS 4.x host, it means
SIGTSTP
. If the job is running on HP-UX andSIGTSTP
is defined as signal number 25, LSF sends signal 25 to the job.Send a signal to a job
On most versions of UNIX, signal names and numbers are listed in the
kill
(1
) orsignal(2)
man pages. On Windows, only customized applications are able to process job control messages specified with the-s
option.
- Run
bkill
-s
signal job_id
, wheresignal
is either the signal name or the signal number:bkill -s TSTP 3421
Job <3421> is being signaledThe above example sends the
TSTP
signal to job 3421.Using Job Groups
A collection of jobs can be organized into job groups for easy management. A job group is a container for jobs in much the same way that a directory in a file system is a container for files. For example, a payroll application may have one group of jobs that calculates weekly payments, another job group for calculating monthly salaries, and a third job group that handles the salaries of part-time or contract employees. Users can submit, view, and control jobs according to their groups rather than looking at individual jobs.
How job groups are created
Job groups can be created
explicitly
orimplicitly
:
- A job group is created
explicitly
with thebgadd
command.- A job group is created
implicitly
by thebsub -g
orbmod -g
command when the specified group does not exist. Job groups are also created implicitly when a default job group is configured (DEFAULT_JOBGROUP inlsb.params
or LSB_DEFAULT_JOBGROUP environment variable).Job groups created when jobs are attached to an SLA service class at submission are implicit job groups (
bsub -sla
service_class_name
-g
job_group_name
). Job groups attached to an SLA service class withbgadd
are explicit job groups (bgadd -sla
service_class_name
job_group_name
).The GRP_ADD event in
lsb.events
indicates how the job group was created:
- 0x01 - job group was created explicitly
- 0x02 - job group was created implicitly
For example:
GRP_ADD" "7.02" 1193032735 1285 1193032735 0 "/Z" "" "user1" "" "" 2 0 "" -1 1
means job group
/Z
is an explicitly created job group.Child groups can be created explicitly or implicitly under any job group.
Only an implicitly created job group which has no job group limit (
bgadd -L
) and is not attached to any SLA can be automatically deleted once it becomes empty. An empty job group is a job group that has no jobs associated with it (including finished jobs). NJOBS displayed bybjgroup
is 0.Job group hierarchy
Jobs in job groups are organized into a hierarchical tree similar to the directory structure of a file system. Like a file system, the tree contains groups (which are like directories) and jobs (which are like files). Each group can contain other groups or individual jobs. Job groups are created independently of jobs, and can have dependency conditions which control when jobs within the group are considered for scheduling.
Job group path
The
job group path
is the name and location of a job group within the job group hierarchy. Multiple levels of job groups can be defined to form a hierarchical tree. A job group can contain jobs and sub-groups.Root job group
LSF maintains a single tree under which all jobs in the system are organized. The top-most level of the tree is represented by a top-level "root" job group, named "
/
". The root group is owned by the primary LSF Administrator and cannot be removed. Users and administrators create new groups under the root group. By default, if you do not specify a job group path name when submitting a job, the job is created under the top-level "root" job group, named "/
".The root job group is not displayed by job group query commands, and you cannot specify the root job in commands.
Job group owner
Each group is owned by the user who created it. The login name of the user who creates the job group is the job group owner. Users can add job groups into a groups that are owned by other users, and they can submit jobs to groups owned by other users. Child job groups are owned by the creator of the job group and the creators of any parent groups.
Job control under job groups
Job owners can control their own jobs attached to job groups as usual. Job group owners can also control any job under the groups they own and below.
For example:
- Job group
/A
is created byuser1
- Job group
/A/B
is created byuser2
- Job group
/A/B/C
is created byuser3
All users can submit jobs to any job group, and control the jobs they own in all job groups. For jobs submitted by other users:
user1
can control jobs submitted by other users in all 3 job groups:/A
,/A/B
, and/A/B/C
user2
can control jobs submitted by other users only in 2 job groups:/A/B
and/A/B/C
user3
can control jobs submitted by other users only in job group/A/B/C
The LSF administrator can control jobs in any job group.
Default job group
You can specify a default job group for jobs submitted without explicitly specifying a job group. LSF associates the job with the job group specified with DEFAULT_JOBGROUP in
lsb.params
. The LSB_DEFAULT_JOBGROUP environment variable overrides the setting of DEFAULT_JOBGROUP. Thebsub -g
job_group_name
option overrides both LSB_DEFAULT_JOBGROUP and DEFAULT_JOBGROUP.Default job group specification supports macro substitution for project name (
%p
) and user name (%u
). When you specifybsub -P
project_name
, the value of%p
is the specified project name. If you do not specify a project name at job submission,%p
is the project name defined by setting the environment variable LSB_DEFAULTPROJECT, or the project name specified by DEFAULT_PROJECT inlsb.params
. the default project name isdefault
.For example, a default job group name specified by
DEFAULT_JOBGROUP=/canada/%p/%u
is expanded to the value for the LSF project name and the user name of the job submission user (for example,/canada/projects/user1
).Job group names must follow this format:
- Job group names must start with a slash character (
/
). For example,DEFAULT_JOBGROUP=/A/B/C
is correct, butDEFAULT_JOBGROUP=A/B/C
is not correct.- Job group names cannot end with a slash character (
/
). For example,DEFAULT_JOBGROUP=/A/
is not correct.- Job group names cannot contain more than one slash character (
/
) in a row. For example, job group names likeDEFAULT_JOBGROUP=/A//B
orDEFAULT_JOBGROUP=AB
are not correct.- Job group names cannot contain spaces. For example,
DEFAULT_JOBGROUP=/A/B C/D
is not correct.- Project names and user names used for macro substitution with
%p
and%u
cannot start or end with slash character (/
).- Project names and user names used for macro substitution with
%p
and%u
cannot contain spaces or more than one slash character (/
) in a row.- Project names or user names containing slash character (
/
) will create separate job groups. For example, if the project name iscanada/projects
,DEFAULT_JOBGROUP=/%p
results in a job group hierarchy/canada/projects
.Job group limits
Job group limits specified with
bgadd -L
apply to the job group hierarchy. The job group limit is a positive number greater than or equal to zero (0), specifying the maximum number of running and suspended jobs under the job group (including child groups). If limit is zero (0), no jobs under the job group can run.By default, a job group has no limit. Limits persist across
mbatchd
restart and reconfiguration.You cannot specify a limit for the root job group. The root job group has no job limit. Job groups added with no limits specified inherit any limits of existing parent job groups. The
-L
option only limits the lowest level job group created.The maximum number of running and suspended jobs (including USUSP and SSUSP) in a job group cannot exceed the limit defined on the job group and its parent job group.
The job group limit is based on the number of running and suspended jobs in the job group. If you specify a job group limit as 2, at most 2 jobs can run under the group at any time, regardless of how many jobs or job slots are used. If the currently available job slots is zero (0), even if the job group job limit is not exceeded, LSF cannot dispatch a job to the job group.
If a parallel job requests 2 CPUs (
bsub -n 2
), the job group limit is per job, not per slots used by the job.A job array may also be under a job group, so job arrays also support job group limits.
Job group limits are not supported at job submission for job groups created automatically with
bsub -g
. Usebgadd -L
before job submission.Jobs forwarded to the execution cluster in a MultiCluster environment are not counted towards the job group limit.
Examples
bgadd -L 6 /canada/projects/test
If
/canada
is existing job group, and/canada/projects
and/canada/projects/test
are new groups, only the job group/canada/projects/test
is limited to 6 running and suspended jobs. Job group/canada/projects
will have whatever limit is specified for its parent job group/canada
. The limit of/canada
does not change.The limits on child job groups cannot exceed the parent job group limit. For example, if
/canada/projects
has a limit of 5:
bgadd -L 6 /canada/projects/test
is rejected because
/canada/projects/test
attempts to increase the limit of its parent/canada/projects
from 5 to 6.Example job group hierarchy with limits
In this configuration:
- Every node is a job group, including the root (
/
) job group- The root (
/
) job group cannot have any limit definition- By default, child groups have the same limit definition as their direct parent group, so
/asia
,/asia/projects
, and/asia/projects/test
all have no limit- The number of running and suspended jobs in a job group (including all of its child groups) cannot exceed the defined limit
- If there are 7 running or suspended jobs in job group
/canada/projects/test1
, even though the job limit of group/canada/qa/auto
is 6,/canada/qa/auto
can only have a maximum of 5 running and suspended (12-7=5)- When a job is submitted to a job group, LSF checks the limits for the entire job group. For example, for a job is submitted to job group
/canada/qa/auto
, LSF checks the limits on groups/canada/qa/auto
,/canada/qa
and/canada
. If any one limit in the branch of the hierarchy is exceeded, the job remains pending- The zero (0) job limit for job group
/canada/qa/manual
means no job in the job group can enter running statusCreate a job group
- Use the
bgadd
command to create a new job group.You must provide full group path name for the new job group. The last component of the path is the name of the new group to be created:
bgadd /risk_group
The above example creates a job group named
risk_group
under the root group/
.
bgadd /risk_group/portfolio1
The above example creates a job group named
portfolio1
under job group/risk_group
.
bgadd /risk_group/portfolio1/current
The above example creates a job group named
current
under job group/risk_group/portfolio1
.If the group hierarchy
/risk_group/portfolio1/current
does not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy.Add a job group limit (bgadd)
- Run
bgadd -L
limit
/
job_group_name
to specify a job limit for a job group.Where
limit
is a positive number greater than or equal to zero (0), specifying the maximum the number of running and suspended jobs under the job group (including child groups) If limit is zero (0), no jobs under the job group can run.For example:
bgadd -L 6 /canada/projects/test
If
/canada
is existing job group, and/canada/projects
and/canada/projects/test
are new groups, only the job group/canada/projects/test
is limited to 6 running and suspended jobs. Job group/canada/projects
will have whatever limit is specified for its parent job group/canada
. The limit of/canada
does not change.Submit jobs under a job group
- Use the
-g
option ofbsub
to submit a job into a job group.The job group does not have to exist before submitting the job.
bsub -g /risk_group/portfolio1/current myjob
Job <105> is submitted to default queue.Submits
myjob
to the job group/risk_group/portfolio1/current
.If group
/risk_group/portfolio1/current
exists, job 105 is attached to the job group.If group
/risk_group/portfolio1/current
does not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy and the job is attached to group.-g and -sla options
tip:
Use-sla
with-g
to attach all jobs in a job group to a service class and have them scheduled as SLA jobs. Multiple job groups can be created under the same SLA. You can submit additional jobs to the job group without specifying the service class name again.MultiCluster
In a MultiCluster job forwarding mode, job groups only apply on the submission cluster, not on the execution cluster. LSF treats the execution cluster as execution engine, and only enforces job group policies at the submission cluster.
Jobs forwarded to the execution cluster in a MultiCluster environment are not counted towards job group limits.
View jobs in job groups
View job group information, and jobs running in specific job groups.
View information about job groups (bjgroup)
- Use the
bjgroup
command to see information about jobs in job groups.bjgroup
GROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /A 0 0 0 0 0 0 () 0/10 user1 /X 0 0 0 0 0 0 () 0/- user2 /A/B 0 0 0 0 0 0 () 0/5 user1 /X/Y 0 0 0 0 0 0 () 0/5 user2- Use
bjgroup -s
to sort job groups by group hierarchy.For example, for job groups named
/A
,/A/B
,/X
and/X/Y
,bjgroup -s
displays:bjgroup -s
GROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /A 0 0 0 0 0 0 () 0/10 user1 /A/B 0 0 0 0 0 0 () 0/5 user1 /X 0 0 0 0 0 0 () 0/- user2 /X/Y 0 0 0 0 0 0 () 0/5 user2- Specify a job group name to show the hierarchy of a single job group:
bjgroup -s /X
GROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /X 25 0 25 0 0 0 puccini 25/100 user1 /X/Y 20 0 20 0 0 0 puccini 20/30 user1 /X/Z 5 0 5 0 0 0 puccini 5/10 user2- Specify a job group name with a trailing slash character (
/
) to show only the root job group:bjgroup -s /X/
GROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /X 25 0 25 0 0 0 puccini 25/100 user1- Use
bjgroup -N
to display job group information by job slots instead of number of jobs. NSLOTS, PEND, RUN, SSUSP, USUSP, RSV are all counted in slots rather than number of jobs:bjgroup -N
GROUP_NAME NSLOTS PEND RUN SSUSP USUSP RSV SLA OWNER /X 25 0 25 0 0 0 puccini user1 /A/B 20 0 20 0 0 0 wagner batch
-N
by itself shows job slot info for all job groups, and can combine with-s
to sort the job groups by hierarchy:bjgroup -N -s
GROUP_NAME NSLOTS PEND RUN SSUSP USUSP RSV SLA OWNER /A 0 0 0 0 0 0 wagner batch /A/B 0 0 0 0 0 0 wagner user1 /X 25 0 25 0 0 0 puccini user1 /X/Y 20 0 20 0 0 0 puccini batch /X/Z 5 0 5 0 0 0 puccini batchView jobs for a specific job group (bjobs)
- Run
bjobs -g
and specify a job group path to view jobs attached to the specified group.bjobs -g /risk_group
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 113 user1 PEND normal hostA myjob Jun 17 16:15 111 user2 RUN normal hostA hostA myjob Jun 14 15:13 110 user1 RUN normal hostB hostA myjob Jun 12 05:03 104 user3 RUN normal hostA hostC myjob Jun 11 13:18
bjobs -l
displays the full path to the group to which a job is attached:bjobs -l -g /risk_group
Job <101>, User, Project , Job Group , Status , Queue , Command Tue Jun 17 16:21:49: Submitted from host , CWD ; ... Control jobs in job groups
Suspend and resume jobs in job groups, move jobs to different job groups, terminate jobs in job groups, and delete job groups.
Suspend jobs (bstop)
- Use the
-g
option ofbstop
and specify a job group path to suspend jobs in a job group
bstop -g /risk_group 106
Job <106> is being stopped- Use job ID 0 (zero) to suspend all jobs in a job group:
bstop -g /risk_group/consolidate 0
Job <107> is being stopped Job <108> is being stopped Job <109> is being stoppedResume suspended jobs (bresume)
- Use the
-g
option ofbresume
and specify a job group path to resume suspended jobs in a job group:
bresume -g /risk_group 106
Job <106> is being resumed- Use job ID 0 (zero) to resume all jobs in a job group:
bresume -g /risk_group 0
Job <109> is being resumed Job <110> is being resumed Job <112> is being resumedMove jobs to a different job group (bmod)
- Use the
-g
option ofbmod
and specify a job group path to move a job or a job array from one job group to another.bmod -g /risk_group/portfolio2/monthly 105
moves job 105 to job group
/risk_group/portfolio2/monthly
.Like
bsub -g
, if the job group does not exist, LSF creates it.
bmod -g
cannot be combined with otherbmod
options. It can only operate on pending jobs. It cannot operate on running or finished jobs.You can modify your own job groups and job groups that other users create under your job groups. The LSF administrator can modify job groups of all users.
You cannot move job array elements from one job group to another, only entire job arrays. If any job array elements in a job array are running, you cannot move the job array to another group. A job array can only belong to one job group at a time.
You cannot modify the job group of a job attached to a service class.
bhist -l
shows job group modification information:bhist -l 105
Job <105>, User, Project , Job Group , Command Wed May 14 15:24:07: Submitted from host , to Queue , CWD <$HOME/lsf51/5.1/sparc-sol7-64/bin>; Wed May 14 15:24:10: Parameters of Job are changed: Job group changes to: /risk_group/portfolio2/monthly; Wed May 14 15:24:17: Dispatched to ; Wed May 14 15:24:17: Starting (Pid 8602); ... Terminate jobs (bkill)
- Use the
-g
option ofbkill
and specify a job group path to terminate jobs in a job group.
bkill -g /risk_group 106
Job <106> is being terminated- Use job ID 0 (zero) to terminate all jobs in a job group:
bkill -g /risk_group 0
Job <1413> is being terminated Job <1414> is being terminated Job <1415> is being terminated Job <1416> is being terminated
bkill
only kills jobs in the job group you specify. It does not kill jobs in lower level job groups in the path. For example, jobs are attached to job groups/risk_group
and/risk_group/consolidate
:bsub -g /risk_group myjob
Job <115> is submitted to default queue. bsub -g /risk_group/consolidate myjob2
Job <116> is submitted to default queue. The following
bkill
command only kills jobs in/risk_group
, not the subgroup/risk_group/consolidate
:bkill -g /risk_group 0
Job <115> is being terminated
To kill jobs in
/risk_group/consolidate
, specify the path to theconsolidate
job group explicitly:bkill -g /risk_group/consolidate 0
Job <116> is being terminatedDelete a job groups manually (bgdel)
- Use the
bgdel
command to manually remove a job group. The job group cannot contain any jobs.bgdel /risk_group
Job group /risk_group is deleted.deletes the job group
/risk_group
and all its subgroups.Normal users can only delete the empty groups they own that are specified by the requested
job_group_name
. These groups can be explicit or implicit.- Run
bgdel 0
to delete all empty job groups you own. Theses groups can be explicit or implicit.- LSF administrators can use
bgdel -u
user_name
0
to delete all empty job groups created by specific users. These groups can be explicit or implicit.Run
bgdel -u all 0
to delete all the users' empty job groups and their sub groups. LSF administrators can delete empty job groups created by any user. These groups can be explicit or implicit.- Run
bgdel -c
job_group_name
to delete all empty groups below the requestedjob_group_name
includingjob_group_name
itself.Modify a job group limit (bgmod)
- Run
bgmod
to change a job group limit.bgmod [-Llimit
| -Ln] /job_group_name
-L
limit
changes the limit ofjob_group_name
to the specified value. If the job group has parent job groups, the new limit cannot exceed the limits of any higher level job groups. Similarly, if the job group has child job groups, the new value must be greater than any limits on the lower level job groups.
-Ln
removes the existing job limit for the job group. If the the job group has parent job groups, the job modified group automatically inherits any limits from its direct parent job group.You must provide full group path name for the modified job group. The last component of the path is the name of the job group to be modified.
Only root, LSF administrators, or the job group creator, or the creator of the parent job groups can use bgmod to modify a job group limit.
The following command only modifies the limit of group
/canada/projects/test1
. It does not modify limits of/canada
or/canada/projects
.bgmod -L 6 /canada/projects/test1
To modify limits of
/canada
or/canada/projects
, you must specify the exact group name:bgmod -L 6 /canada
or
bgmod -L 6 /canada/projects
Automatic job group cleanup
When an implicitly created job group becomes empty, it can be automatically deleted by LSF. Job groups that can be automatically deleted cannot:
- Have limits specified including their child groups
- Have explicitly created child job groups
- Be attached to any SLA
Configure JOB_GROUP_CLEAN=Y in
lsb.params
to enable automatic job group deletion.For example, for the following job groups:
When automatic job group deletion is enabled, LSF only deletes job groups
/X/Y/Z/W
and/X/Y/Z
. Job group/X/Y
is not deleted because it is an explicitly created job group, Job group/X
is also not deleted because it has an explicitly created child job group/X/Y
.Automatic job group deletion does not delete job groups attached to SLA service classes. Use
bgdel
to manually delete job groups attached to SLAs.Handling Job Exceptions
You can configure hosts and queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected and their corresponding actions. By default, LSF does not detect any exceptions.
Run
bjobs -d -m
host_name
to see exited jobs for a particular host.Job exceptions LSF can detect
If you configure job exception handling in your queues, LSF detects the following job exceptions:
- Job underrun - jobs end too soon (run time is less than expected). Underrun jobs are detected when a job exits abnormally
- Job overrun - job runs too long (run time is longer than expected). By default, LSF checks for overrun jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in
lsb.params
to change how frequently LSF checks for job overrun.- Job estimated run time exceeded- the job's actual run time has exceeded the estimated run time.
- Idle job - running job consumes less CPU time than expected (in terms of CPU time/runtime). By default, LSF checks for idle jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in
lsb.params
to change how frequently LSF checks for idle jobs.Host exceptions LSF can detect
If you configure host exception handling, LSF can detect jobs that exit repeatedly on a host. The host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs dispatched to such "black hole", or "job-eating" hosts exit abnormally. By default, LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE in
lsb.hosts
).If EXIT_RATE is not specified for the host, LSF invokes
eadmin
if the job exit rate for a host remains above the configured threshold for longer than 5 minutes. Use JOB_EXIT_RATE_DURATION inlsb.params
to change how frequently LSF checks the job exit rate.Use GLOBAL_EXIT_RATE in
lsb.params
to set a cluster-wide threshold in minutes for exited jobs. If EXIT_RATE is not specified for the host inlsb.hosts
, GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster. Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.Customize job exception actions with the eadmin script
When an exception is detected, LSF takes appropriate action by running the script
LSF_SERVERDIR/eadmin
on the master host.You can customize
eadmin
to suit the requirements of your site. For example,eadmin
could find out the owner of the problem jobs and usebstop -u
to stop all jobs that belong to the user.In some environments, a job running 1 hour would be an overrun job, while this may be a normal job in other environments. If your configuration considers jobs running longer than 1 hour to be overrun jobs, you may want to close the queue when LSF detects a job that has run longer than 1 hour and invokes
eadmin
.Email job exception details
Set LSF to send you an email about job exceptions that includes details including JOB_ID, RUN_TIME, IDLE_FACTOR (if job has been idle), USER, QUEUE, EXEC_HOST, and JOB_NAME.
- In
lsb.params
, setEXTEND_JOB_EXCEPTION_NOTIFY=Y
.- Set the format option in the
eadmin
script (LSF_SERVERDIR/eadmin
on the master host).
- Uncomment the
JOB_EXCEPTION_EMAIL_FORMAT
line and add a value for the format:
JOB_EXCEPTION_EMAIL_FORMAT=fixed
: The eadmin shell generates an exception email with a fixed length for the job exception information. For any given field, the characters truncate when the maximum is reached (between 10-19).JOB_EXCEPTION_EMAIL_FORMAT=full
: The eadmin shell generates an exception email without a fixed length for the job exception information.Default eadmin actions
For host-level exceptions, LSF closes the host and sends email to the LSF administrator. The email contains the host name, job exit rate for the host, and other host information. The message
eadmin: JOB EXIT THRESHOLD EXCEEDED
is attached to the closed host event inlsb.events
, and displayed bybadmin hist
andbadmin hhist
.For job exceptions. LSF sends email to the LSF administrator. The email contains the job ID, exception type (overrun, underrun, idle job), and other job information.
An email is sent for all detected job exceptions according to the frequency configured by EADMIN_TRIGGER_DURATION in
lsb.params
. For example, if EADMIN_TRIGGER_DURATION is set to 5 minutes, and 1 overrun job and 2 idle jobs are detected, after 5 minutes,eadmin
is invoked and only one email is sent. If another overrun job is detected in the next 5 minutes, another email is sent.Handling job initialization failures
By default, LSF handles job exceptions for jobs that exit after they have started running. You can also configure LSF to handle jobs that exit during initialization because of an execution environment problem, or because of a user action or LSF policy.
LSF detects that the jobs are exiting before they actually start running, and takes appropriate action when the job exit rate exceeds the threshold for specific hosts (EXIT_RATE in
lsb.hosts
) or for all hosts (GLOBAL_EXIT_RATE inlsb.params
).Use EXIT_RATE_TYPE in
lsb.params
to include job initialization failures in the exit rate calculation. The following table summarizes the exit rate types you can configure:Table 1: Exit rate types you can configure
Exit rate type ... Includes ... JOBEXIT Local exited jobsRemote job initialization failuresParallel job initialization failures on hosts other than the first execution hostJobs exited by user action (e.g., bkill, bstop, etc.) or LSF policy (e.g., load threshold exceeded, job control action, advance reservation expired, etc.) JOBEXIT_NONLSFThis is the default when EXIT_RATE_TYPE is not set Local exited jobsRemote job initialization failuresParallel job initialization failures on hosts other than the first execution host JOBINIT Local job initialization failuresParallel job initialization failures on the first execution host HPCINIT Job initialization failures for Platform LSF HPC jobsJob exits excluded from exit rate calculation
By default, jobs that are exited for non-host related reasons (user actions and LSF policies) are not counted in the exit rate calculation. Only jobs that are exited for what LSF considers host-related problems and are used to calculate a host exit rate.
The following cases are
not included
in the exit rate calculations:
bkill
,bkill -r
brequeue
- RERUNNABLE jobs killed when a host is unavailable
- Resource usage limit exceeded (for example, PROCESSLIMIT, CPULIMIT, etc.)
- Queue-level job control action TERMINATE and TERMINATE_WHEN
- Checkpointing a job with the kill option (
bchkpnt -k
)- Rerunnable job migration
- Job killed when an advance reservation has expired
- Remote lease job start fails
- Any jobs with an exit code found in SUCCESS_EXIT_VALUES, where a particular exit value is deemed as successful.
Excluding LSF and user-related job exits
To explicitly
exclude
jobs exited because of user actions or LSF-related policies from the job exit calculation, set EXIT_RATE_TYPE = JOBEXIT_NONLSF inlsb.params
. JOBEXIT_NONLSF tells LSF to include all job exitsexcept
those that are related to user action or LSF policy. This is the default value for EXIT_RATE_TYPE .To
include
all job exit cases in the exit rate count, you must set EXIT_RATE_TYPE = JOBEXIT inlsb.params
. JOBEXIT considers all job exits.Jobs killed by signal external to LSF will still be counted towards exit rate
Jobs killed because of job control SUSPEND action and RESUME action are still counted towards the exit rate. This because LSF cannot distinguish between jobs killed from SUSPEND action and jobs killed by external signals.
If both JOBEXIT and JOBEXIT_NONLSF are defined, JOBEXIT_NONLSF is used.
Local jobs
When EXIT_RATE_TYPE=JOBINIT, various job initialization failures are included in the exit rate calculation, including:
- Host-related failures; for example, incorrect user account, user permissions, incorrect directories for checkpointable jobs, host name resolution failed, or other execution environment problems
- Job-related failures; for example, pre-execution or setup problem, job file not created, etc.
Parallel jobs
By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization failure on the first execution host does not count in the job exit rate calculation. Job initialization failure for hosts other than the first execution host are counted in the exit rate calculation.
When EXIT_RATE_TYPE=JOBINIT, job initialization failure happens on the first execution host are counted in the job exit rate calculation. Job initialization failures for hosts other than the first execution host are
not
counted in the exit rate calculation.
tip:
For parallel job exit exceptions to be counted forall
hosts, specify EXIT_RATE_TYPE=HPCINIT or EXIT_RATE_TYPE=JOBEXIT_NONLSF JOBINIT.Remote jobs
By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization failures are counted as exited jobs on the remote execution host and are included in the exit rate calculation for that host. To include only
local
job initialization failures on the execution cluster from the exit rate calculation, set EXIT_RATE_TYPE to include only JOBINIT or HPCINIT.Scaling and tuning job exit rate by number of slots
On large, multiprocessor hosts, use to ENABLE_EXIT_RATE_PER_SLOT=Y in
lsb.params
to scale the job exit rate so that the host is only closed when the job exit rate is high enough in proportion to the number of processors on the host. This avoids having a relatively low exit rate close a host inappropriately.Use a float value for GLOBAL_EXIT_RATE in
lsb.params
to tune the exit rate on multislot hosts. The actual calculated exit rate value is never less than 1.Example: exit rate of 5 on single processor and multiprocessor hosts
On a single-processor host, a job exit rate of 5 is much more severe than on a 20-processor host. If a stream of jobs to a single-processor host is consistently failing, it is reasonable to close the host or take some other action after 5 failures.
On the other hand, for the same stream of jobs on a 20-processor host, it is possible that 19 of the processors are busy doing other work that is running fine. To close this host after only 5 failures would be wrong because effectively less than 5% of the jobs on that host are actually failing.
Example: float value for GLOBAL_EXIT_RATE on multislot hosts
Using a float value for GLOBAL_EXIT_RATE allows the exit rate to be less than the number of slots on the host. For example, on a host with 4 slots, GLOBAL_EXIT_RATE=0.25 gives an exit rate of 1. The same value on an 8 slot machine would be 2 and so on. On a single-slot host, the value is never less than 1.
For more information
- See Handling Host-level Job Exceptions for information about configuring host-level job exceptions.
- See Handling Job Exceptions in Queues for information about configuring job exceptions. in queues
Platform Computing Inc. www.platform.com |
Knowledge Center Contents Previous Next Index |
http://www.ccs.miami.edu/hpc/lsf/7.0.6/admin/job_ops.html
http://www-01.ibm.com/support/knowledgecenter/SSETD4_9.1.3/lsf_command_ref/lsinfo.1.dita
http://www2.nchc.org.tw/~a00yys00/lsf7/7.0.6/lsf_using/index.htm?job_kill.html~main