LSF系统简介

原文地址: https://blog.csdn.net/augusdi/article/details/45587373

另外可参考: https://blog.csdn.net/appleml/article/details/46712971

 

LSF(Load Sharing Facility)是分布资源管理的工具,用来调度、监视、分析联网计算机的负载。通过集中监控和调度,充分共享计算机的CPU、内存、磁盘、License等资源。

 

LSF系统国内网站


http://scc.ustc.edu.cn/zh_CN/ 中科大超算中心
http://www.sccas.cn/gb/index.html 中科院超算中心
http://www.ssc.net.cn/ 上海超算中心

 

LSF 术语

Cluster

一组运行LSF软件的计算机(当然是用TCP/IP网络互连的),一组安装了LSF软件的计算机组成一个Cluster,Cluster内的资源统一监控和调度。
Server Host
Cluster内提交和执行Job的计算机
Client Host
Cluster内仅仅提交Job的计算机

 

环境参数

基本上登录后系统就载入了LSF的配置,因此不需再做任何设定。 下面的命令可以用来检查LSF的相关设定:

env | grep LSF 

LSF_ENVDIR=/opt/lsf/conf

LSF_BINDIR=/opt/lsf/6.2/hpuxia64/bin

LSF_LIBDIR=/opt/lsf/6.2/hpuxia64/lib

LSF_SERVERDIR=/opt/lsf/6.2/hpuxia64/etc

XLSF_UIDDIR=/opt/lsf/6.2/hpuxia64/lib/uid

echo $PATH

...:/opt/lsf/6.2/hpuxia64/etc:/opt/lsf/6.2/hpuxia64/bin:…

基本指令

基本指令有:bsub、bqueues、bhosts、bjobs、bkill、bhist、bacct

bsub    

说明

bsub用来提交job,常用的参数有:-n、-q、-o、-e、-J

 

             

-n

指定计算工作所需的核心数目。可省略,默认值是:1

 

-q

指定执行计算工作的队列的名称。可省略,默认值是:xfer

 

-o

指定(stdout) 的输出目录名称。可省略,预设名称是:$job_id.out

 

-e

指定(stderr) 的输出目录名称。可省略,预设名称是:$job_id.err

 

-J

指定计算工作在队列中的名称。可省略,预设名称是所执行的程序名称

 

示例  

1.     

非平行(serial ) 程式(其中以红色粗体标出的部分,是需要执行的程式名称):

     

    

bsub ./my_program 

 

 

bsub -n 1 -q xfer -o output.txt -e error.txt -J TEST ./my_program 

 

 

上述两个命令执行的结果是相同的,都提交到队列xfer。不同的是stdout和stderr的输出信息路径。除此之外,后面的一个示例在队列中显示的名称为TEST,但上一个示例是my_program。

  

                  

2.

平行程式(其中以红色粗体标出的部分,是需要执行的程式名称):

 

 

bsub -n 4 -q xfer /work1/my_small_job 

 

 

bsub -n 16 -q mono my_paralle_job 

 

 

上述两个例子的差别在于指定执行程序的路径位置,若未列出指定程序的完整路径,则认为执行程序位于当前的目录下。

 

             

 

3.

 

MPI 的程式其中以红色粗体标出的部分,是需要执行的程式名称)

 

 

$ bsub -n 16 -q mono mpirun -np 4 mpi_program 

 

bqueues  

说明

查看各个队列的排队狀況

示例

bqueues

QUEUE_NAME    PRIO   STATUS          MAX JL/U JL/P JL/H NJOBS PEND  RUN  SUSP

mono                50     Open:Active      80     80    4      -     224     160   64        0

xfer                   50     Open:Active       80    32     4     -       0         0     0          0

 

bhosts  

 说明

查看队列系统中的各个主机(host) 的状态

示例

bhosts

 

HOST_NAME    STATUS      JL/U   MAX NJOBS   RUN SSUSP USUSP   RSV

hale                   ok             -          88     64        64        0        0         0

halen                 ok             -            8       0          0        0        0         0

 

 bjobs          

说明

查看队列系统中各个计算工作的执行状态。常用参数有:-u、-q、-l、-p

 

-u 指定要查询的使用者账号。可省略,预设值是目前使用的账号。

 

-q 指定要查询的队列名称。

 

-l 以长格式(long format) 的方式,显示详细的队列资料。

 

-p 只显示暂停中(pending) 的计算工作。

 

不加任何参数时,只会显示目前使用者执行中或暂停中的计算工作。若要查看其他使用者计算工作的执行状态,可使用-u 参数指定要查询的使用者账号,或使用all 查询所有使用者的执行状态。(注:all 亦可套用在队列名称上。)

示例

bjobs -u all

JOBID  USER    STAT  QUEUE     FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME

4001   u11aaa0 RUN  mono       halen            32*hale       *igen-1010 Jan 1 02:03

4002   u12bbb0 RUN  mono       halen            32*hale       *igen-1020 Jan 1 03:04

4003   u13ccc0  RUN  xfer         halen             hale           *igen-1030 Jan 1 04:05

4004   u14ddd0 PEND mono      halen                              *igen-1040 Jan 1 05:06

4005   u15eee0 PEND mono      halen                              *igen-1050 Jan 1 06:07

4006   u16fff0   PEND mono      halen                              *igen-1060 Jan 1 07:08

4007   u17ggg0 PEND mono      halen                              *igen-1070 Jan 1 08:09

 

bkill  

说明

终止或暂停工作

示例

$bkill 4001

  

bhist  

说明

用来查看(包括已执行完成的) 计算工作的执行历程及结束的原因。常用的参数有:-b、-l

  -b简短格式
  -l 详细资料
 示例  $ bhist -l4001

Job <4001>, User , Project , Command

Sat Jan 1 21:31:06: Submitted from host , to Queue , CWD , Output File , 32 Processors Requested;

Sun Jan 2 17:10:38: Dispatched to 32 Hosts/Processors <32*hale>;

Sun Jan 2 17:10:38: Starting (Pid 18479);

Sun Jan 2 17:10:38: Running with execution home , Execution CWD , Execution Pid <10000>;

Summary of time in seconds spent in various states by Sun Jan 2 16:04:18

 

 PEND    PSUSP       RUN         USUSP   SSUSP   UNKWN   TOTAL

 243572    0           168820     0           0           0            412392

  

bacct  

说明

统计在队列系统中执行的CPU时间及相关资料。常用的参数有:-C、-l、-q、-u

  -C 指定时间。可省略,未指定则是从系统记录开始时间到现在的时间。
 

-l 详细资料。可省略,未指定则仅显示统计数据,不会显示每一个记录。

 

-q 指定队列名称。可省略,未指定则是所有队列都算。

 

-u 指定使用者账号(可省略)。

示例 bacct -u user -q mono -C 03/01,05/31
 

统计user从今年3/1 日到5/31间,提交到mono 这个队列中的使用资料。

 

以上各指令的详细用法,皆可由man page 取得。例如:man bacctman bjobs、…

在terminal中用man bsub获得的mannual中的内容(部分),给出了bsub更多的参数:

         -a
          Specifies one or more application-specific esub executables
          that you want LSF to associate with the job.


          -app
          Submits the job to the specified application profile.


          -ar
          Specifies that the job is autoresizable.


          -B
          Sends mail to you when the job is dispatched and begins
          execution.


          -b
          Dispatches the job for execution on or after the specified
          date and time.


          -C
          Sets a per-process (soft) core file size limit for all the
          processes that belong to this job.


          -c
          Limits the total CPU time the job can use.


          -clusters
          MultiCluster only. Specifies cluster names when submitting
          jobs.


          -cwd
          Specifies the current working directory for job execution.


          -D
          Sets a per-process (soft) data segment size limit for each of
          the processes that belong to the job.


          -dc_chkpntvm
          Dynamic Cluster only. Enable VM job checkpointing by
          specifying an initial checkpoint time and recurring checkpoint
          interval.


          -dc_livemigvm
          Dynamic Cluster only. Specifies whether the job can be live
          migrated when its hypervisor host is selected for host memory
          defragmentation.


          -dc_mtype
          Dynamic Cluster only. Specifies the machine type for the job.


          -dc_tmpl
          Dynamic Cluster only. Specifies the Dynamic Cluster templates
          that the job can use.


          -dc_vmaction
          Dynamic Cluster only. Specifies the VM behavior if this job is
          preempted.


          -E
          Runs the specified job-based pre-execution command on the
          execution host before actually running the job.


          -e
          Appends the standard error output of the job to the specified
          file path.


          -env
          Controls the propagation of the specified job submission
          environment variables to the execution hosts.


          -eo
          Overwrites the standard error output of the job to the
          specified file path.


          -Ep
          Runs the specified job-based post-execution command on the
          execution host after the job finishes.


          -ext
          Specifies application-specific external scheduling options for
          the job.


          -F
          Sets a per-process (soft) file size limit for each of the
          processes that belong to the job.


          -f
          Copies a file between the local (submission) host and the
          remote (execution) host.


          -freq
          Specifies a CPU frequency for a job.


          -G
          For fairshare scheduling. Associates the job with the
          specified group.


          -g
          Submits jobs in the specified job group.


          -H
          Holds the job in the PSUSP state when the job is submitted.


          -hl
          Enables job-level host-based memory and swap limit enforcement
  
          -hostfile
          Submits a job with a user-specified host file.


          -I
          Submits an interactive job.


          -i
          Gets the standard input for the job from specified file path.


          -Ip
          Submits an interactive job and creates a pseudo-terminal when
          the job starts.


          -IS
          Submits an interactive job under a secure shell (ssh).


          -Is
          Submits an interactive job and creates a pseudo-terminal with
          shell mode when the job starts.


          -is
          Gets the standard input for the job from the specified file
          path, but allows you to modify or remove the input file before
          the job completes.


          -ISp
          Submits an interactive job under a secure shell (ssh) and
          creates a pseudo-terminal when the job starts.


          -ISs
          Submits an interactive job under a secure shell (ssh) and
          creates a pseudo-terminal with shell mode support when the job
          starts.


          -IX
          Submits an interactive X-Window job.


          -J
          Assigns the specified name to the job, and, for job arrays,
          specifies the indices of the job array and optionally the
          maximum number of jobs that can run at any given time.


          -Jd
          Assigns the specified description to the job; for job arrays,
          specifies the same job description for all elements in the job
          array.


          -jsdl
          Submits a job using a JSDL file that uses the LSF extension to
          specify job submission options.


          -jsdl_strict
          Submits a job using a JSDL file that only uses the standard
          JSDL elements and POSIX extensions to specify job submission
          options.


          -K
          Submits a job and waits for the job to complete. Sends job
          status messages to the terminal.


          -k
          Makes a job checkpointable and specifies the checkpoint
          directory.


          -L
          Initializes the execution environment using the specified
          login shell.


          -Lp
          Assigns the job to the specified License Scheduler project.


          -M
          Sets a per-process (soft) memory limit for all the processes
          that belong to this job.


          -m
          Runs the job on one of the specified hosts or host groups, or
          within the specified compute units.


          -mig
          Specifies the migration threshold for checkpointable or
          rerunnable jobs, in minutes.


          -N
          Sends the job report to you by mail when the job finishes.


          -n
          Submits a parallel job and specifies the number of tasks in
          the job.


          -network
          For LSF IBM Parallel Environment (IBM PE) integration.
          Specifies the network resource requirements to enable
          network-aware scheduling for IBM PE jobs.


          -o
          Appends the standard output of the job to the specified file
          path.


          -oo
          Overwrites the standard output of the job to the specified
          file path.


          -outdir
          Creates the job output directory.


          -P
          Assigns the job to the specified project.


          -p
          Sets the limit of the number of processes to the specified
          value for the whole job.


          -pack
          Submits job packs instead of an individual job.


          -Q
          Specify automatic job requeue exit values.


          -q
          Submits the job to one of the specified queues.


          -R
          Runs the job on a host that meets the specified resource
          requirements.


          -r
          Reruns a job if the execution host or the system fails; it
          does not rerun a job if the job itself fails.


          -rn
          Specifies that the job is never rerunnable.


          -rnc
          Specifies the full path of an executable to be invoked on the
          first execution host when the job allocation has been modified
          (both shrink and grow).


          -S
          Sets a per-process (soft) stack segment size limit for each of
          the processes that belong to the job.


          -s
          Sends the specified signal when a queue-level run window
          closes.


          -sla
          Specifies the service class where the job is to run.


          -sp
          Specifies user-assigned job priority that orders all jobs
          (from all users) in a queue.


          -T
          Sets the limit of the number of concurrent threads to the
          specified value for the whole job.


          -t
          Specifies the job termination deadline.


          -ti
          Enables automatic orphan job termination at the job level for
          a job with a dependency expression (set using -w).


          -tty
          When submitting an interactive job, displays output/error
          messages on the screen (except pre-execution output/error
          messages).


          -U
          If an advance reservation has been created with the brsvadd
          command, the job makes use of the reservation.


          -u
          Sends mail to the specified email destination.


          -ul
          Passes the current operating system user shell limits for the
          job submission user to the execution host.


          -v
          Sets the total process virtual memory limit to the specified
          value for the whole job.


          -W
          Sets the runtime limit of the job.


          -w
          LSF does not place your job unless the dependency expression
          evaluates to TRUE.


          -wa
          Specifies the job action to be taken before a job control
          action occurs.


          -We
          Specifies an estimated run time for the job.


          -wt
          Specifies the amount of time before a job control action
          occurs that a job warning action is to be taken.


          -x
          Puts the host running your job into exclusive execution mode.


          -XF
          Submits a job using SSH X11 forwarding.
      

 

 

 

你可能感兴趣的:(LSF系统简介)