1.修改 $LSF_ENVDIR/lsf.conf。LSF 安装可能已添加以下部分参数:
注: LSF 使用 -j
注: LSF 使用 -j
2.从 Cray 登录节点,运行 $LSF_BINDIR/genVnodeConf 命令。
此命令以 BATCH 方式生成计算节点的列表。 您可以将计算节点添加到 $LSF_ENVDIR/lsf.cluster.
HOSTNAME model type server r1m mem swp RESOURCES
nid00038 ! ! 1 3.5 () () (craylinux vnode)
nid00039 ! ! 1 3.5 () () (craylinux vnode)
nid00040 ! ! 1 3.5 () () (craylinux vnode)
nid00041 ! ! 1 3.5 () () (craylinux vnode)
nid00042 ! ! 1 3.5 () () (craylinux vnode gpu)
nid00043 ! ! 1 3.5 () () (craylinux vnode gpu)
nid00044 ! ! 1 3.5 () () (craylinux vnode)
nid00045 ! ! 1 3.5 () () (craylinux vnode)
nid00046 ! ! 1 3.5 () () (craylinux vnode)
nid00047 ! ! 1 3.5 () () (craylinux vnode)
nid00048 ! ! 1 3.5 () () (craylinux vnode)
nid00049 ! ! 1 3.5 () () (craylinux vnode)
nid00050 ! ! 1 3.5 () () (craylinux vnode)
nid00051 ! ! 1 3.5 () () (craylinux vnode)
nid00052 ! ! 1 3.5 () () (craylinux vnode gpu)
nid00053 ! ! 1 3.5 () () (craylinux vnode gpu)
nid00054 ! ! 1 3.5 () () (craylinux vnode)
nid00055 ! ! 1 3.5 () () (craylinux vnode)
nid00056 ! ! 1 3.5 () () (craylinux vnode)
nid00057 ! ! 1 3.5 () () (craylinux vnode)
3.配置 $LSF_ENVDIR/hosts。确保计算节点的 IP 地址与已在使用的任何 IP 地址都不冲突。
cat $LSF_ENVDIR/hosts
10.128.0.34 nid00033 c0-0c1s0n3 sdb001 sdb002
10.128.0.61 nid00060 c0-0c1s1n0 login login1 castor-p2
10.128.0.36 nid00035 c0-0c1s1n3
10.128.0.59 nid00058 c0-0c1s2n0
10.128.0.38 nid00037 c0-0c1s2n3
10.128.0.57 nid00056 c0-0c1s3n0
10.128.0.58 nid00057 c0-0c1s3n1
10.128.0.39 nid00038 c0-0c1s3n2
10.128.0.40 nid00039 c0-0c1s3n3
10.128.0.55 nid00054 c0-0c1s4n0
10.128.0.56 nid00055 c0-0c1s4n1
10.128.0.41 nid00040 c0-0c1s4n2
10.128.0.42 nid00041 c0-0c1s4n3
10.128.0.53 nid00052 c0-0c1s5n0
10.128.0.54 nid00053 c0-0c1s5n1
10.128.0.43 nid00042 c0-0c1s5n2
10.128.0.44 nid00043 c0-0c1s5n3
10.128.0.51 nid00050 c0-0c1s6n0
10.128.0.52 nid00051 c0-0c1s6n1
10.128.0.45 nid00044 c0-0c1s6n2
10.128.0.46 nid00045 c0-0c1s6n3
10.128.0.49 nid00048 c0-0c1s7n0
10.128.0.50 nid00049 c0-0c1s7n1
10.128.0.47 nid00046 c0-0c1s7n2
10.128.0.48 nid00047 c0-0c1s7n3
10.131.255.251 sdb sdb-p2 syslog ufs
4.修改 $LSF_ENVDIR/lsbatch/
Begin Host
HOST_NAME MXJ r1m pg ls tmp DISPATCH_WINDOW # Keywords
nid00060 9999 () () () () () # Example
nid00062 9999 () () () () () # Example
default ! () () () () () # Example
End Host
5.修改 $LSF_ENVDIR/lsbatch/
loadSched
/loadStop
行。Begin Queue QUEUE_NAME = normal PRIORITY = 30 NICE = 20 PREEMPTION = PREEMPTABLE JOB_CONTROLS = SUSPEND[bmig $LSB_BATCH_JID] RERUNNABLE = Y #RUN_WINDOW = 5:19:00-1:8:30 20:00-8:30 #r1m = 0.7/2.0 # loadSched/loadStop #r15m = 1.0/2.5 #pg = 4.0/8 #ut = 0.2 #io = 50/240 #CPULIMIT = 180/hostA # 3 hours of hostA #FILELIMIT = 20000 #DATALIMIT = 20000 # jobs data segment limit #CORELIMIT = 20000
#TASKCLIMIT = 5 # job task limit
#USERS = all # users who can submit jobs to this queue
#HOSTS = all # hosts on which jobs in this queue can run
#PRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
#POST_EXEC = /usr/local/lsf/misc/testq_post |grep -v "Hey"
#REQUEUE_EXIT_VALUES = 55 34 78
#APS_PRIORITY = WEIGHT[[RSRC, 10.0] [MEM, 20.0] [PROC, 2.5] [QPRIORITY, 2.0]] \
#LIMIT[[RSRC, 3.5] [QPRIORITY, 5.5]] \
#GRACE_PERIOD[[QPRIORITY, 200s] [MEM, 10m] [PROC, 2h]]
DESCRIPTION = For normal low priority jobs, running only if hosts are lightly loaded.
End Queue
Begin Queue
QUEUE_NAME = owners
PRIORITY = 43
JOB_CONTROLS = SUSPEND[bmig $LSB_BATCH_JID]
RERUNNABLE = YES
PREEMPTION = PREEMPTIVE
NICE = 10
#RUN_WINDOW = 5:19:00-1:8:30 20:00-8:30
r1m = 1.2/2.6
#r15m = 1.0/2.6
#r15s = 1.0/2.6
pg = 4/15
io = 30/200
swp = 4/1
tmp = 1/0
#CPULIMIT = 24:0/hostA # 24 hours of hostA
#FILELIMIT = 20000
#DATALIMIT = 20000 # jobs data segment limit
#CORELIMIT = 20000
#TASKLIMIT = 5 # job task limit
#USERS = user1 user2
#HOSTS = hostA hostB
#ADMINISTRATORS = user1 user2
#PRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
#POST_EXEC = /usr/local/lsf/misc/testq_post |grep -v "Hey"
#REQUEUE_EXIT_VALUES = 55 34 78
DESCRIPTION = For owners of some machines, only users listed in the HOSTS\
section can submit jobs to this queue.
End Queue
6.修改 $LSF_ENVDIR/lsf.shared。
确保在 resource 部分中定义了以下布尔资源:
vnode Boolean () () (sim node)
gpu Boolean () () (gpu)
frontnode Boolean () () (login/service node)
craylinux Boolean () () (Cray XT/XE MPI)
7.缺省情况下,为 LSF 启用了 LSF_CRAY_RUR_ACCOUNTING=Y 以使用资源实用程序报告 (RUR)。 如果环境中未安装 RUR ,那么必须通过在 lsf.conf中设置 LSF_CRAY_RUR_ACCOUNTING=N 来禁用 RUR。
8.修改 /etc/opt/cray/rur/rur.conf。
通过在 apsys
部分中注释掉以下行来禁用缺省 prolog 和 epilog 脚本:
apsys
# prologPath - location of the executable file to be run before application
# prologPath /usr/local/adm/sbin/prolog
# epilogPath - location of the executable file to be run after application
# epilogPath /usr/local/adm/sbin/epilog
# prologTimeout - time in seconds before prolog is aborted as "hung"
# prologTimeout 10
# epilogTimeout - time in seconds before epilog is aborted as "hung"
# epilogTimeout 10
# prologPath /opt/cray/rur/default/bin/rur_prologue.py
# epilogPath /opt/cray/rur/default/bin/rur_epilogue.py
# prologTimeout 100
# epilogTimeout 100
/apsys
9.修改 /etc/opt/cray/alps/alps.conf。
通过在 apsys
部分中注释掉以下行来禁用缺省 prolog 和 epilog 脚本:
apsys
# prologPath - location of the executable file to be run before application
# prologPath /usr/local/adm/sbin/prolog
# epilogPath - location of the executable file to be run after application
# epilogPath /usr/local/adm/sbin/epilog
# prologTimeout - time in seconds before prolog is aborted as "hung"
# prologTimeout 10
# epilogTimeout - time in seconds before epilog is aborted as "hung"
# epilogTimeout 10
# prologPath /opt/cray/rur/default/bin/rur_prologue.py
# epilogPath /opt/cray/rur/default/bin/rur_epilogue.py
# prologTimeout 100
# epilogTimeout 100
/apsys
10.重新启动登录节点上的 alps 守护程序以将更改应用于 alps.conf 和 rur.conf 文件。
/etc/init.d/alps restart
11.使用 service 命令可根据需要启动和停止 LSF 服务。