今天在cluster上提交任务,发现提交之后一直显示处于站队状态(Q)。换了一个node之后发现可以正常运行。cluster的配置是一个head node带了10个child node, 所有Maui和TORQUE的配置均在head node上。版本信息:
Ubuntu 12.04.4 LTS
Torque PBS 2.5.12
Maui 3.3.1
qstat的状态
# qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
66625.head testpy19 qz 0 Q temp
66626.child09 testpy19 qz 0 R temp
追踪任务显示没有给相应的job分配任何资源
# checkjob 66625
checking job 66625
State: Idle EState: Deferred
Creds: user:qz group:qz class:batch qos:DEFAULT
WallTime: 00:00:00 of 1:00:00:00
SubmitTime: Wed Oct 14 16:52:37
(Time Queued Total: 00:00:31 Eligible: 00:00:00)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1]
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: NoResources (cannot create reservation for job '66625' (intital reservation attempt)
)
Holds: Defer (hold reason: NoResources)
PE: 1.00 StartPriority: 1
cannot select job 66625 for partition DEFAULT (job hold active)
一开始怀疑是queue的配置或者maui的配置问题,但是因为只有在head node上有问题,其他child node都运行良好,所以问题不在配置。下一步检查各个node
# checknode head
checking node head
State: Down (in current state for 00:00:00)
Configured Resources: PROCS: 24 MEM: 15G SWAP: 16G DISK: 1M
Utilized Resources: PROCS: 24
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.120
Network: [DEFAULT]
Features: [temp][normal][mpi][long][bigmem]
Attributes: [Batch]
Classes: [temp 24:24][normal 24:24][mpi 24:24][long 24:24]
Total Time: INFINITY Up: INFINITY (96.56%) Active: INFINITY (42.95%)
Reservations:
NOTE: no reservations on node
# checknode child09
checking node child09
State: Idle (in current state for 00:40:17)
Configured Resources: PROCS: 12 MEM: 31G SWAP: 47G DISK: 1M
Utilized Resources: SWAP: 5290M
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.000
Network: [DEFAULT]
Features: [temp][normal][mpi][long]
Attributes: [Batch]
Classes: [temp 12:12][normal 12:12][mpi 12:12][long 12:12]
Total Time: INFINITY Up: INFINITY (98.61%) Active: INFINITY (17.80%)
Reservations:
NOTE: no reservations on node
很明显,head node 没有正常工作,但是pbsnodes显示head node的状态是free。仔细对比了两个node的状态之后,发现head node没有任何session (nsession=0), 而且有一条错误信息表明spool 文件系统已满……
# pbsnodes head
head
state = free
np = 24
properties = normal,bigmem,long,mpi,temp
ntype = cluster
status = rectime=1444852351,varattr=,jobs=,state=free,netload=124602597243261,gres=,message=ERROR: torque spool filesystem full,loadave=0.00,ncpus=24,physmem=264108356kb,availmem=276266268kb,totmem=295356736kb,idletime=128,nusers=0,nsessions=0,uname=Linux mobs-head 3.5.0-45-generic #68~precise1-Ubuntu SMP Wed Dec 4 16:18:46 UTC 2013 x86_64,opsys=linux
gpus = 0
# pbsnodes child09
child09
state = free
np = 12
properties = normal,long,mpi,temp
ntype = cluster
status = rectime=1444852532,varattr=,jobs=,state=free,netload=77659293583815,gres=,loadave=0.00,ncpus=12,physmem=32901268kb,availmem=43868400kb,totmem=49285264kb,idletime=9836391,nusers=4,nsessions=10,sessions=510 1075 1101 1217 1233 1260 1295 1423 10483 12024,uname=Linux mobs-child09 3.5.0-45-generic #68~precise1-Ubuntu SMP Wed Dec 4 16:18:46 UTC 2013 x86_64,opsys=linux
gpus = 0
于是,删掉没有用的文件…… 大约26G……
# rm /var/spool/torque/server_logs/*
# rm /var/spool/torque/undelivered/*
重启pbs_mom之后,一切正常
# ps aux | grep pbs_mom
root 1169 0.0 0.0 32792 22428 ? SLsl 15:32 0:00 pbs_mom -p
root 46849 0.0 0.0 9392 944 pts/1 S+ 15:57 0:00 grep --color=auto pbs_mom
# kill -9 1169
# pbs_mom -p