Maui/Torque Job defered

今天在cluster上提交任务,发现提交之后一直显示处于站队状态(Q)。换了一个node之后发现可以正常运行。cluster的配置是一个head node带了10个child node, 所有Maui和TORQUE的配置均在head node上。版本信息:

Ubuntu 12.04.4 LTS

Torque PBS 2.5.12

Maui 3.3.1

 

qstat的状态

# qstat

Job id                    Name             User            Time Use S Queue

------------------------- ---------------- --------------- -------- - -----

66625.head               testpy19         qz                  0 Q temp

66626.child09            testpy19         qz                  0 R temp

 

追踪任务显示没有给相应的job分配任何资源

 

# checkjob 66625

checking job 66625

 

State: Idle  EState: Deferred

Creds:  user:qz  group:qz  class:batch  qos:DEFAULT

WallTime: 00:00:00 of 1:00:00:00

SubmitTime: Wed Oct 14 16:52:37

 (Time Queued  Total: 00:00:31  Eligible: 00:00:00)

 

 Total Tasks: 1

 

 Req[0]  TaskCount: 1  Partition: ALL

 Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0

 Opsys: [NONE]  Arch: [NONE]  Features: [1][ppn=1]

 NodeCount: 1

 

 IWD: [NONE]  Executable:  [NONE]

 Bypass: 0  StartCount: 0

 PartitionMask: [ALL]

 Flags:       RESTARTABLE

 

 job is deferred.  Reason:  NoResources  (cannot create reservation for job '66625' (intital reservation attempt)

 )

 Holds:    Defer  (hold reason:  NoResources)

 PE:  1.00  StartPriority:  1

 cannot select job 66625 for partition DEFAULT (job hold active)

 

一开始怀疑是queue的配置或者maui的配置问题,但是因为只有在head node上有问题,其他child node都运行良好,所以问题不在配置。下一步检查各个node

 

# checknode head

checking node head

State:      Down  (in current state for 00:00:00)

Configured Resources: PROCS: 24  MEM: 15G  SWAP: 16G  DISK: 1M

Utilized   Resources: PROCS: 24

Dedicated  Resources: [NONE]

Opsys:         linux  Arch:      [NONE]

Speed:      1.00  Load:       0.120

Network:    [DEFAULT]

Features:   [temp][normal][mpi][long][bigmem]

Attributes: [Batch]

Classes:    [temp 24:24][normal 24:24][mpi 24:24][long 24:24]

 

Total Time:   INFINITY  Up:   INFINITY (96.56%)  Active:   INFINITY (42.95%)

 

Reservations:

NOTE:  no reservations on node

 

# checknode child09

checking node child09

 

State:      Idle  (in current state for 00:40:17)

Configured Resources: PROCS: 12  MEM: 31G  SWAP: 47G  DISK: 1M

Utilized   Resources: SWAP: 5290M

Dedicated  Resources: [NONE]

Opsys:         linux  Arch:      [NONE]

Speed:      1.00  Load:       0.000

Network:    [DEFAULT]

Features:   [temp][normal][mpi][long]

Attributes: [Batch]

Classes:    [temp 12:12][normal 12:12][mpi 12:12][long 12:12]

 

Total Time:   INFINITY  Up:   INFINITY (98.61%)  Active:   INFINITY (17.80%)

 

Reservations:

NOTE:  no reservations on node

 

很明显,head node 没有正常工作,但是pbsnodes显示head node的状态是free。仔细对比了两个node的状态之后,发现head node没有任何session (nsession=0), 而且有一条错误信息表明spool 文件系统已满……

# pbsnodes head

head

     state = free

     np = 24

     properties = normal,bigmem,long,mpi,temp

     ntype = cluster

     status = rectime=1444852351,varattr=,jobs=,state=free,netload=124602597243261,gres=,message=ERROR: torque spool filesystem full,loadave=0.00,ncpus=24,physmem=264108356kb,availmem=276266268kb,totmem=295356736kb,idletime=128,nusers=0,nsessions=0,uname=Linux mobs-head 3.5.0-45-generic #68~precise1-Ubuntu SMP Wed Dec 4 16:18:46 UTC 2013 x86_64,opsys=linux

     gpus = 0

 

# pbsnodes child09

child09

     state = free

     np = 12

     properties = normal,long,mpi,temp

     ntype = cluster

     status = rectime=1444852532,varattr=,jobs=,state=free,netload=77659293583815,gres=,loadave=0.00,ncpus=12,physmem=32901268kb,availmem=43868400kb,totmem=49285264kb,idletime=9836391,nusers=4,nsessions=10,sessions=510 1075 1101 1217 1233 1260 1295 1423 10483 12024,uname=Linux mobs-child09 3.5.0-45-generic #68~precise1-Ubuntu SMP Wed Dec 4 16:18:46 UTC 2013 x86_64,opsys=linux

     gpus = 0

 

于是,删掉没有用的文件…… 大约26G……

# rm /var/spool/torque/server_logs/*

# rm /var/spool/torque/undelivered/*

 

重启pbs_mom之后,一切正常

# ps aux | grep pbs_mom

root      1169  0.0  0.0  32792 22428 ?        SLsl 15:32   0:00 pbs_mom -p

root     46849  0.0  0.0   9392   944 pts/1    S+   15:57   0:00 grep --color=auto pbs_mom

# kill -9 1169

# pbs_mom -p

 

你可能感兴趣的:(HPC,Linux,TORQUE,PBS)