torque pbs 安装 job 处于Q状态一直不执行

question: the job 可以提交,可以进队,但是一直处于 Q 状态,不会被调度。

作业处于Q状态不执行最初的错误是:没有配置 server的 scheduling 属性:这个属性可以在qmgr这个命令下配置:具体命令是:set server scheduling=true 但是在执行这个命令的跳出了以下错误qmgr obj= svr=default: Illegal attribute or resource value for scheduling  属性一直配置不上,然后就把之前的队列都清空了,用命令:pbs server -t create 在这之后,重新配置了队列属性:

Qmgr: create queue myque queue type=execution
Qmgr: set server default queue=myque
Qmgr: set queue myque started=true
Qmgr: set queue myque enabled=true
Qmgr: set server scheduling=true

配置以后,作业提交还是Q状态,并且用astat -f 查看 作业提交了以后不给分配 执行节点,强制执行qrun 作业以后,作业会分配到当前处于free状态的节点,但是还是不执行

qstat以后显示:

1.node90 STDIN admin 0 Q myque
3.node90 testpbs freeman 0 Q myque

qrun 1.node90 然后 qstat -f 后显示:

Job Id: 1.node90
    Job_Name = STDIN
    Job_Owner = admin@node90
    job_state = Q
    queue = myque
    server = node90
    Checkpoint = u
    ctime = Sat Jun  7 21:29:40 2014
    Error_Path = node90:/var/spool/torque/STDIN.e32
    exec_host = nodelhj/0
    exec_port = 15003
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Sat Jun  7 21:44:48 2014
    Output_Path = node90:/var/spool/torque/STDIN.o32
    Priority = 0
    qtime = Sat Jun  7 21:29:40 2014
    Rerunable = True
    substate = 10
    Variable_List = PBS_O_QUEUE=myque,PBS_O_HOME=/home/admin,
后面
次要的信息 没有给显示 太长了,qrun的作业陪分配了exec_host‘ 但是依旧不执行;而没有qrun的作业 还是没有执行节点。

tracejob 1.node90 之后显示:

06/08/2014 10:35:18 S enqueuing into myque, state 1 hop 1
06/08/2014 10:35:18 A queue=myque
06/08/2014 10:45:47 S enqueuing into myque, state 1 hop 1
06/08/2014 10:45:47 S Requeueing job, substate: 10 Requeued in queue: myque
06/08/2014 10:51:55 S enqueuing into myque, state 1 hop 1
06/08/2014 10:51:55 S Requeueing job, substate: 10 Requeued in queue: myque
06/08/2014 10:52:38 S Job Run at request of root@node90
06/08/2014 10:52:38 S unable to run job, MOM rejected/rc=-1
06/08/2014 10:52:38 S unable to run job, send to MOM '168036859' failed

然后查看server_logs 会发现有以下错误:

06/08/2014 16:27:36;0001;PBS_Server.31118;Svr;PBS_Server;LOG_ERROR::Operation now in progress (115) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 10.4.9.251:15003]
06/08/2014 16:27:36;0001;PBS_Server.31118;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host nodelhj:15003

这是计算节点拒绝,后来有人提示可能是因为不是ssh为I密码登陆问题,然后设置了ssh无密码登录,这个配置 详见:http://blog.csdn.net/leexide/article/details/17252369

然后问题还是没有解决,后来发现mom节点的时间是美国时间,修改了时区,然后qrun的作业可以正确执行,修改时区方法:

[root@nodelhj torque]# date
Thu Jun  5 06:01:59 PDT 2014
[root@nodelhj torque]# set date
[root@nodelhj torque]# tzselect
Please identify a location so that time zone rules can be set correctly.
Please select a continent or ocean.
 1) Africa
 2) Americas
 3) Antarctica
 4) Arctic Ocean
 5) Asia
 6) Atlantic Ocean
 7) Australia
 8) Europe
 9) Indian Ocean
10) Pacific Ocean
11) none - I want to specify the time zone using the Posix TZ format.
#? 5
Please select a country.
 1) Afghanistan  18) Israel    35) Palestine
 2) Armenia  19) Japan    36) Philippines
 3) Azerbaijan  20) Jordan    37) Qatar
 4) Bahrain  21) Kazakhstan    38) Russia
 5) Bangladesh  22) Korea (North)    39) Saudi Arabia
 6) Bhutan  23) Korea (South)    40) Singapore
 7) Brunei  24) Kuwait    41) Sri Lanka
 8) Cambodia  25) Kyrgyzstan    42) Syria
 9) China  26) Laos    43) Taiwan
10) Cyprus  27) Lebanon    44) Tajikistan
11) East Timor  28) Macau    45) Thailand
12) Georgia  29) Malaysia    46) Turkmenistan
13) Hong Kong  30) Mongolia    47) United Arab Emirates
14) India  31) Myanmar (Burma)    48) Uzbekistan
15) Indonesia  32) Nepal    49) Vietnam
16) Iran  33) Oman    50) Yemen
17) Iraq  34) Pakistan
#? 9
Please select one of the following time zone regions.
1) east China - Beijing, Guangdong, Shanghai, etc.
2) Heilongjiang (except Mohe), Jilin
3) central China - Sichuan, Yunnan, Guangxi, Shaanxi, Guizhou, etc.
4) most of Tibet & Xinjiang
5) west Tibet & Xinjiang
#? 1


The following information has been given:


China
east China - Beijing, Guangdong, Shanghai, etc.


Therefore TZ='Asia/Shanghai' will be used.
Local time is now: Thu Jun  5 21:04:44 CST 2014.
Universal Time is now: Thu Jun  5 13:04:44 UTC 2014.
Is the above information OK?
1) Yes
2) No
#? 1


You can make this change permanent for yourself by appending the line
TZ='Asia/Shanghai'; export TZ
to the file '.profile' in your home directory; then log out and log in again.


Here is that TZ value again, this time on standard output so that you
can use the /usr/bin/tzselect command in shell scripts:
Asia/Shanghai

但是 重启机器以后没有时区没有修改成功,于是用了手工修改的方法(进入localtime文件修改时间 保存修改即可生效):

vi /etc/sysconfig/clock ZONE=Asia/Shanghai(查/usr/share/zoneinfo下面的文件) UTC=false ARC=false 

rm /etc/localtime 

 ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime。

然后在解决了这一问题以后,作业都可以在qrun的命令下执行,但是 作业还是不会自己被调度:

但是现在sched_logs的调度日志那个纵欲有了日志,但是调度依旧没有发生。于是开始安装maui 期待买可以调度执行作业。



你可能感兴趣的:(torque,pbs,4.2.4.1版本)