question: the job 可以提交,可以进队,但是一直处于 Q 状态,不会被调度。
作业处于Q状态不执行最初的错误是:没有配置 server的 scheduling 属性:这个属性可以在qmgr这个命令下配置:具体命令是:set server scheduling=true 但是在执行这个命令的跳出了以下错误:qmgr obj= svr=default: Illegal attribute or resource value for scheduling 属性一直配置不上,然后就把之前的队列都清空了,用命令:pbs server -t create 在这之后,重新配置了队列属性:
Qmgr: create queue myque queue type=execution
Qmgr: set server default queue=myque
Qmgr: set queue myque started=true
Qmgr: set queue myque enabled=true
Qmgr: set server scheduling=true
配置以后,作业提交还是Q状态,并且用astat -f 查看 作业提交了以后不给分配 执行节点,强制执行qrun 作业以后,作业会分配到当前处于free状态的节点,但是还是不执行
qstat以后显示:
1.node90 STDIN admin 0 Q myque
3.node90 testpbs freeman 0 Q myque
qrun 1.node90 然后 qstat -f 后显示:
Job Id: 1.node90
Job_Name = STDIN
Job_Owner = admin@node90
job_state = Q
queue = myque
server = node90
Checkpoint = u
ctime = Sat Jun 7 21:29:40 2014
Error_Path = node90:/var/spool/torque/STDIN.e32
exec_host = nodelhj/0
exec_port = 15003
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Sat Jun 7 21:44:48 2014
Output_Path = node90:/var/spool/torque/STDIN.o32
Priority = 0
qtime = Sat Jun 7 21:29:40 2014
Rerunable = True
substate = 10
Variable_List = PBS_O_QUEUE=myque,PBS_O_HOME=/home/admin,
后面次要的信息 没有给显示 太长了,qrun的作业陪分配了exec_host‘ 但是依旧不执行;而没有qrun的作业 还是没有执行节点。
tracejob 1.node90 之后显示:
06/08/2014 10:35:18 S enqueuing into myque, state 1 hop 1
06/08/2014 10:35:18 A queue=myque
06/08/2014 10:45:47 S enqueuing into myque, state 1 hop 1
06/08/2014 10:45:47 S Requeueing job, substate: 10 Requeued in queue: myque
06/08/2014 10:51:55 S enqueuing into myque, state 1 hop 1
06/08/2014 10:51:55 S Requeueing job, substate: 10 Requeued in queue: myque
06/08/2014 10:52:38 S Job Run at request of root@node90
06/08/2014 10:52:38 S unable to run job, MOM rejected/rc=-1
06/08/2014 10:52:38 S unable to run job, send to MOM '168036859' failed
然后查看server_logs 会发现有以下错误:
06/08/2014 16:27:36;0001;PBS_Server.31118;Svr;PBS_Server;LOG_ERROR::Operation now in progress (115) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 10.4.9.251:15003]
06/08/2014 16:27:36;0001;PBS_Server.31118;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host nodelhj:15003
这是计算节点拒绝,后来有人提示可能是因为不是ssh为I密码登陆问题,然后设置了ssh无密码登录,这个配置 详见:http://blog.csdn.net/leexide/article/details/17252369
然后问题还是没有解决,后来发现mom节点的时间是美国时间,修改了时区,然后qrun的作业可以正确执行,修改时区方法:
[root@nodelhj torque]# date
Thu Jun 5 06:01:59 PDT 2014
[root@nodelhj torque]# set date
[root@nodelhj torque]# tzselect
Please identify a location so that time zone rules can be set correctly.
Please select a continent or ocean.
1) Africa
2) Americas
3) Antarctica
4) Arctic Ocean
5) Asia
6) Atlantic Ocean
7) Australia
8) Europe
9) Indian Ocean
10) Pacific Ocean
11) none - I want to specify the time zone using the Posix TZ format.
#? 5
Please select a country.
1) Afghanistan 18) Israel 35) Palestine
2) Armenia 19) Japan 36) Philippines
3) Azerbaijan 20) Jordan 37) Qatar
4) Bahrain 21) Kazakhstan 38) Russia
5) Bangladesh 22) Korea (North) 39) Saudi Arabia
6) Bhutan 23) Korea (South) 40) Singapore
7) Brunei 24) Kuwait 41) Sri Lanka
8) Cambodia 25) Kyrgyzstan 42) Syria
9) China 26) Laos 43) Taiwan
10) Cyprus 27) Lebanon 44) Tajikistan
11) East Timor 28) Macau 45) Thailand
12) Georgia 29) Malaysia 46) Turkmenistan
13) Hong Kong 30) Mongolia 47) United Arab Emirates
14) India 31) Myanmar (Burma) 48) Uzbekistan
15) Indonesia 32) Nepal 49) Vietnam
16) Iran 33) Oman 50) Yemen
17) Iraq 34) Pakistan
#? 9
Please select one of the following time zone regions.
1) east China - Beijing, Guangdong, Shanghai, etc.
2) Heilongjiang (except Mohe), Jilin
3) central China - Sichuan, Yunnan, Guangxi, Shaanxi, Guizhou, etc.
4) most of Tibet & Xinjiang
5) west Tibet & Xinjiang
#? 1
The following information has been given:
China
east China - Beijing, Guangdong, Shanghai, etc.
Therefore TZ='Asia/Shanghai' will be used.
Local time is now: Thu Jun 5 21:04:44 CST 2014.
Universal Time is now: Thu Jun 5 13:04:44 UTC 2014.
Is the above information OK?
1) Yes
2) No
#? 1
You can make this change permanent for yourself by appending the line
TZ='Asia/Shanghai'; export TZ
to the file '.profile' in your home directory; then log out and log in again.
Here is that TZ value again, this time on standard output so that you
can use the /usr/bin/tzselect command in shell scripts:
Asia/Shanghai
但是 重启机器以后没有时区没有修改成功,于是用了手工修改的方法(进入localtime文件修改时间 保存修改即可生效):
vi /etc/sysconfig/clock ZONE=Asia/Shanghai(查/usr/share/zoneinfo下面的文件) UTC=false ARC=false
rm /etc/localtime
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime。
然后在解决了这一问题以后,作业都可以在qrun的命令下执行,但是 作业还是不会自己被调度:
但是现在sched_logs的调度日志那个纵欲有了日志,但是调度依旧没有发生。于是开始安装maui 期待买可以调度执行作业。