2019独角兽企业重金招聘Python工程师标准>>>
同事新装的 SGE 在提交任务后,只有一个job可以运行, 多个任务显示在 T 状态, 如下:
# qstat -u '*'
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
55 0.60500 I_ZC170000 robots r 02/23/2017 11:27:45 all.q@### 6
56 0.50500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@### 1
57 0.50500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@### 1
58 0.60500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@### 6
59 0.50500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@### 1
60 0.60500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@Analysis 6
61 0.50500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@Analysis 1
63 0.60500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@Analysis 6
65 0.60500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@Analysis 6
68 0.60500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@Analysis 6
70 0.60500 II_ZC17000 robots t 02/23/2017 11:29:45 all.q@Analysis 6
使用 命令 qstat -j 56
查看任务信息有如下报错:
error reason 1: can not find an unused add_grp_id
1: can not find an unused add_grp_id
1: can not find an unused add_grp_id
1: can not find an unused add_grp_id
解决办法
导致该问题的原因是因为 SGE 的环境配置中 gid_range 大小不够用造成, 如:
[root@Analysis gridengine]# qconf -sconf | grep gid_range
gid_range 21000
上述的配置应该是一个区间, 而被同事错误的配置成了一个数字, 所以只有一个job可以正常执行。将该值改为区间即可,再重启下sgemaster 即可。
[root@Analysis gridengine]# qconf -sconf | grep gid_range
gid_range 20000-21000
man 一下:
[root@Analysis ~]# man sge_conf
.........
gid_range
The gid_range is a comma separated list of range expressions of the
form n-m (n as well as m are integer numbers greater than 99), where m
is an abbreviation for m-m. These numbers are used in sge_execd(8) to
identify processes belonging to the same job.
Each sge_execd(8) may use a separate set up group ids for this purpose.
All number in the group id range have to be unused supplementary group
ids on the system, where the sge_execd(8) is started.
Changing gid_range will take immediate effect. There is no default for
gid_range. The administrator will have to assign a value for gid_range
during installation of Sun Grid Engine.
The global configuration entry for this value may be overwritten by the
execution host local configuration.
参考资料: http://arc.liv.ac.uk/pipermail/gridengine-users/2005-September/007056.html