在 vm1 上安装 gt4
首先安装各种必须的包,安装了安装光盘上的 postgresql-lib, postgresql7.3.4, postgresql-server,安装了 jdk-1_5_0_05-linux-i586.bin, apache-ant-1.6.5-bin.tar, 检查一下 gcc,g++,sed,make,perl,sudo,tar这些有没有安装。 Globus安装包使用的是 gt4.0.2-x86_rh_9-installer.tar,这是二进制安装包,非常快速。
关于 globus安装后的配置,请见 http://blog.csdn.net/jcwKyl/archive/2009/07/18/4360031.aspx或者 http://www.globus.org/toolkit/docs/4.0/admin/docbook/quickstart.html。
安装 globus sge adapter
参见 http://www.globusconsortium.org/tutorial/ch8/page_2.php上的文档。
下载四个包:
[whb@jcwkyl gridsoft]$ wget http://www.lesc.ic.ac.uk/projects/globus_gram_job_manager_setup_sge-1.1.tar.gz
[whb@jcwkyl gridsoft]$ wget http://www.lesc.ic.ac.uk/projects/globus_scheduler_event_generator_sge-1.1.tar.gz
[whb@jcwkyl gridsoft]$ wget http://www.lesc.ic.ac.uk/projects/globus_scheduler_event_generator_sge_setup-1.1.tar.gz
[whb@jcwkyl gridsoft]$ wget http://www.lesc.ic.ac.uk/projects/globus_wsrf_gram_service_java_setup_sge-1.1.tar.gz
[globus@vm1 globus]$ cd $SGE_ROOT
[globus@vm1 sge]$ source default/common/settings.sh
[globus@vm1 sge]$ source $GLOBUS_LOCATION/etc/globus-user-env.sh
[globus@vm1 sge]$ cd
[globus@vm1 globus]$ gpt-build /software/globus_gram_job_manager_setup_sge-1.1.tar.gz
[globus@vm1 globus]$ gpt-build /software/globus_scheduler_event_generator_sge-1.1.tar.gz gcc32dbg
[globus@vm1 globus]$ gpt-build /software/globus_scheduler_event_generator_sge_setup-1.1.tar.gz
[globus@vm1 globus]$ gpt-build /software/globus_wsrf_gram_service_java_setup_sge-1.1.tar.gz
[globus@vm1 globus]$ gpt-postinstall
现在可以测试一下 GRAM WS SGE jobmanager。
首先启动 container。
-bash-2.05b$ postmaster -i -D /opt/pgsql/data/ > logfile 2>&1 &
[globus@vm1 globus]$ globus-start-container > logfile 2>&1 &
在启动 globus-start-container的时候,会出现以下警告:
2009-11-28 14:46:33,893 WARN usefulrp.GLUEResourceProperty [GLUE refresher 0,runScript:315] Script Execution error when executing shell /opt/globus-4.0.2/libexec/globus-scheduler-provider-sge
java.io.IOException: java.io.IOException: /opt/globus-4.0.2/libexec/globus-scheduler-provider-sge: not found
at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:451)
at java.lang.Runtime.exec(Runtime.java:591)
at java.lang.Runtime.exec(Runtime.java:429)
at java.lang.Runtime.exec(Runtime.java:326)
在 http://www.globusconsortium.org/tutorial/ch8/page_3.php这个网站上说这条信息可以忽略,但是在提交作业的时候总是出现 错误, google发现这个网址处给的整合 gt4和 sge的方法来自于 http://www.lesc.ic.ac.uk/projects/SGE-GT4.html,在 globus的 developer's guide: http://docs.huihoo.com/globus/toolkit/4.0/execution/wsgram/developer-index.html 中关于 sge的整合一节中也给的是 www.lesc.ic.ac.uk这个链接。
但是在提交作业的时候总是会出现 Unsubmitted错误,如下:
[guest@vm1 guest]$ globusrun-ws -submit -s -F vm1 -Ft SGE -c /bin/echo "just a test"
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:5b2ab3c0-dcf3-11de-96f9-080027f48588
Termination time: 11/30/2009 14:27 GMT
到了这里就不动了,等待很长时间后说:
Current job state: Unsubmitted
但事实上,这个作业已经被 SGE执行了,我们上面是在 vm1上用 guest用户提交的作业,在 vm2上可以看到:
[guest@vm2 guest]$ ls
5b2ab3c0-dcf3-11de-96f9-080027f48588.0.stderr test
5b2ab3c0-dcf3-11de-96f9-080027f48588.0.stdout transfer.xfr
一开始的那个就是上面提交作业时显示的 uuid, 5b2ab3c0-dcf3-11de-96f9-080027f48588.0.stdout文件的内容就是 "just a test”。看来 adpater是起作用了,提交的作业确实被 SGE执行了,只是状态信息弄错了。
google到这个链接: http://dev.uabgrid.uab.edu/uabgrid-stage/wiki/BuildTheStage,在这篇文章中作者也提到了这种情况,并且说“ Many people have reported this bug, but could not find any solution yet. ”,作者的作法是用 gcc64dbg这个 flavor重新 gpt-build了一下 globus_scheduler_event_generator_sge-1.1.tar.gz。模仿作者的这种做法,却不知道应该编译哪个 flavor, gcc64dbg是肯定出错的,但以弄不清到底有哪些 flavor可用, gpt-build命令有个 -all-flavors参数,却出错了。这条思路也暂时断掉。
无聊之下,提交一个作业,看看 SGE的 reporting文件是怎样记录的,验证一下“在 globusrun-ws提交遇到 Unsubmitted时作业已经被 SGE正确执行”。
我们可以从日志文件中看出来:
清空日志文件:
[sgeadmin@vm2 sgeadmin]$ cd /opt/sge/default/common/
[sgeadmin@vm2 common]$ echo "" > reporting
再次提交作业:
[guest@vm1 guest]$ globusrun-ws -submit -s -F vm1 -Ft SGE -c /bin/hostname
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:b4329c1a-dcfa-11de-8a96-080027f48588
Termination time: 11/30/2009 15:20 GMT
上面 -s参数表示写输出文件, -F指定 factory, -Ft指定 epr类型, -c指定要执行的命令。
在 vm2上看一下,可以看到这个作业已经执行完成 b ,如下:
[root@vm2 root]# su - guest
[guest@vm2 guest]$ ls
b4329c1a-dcfa-11de-8a96-080027f48588.0.stderr test
b4329c1a-dcfa-11de-8a96-080027f48588.0.stdout transfer.xfr
看看日志文件的内容:
1259508019:new_job:1259508019:26:-1:NONE:sge_job_script.13114:guest:guest::defaultdepartment:sge:1024
1259508019:job_log:1259508019:pending:26:-1:NONE::guest:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:new job
1259508026:job_log:1259508026:sent:26:0:NONE:t:master:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:sent to execd
1259508026:queue_consumable:all.q:vm3:1259508026::slots=1.000000=1.000000
1259508026:job_log:1259508026:delivered:26:0:NONE:r:master:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:job received by execd
1259508027:acct:all.q:vm3:guest:guest:sge_job_script.13114:26:sge:0:1259508019:1259508025:1259508025:0:0:0:0:0:0.000000:0:0:0:0:4424:6324:0:0.000000:0:0:0:0:0:0:NONE:defaultdepartment:NONE:1:0:0.000000:0.000000:0.000000:NONE:0.000000:NONE:0.000000
1259508027:job_log:1259508027:finished:26:0:NONE:r:execution daemon:vm3:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:job exited
1259508027:job_log:1259508027:finished :26:0:NONE:r:master:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:job waits for schedds deletion
1259508027:queue_consumable:all.q:vm3:1259508027::slots=0.000000=1.000000
1259508041:job_log:1259508041:deleted :26:0:NONE:T:scheduler:vm1:0:1024:1259508019:sge_job_script.13114:guest:guest::defaultdepartment:sge:job deleted by schedd
从这些信息中大约可以看出,作业放在了 all.q@vm3队列中,并且被 vm3上的 execd执行。看看这个作业的输出结果:
[root@vm2 root]# su - guest
[guest@vm2 guest]$ cat b4329c1a-dcfa-11de-8a96-080027f48588.0.stdout
vm3
作业执行的是 /bin/hostname,在 vm3上执行,所以输出的是 vm3。
再提交一个数组作业:
[guest@vm1 guest]$ globusrun-ws -submit -s -F vm1 -Ft SGE -c /opt/sge/examples/jobs/array_submitter.sh 7
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:fd714f56-dcfb-11de-a6c4-080027f48588
Termination time: 11/30/2009 15:29 GMT
用 qstat查看某时刻的执行状态,如下:
[guest@vm2 guest]$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@vm1 BIP 1/1 0.00 lx24-x86
29 0.55500 StepB guest r 11/29/2009 10:30:26 1 1
----------------------------------------------------------------------------
all.q@vm2 BIP 0/1 0.00 lx24-x86
----------------------------------------------------------------------------
all.q@vm3 BIP 1/1 0.00 lx24-x86
29 0.55500 StepB guest r 11/29/2009 10:30:26 1 2
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
29 0.00000 StepB guest qw 11/29/2009 10:29:41 1 3-7:1
输出文件如下:
[guest@vm2 guest]$ ls
fd714f56-dcfb-11de-a6c4-080027f48588.0.stderr StepA.e28.6 StepA.o28.6 StepB.e29.6 StepB.o29.6
fd714f56-dcfb-11de-a6c4-080027f48588.0.stdout StepA.e28.7 StepA.o28.7 StepB.e29.7 StepB.o29.7
StepA.e28.1 StepA.o28.1 StepB.e29.1 StepB.o29.1 test
StepA.e28.2 StepA.o28.2 StepB.e29.2 StepB.o29.2 transfer.xfr
StepA.e28.3 StepA.o28.3 StepB.e29.3 StepB.o29.3
StepA.e28.4 StepA.o28.4 StepB.e29.4 StepB.o29.4
StepA.e28.5 StepA.o28.5 StepB.e29.5 StepB.o29.5
用 file查看 globusrun-ws,发现它是 elf文件,用 gdb去调试它,发现只有汇编代码可用,可能是因为安装的是 gt4的二进制安装包,改用源码安装包试一次,看问题能不能解决或者找到问题的根源。
[想:既然启动 container 时那则警告可以忽略,而且事实证明这则警告不影响作业 提交到 SGE 上去执行,所以就想消去这则警告,把 $GLOBUS_LOCATION/libexec/globus-gram-jobmanager-fork 复制一份并改名为 globus-gram-jobmanager-sge ,重启 container ,果然消除了警告,但是仍然有 unsubmitted 的错误。
https://www.nbcr.net/pub/wiki/index.php?title=GT4_Installation_and_Configuration
这篇文章简明扼要地讲述了 gt4 的安装过程。
注:上面这种取消这个警告的方法并不正规,只因为这则警告无足轻重才这样做。
在 globus developer's guide 上面 ( http://www.globus.org/toolkit/docs/4.0/execution/wsgram/developer-index.html)找到了关于 pbs出现这个问题的解决方法 。依照这个步骤,现在做过的工作是,修改了 $GLOBUS_LOCATION/container-log4j.properties 文件,把其中的所有 debug 选项全部打开。
在 SGE 上提交作业,发现提交 shell 作业没有问题,提交二进制文件比如直接 qsub /bin/hostname 会出问题,但是写一个 shell 脚本,在其中调用 hostname 就可以。于是写这样一个脚本文件,用 globusrun-ws 去提交,仍然是 Unsubmitted 。
另外, gpt-build 那四个 gt4-sge 的 adapter 时查看 BUILD 目录下有个 globus_core-4.30 ,是不是换成这个 globus 版本应该就没有问题了。
续之前的编译所有 flavor的思路,在 gpt-build那四个软件包后在 BUILD目录中用 find -name “*” -exec grep flavor {} /;命令都找过,没有找到,这一次找时,有意外的发现:
[globus@vm1 globus_scheduler_event_generator_sge-1.1]$ grep flavor *
aclocal.m4:#extract whether the package is built with flavors from the src metadata
aclocal.m4: GLOBUS_FLAVOR_NAME="noflavor"
aclocal.m4:AC_ARG_WITH(flavor,
aclocal.m4: [ --with-flavor=<FL> Specify the globus build flavor or without-flavor for a flavor independent ],
aclocal.m4: echo "Please specify a globus build flavor" >&2
aclocal.m4: if test "x$GLOBUS_FLAVOR_NAME" = "xnoflavor"; then
aclocal.m4: echo "Warning: package doesn't build with flavors $withval ignored" >&2
aclocal.m4: if test ! -f "$GLOBUS_LOCATION/etc/globus_core/flavor_$GLOBUS_FLAVOR_NAME.gpt"; then
aclocal.m4: echo "Please specify a globus build flavor" >&2
aclocal.m4:if test "x$GLOBUS_FLAVOR_NAME" != "xnoflavor" ; then
config.log: $ /home/globus/BUILD/globus_scheduler_event_generator_sge-1.1//configure --with-threads=pthreads --with-flavor=gcc32pthr
config.status: with options /"'--with-threads=pthreads' '--with-flavor=gcc32pthr'/"
config.status: echo "running /bin/sh /home/globus/BUILD/globus_scheduler_event_generator_sge-1.1//configure " '--with-threads=pthreads' '--with-flavor=gcc32pthr' $ac_configure_extra_args " --no-create --no-recursion" >&6
config.status: exec /bin/sh /home/globus/BUILD/globus_scheduler_event_generator_sge-1.1//configure '--with-threads=pthreads' '--with-flavor=gcc32pthr' $ac_configure_extra_args --no-create --no-recursion
configure: --with-flavor=<FL> Specify the globus build flavor or without-flavor for a flavor independent
configure:#extract whether the package is built with flavors from the src metadata
configure: GLOBUS_FLAVOR_NAME="noflavor"
configure:# Check whether --with-flavor or --without-flavor was given.
configure:if test "${with_flavor+set}" = set; then
configure: withval="$with_flavor"
configure: echo "Please specify a globus build flavor" >&2
configure: if test "x$GLOBUS_FLAVOR_NAME" = "xnoflavor"; then
configure: echo "Warning: package doesn't build with flavors $withval ignored" >&2
configure: if test ! -f "$GLOBUS_LOCATION/etc/globus_core/flavor_$GLOBUS_FLAVOR_NAME.gpt"; then
configure: echo "Please specify a globus build flavor" >&2
configure:if test "x$GLOBUS_FLAVOR_NAME" != "xnoflavor" ; then
globus_automake_pre:flavorincludedir = $(GLOBUS_LOCATION)/include/$(GLOBUS_FLAVOR_NAME)
globus_automake_pre:## flavorinclude = [ HEADERS ]
Makefile:flavorincludedir = $(GLOBUS_LOCATION)/include/$(GLOBUS_FLAVOR_NAME)
Makefile.in:flavorincludedir = $(GLOBUS_LOCATION)/include/$(GLOBUS_FLAVOR_NAME)
最后的这两行给人提了个醒,急忙 ls一下 $GLOBUS_LOCATION/include,发现:
[globus@vm1 globus_scheduler_event_generator_sge-1.1]$ ls $GLOBUS_LOCATION/include
gcc32 gcc32dbg gcc32dbgpthr gcc32pthr
于是,再次 gpt-build:
...
[globus@vm1 globus]$ gpt-build -force /software/globus_scheduler_event_generator_sge-1.1.tar.gz gcc32 gcc32dbg gcc32dbgpthr gcc32pthr
...
这一次问题终于解决了!
[guest@vm1 guest]$ ps ax
...
5424 pts/0 S 0:00 /opt/globus-4.0.2/libexec/globus-scheduler-event-generator -s fork -t 125
5442 pts/0 S 0:00 /opt/globus-4.0.2/libexec/globus-scheduler-event-generator -s sge -t 1259
...
[guest@vm1 guest]$ globusrun-ws -submit -s -F vm1 -Ft SGE -c /bin/hostname
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:84413516-dfac-11de-9018-080027f48588
Termination time: 12/04/2009 01:38 GMT
Current job state: Pending
Current job state: Active
vm1
Current job state: CleanUp-Hold
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.
第一次操作 globus这种大系统,调试时无从下手,一味地猜测尝试,鲜知背后的原理,只是为了凑出一个运行结果。以上文档仅供参考。至此,绝大部分任务已经完成。接下来的就是安装配置 csf和 vjm,这些都是比较简单的工作了。