http://kaldi-asr.org/doc/queue.html kaldi官网
http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_execution_host.shtml Installationof the Grid Engine Execution Host
http://gridscheduler.sourceforge.net/CompileGridEngineSource.html
http://blog.csdn.net/leijunan/article/details/39608849集群环境配置
环境:centos 7 64位
这个系统需要自己编译源码,是比较麻烦的,弄好后才知道选择其他linux的发行版本,可以直接下载
gridengine-master gridengine-clientgridengine-client gridengine-exec
1、 下载GE2011.11p1.tar.gz 对应6.2u5版本
访问http://gridscheduler.sourceforge.net/ ,然后转到Download GridEngine/Grid Scheduler
标签下下载
2、 解压
tar –zxvf GE2011.11p1.tar.gz
3、 执行以下指令,编译GE
cd GE2011.11p1/source
./aimk -no-java -no-jni -no-secure -spool-classic -no-dump -only-depend
./scripts/zerodepend
./aimk -no-java -no-jni -no-secure -spool-classic -no-dump depend
./aimk -no-java -no-jni -no-secure -spool-classic -no-dump
如果出错一般是因为配置和系统软件安装不匹配造成的,以下是安装过程中出现的错误信息
1 |
执行到 % ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump 时,报: ../utilbin/authuser.c:72:31: 致命错误:security/pam_appl.h:没有那个文件或目录#include |
解决 办法 |
检security目录下没有pam_appl.h,因为pam没装好,下载openpam-20130907.tar.gz,编译 cd openpam-20130907 ./configure sudo make install -------------------------- 重新执行SGE的编译指令 |
2 |
In file included from ../Xmt310/Xmt/All.c:23:0: ../Xmt310/Xmt/Xmt.h:56:19: 致命错误:Xm/Xm.h:没有那个文件或目录 #include |
解决 办法 |
cd GE2011.11p1/source ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump -only-depend ./scripts/zerodepend ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump depend ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump -no-qmon 这个配置是不编译qmon,因为系统没装X11,其他配置项,参考这个链接: http://gridscheduler.sourceforge.net/CompileGridEngineSource.html 这个地方要注意下,将-no-qmon配置到 ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump depend 指令,貌似也会报错,要放到最后一行指令,具体我也不清楚什么原因。 |
3 |
rm -f gethost gcc -o gethost -DSGE_ARCH_STRING=\"linux-x64\" -O3 -Wall -Wstrict-prototypes -DUSE_POLL -DLINUX -DLINUXX64 -DLINUXX64 -D_GNU_SOURCE -DGETHOSTBYNAME_R6 -DGETHOSTBYADDR_R8 -DHAS_VSNPRINTF -DHAS_IN_PORT_T -I/build/berkeleydb/include/ -DTARGET_64BIT -DSPOOLING_classic -Wno-strict-aliasing -DNO_JNI -DCOMPILE_DC -D__SGE_COMPILE_WITH_GETTEXT__ -D__SGE_NO_USERMAPPING__ -DTHREADBINDING -DHWLOC -Wno-error -DPROG_NAME='"qtcsh"' -DLINUXX64 -I. -I.. -D_PATH_TCSHELL='"/usr/local/bin/tcsh"' -I../../../libs/gdi -I../../../libs/gdi ../gethost.c -lncurses -lcrypt -L../../../LINUXX64 -R/lib/linux-x64 -L/build/berkeleydb/lib/ -L. -Wl,-rpath,\$ORIGIN/../../lib/linux-x64 -lsge -lpthread -ldl gcc: 错误:unrecognized command line option‘-R’ |
|
使用GE2011.11.tar.gz版本时报错,表面上看是’-R’参数的问题,但gcc一般不会出这样的错误,应该是没有将-no-qmon配置到最后一条令: ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump depend的原因,由于后面使用GE2011.11p1.tar.gz编译成功了,所以后面我没有测试了。 |
4、 配置环境变量
mkdir /opt/ge2011
export SGE_ROOT=/opt/ge2011
export cell=default
5、 执行:scripts/distinst -all -local –noexit
这条指令将install_qmaster、install_execd等安装在$SGE_ROOT下面
1 |
Installing: sge_qmaster sge_execd sge_shadowd sge_shepherd sge_coshepherd qstat qsub qalter qconf qdel qacct qmod qsh utilbin jobs qmon qhost qmake qtcsh qping qloadsensor.exe sgepasswd qquota qrsub qrstat qrdel common Architectures: –noexit Base directory: /opt/ge2011 OK [Y/N][Y]: OK [Y/N][Y]: y
Installing "3rd_party/" directory tree cp: 无法获取"dist/3rd_party" 的文件状态(stat): 没有那个文件或目录
This command failed: cp -r dist/3rd_party /opt/ge2011 Installation failed. Exiting. |
解决办法 |
这个错误是路径问题,我直接到distinst目录下执行: ./distinst -all -local –noexit 导致脚本相对路径不正确,所以无法找到dist/3rd_party文件夹。 到scripts目录下,再执行scripts/distinst -all -local –noexit就没问题了 |
2 |
Installing "3rd_party/" directory tree Installing "inst_sge", "install_qmaster" and "install_execd" Installing "util/" directory tree chmod: 无法访问"/opt/ge2011/util/DetectJvmLibrary.jar": 没有那个文件或目录
This command failed: chmod 644 /opt/ge2011/util/install_modules/backup_template.conf /opt/ge2011/util/install_modules/DB_CONFIG /opt/ge2011/util/install_modules/inst_berkeley.sh /opt/ge2011/util/install_modules/inst_common.sh /opt/ge2011/util/install_modules/inst_execd.sh /opt/ge2011/util/install_modules/inst_execd_uninst.sh /opt/ge2011/util/install_modules/inst_qmaster.sh /opt/ge2011/util/install_modules/inst_qmaster_uninst.sh /opt/ge2011/util/install_modules/inst_schedd_high.conf /opt/ge2011/util/install_modules/inst_schedd_max.conf /opt/ge2011/util/install_modules/inst_schedd_normal.conf /opt/ge2011/util/install_modules/inst_st.sh /opt/ge2011/util/install_modules/inst_template.conf /opt/ge2011/util/rctemplates/darwin_template /opt/ge2011/util/rctemplates/sgebdb_template /opt/ge2011/util/rctemplates/sgeexecd_template /opt/ge2011/util/rctemplates/sgemaster_template /opt/ge2011/util/sgeCA/sge_ca.cnf /opt/ge2011/util/sgeCA/sge_ssl.cnf /opt/ge2011/util/sgeCA/sge_ssl_template.cnf /opt/ge2011/util/sgeSMF/bdb_template.xml /opt/ge2011/util/sgeSMF/execd_template.xml /opt/ge2011/util/sgeSMF/qmaster_template.xml /opt/ge2011/util/sgeSMF/shadowd_template.xml /opt/ge2011/util/sgeSMF/sge_smf_support.sh /opt/ge2011/util/DetectJvmLibrary.jar /opt/ge2011/util/resources/calendars/day /opt/ge2011/util/resources/calendars/day_s /opt/ge2011/util/resources/calendars/night /opt/ge2011/util/resources/calendars/night_s /opt/ge2011/util/resources/centry/arch /opt/ge2011/util/resources/centry/calendar /opt/ge2011/util/resources/centry/cpu /opt/ge2011/util/resources/centry/display_win_gui /opt/ge2011/util/resources/centry/h_core /opt/ge2011/util/resources/centry/h_cpu /opt/ge2011/util/resources/centry/h_data /opt/ge2011/util/resources/centry/h_fsize /opt/ge2011/util/resources/centry/hostname /opt/ge2011/util/resources/centry/h_rss /opt/ge2011/util/resources/centry/h_rt /opt/ge2011/util/resources/centry/h_stack /opt/ge2011/util/resources/centry/h_vmem /opt/ge2011/util/resources/centry/load_avg /opt/ge2011/util/resources/centry/load_long /opt/ge2011/util/resources/centry/load_medium /opt/ge2011/util/resources/centry/load_short /opt/ge2011/util/resources/centry/m_core /opt/ge2011/util/resources/centry/mem_free /opt/ge2011/util/resources/centry/mem_total /opt/ge2011/util/resources/centry/mem_used /opt/ge2011/util/resources/centry/min_cpu_interval /opt/ge2011/util/resources/centry/m_socket /opt/ge2011/util/resources/centry/m_topology /opt/ge2011/util/resources/centry/m_topology_inuse /opt/ge2011/util/resources/centry/np_load_avg /opt/ge2011/util/resources/centry/np_load_long /opt/ge2011/util/resources/centry/np_load_medium /opt/ge2011/util/resources/centry/np_load_short /opt/ge2011/util/resources/centry/num_proc /opt/ge2011/util/resources/centry/qname /opt/ge2011/util/resources/centry/rerun /opt/ge2011/util/resources/centry/s_core /opt/ge2011/util/resources/centry/s_cpu /opt/ge2011/util/resources/centry/s_data /opt/ge2011/util/resources/centry/seq_no /opt/ge2011/util/resources/centry/s_fsize /opt/ge2011/util/resources/centry/slots /opt/ge2011/util/resources/centry/s_rss /opt/ge2011/util/resources/centry/s_rt /opt/ge2011/util/resources/centry/s_stack /opt/ge2011/util/resources/centry/s_vmem /opt/ge2011/util/resources/centry/swap_free /opt/ge2011/util/resources/centry/swap_rate /opt/ge2011/util/resources/centry/swap_rsvd /opt/ge2011/util/resources/centry/swap_total /opt/ge2011/util/resources/centry/swap_used /opt/ge2011/util/resources/centry/tmpdir /opt/ge2011/util/resources/centry/virtual_free /opt/ge2011/util/resources/centry/virtual_total /opt/ge2011/util/resources/centry/virtual_used /opt/ge2011/util/resources/pe/make /opt/ge2011/util/resources/pe/make.sge_pqs_api /opt/ge2011/util/resources/schemas/qhost/qhost.xsd /opt/ge2011/util/resources/schemas/qquota/qquota.xsd /opt/ge2011/util/resources/schemas/qrstat/qrstat.xsd /opt/ge2011/util/resources/schemas/qstat/detailed_job_info_cb.xsd /opt/ge2011/util/resources/schemas/qstat/detailed_job_info.xsd /opt/ge2011/util/resources/schemas/qstat/message.xsd /opt/ge2011/util/resources/schemas/qstat/qstat_cb.xsd /opt/ge2011/util/resources/schemas/qstat/qstat.xsd /opt/ge2011/util/resources/usersets/arusers /opt/ge2011/util/resources/usersets/deadlineusers /opt/ge2011/util/resources/usersets/defaultdepartment
Installation failed. Exiting. |
解决办法 |
scripts/distinst -all -local –noexit scripts/distinst -all -local -noexit 错误提示是没有DetectJvmLibrary.jar这个文件,我们编译的时候已经配置了-no-java,没有是正常的,但修改权限时没检测到,如果没有设置-noexit就会中断执行,而noexit前的斜杠太诡异了,回车就变了,只能说幸好发现了。 出现同样问题的链接: https://sourceforge.net/p/gridscheduler/mailman/message/35610855/ |
到目前为止,SGE已经安装好了,下面就是配置SGE了
6、 修改集群的端口号
/etc/services
集群需要两个没有用过的集群端口号,默认的为
sge_qmaster 6444/tcp sge-qmaster # Grid Engine Qmaster Service
sge_qmaster 6444/udp sge-qmaster # Grid Engine Qmaster Service
sge_execd 6445/tcp sge-execd # Grid Engine Execution Service
sge_execd 6445/udp sge-execd # Grid Engine Execution Service
修改为不常用的端口号:
sge_qmaster 27100/tcp
sge_qmaster 27100/udp
sge_execd 27101/tcp
sge_execd 27101/udp
注:端口号设置需要在每台准备用作集群的电脑上进行。
7、 网络文件系统配置NFS
NFS是网络文件系统,用作集群中主控主机和执行主机间文件的传输,局域网内的传输是非常快的!
7.1 配置主机名
在所有主机上,按照下面命令打开主机名文件:
vim /etc/hosts
依次添加想作为执行主机的主机名,格式如下:
192.168.0.21 hostname1
192.168.0.22 hostname2
….
IP hostnameN
Note: IP即每台主机ip地址,可通过命令 % ifconfig 查看
hostname可通过命令 % hostname 查看
7.2配置共享目录文件
在准备作为主控主机的电脑上,通过命令:
vim /etc/exports
打开配置文件按如下格式进行编辑:
(本机挂载版本)
/opt/ge2011 192.168.1.216(rw,insecure,no_all_squash,no_root_squash,sync)
/usr/wxf/kaldi192.168.1.216(rw,insecure,no_all_squash,no_root_squash,sync)
Note: 第一列为待共享的路径,第二列为允许共享的ip,括号中为共享类型;/opt/ge2011为集群的根目录,/home/kaldi为kaldi安装路径。
Note: 如果配置192.168.1.216,那就只能在本机上挂载,在其他主机上挂载则会报:
mount.nfs: access denied by server whilemounting 192.168.1.216:/opt/ge2011
错误。
所以正确的配置应该是:
(其他主机挂载版本)
/opt/ge2011*(rw,insecure,no_all_squash,no_root_squash,sync)
/usr/wxf/kaldi *(rw,insecure,no_all_squash,no_root_squash,sync)
然后通过以下命令将配置写入系统中。
exportfs –av
7.3在每台执行主机上挂载主控主机的文件
创建要挂载的文件夹
mkdir /opt/ge2011 /usr/wxf /usr/wxf/kaldi
mount 192.168.1.216:/opt/ge2011 /opt/ge2011
mount 192.168.1.216:/usr/wxf/kaldi /usr/wxf/kaldi
Note:server是主控主机的ip或者主机名,第三列为挂载点。
挂载成功的检验:
1)输入命令后没有报错,
2)执行主机上通过命令:mount可以查看到挂载的路径,
3)并且,在每台执行主机上,cd到/opt/ge2014和/home/kaldi路径下,能够看到主控主机在这个路径下的所有文件。如此,则mount成功
mount出错的原因分析:
1)NFS服务未开启,通过以下命令在主控主机和执行主机上开启:
service rpcbind restart
service nfs restart
重启后重新配置下防火墙的端口过滤,或者配置NFS固定的端口
2)防火墙的问题:打开配置文件,加入红字部分内容,保存
2.1)iptables防火墙版本
vim /etc/sysconfig/iptables
# Firewall configuration written by system-config-firewall # Manual customization of this file is not recommended. *filter
-A INPUT -m state --state NEW -m tcp -p tcp --dport 2049 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 111 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 32803 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 892 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 875 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 662 -j ACCEPT :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited COMMIT |
service iptables restart
2.2)firewalld版本
firewall-cmd--add-service=nfs
firewall-cmd --reload
如果以上仍不能解决,可能要关闭防火墙
chkconfig iptables off
或者:
systemctl stopfirewalld.service
systemctldisable firewalld.service #禁止firewall开机启动
firewall-cmd–state #查看防火墙状态
Note:nfs参考网址:
http://www.unixmen.com/nfs-server-installation-and-configuration-in-centos-6-3-rhel-6-3-and-scientific-linux-6-3/
1 |
本地mount没有问题。 在执行主机上mount出现: mount.nfs: access denied by server while mounting 192.168.1.216:/opt/ge2011 1) ping得通 2) rpcbind nfs启动了的 [root@hadoop-0 wxf]# service rpcbind restart Redirecting to /bin/systemctl restart rpcbind.service [root@hadoop-0 wxf]# service nfs restart Redirecting to /bin/systemctl restart nfs.service rpcinfo -p localhost 3)关闭防火墙 systemctl stop firewalld.service systemctl disable firewalld.service #禁止firewall开机启动 firewall-cmd –state |
解决办法 |
这就是因为NFS配置的文件夹权限问题 /opt/ge2011。192.168.1.216(rw,insecure,no_all_squash,no_root_squash,sync) 应该设置为: /opt/ge2011 *(rw,insecure,no_all_squash,no_root_squash,sync) |
1、
以root用户进入到SGE目录下:
cd $SGE_ROOT
2、
新建文件hostlist,依次输入执行主机名,每个名字占一行,如下:
hostname1
hostname2
…
hostnameN
3、
安装执行install_qmaster,
流程在“主节点安装.docx”文档里
主节点安装:
重要的地方
under an user id other than >root< (y/n) [y] >>y
Please enter a valid user name >> sgeadmin
Are you going to install Windows Execution Hosts? (y/n) [n] >>回车
Do you want to enable the JMX MBeanserver (y/n) [n] >>回车
Please enter a range [20000-20100]>>2000-21000
Do you want to use a file which contains the list of hosts (y/n) [n]>>y
Please enter the file name which containsthe host list:hostlist
Do you want to add your shadow host(s)now? (y/n) [y] >>n
流程在“执行节点安装.docx”文档里
执行节点:
1、创建用户:
sudo adduser sgeadmin
2、设置sge的端口:
Vim /etc/services 修改端口为:
sge_qmaster 27100/tcp sge-qmaster # Grid EngineQmaster Service
sge_qmaster 27100/udp sge-qmaster # Grid EngineQmaster Service
sge_execd 27101/tcp sge-execd # Grid Engine Execution Service
sge_execd 27101/udp sge-execd # Grid Engine Execution Service
Note: 1)所有主机设置为一样的
2)注意重复设置,以致端口没有修改成功
3)如果找不到主节点,需要到主节点操作防火墙开放上面的端口
[root@hadoop-0 /]# firewall-cmd --zone=public --add-port=27100/tcp--permanent
[root@hadoop-0 /]# firewall-cmd --zone=public --add-port=27100/udp--permanent
[root@hadoop-0 /]# firewall-cmd --zone=public --add-port=27101/udp--permanent
[root@hadoop-0 /]# firewall-cmd --zone=public --add-port=27101/tcp--permanent
[root@hadoop-0/]# firewall-cmd –reload
3、执行:/opt/ge2011/default/common/settings.sh,设置环境变量,否则后面运行会有问题。
4、执行:/opt/ge2011/install_execd,下面是执行的过程.注意的点:
Do you want to configure a different spool directory
for this host (y/n) [n] >>y
Enter the spool directory now! >>/home/sgeadmin/hadoop-0
设置开机启动
1、 将./etc/init.d/sgemaster.p27100和./etc/init.d/sgeexecd.p27100写道
/etc/rc.local 中
2、将./opt/ge2011/default/common/settings.sh写到/etc/profile
工具集在该路劲下:
/opt/ge2011/bin/linux-x64
配置执行主机
./opt/ge2011/bin/linux-x64/qconf -sel
qconf -ae hostname |
添加执行主机 |
qconf -de hostname |
删除执行主机 |
qconf -sel |
显示执行主机列表 |
配置管理主机
./opt/ge2011/bin/linux-x64/qconf -sh
qconf -ah hostname |
添加管理主机 |
qconf -dh hostname |
删除管理主机 |
qconf -sh |
显示管理主机列表 |
配置提交主机
./opt/ge2011/bin/linux-x64/qconf -ss
qconf -as hostname |
添加提交主机 |
qconf -ds hostname |
删除提交主机 |
qconf -ss |
显示提交主机列表 |
配置队列
qconf -aq queuename |
添加集群队列 |
qconf -dq queuename |
删除集群队列 |
qconf -mq queuename |
修改集群队列配置 |
qconf -sq queuename |
显示集群队列配置 |
qconf -sql |
显示集群队列列表 |
配置用户组
qconf -ahgrp groupname |
添加用户组 |
qconf -mhgrp groupname |
修改用户组成员 |
qconf -shgrp groupname |
显示用户组成员 |
主机状态
./opt/ge2011/bin/linux-x64/qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
hadoop-0 linux-x64 4 0.18 15.4G 5.3G 7.8G 0.0
集群状态
./opt/ge2011/bin/linux-x64/qstat -f
集群状态信息如下:
queuename qtyperesv/used/tot.load_avg arch states
-------------------------------------------------------------------------------
all.q@hadoop-0 BIP 0/0/4 0.17 linux-x64
-------------------------------------------------------------------------------
all.q@hadoop-2 BIP 0/0/4 0.58 linux-x64
-------------------------------------------------------------------------------
all.q@hadoop-5 BIP 0/0/4 0.01 linux-x64
-------------------------------------------------------------------------------
all.q@hadoop-8 BIP 0/0/4 0.01 linux-x64
4个执行节点,tot表示核心贡献数,可以看到都是4核的