主机系统:rhel6.0 (数量不限)
- useradd yejk
- passwd yejk (密码设为westos)
此处注意要加no_root_squash,因为nfs服务器将NFS客户端上的root视为用户nfsnobody,会出现权限 上的问题。发布以后在其他节点上用命令可以直接挂载,并应该写入fstab
- yum install nfs-utils rpcbind -y
- vim /etc/exports
- /home/yejk 192.168.0.0/24(rw,no_root_squash)
- exportfs -r #使配置生效
- exportfs -v #查看
- /etc/init.d/rpcbind start #需要rpcbind支持
- /etc/init.d/nfs start
- mount 192.168.0.80:/home/yejk /home/yejk
- vim /etc/fstab
- 192.168.0.80:/home/yejk /home/yejk nfs defaults 0 0
此时在任意节点上的操作就可以在每个节点上生效
- su - yejk
- ssh-keygen 一路回车即可
- ssh-copy-id -i ~/.ssh/id_rsa.pub desktop80.example.com(与自己信任连接)
- ssh-copy-id -i ~/.ssh/id_rsa.pub desktop83.example.com(与其他节点信任连接)
- sh-copy-id -i ~/.ssh/id_rsa.pub desktop26.example.com(与其他节点信任连接)
2.在每个节点上安装mpich2
- yum install mpich2 -y
- su - yejk
- cd /home/yejk
- vim .mpd.conf
- secretword=westos #节点的密码,每台节点都得相同
- chmod 600 .mpd.conf
创建集群节点集合文件mpd.hosts
- vim mpd.hosts
- desktop80.example.com
- desktop83.example.com
注:如果以root用户启动mpd服务, 把mpd.conf文件创建在/etc目录中(不加"."),mpd.hosts放在root主目录中
- mpd
- mpdtrace
- desktop80.example.com
- mpdallexit
- mpdboot -n 2 -f mpd.hosts 参数-n 2指定了要起动的机器个数,-f mpd.hosts指定了运行mpd.hosts文件中指定的节点。
- mpdtrace
- desktop80.example.com
- desktop83.example.com
- mpdallexit
3.3单机测试
- ./icpi-64
- Enter the number of intervals: (0 quits) 1000000000
- pi is approximately 3.1415926535921401, Error is 0.0000000000023470
- wall clock time = 46.571311
- Enter the number of intervals: (0 quits) 10000
- pi is approximately 3.1415926544231341, Error is 0.0000000008333410
- wall clock time = 0.000542
- Enter the number of intervals: (0 quits) 0
3.4 集群测试
- mpdboot -n 2 -f mpd.hosts
- mpiexec -n 2 /root/icpi-64
- Enter the number of intervals: (0 quits) 1000000000
- pi is approximately 3.1415926535899761, Error is 0.0000000000001830
- wall clock time = 15.530082
- Enter the number of intervals: (0 quits) 10000
- pi is approximately 3.1415926544231323, Error is 0.0000000008333392
- wall clock time = 0.006318
- Enter the number of intervals: (0 quits) 0
- mpdallexit
- mpdcheck获得帮助信息
- mpdcheck -pc
- mpdcheck -l
- 通过mpd.hosts文件查错
- mpdcheck -f mpd.hosts
- mpdcheck -f mpd.hosts -ssh
- 对任意两台机器进行查错
- Station11上:
- mpdcheck -s
- 返回主机名host和端口port
- server listening at INADDR_ANY on: station11 52576
- station12上:
- mpdcheck -c 192.168.0.1 40782
- client successfully recvd ack from server: ack_from_server _to_client
- station11 上返回消息传递结果
- server has conn on from ('192.168.0.12', 54438)
- server successfully recvd msg from client: hello_from_client_to_server
- station1上:
- mpd -e &
- 返回使用的端口
- [1]12703
- mpd_port=42498
- station12上:
- mpd -h station11 -p 41563 &
- [1]5122
- yum install gcc gcc-c++ make -y
- tar zxf torque-3.0.0.tar.gz
- cd torque-3.0.0
- ./configure --with-rcp=scp --with-default-server=desktop26.example.com (--with-rcp=scp用于ssh key方式)
- make
- make install (torque的配置目录: /var/spool/torque)
- make packages (生成计算节点安装包,即在station11和station12上安装的包,确保所有计算节点和服务节点的架构是相同的)
- cp contrib/init.d/pbs_server /etc/init.d/
- cp contrib/init.d/pbs_sched /etc/init.d/
- scp contrib/init.d/pbs_mom desktop80.example.com:/etc/init.d
- scp contrib/init.d/pbs_mom desktop83.example.com:/etc/init.d
- vi /var/spool/torque/server_priv/nodes (设定计算节点,服务节点也可做计算)
- desktop80.example.com
- desktop83.example.com
- ./torque.setup root (设置torque的管理帐户)
- qterm -t quick (停止torque)
- service pbs_server start (启动torque)
- service pbs_sched start (启动调度程序)
- scp torque-package-clients-linux-x86_64.sh torque-package-mom-linux-x86_64.sh desktop80.example.com:~
- scp torque-package-clients-linux-x86_64.sh torque-package-mom-linux-x86_64.sh desktop83.example.com:~
- ./torque-package-clients-linux-x86_64.sh --install
- ./torque-package-mom-linux-x86_64.sh --install
- tar zxf torque-3.0.0.tar.gz
- ./configure --with-rcp=scp --with-default-server=server1.example.com
- make
- make install_mom install_clients
- vi /var/spool/torque/mom_priv/config (所有计算节点执行此配置)
- $pbsserver server1.example.com
- $logevent 255
- service pbs_mom start (所有计算节点执行此命令,启动计算节点守护进程)
- su - yejk
- cd /home/yejk
- vim job1.pbs (串行作业)
- #!/bin/bash
- #PBS -N job1
- #PBS -o job1.log
- #PBS -e job1.err
- #PBS -q batch
- cd /home/yejk
- echo Running on hosts `hostname`
- echo Time is `date`
- echo Directory is $PWD
- echo This job runs on the following nodes:
- cat $PBS_NODEFILE
- echo This job has allocated 1 node
- ./prog
- vi job2.pbs (并行作业)
- #!/bin/bash
- #PBS -N job2
- #PBS -o job2.log
- #PBS -e job2.err
- #PBS -q batch
- #PBS -l nodes=2
- cd /home/yejk
- echo Time is `date`
- echo Directory is $PWD
- echo This job runs on the following nodes:
- cat $PBS_NODEFILE
- NPROCS=`wc -l < $PBS_NODEFILE`
- echo This job has allocated $NPROCS nodes
- mpiexec -machinefile $PBS_NODEFILE -np $NPROCS ./prog
- vi prog
- #!/bin/bash
- echo 1000000000 | ./icpi-64 (icpi程序是mpi自带的,拷贝过来即可)
- chmod +x prog
- qsub jobx.pbs(提交作业)
- qstat (查看作业)
- pbsnodes (查看节点)
- [yejk@desktop26 ~]$ cat job1.log
- Running on hosts desktop80.example.com
- Time is Sat Jun 2 05:33:32 CST 2012
- Directory is /home/yejk
- This job runs on the following nodes:
- desktop80.example.com
- This job has allocated 1 node
- Enter the number of intervals: (0 quits) pi is approximately 3.1415926535899708, Error is 0.0000000000001776
- wall clock time = 43.059767
- Enter the number of intervals: (0 quits) No number entered; quitting
- [yejk@desktop26 ~]$ cat job2.log
- Time is Sat Jun 2 05:49:29 CST 2012
- Directory is /home/yejk
- This job runs on the following nodes:
- desktop83.example.com
- desktop80.example.com
- This job has allocated 2 nodes
- Enter the number of intervals: (0 quits) pi is approximately 3.1415926535900072, Error is 0.0000000000002141
- wall clock time = 23.318623
- Enter the number of intervals: (0 quits) No number entered; quitting
- You have mail in /var/spool/mail/yejk
- [yejk@desktop26 ~]$