一、IA64上配MPICH2遇到的一些问题:
(1)/home/zhxue/mpich2-1.1.1p1/configure -prefix=/opt/app/mpich2/ 2>&1 tee c.txt
configure: error:
The nemesis channel was selected yet no native atomic primitives are
available on this platform. OpenPA can emulate atomic primitives using
locks by specifying --with-atomic-primitives=no but performance will be
very poor. This override should only be specified for correctness
testing purposes.
configure: error: /home/zhxue/mpich2-1.1.1p1/src/mpid/ch3/channels/nemesis/configure failed for channels/nemesis
configure: error: Configure of src/mpid/ch3 failed!
解决办法如下:
http://trac.mcs.anl.gov/projects/mpich2/ticket/764
(2)When I run the follwing command, it prompts error messages:
[root@c2402 root]# mpdboot -n 2 -f /opt/app/mpd.hosts
mpdboot_c2402 (handle_mpd_output 415): failed to connect to mpd on c2403
It fails since firewall prevent mpd. You can set an port arrange in your mpd.conf file, and open the range in
/etc/sysconfig/iptables, and then service restart iptables . The mpd.conf file looks like the follwing:
MPD_PORT_RANGE=55000:56000
When you encounter "no port" error message, please ensure you have installed python 2.6 or above version.
In addition, this command will launch other nodes in the mpd.hosts. When you execute mpd & on other nodes, it will prompt error message when you use mpdboot command.
二、在Mellanox上配MVAPICH2遇到的一些问题:
(1)总是提示找不到网卡驱动
后来安装了OFED,这个程序把网卡驱动还有MVAPICH2等全部装上了,但必须在2.6.18内核上装,其他内核没装上。
(2)节点之间无法通讯
在每个节点上:
service openibd start
service opensmd start
就可以了。之前没启动opensmd,总是只能和自己通讯,无法和另外一个节点通讯。
(3)root用户可以,非root不行的问题
无密码互通配置好后,用root用户可以运行,zhxue用户不行
后来找到真正的原因了:
在/etc/security/limits.conf中加入如下:
#begin by zhxue
* soft memlock unlimited
* hard memlock unlimited
#end by zhxue
运行如下命令成功
[zhxue@mpi002 /]$ /usr/mpi/gcc/mvapich2-1.6/bin/mpiexec -np 50 -hosts mpi002,mpi006 /home/zhxue/mpiprog/cpi
Process 2 of 50 is on mpi002
Process 6 of 50 is on mpi002
Process 11 of 50 is on mpi006
Process 10 of 50 is on mpi002
。。。。。。。。。。。。。。。
Process 17 of 50 is on mpi006
Process 16 of 50 is on mpi002
pi is approximately 3.1415926544231274, Error is 0.0000000008333343
wall clock time = 0.830070
(4)mpirun_rsh命令:mpi006(本地安装)节点可以,mpi002(无盘系统mpi006的完全拷贝)节点不行
[zhxue@mpi006 mvapich2-1.6]$ /usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -hostfile /home/zhxue/mpiprog/mpi.hosts -np 2 /home/zhxue/mpiprog/cpi
Process 0 of 1 is on mpi006
Process 0 of 1 is on mpi002
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000500
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000524
但这个结果并不是把一个任务分成多个进程放到多个节点上运行。。。。
[zhxue@mpi002 /]$ /usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -hostfile /home/zhxue/mpiprog/mpi.hosts -np 2 /home/zhxue/mpiprog/cpi
Child exited abnormally!
Killing remote processes...Signal 15 received.
DONE
想debug,于是:
[root@mpi002 mpiprog]# /usr/mpi/gcc/mvapich-1.2.0/bin/mpirun_rsh -debug -hostfile /home/zhxue/mpiprog/mpi.hosts -np 2 /home/zhxue/mpiprog/cpi
debug enabled !
RSH/SSH command failed!: No such file or directory
RSH/SSH command failed!: No such file or directory
Child exited abnormally!
Killing remote processes...DONE
但是在mpi006上执行相同的命令,也会出错,与上述结果一摸一样,差点被这个误导了,debug没什么用啊。
(5)unable to change wdir 问题
[root@mpi002 ~]# su zhxue
[zhxue@mpi002 root]$ pwd
/root
[zhxue@mpi002 root]$ /usr/mpi/gcc/mvapich2-1.6/bin/mpiexec -np 10 -hosts mpi002,mpi006 /home/zhxue/mpiprog/cpi
[proxy:0:0@mpi002] launch_procs (./pm/pmiserv/pmip_cb.c:665): unable to change wdir to /root (Permission denied)
Killed
解决方案:
[zhxue@mpi002 root]$ cd /home/zhxue
[zhxue@mpi002 ~]$ /usr/mpi/gcc/mvapich2-1.6/bin/mpiexec -np 2 -hosts mpi002,mpi006 /home/zhxue/mpiprog/cpi
Process 0 of 2 is on mpi002
Process 1 of 2 is on mpi006
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000290