参考自:MPI集群环境搭建(MPICH)
注意: 之前本人用的MPICH3.0
来跑的分布式程序,一直没跑起来,后面就换成了OpenMPI2.1.1
就行了,安装可以参考本文中的对应教程。此外,分布式集群的相关搭建基本只需要在Master
节点(node1
)上完成即可。
如果租服务器进行构建集群,可以先开一台机器Master
,配置好代码运行环境,然后将代码和数据集上传到Master
机器,保证单机能够运行,然后保存Master
的镜像,在用该镜像去租其它服务器节点,并执行后续的步骤。
以下全部仅需在master上执行
基于master机器构建文件local_hosts
,结合所有机器的IP构建local_hosts
,内容如下:
127.0.0.1 localhost
47.104.97.177 node1
47.104.216.8 node2
x.x.x.x node3
x.x.x.x node4
分别将local_hosts
的内容写入每个机器的hosts文件(类似于直接替换掉每个节点机器原始的hosts文件):
cat local_hosts > /etc/hosts # 修改master
scp /etc/hosts root@node2:/etc/hosts # 修改node2
scp /etc/hosts root@node3:/etc/hosts # 修改node3
scp /etc/hosts root@node4:/etc/hosts # 修改node4
如果每台机器还没有公钥,则需要在每台机器上进行生成,方法可以参考:
ssh-keygen -t rsa # 执行本命令,然后一路回车即可
ls ~/.ssh # 查看生成的文件
以下全在master上执行
# 获取所有节点的公钥
cp id_rsa.pub node1.pub
scp root@node2:/root/.ssh/id_rsa.pub node2.pub # node2
scp root@node3:/root/.ssh/id_rsa.pub node3.pub # node3
scp root@node4:/root/.ssh/id_rsa.pub node4.pub # node4
# 合并公钥
cat node* >> authorized_keys
# 将包含所有节点的公钥授权文件分发给所有机器
scp authorized_keys root@node2:/root/.ssh/authorized_keys
scp authorized_keys root@node3:/root/.ssh/authorized_keys
scp authorized_keys root@node4:/root/.ssh/authorized_keys
自此,所有机器实现了免密登入,测试如下
ssh root@node2 # 会发现,不需要输入密码,直接进到了node2
在运行的程序里面,新建如下文件hostfile
, 注意下面的格式时OpenMPI的要求
node1 slots=1
node2 slots=1
node3 slots=1
node4 slots=1
如果时MPICH则格式如下:
node1:1
node2:1
node3:1
node4:1
然后复制到各个机器相同的位置:
scp hostfile root@node2:/home/code/YourProject
scp hostfile root@node3:/home/code/YourProject
scp hostfile root@node4:/home/code/YourProject
运行代码:
mpirun -n 2 -hostfile hostfile --allow-run-as-root ./your_app
如果报相关防火墙错误,可以将其关闭
https://blog.csdn.net/qq_15256443/article/details/101289412
1.查看防火墙当前状态
sudo ufw status
2.开启防火墙
sudo ufw enable
3.关闭防火墙
sudo ufw disable
4.查看防火墙版本
sudo ufw version
5.默认允许外部访问本机
sudo ufw default allow
6.默认拒绝外部访问主机
sudo ufw default deny
7.允许外部访问53端口
sudo ufw allow 53
8.拒绝外部访问53端口
sudo ufw deny 53
9.允许某个IP地址访问本机所有端口
sudo ufw allow from 192.168.0.1
————————————————
版权声明:本文为CSDN博主「ITROOKIEIS」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/qq_15256443/article/details/101289412
https://blog.csdn.net/liu_feng_zi_/article/details/108405806
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: node2
Remote host: 172.31.168.80
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
应该是防火墙的问题,需要关闭防火墙,然而用ufw status
查询到状态Status: inactive
,依然会报错,通过top
命令可以地看到firewalld
依然存在,依然可以尝试下面的方法进一步打开限制:
参考自:https://blog.51cto.com/u_14841814/3007756
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -F
Offending key for IP in /root/.ssh/known_hosts:6
Warning: the ECDSA host key for 'node1' differs from the key for the IP address '47.104.186.95'
Offending key for IP in /root/.ssh/known_hosts:6
Matching host key in /root/.ssh/known_hosts:8
Are you sure you want to continue connecting (yes/no)? yes
Welcome to Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-173-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
New release '20.04.5 LTS' available.
Run 'do-release-upgrade' to upgrade to it.
机器上保存的旧node1
和IP
与现在的可能对不上,故可以删除就的:
rm ~/.ssh/known_hosts
可能是每个机器之间每有进行初始化互通,可以在每台机器上执行ssh nodex
操作,其中x
表示所有机器的id
包括自己,可能会出现如下情况,初始运行是需要回复yes
的,这个可能是卡顿情况的一种,初次登入之后,第二次就可以直接ssh nodex
登入了,不需要交互式操作。
root@node3:/home/code/Ingress# ssh node4
The authenticity of host 'node4 (47.104.97.177)' can't be established.
ECDSA key fingerprint is SHA256:Zts/zzID/TYAQMN/Hs0m42+jzzVhKbouyvHeoVC81nI.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node4,47.104.97.177' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-173-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
New release '20.04.5 LTS' available.
Run 'do-release-upgrade' to upgrade to it.
Welcome to Alibaba Cloud Elastic Compute Service !
Last login: Mon Nov 7 18:43:06 2022 from 118.190.205.234
[node3][[21043,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.168.86 failed: No route to host (113)
[node2][[21043,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.168.86 failed: No route to host (113)
[node1][[21043,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.168.86 failed: No route to host (113)
总结下大概可能的情况就是iptables
的问题,简单处理如下:
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -F # 我执行这个命令就解决了
如果不行可以试试在每个机器上都执行下
。
参考:
高算 openmpi no route to host
failed: No route to host问题的解决
Linux下安装OpenMPI下载,安装,环境变量,例子测试
下载openmpi:https://www.open-mpi.org/
# 找一个地方存放下载文件
cd software
# 1.下载与解压
wget https://download.open-mpi.org/release/open-mpi/v2.1/openmpi-2.1.1.tar.gz
tar zxf openmpi-2.1.1.tar.gz
# 2.配置安装路径,编译并安装,安装路径自定义
cd openmpi-2.1.1
./configure --prefix=/home/mpi/openmpi2.1.1
make
make install #别忘记执行
# 3.设置环境变量,路径为自己安装的路径, 修改文件:
sudo vim ~/.bashrc # 用于当前用户,如果需要用于所有用户参考:https://blog.csdn.net/White_Idiot/article/details/78253004
## 添加和修改以下内容
MPI_HOME=/home/mpi/openmpi2.1.1
export PATH=${MPI_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${MPI_HOME}/lib:$LD_LIBRARY_PATH
export MANPATH=${MPI_HOME}/share/man:$MANPATH
# 4.刷新修改的环境变量
source ~/.bashrc
whereis mpirun # 查看mpi的安装路径
配置不同用户环境:
Ubuntu设置和查看环境变量
要使环境变量对所有用户有效,可以修改profile文件:
sudo vim /etc/profile
添加语句:
export CLASS_PATH=./JAVA_HOME/lib:$JAVA_HOME/jre/lib1
注销或者重启可以使修改生效,如果要使添加的环境变量马上生效:
source /etc/profile
对应步骤可以参考前文:
1_clear_old_ip.sh
:
################################################################################
# 用与清理旧ip和node之间的换成依赖信息
# 或者直接删除: rm /root/.ssh/known_hosts
################################################################################
node_num=4
# 删除旧的/root/.ssh/known_hosts
rm /root/.ssh/known_hosts
for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
# ssh-keygen -f "/root/.ssh/known_hosts" -R "node1"
cmd="ssh-keygen -f \"/root/.ssh/known_hosts\" -R \"node${i}\""
echo -e $cmd
eval $cmd
done
echo "finish clear old ip map!"
2_gen_assign_hosts.sh
:
################################################################################
# 本程序的说明:
# - 主要功能是: 将用户手动配置好的文件(local_hosts), 分配给集群内部所有机器节点,即用
# local_hosts文件去替换调每台机器自身的/etc/hosts.
################################################################################
# 安装sshpass
echo 'install sshpass'
sudo apt install sshpass
echo -e 'install finish \n---------------------------------------------------\n'
# echo '请输入依次您集群统一的登入密码!'
read -s -p "Enter Cluster unified password: " passwd
node_num=4
#echo $passwd
sshpass -p $passwd scp local_hosts /etc/hosts # 先修改自己的
# 将hosts文件分发给所有机器
for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
cmd="sshpass -p $passwd scp local_hosts root@node${i}:/etc/hosts" # 修改node2
echo $cmd
eval $cmd
done
echo "finish assignment local_hosts -> /etc/hosts!"
3_gen_assign_key.sh
:
################################################################################
# 本程序的说明:
# - 主要功能是: 基于本人的镜像来配置分布式集群时自动配置集群内部机器免密登录。
# - 为了方便,本地机器与集群机器也进行免密登入,故将本地机器的公钥(authorized_keys_ys_lab)
# 上传到了目录(/root/.ssh)下面。
# - 本集群内部所有机器全部使用统一用户名(root)以及统一密码(passwd),虽然代码相顾固定,但是很
# 容易读懂,并修改为您所需要样子。
# - 在配置完成免密功能前需要密码才能访问其它机器,所以我们需要先读取用户的统一密码,故通过shell
# 交互式读取用户输入的密码,保存于变量passwd中。
# Others:
# - 获取密码进行交互式,利用的是sshpass,没有时需要进行安装, https://blog.csdn.net/nfe
# r_zhuang/article/details/42646849
################################################################################
# 安装sshpass
# sudo apt install sshpass
# echo '请输入依次您集群统一的登入密码!'
read -s -p "Enter Cluster unified password: " passwd
node_num=4
cd /root/.ssh
rm -rf temp_key
mkdir temp_key
# 获取本地机器的key,方便与集群交流
cp authorized_keys_ys_lab temp_key/local_ys_key
# 获取master节点的key
cp id_rsa.pub temp_key/node1.pub
cd temp_key
# 获取其它所有节点的key
for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
sshpass -p $passwd scp root@node${i}:/root/.ssh/id_rsa.pub node${i}.pub
done
echo "finish get public key!"
# 合并key
cat local_ys_key >> authorized_keys
cat node* >> authorized_keys
# 将包含所有节点的公钥授权文件分发给所有机器
for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
sshpass -p $passwd scp authorized_keys root@node${i}:/root/.ssh/authorized_keys
done
echo "finish assignment public key!"
4_gen_assign_hostfile.sh
:
################################################################################
# 本程序的说明:
# - 主要功能是: 将用户手动配置好的文件(hostfile), 分配给集群内部所有机器节点,即将hostfile
# 文件分发给所有机器上的项目上.
# hostfile(openMPI格式)内容如下:
# node1 slots=1
# node2 slots=1
# node3 slots=1
# node4 slots=1
# hostfile(MPICH格式)内容如下:
# node1:1
# node2:1
# node3:1
# node4:1
#
################################################################################
echo "运行前,请在文件中,补充需要运行的分布式项目路径(YourProject_Path)"
# 将hostfile复制到的路径
YourProject_Path=
if [ -z "$YourProject_Path" ];
then
echo " ERROR: YourProject_Path is empty."
exit
else
echo " YourProject_Path=$YourProject_Path"
fi
# 节点数量
node_num=4
# 将hostfile文件分发给所有机器
for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
scp hostfile root@node${i}:${YourProject_Path}
done
echo "finish assignment hostfile!"