OpenMPI的安装与运行分布式项目

目录

  • 基于云服务器
  • 配置Hosts
  • 配置免密登录
  • 运行分布式程序
  • 防火墙
  • 其它问题
    • 1. MPI多节点执行:HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from x to y (No route to host)
    • 2. A process or daemon was unable to complete a TCP connection to another process:
    • 3.Warning: the ECDSA host key for 'node1' differs from the key for the IP address '47.104.186.95'
    • 4.运行卡顿,一直没有输出
    • 5.[node3][[21043,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.168.86 failed: No route to host (113)
  • openMPI的安装
  • 分布式集群配置脚本

参考自:MPI集群环境搭建(MPICH)
注意: 之前本人用的MPICH3.0来跑的分布式程序,一直没跑起来,后面就换成了OpenMPI2.1.1就行了,安装可以参考本文中的对应教程。此外,分布式集群的相关搭建基本只需要在Master节点(node1)上完成即可。

基于云服务器

如果租服务器进行构建集群,可以先开一台机器Master,配置好代码运行环境,然后将代码和数据集上传到Master机器,保证单机能够运行,然后保存Master的镜像,在用该镜像去租其它服务器节点,并执行后续的步骤。

配置Hosts

以下全部仅需在master上执行
基于master机器构建文件local_hosts,结合所有机器的IP构建local_hosts,内容如下:

127.0.0.1       localhost

47.104.97.177 node1
47.104.216.8  node2
x.x.x.x  node3
x.x.x.x  node4

分别将local_hosts的内容写入每个机器的hosts文件(类似于直接替换掉每个节点机器原始的hosts文件):

cat local_hosts > /etc/hosts  # 修改master
scp /etc/hosts root@node2:/etc/hosts  # 修改node2
scp /etc/hosts root@node3:/etc/hosts  # 修改node3
scp /etc/hosts root@node4:/etc/hosts  # 修改node4

配置免密登录

如果每台机器还没有公钥,则需要在每台机器上进行生成,方法可以参考:

ssh-keygen -t rsa # 执行本命令,然后一路回车即可
ls ~/.ssh         # 查看生成的文件

以下全在master上执行

# 获取所有节点的公钥
cp id_rsa.pub node1.pub
scp root@node2:/root/.ssh/id_rsa.pub node2.pub # node2
scp root@node3:/root/.ssh/id_rsa.pub node3.pub # node3
scp root@node4:/root/.ssh/id_rsa.pub node4.pub # node4
# 合并公钥
cat node* >> authorized_keys
# 将包含所有节点的公钥授权文件分发给所有机器
scp authorized_keys root@node2:/root/.ssh/authorized_keys
scp authorized_keys root@node3:/root/.ssh/authorized_keys
scp authorized_keys root@node4:/root/.ssh/authorized_keys

自此,所有机器实现了免密登入,测试如下

ssh root@node2  # 会发现,不需要输入密码,直接进到了node2

运行分布式程序

在运行的程序里面,新建如下文件hostfile, 注意下面的格式时OpenMPI的要求

node1 slots=1
node2 slots=1
node3 slots=1
node4 slots=1

如果时MPICH则格式如下:

node1:1
node2:1
node3:1
node4:1

然后复制到各个机器相同的位置:

scp hostfile root@node2:/home/code/YourProject
scp hostfile root@node3:/home/code/YourProject
scp hostfile root@node4:/home/code/YourProject

运行代码:

mpirun -n 2 -hostfile hostfile --allow-run-as-root  ./your_app

防火墙

如果报相关防火墙错误,可以将其关闭
https://blog.csdn.net/qq_15256443/article/details/101289412

1.查看防火墙当前状态
sudo ufw status
2.开启防火墙
sudo ufw enable
3.关闭防火墙
sudo ufw disable
4.查看防火墙版本
sudo ufw version
5.默认允许外部访问本机
sudo ufw default allow
6.默认拒绝外部访问主机
sudo ufw default deny
7.允许外部访问53端口
sudo ufw allow 53
8.拒绝外部访问53端口
sudo ufw deny 53
9.允许某个IP地址访问本机所有端口
sudo ufw allow from 192.168.0.1
————————————————
版权声明:本文为CSDN博主「ITROOKIEIS」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/qq_15256443/article/details/101289412

其它问题

1. MPI多节点执行:HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from x to y (No route to host)

https://blog.csdn.net/liu_feng_zi_/article/details/108405806

2. A process or daemon was unable to complete a TCP connection to another process:

------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    node2
  Remote host:   172.31.168.80
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

应该是防火墙的问题,需要关闭防火墙,然而用ufw status查询到状态Status: inactive,依然会报错,通过top命令可以地看到firewalld依然存在,依然可以尝试下面的方法进一步打开限制:
参考自:https://blog.51cto.com/u_14841814/3007756

    iptables -P INPUT ACCEPT
    iptables -P FORWARD ACCEPT
    iptables -P OUTPUT ACCEPT
    iptables -F

3.Warning: the ECDSA host key for ‘node1’ differs from the key for the IP address ‘47.104.186.95’

Offending key for IP in /root/.ssh/known_hosts:6

Warning: the ECDSA host key for 'node1' differs from the key for the IP address '47.104.186.95'
Offending key for IP in /root/.ssh/known_hosts:6
Matching host key in /root/.ssh/known_hosts:8
Are you sure you want to continue connecting (yes/no)? yes
Welcome to Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-173-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
New release '20.04.5 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

机器上保存的旧node1IP与现在的可能对不上,故可以删除就的:

rm ~/.ssh/known_hosts

4.运行卡顿,一直没有输出

可能是每个机器之间每有进行初始化互通,可以在每台机器上执行ssh nodex操作,其中x表示所有机器的id包括自己,可能会出现如下情况,初始运行是需要回复yes的,这个可能是卡顿情况的一种,初次登入之后,第二次就可以直接ssh nodex登入了,不需要交互式操作。

root@node3:/home/code/Ingress# ssh node4
The authenticity of host 'node4 (47.104.97.177)' can't be established.
ECDSA key fingerprint is SHA256:Zts/zzID/TYAQMN/Hs0m42+jzzVhKbouyvHeoVC81nI.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node4,47.104.97.177' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-173-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
New release '20.04.5 LTS' available.
Run 'do-release-upgrade' to upgrade to it.


Welcome to Alibaba Cloud Elastic Compute Service !

Last login: Mon Nov  7 18:43:06 2022 from 118.190.205.234

5.[node3][[21043,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.168.86 failed: No route to host (113)

[node3][[21043,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.168.86 failed: No route to host (113)
[node2][[21043,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.168.86 failed: No route to host (113)
[node1][[21043,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 172.31.168.86 failed: No route to host (113)

总结下大概可能的情况就是iptables的问题,简单处理如下:

iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -F  # 我执行这个命令就解决了

如果不行可以试试在每个机器上都执行下
参考:
高算 openmpi no route to host
failed: No route to host问题的解决

openMPI的安装

Linux下安装OpenMPI下载,安装,环境变量,例子测试
下载openmpi:https://www.open-mpi.org/

# 找一个地方存放下载文件
cd software
# 1.下载与解压
wget https://download.open-mpi.org/release/open-mpi/v2.1/openmpi-2.1.1.tar.gz
tar zxf openmpi-2.1.1.tar.gz

# 2.配置安装路径,编译并安装,安装路径自定义
cd openmpi-2.1.1
./configure --prefix=/home/mpi/openmpi2.1.1
make
make install #别忘记执行

# 3.设置环境变量,路径为自己安装的路径, 修改文件:
sudo vim ~/.bashrc # 用于当前用户,如果需要用于所有用户参考:https://blog.csdn.net/White_Idiot/article/details/78253004
## 添加和修改以下内容
MPI_HOME=/home/mpi/openmpi2.1.1
export PATH=${MPI_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${MPI_HOME}/lib:$LD_LIBRARY_PATH
export MANPATH=${MPI_HOME}/share/man:$MANPATH

# 4.刷新修改的环境变量
source ~/.bashrc
whereis mpirun # 查看mpi的安装路径

配置不同用户环境:
Ubuntu设置和查看环境变量

要使环境变量对所有用户有效,可以修改profile文件:
sudo vim /etc/profile
添加语句:
export CLASS_PATH=./JAVA_HOME/lib:$JAVA_HOME/jre/lib1
注销或者重启可以使修改生效,如果要使添加的环境变量马上生效:
source /etc/profile

分布式集群配置脚本

对应步骤可以参考前文:
1_clear_old_ip.sh:

################################################################################
#  用与清理旧ip和node之间的换成依赖信息
#  或者直接删除: rm /root/.ssh/known_hosts
################################################################################
node_num=4

# 删除旧的/root/.ssh/known_hosts
rm /root/.ssh/known_hosts

for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
  # ssh-keygen -f "/root/.ssh/known_hosts" -R "node1"
  cmd="ssh-keygen -f \"/root/.ssh/known_hosts\" -R \"node${i}\""
  echo -e $cmd
  eval $cmd
done
echo "finish clear old ip map!"

2_gen_assign_hosts.sh:

################################################################################
# 本程序的说明:
#  - 主要功能是: 将用户手动配置好的文件(local_hosts), 分配给集群内部所有机器节点,即用
# local_hosts文件去替换调每台机器自身的/etc/hosts.
################################################################################

# 安装sshpass
echo 'install sshpass'
sudo apt install sshpass
echo -e 'install finish \n---------------------------------------------------\n'

# echo '请输入依次您集群统一的登入密码!'
read -s -p "Enter Cluster unified password: "  passwd
node_num=4

#echo $passwd

sshpass -p $passwd scp local_hosts /etc/hosts # 先修改自己的

# 将hosts文件分发给所有机器
for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
  cmd="sshpass -p $passwd scp local_hosts root@node${i}:/etc/hosts"  # 修改node2
  echo $cmd
  eval $cmd
done

echo "finish assignment local_hosts -> /etc/hosts!"

3_gen_assign_key.sh:

################################################################################
# 本程序的说明:
#  - 主要功能是: 基于本人的镜像来配置分布式集群时自动配置集群内部机器免密登录。
#  - 为了方便,本地机器与集群机器也进行免密登入,故将本地机器的公钥(authorized_keys_ys_lab)
#  上传到了目录(/root/.ssh)下面。
#  - 本集群内部所有机器全部使用统一用户名(root)以及统一密码(passwd),虽然代码相顾固定,但是很
#  容易读懂,并修改为您所需要样子。
#  - 在配置完成免密功能前需要密码才能访问其它机器,所以我们需要先读取用户的统一密码,故通过shell
#  交互式读取用户输入的密码,保存于变量passwd中。
#  Others:
#   - 获取密码进行交互式,利用的是sshpass,没有时需要进行安装, https://blog.csdn.net/nfe
#  r_zhuang/article/details/42646849
################################################################################

# 安装sshpass
# sudo apt install sshpass

# echo '请输入依次您集群统一的登入密码!'
read -s -p "Enter Cluster unified password: "  passwd
node_num=4

cd /root/.ssh
rm -rf temp_key
mkdir temp_key
# 获取本地机器的key,方便与集群交流
cp authorized_keys_ys_lab temp_key/local_ys_key
# 获取master节点的key 
cp id_rsa.pub temp_key/node1.pub
cd temp_key
# 获取其它所有节点的key
for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
  sshpass -p $passwd scp root@node${i}:/root/.ssh/id_rsa.pub node${i}.pub
done
echo "finish get public key!"

# 合并key
cat local_ys_key >> authorized_keys
cat node* >> authorized_keys
# 将包含所有节点的公钥授权文件分发给所有机器
for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
  sshpass -p $passwd scp authorized_keys root@node${i}:/root/.ssh/authorized_keys
done

echo "finish assignment public key!"

4_gen_assign_hostfile.sh:

################################################################################
# 本程序的说明:
#  - 主要功能是: 将用户手动配置好的文件(hostfile), 分配给集群内部所有机器节点,即将hostfile
# 文件分发给所有机器上的项目上.
# hostfile(openMPI格式)内容如下:
#   node1 slots=1
#   node2 slots=1
#   node3 slots=1
#   node4 slots=1
# hostfile(MPICH格式)内容如下:
#   node1:1
#   node2:1
#   node3:1
#   node4:1
# 
################################################################################

echo "运行前,请在文件中,补充需要运行的分布式项目路径(YourProject_Path)"
# 将hostfile复制到的路径
YourProject_Path=

if [ -z "$YourProject_Path" ];
then
  echo " ERROR: YourProject_Path is empty."
  exit
else
  echo " YourProject_Path=$YourProject_Path"
fi

# 节点数量
node_num=4

# 将hostfile文件分发给所有机器
for(( i = 1; i <= ${node_num}; i = i + 1 ))
do
  scp hostfile root@node${i}:${YourProject_Path}
done

echo "finish assignment hostfile!"

你可能感兴趣的:(软件安装与使用,并行计算,分布式,网络,服务器)