MPI分布式编程 --3.OpenMPI多节点运行报错

1. OpenMPI多节点运行报错问题

问题描述:节点一即host3,通过mpirun调用节点二即host4的mpi程序,报错如下。

$ mpirun -np 1 --host host4 ./main
 [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367
 [[INVALID],INVALID]-[[59225,0],0] mca_oob_tcp_peer_try_connect: connect to 255.255.255.255:51754 failed: Network is unreachable (101)
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------


解决方案

在确保节点一和节点二都能单机运行OpenMPI程序的前提下,检查两个节点的OpenMPI版本是否一致。如果不一致,重装OpenMPI使之版本一致。



参考资料

[1. OpenMPI报错问题] https://www.slothparadise.com/fix-orte-error-unknown-option-hnp-topo-sig/

你可能感兴趣的:(分布式集群,分布式集群)