ubuntu下安装nccl具体教程

使用paddlepaddle框架进行多卡训练时报错:

Traceback (most recent call last):
  File "train.py", line 210, in
    do_train()
  File "train.py", line 91, in do_train
    paddle.distributed.init_parallel_env()
  File "/home/th/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/distributed/parallel.py", line 225, in init_parallel_env
    parallel_helper._init_parallel_ctx()
  File "/home/th/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 42, in _init_parallel_ctx
    __parallel_ctx__clz__.init()
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
  Suggestions:
  1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
  2. Configure third-party dynamic library environment variables as follows:
  - Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
  - Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:285)

INFO 2022-04-10 14:18:14,425 launch_utils.py:320] terminate process group gid:6221
INFO 2022-04-10 14:18:18,430 launch_utils.py:341] terminate all the procs
ERROR 2022-04-10 14:18:18,430 launch_utils.py:604] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2022-04-10 14:18:22,434 launch_utils.py:341] terminate all the procs
INFO 2022-04-10 14:18:22,435 launch.py:311] Local processes completed.

我的运行环境是ubuntu16.04、cuda10.2、paddlepaddle2.2.2

报错原因是paddle多卡训练需要nccl,在cuda10.2库中找不到libnccl.so。到/usr/local/cuda-10.2/lib64目录下查看,果然没有libnccl.so这个文件。

RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)

解决方法:下载安装nccl

官方安装教程:Installation Guide :: NVIDIA Deep Learning NCCL Documentation

不过安装过程中需要注册账号查看cuda相应版本的nccl版本号。以下给出cuda10.0、cuda10.1、cuda10.2对应的版本,无需再注册。

一、ubuntu安装nccl步骤

以下是官方ubuntu系统的具体步骤,其他系统安装过程可到以上官方链接查看:

在 Ubuntu 上安装NCCL需要您首先将存储库添加到包含NCCL包的 APT 系统,然后通过 APT 安装NCCL包。有两个可用的存储库;本地存储库和网络存储库。建议选择后者,以便在发布较新版本时轻松检索升级。

在以下命令中,请替换< architecture>使用您的 CPU 架构:x86_64,ppc64le, 或者 sbsa, 并替换< distro>例如,使用 Ubuntu 版本ubuntu1604,ubuntu1804, 或者 ubuntu2004.

1. 安装keys。注意要替换<architecture>

(2)使用 Ubuntu 20.04/18.04 的网络存储库安装时:

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos///7fa2af80.pub

(2)使用 Ubuntu 16.04 的网络存储库安装时:

sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos///7fa2af80.pub

2. 安装存储库。

这里使用的是网络存储库

sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/// /"

3.更新 APT 数据库:

sudo apt update

4.利用APT安装libnccl2

cuda10.2:

sudo apt install libnccl2=2.9.6-1+cuda10.2 libnccl-dev=2.9.6-1+cuda10.2

cuda10.1:

sudo apt install libnccl2=2.4.8-1+cuda10.1 libnccl-dev=2.4.8-1+cuda10.1

cuda10.0:

sudo apt install libnccl2=2.4.8-1+cuda10.0 libnccl-dev=2.4.8-1+cuda10.0

二、将nccl添加到环境变量中 

首先,找到你nccl的安装目录,怎么找?当然是终端输入命令:whereis nccl 了,我的是在/usr/include/nccl.h中

然后,终端输入vim ~/.bashrc进入该文件,添加如下内容到文件中(添加到最低行):

#设置cuda库的目录
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64
#将nccl添加到LD_LIBRARY_PATH中
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/include/nccl.h

保存好后,终端输入命令:source ~/.bashrc 让配置文件生效啊 

再通过echo $LD_LIBRARY_PATH命令查看环境变量设置是否成功。

到此大功告成,终于可以分布式多卡训练了。

三、参考

Linux系统下解决“RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so)...”报错_深度科研的博客-CSDN博客

1Installation Guide :: NVIDIA Deep Learning NCCL Documentation

你可能感兴趣的:(linux,paddle,nccl,多卡训练,cuda,libnccl.so)