使用paddlepaddle框架进行多卡训练时报错:
Traceback (most recent call last):
File "train.py", line 210, in
do_train()
File "train.py", line 91, in do_train
paddle.distributed.init_parallel_env()
File "/home/th/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/distributed/parallel.py", line 225, in init_parallel_env
parallel_helper._init_parallel_ctx()
File "/home/th/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 42, in _init_parallel_ctx
__parallel_ctx__clz__.init()
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
Suggestions:
1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:285)
INFO 2022-04-10 14:18:14,425 launch_utils.py:320] terminate process group gid:6221
INFO 2022-04-10 14:18:18,430 launch_utils.py:341] terminate all the procs
ERROR 2022-04-10 14:18:18,430 launch_utils.py:604] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2022-04-10 14:18:22,434 launch_utils.py:341] terminate all the procs
INFO 2022-04-10 14:18:22,435 launch.py:311] Local processes completed.
我的运行环境是ubuntu16.04、cuda10.2、paddlepaddle2.2.2
报错原因是paddle多卡训练需要nccl,在cuda10.2库中找不到libnccl.so。到/usr/local/cuda-10.2/lib64目录下查看,果然没有libnccl.so这个文件。
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
解决方法:下载安装nccl
官方安装教程:Installation Guide :: NVIDIA Deep Learning NCCL Documentation
不过安装过程中需要注册账号查看cuda相应版本的nccl版本号。以下给出cuda10.0、cuda10.1、cuda10.2对应的版本,无需再注册。
以下是官方ubuntu系统的具体步骤,其他系统安装过程可到以上官方链接查看:
在 Ubuntu 上安装NCCL需要您首先将存储库添加到包含NCCL包的 APT 系统,然后通过 APT 安装NCCL包。有两个可用的存储库;本地存储库和网络存储库。建议选择后者,以便在发布较新版本时轻松检索升级。
1. 安装keys。注意要替换
(2)使用 Ubuntu 20.04/18.04 的网络存储库安装时:
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos///7fa2af80.pub
(2)使用 Ubuntu 16.04 的网络存储库安装时:
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos///7fa2af80.pub
2. 安装存储库。
这里使用的是网络存储库
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/// /"
3.更新 APT 数据库:
sudo apt update
4.利用APT安装libnccl2
cuda10.2:
sudo apt install libnccl2=2.9.6-1+cuda10.2 libnccl-dev=2.9.6-1+cuda10.2
cuda10.1:
sudo apt install libnccl2=2.4.8-1+cuda10.1 libnccl-dev=2.4.8-1+cuda10.1
cuda10.0:
sudo apt install libnccl2=2.4.8-1+cuda10.0 libnccl-dev=2.4.8-1+cuda10.0
首先,找到你nccl的安装目录,怎么找?当然是终端输入命令:whereis nccl 了,我的是在/usr/include/nccl.h中
然后,终端输入vim ~/.bashrc进入该文件,添加如下内容到文件中(添加到最低行):
#设置cuda库的目录
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64
#将nccl添加到LD_LIBRARY_PATH中
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/include/nccl.h
保存好后,终端输入命令:source ~/.bashrc 让配置文件生效啊
再通过echo $LD_LIBRARY_PATH命令查看环境变量设置是否成功。
到此大功告成,终于可以分布式多卡训练了。
Linux系统下解决“RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so)...”报错_深度科研的博客-CSDN博客
1Installation Guide :: NVIDIA Deep Learning NCCL Documentation