参考：https://github.com/NVIDIA/nccl

1. 安装 nccl

（1）将nccl文件夹git clone 下来

yum -y install git

git clone https://github.com/NVIDIA/nccl.git

（2）build

cd nccl

有三种选择

i. 直接build

make -j src.build

ii. 改路径：CUDA_HOME 默认是 /usr/local/cuda

make src.build CUDA_HOME=

iii. 只安装GTX 1080 TI的框架，不会那么大

make -j src.build NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61"

（3）install (CentOS系统)

sudo yum install rpm-build rpmdevtools

make pkg.redhat.build

ls build/pkg/rpm/

2. nccl test

（1）先把nccl-test的文件git clone下来，进入

git clone https://github.com/NVIDIA/nccl-tests.git

cd nccl-tests

（2）make

make NCCL_HOME=/root/nccl/build

(make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl )

先看看cuda能不能用（nvcc -V）

vim ~/.bashrc

export LD_LIBRARY_PATH="/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH"

export PATH="/usr/local/cuda-9.0/bin:$PATH"

source ~/.bashrc

（3）试一下

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

报错1

./build/all_reduce_perf: error while loading shared libraries: libcudart.so.9.0: cannot open shared object file: No such file or directory

解决方案（先让nvcc 能用，再复制文件）

[root@localhost nccl-tests]# sudo cp /usr/local/cuda-9.0/lib64/libcudart.so.9.0 /usr/local/lib/libcudart.so.9.0 && sudo ldconfig

[root@localhost nccl-tests]# sudo cp /usr/local/cuda-9.0/lib64/libcublas.so.9.0 /usr/local/lib/libcublas.so.9.0 && sudo ldconfig

[root@localhost nccl-tests]# sudo cp /usr/local/cuda-9.0/lib64/libcurand.so.9.0 /usr/local/lib/libcurand.so.9.0 && sudo ldconfig

报错2

[root@localhost nccl-tests]# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

./build/all_reduce_perf: error while loading shared libraries: libnccl.so.2: cannot open shared object file: No such file or directory

解决方案

[root@localhost nccl-tests]# sudo cp /root/nccl/build/lib/libnccl.so.2 /usr/lib/libnccl.so.2 && sudo ldconfig

3. 运行代码时的问题

python main.py data -a resnet50 --dist-url 'tcp://10.141.221.203:203' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 --b 4

python main.py data -a resnet50 --dist-url 'tcp://10.141.221.203:203' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 1 --b 4

（1）关闭防火墙

systemctl status firewalld.service

systemctl stop firewalld.service

（2）配置环境变量

export NCCL_SOCKET_IFNAME=em1

网卡位置：/etc/sysconfig/network-scripts/ifcfg-em1

主机

辅机

报错

在主机上如果把world-size改成1，是可以跑的～说明是辅机连接到主机有问题

DistributedDataParallel

参考：https://github.com/NVIDIA/nccl

1. 安装 nccl

2. nccl test

你可能感兴趣的:(DistributedDataParallel)