DistributedDataParallel

参考:https://github.com/NVIDIA/nccl

1. 安装 nccl

(1)将nccl文件夹git clone 下来

yum -y  install git

git clone https://github.com/NVIDIA/nccl.git

(2)build

cd nccl

有三种选择

i. 直接build

make -j src.build

ii. 改路径:CUDA_HOME 默认是 /usr/local/cuda

make src.build CUDA_HOME=

iii. 只安装GTX 1080 TI的框架,不会那么大

make -j src.build NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61"

(3)install (CentOS系统)

sudo yum install rpm-build rpmdevtools

make pkg.redhat.build

ls build/pkg/rpm/

2. nccl test

(1)先把nccl-test的文件git clone下来,进入

git clone https://github.com/NVIDIA/nccl-tests.git

cd nccl-tests

(2)make

make NCCL_HOME=/root/nccl/build

(make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl )

先看看cuda能不能用(nvcc -V)

vim ~/.bashrc

export LD_LIBRARY_PATH="/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH"

export PATH="/usr/local/cuda-9.0/bin:$PATH"

source ~/.bashrc

(3)试一下

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

报错1 

./build/all_reduce_perf: error while loading shared libraries: libcudart.so.9.0: cannot open shared object file: No such file or directory

解决方案 (先让nvcc 能用,再复制文件)

[root@localhost nccl-tests]# sudo cp /usr/local/cuda-9.0/lib64/libcudart.so.9.0 /usr/local/lib/libcudart.so.9.0 && sudo ldconfig

[root@localhost nccl-tests]# sudo cp /usr/local/cuda-9.0/lib64/libcublas.so.9.0 /usr/local/lib/libcublas.so.9.0 && sudo ldconfig

[root@localhost nccl-tests]# sudo cp /usr/local/cuda-9.0/lib64/libcurand.so.9.0 /usr/local/lib/libcurand.so.9.0 && sudo ldconfig

报错2

[root@localhost nccl-tests]# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

./build/all_reduce_perf: error while loading shared libraries: libnccl.so.2: cannot open shared object file: No such file or directory 

解决方案

[root@localhost nccl-tests]#    sudo cp /root/nccl/build/lib/libnccl.so.2 /usr/lib/libnccl.so.2 && sudo ldconfig


3. 运行代码时的问题

python main.py data -a resnet50 --dist-url 'tcp://10.141.221.203:203' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 --b 4

python main.py data -a resnet50 --dist-url 'tcp://10.141.221.203:203' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 1 --b 4


(1)关闭防火墙

systemctl status firewalld.service

systemctl stop firewalld.service


(2)配置环境变量

export NCCL_SOCKET_IFNAME=em1

网卡位置:/etc/sysconfig/network-scripts/ifcfg-em1



主机
辅机
报错


在主机上如果把world-size改成1,是可以跑的~说明是辅机连接到主机有问题

你可能感兴趣的:(DistributedDataParallel)