【深度学习框架-Paddle】丝滑安装PaddlePaddle,无缝衔接使用多卡

目录

  • Paddle爱恨史
  • PaddleCloud
  • 多卡

Paddle爱恨史

Paddle是由百度开发的国内的深度学习框架,PaddlePaddle支撑了PaddleOCR、PaddleNLP等一系列领域内的开源工具包,为国内深度学习的落地与实践做出了大量贡献。
但是,PaddlePaddle安装问题一直都困扰着我,什么````C++```报错了、什么不能使用多卡了,不同Linux环境安装后报错也各不相同。。。诸多限制,让我对它又渐渐疏远。怎么样,才能让Paddle安装像torch那么丝滑,开箱即用,而不是陷入各种报错当中,在不断摸索的过程中,也渐渐看到了方向。

PaddleCloud

先放上链接:https://hub.docker.com/r/paddlecloud/paddlenlp
某一天,在PaddleNLP文档上查看资料,看到PaddleCloud开源了基于Paddle的镜像,可开箱即用。

PaddleCloud主要用于存储飞桨模型套件PaddleNLP的标准镜像,方便模型套件用户进行Docker化部署或在云上部署。

然后我立刻尝试,将镜像拉取到linux服务器上,

docker pull paddlecloud/paddlenlp:develop-gpu-cuda11.2-cudnn8-latest

接下来就是创建容器,

docker run -itd --name container_name -v /path:/path paddlecloud/paddlenlp:develop-gpu-cuda11.2-cudnn8-latest /bin/bash

进入容器

docker exec -it container_name /bin/bash

检查PaddlePaddle框架是否正常

python
>>import paddle
>>paddle.utils.run_check()
>Running verify PaddlePaddle program ... 
W0130 06:01:35.244894    23 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.2
W0130 06:01:35.276093    23 gpu_context.cc:306] device: 0, cuDNN Version: 8.1.
PaddlePaddle works well on 1 GPU.
W0130 06:01:44.027418    23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 1
W0130 06:01:44.027439    23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 2
W0130 06:01:44.027443    23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 3
W0130 06:01:44.027446    23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 4
W0130 06:01:44.027449    23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 5
W0130 06:01:44.027452    23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 6
W0130 06:01:44.027456    23 parallel_executor.cc:642] Cannot enable P2P access from 0 to 7
W0130 06:01:44.027458    23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 0
W0130 06:01:44.027462    23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 2
W0130 06:01:44.027464    23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 3
W0130 06:01:44.027467    23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 4
W0130 06:01:44.027469    23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 5
W0130 06:01:44.027472    23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 6
W0130 06:01:44.027477    23 parallel_executor.cc:642] Cannot enable P2P access from 1 to 7
W0130 06:01:44.027480    23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 0
W0130 06:01:44.027523    23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 1
W0130 06:01:44.027529    23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 3
W0130 06:01:44.027530    23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 4
W0130 06:01:44.027534    23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 5
W0130 06:01:44.027536    23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 6
W0130 06:01:44.027541    23 parallel_executor.cc:642] Cannot enable P2P access from 2 to 7
W0130 06:01:44.027544    23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 0
W0130 06:01:44.027549    23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 1
W0130 06:01:44.027554    23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 2
W0130 06:01:44.027556    23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 4
W0130 06:01:44.027559    23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 5
W0130 06:01:44.027611    23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 6
W0130 06:01:44.027614    23 parallel_executor.cc:642] Cannot enable P2P access from 3 to 7
W0130 06:01:44.027617    23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 0
W0130 06:01:44.027621    23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 1
W0130 06:01:44.027624    23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 2
W0130 06:01:44.027627    23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 3
W0130 06:01:44.027629    23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 5
W0130 06:01:44.027632    23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 6
W0130 06:01:44.027635    23 parallel_executor.cc:642] Cannot enable P2P access from 4 to 7
W0130 06:01:44.027638    23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 0
W0130 06:01:44.027640    23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 1
W0130 06:01:44.027643    23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 2
W0130 06:01:44.027647    23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 3
W0130 06:01:44.027649    23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 4
W0130 06:01:44.027652    23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 6
W0130 06:01:44.027655    23 parallel_executor.cc:642] Cannot enable P2P access from 5 to 7
W0130 06:01:44.027696    23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 0
W0130 06:01:44.027699    23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 1
W0130 06:01:44.027704    23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 2
W0130 06:01:44.027707    23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 3
W0130 06:01:44.027712    23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 4
W0130 06:01:44.027717    23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 5
W0130 06:01:44.027720    23 parallel_executor.cc:642] Cannot enable P2P access from 6 to 7
W0130 06:01:44.027724    23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 0
W0130 06:01:44.027727    23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 1
W0130 06:01:44.027730    23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 2
W0130 06:01:44.027736    23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 3
W0130 06:01:44.027740    23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 4
W0130 06:01:44.027752    23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 5
W0130 06:01:44.027757    23 parallel_executor.cc:642] Cannot enable P2P access from 7 to 6
WARNING:root:PaddlePaddle meets some problem with 8 GPUs. This may be caused by:
 1. There is not enough GPUs visible on your system
 2. Some GPUs are occupied by other process now
 3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests 
 to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
WARNING:root:
 Original Error is: (External) NCCL error(2), unhandled system error. 
  [Hint: 'ncclSystemError'. A call to the system failed.] (at /paddle/paddle/fluid/platform/device/gpu/nccl_helper.h:155)

PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.

出现了上面的结果,说明安装成功,但是只能使用单卡,虽然不能使用多卡,但是勉强用着吧,

多卡

目前深度学习训练过程,一般2张起步,对于PaddlePaddle不能使用多卡,还是耿耿于怀。经过一番查询之后,发现是NCCL出了问题。怎么解决,参考不少资料。最终发现了问题所在,
解决链接:
https://github.com/pytorch/pytorch/issues/73775
【深度学习框架-Paddle】丝滑安装PaddlePaddle,无缝衔接使用多卡_第1张图片
因此,删掉之前创建的容器,重新创建。

docker run -itd --name container_name -v /path:/path  -v /dev/shm/:/dev/shm paddlecloud/paddlenlp:develop-gpu-cuda11.2-cudnn8-latest /bin/bash

进入容器后,检查Paddle是否正常

>>paddle.utils.run_check()
Running verify PaddlePaddle program ... 
W0130 06:10:52.232132    22 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.2
W0130 06:10:52.234642    22 gpu_context.cc:306] device: 0, cuDNN Version: 8.1.
PaddlePaddle works well on 1 GPU.
W0130 06:10:54.919947    22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 1
W0130 06:10:54.919976    22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 2
W0130 06:10:54.919981    22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 3
W0130 06:10:54.919983    22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 4
W0130 06:10:54.919986    22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 5
W0130 06:10:54.919989    22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 6
W0130 06:10:54.919992    22 parallel_executor.cc:642] Cannot enable P2P access from 0 to 7
W0130 06:10:54.919996    22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 0
W0130 06:10:54.919998    22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 2
W0130 06:10:54.920001    22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 3
W0130 06:10:54.920003    22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 4
W0130 06:10:54.920009    22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 5
W0130 06:10:54.920012    22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 6
W0130 06:10:54.920019    22 parallel_executor.cc:642] Cannot enable P2P access from 1 to 7
W0130 06:10:54.920022    22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 0
W0130 06:10:54.920027    22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 1
W0130 06:10:54.920029    22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 3
W0130 06:10:54.920037    22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 4
W0130 06:10:54.920039    22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 5
W0130 06:10:54.920044    22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 6
W0130 06:10:54.920084    22 parallel_executor.cc:642] Cannot enable P2P access from 2 to 7
W0130 06:10:54.920087    22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 0
W0130 06:10:54.920092    22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 1
W0130 06:10:54.920095    22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 2
W0130 06:10:54.920099    22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 4
W0130 06:10:54.920101    22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 5
W0130 06:10:54.920104    22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 6
W0130 06:10:54.920106    22 parallel_executor.cc:642] Cannot enable P2P access from 3 to 7
W0130 06:10:54.920110    22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 0
W0130 06:10:54.920117    22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 1
W0130 06:10:54.920123    22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 2
W0130 06:10:54.920127    22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 3
W0130 06:10:54.920132    22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 5
W0130 06:10:54.920135    22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 6
W0130 06:10:54.920140    22 parallel_executor.cc:642] Cannot enable P2P access from 4 to 7
W0130 06:10:54.920146    22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 0
W0130 06:10:54.920152    22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 1
W0130 06:10:54.920157    22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 2
W0130 06:10:54.920164    22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 3
W0130 06:10:54.920169    22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 4
W0130 06:10:54.920176    22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 6
W0130 06:10:54.920181    22 parallel_executor.cc:642] Cannot enable P2P access from 5 to 7
W0130 06:10:54.920184    22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 0
W0130 06:10:54.920190    22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 1
W0130 06:10:54.920194    22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 2
W0130 06:10:54.920200    22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 3
W0130 06:10:54.920207    22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 4
W0130 06:10:54.920212    22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 5
W0130 06:10:54.920217    22 parallel_executor.cc:642] Cannot enable P2P access from 6 to 7
W0130 06:10:54.920221    22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 0
W0130 06:10:54.920228    22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 1
W0130 06:10:54.920233    22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 2
W0130 06:10:54.920238    22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 3
W0130 06:10:54.920243    22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 4
W0130 06:10:54.920254    22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 5
W0130 06:10:54.920261    22 parallel_executor.cc:642] Cannot enable P2P access from 7 to 6
W0130 06:11:12.578923    22 fuse_all_reduce_op_pass.cc:76] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2.
PaddlePaddle works well on 8 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

出现了 PaddlePaddle is installed successfully!,说明Paddle完全安装成功,没有问题了。

在用Paddle之路上,找到一个较为方便的Paddle安装方法,分享给大家。

你可能感兴趣的:(深度学习,paddle,paddlepaddle,docker,nccl)