使用pytorch cuda11 cudnn8.02 Docker搭建nnUNet训练环境

(2020-10- 30更新)pytorch官网已经更新了pytorch1.7,适配了cuda11和cudnn8,本文依然可以作为docker使用教程

本文目的是加速训练过程,平时使用pytorch1.6 cuda10.2训练一个epoch将近400秒,搭建好docker环境后,使用cuda11、cudnn8.0.2 的环境,一个epoch的时间只需要160秒左右。加速非常明显。

问题来源于:https://github.com/MIC-DKFZ/nnUNet/issues/292

项目主仓库 :https://github.com/MIC-DKFZ/nnUNet

测试环境:

软件环境:

  • Linux:Ubuntu 16/18/20,或者其他发行版本

硬件要求:

  • 显卡:RTX2080TI 或者更高的型号 ,显卡驱动 大于Driver Version: 450.51.06
  • CPU:Inel i5-9600K或者更高性能的处理器
  • 磁盘:SSD 500G或者更大容量

参考

  • Docker 入门教程
  • Installation Guide
  • nnU-Net : The no-new-UNet for automatic segmentation

Installing on Ubuntu and Debian

The following steps can be used to setup NVIDIA Container Toolkit on Ubuntu LTS - 16.04, 18.04, 20.4 and Debian - Stretch, Buster distributions.

Docker-CE on Ubuntu can be setup using Docker’s official convenience script:

curl https://get.docker.com | sh

sudo systemctl start docker && sudo systemctl enable docker

Setup the stable repository and the GPG key:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Note:To get access to experimental features such as CUDA on WSL or the new MIG capability on A100, you may want to add the experimental branch to the repository listing:

curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list

Install the nvidia-docker2 package (and dependencies) after updating the package listing:

sudo apt-get update

sudo apt-get install -y nvidia-docker2

Restart the Docker daemon to complete the installation after setting the default runtime:

sudo systemctl restart docker

At this point, a working setup can be tested by running a base CUDA container:

$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Adding the NVIDIA Runtime

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html

Use dockerd to add the nvidia runtime:

$ sudo dockerd --add-runtime=nvidia=/usr/bin/nvidia-container-runtime [...]

Reload the Docker settings for the systemd daemon.

$ sudo systemctl daemon-reload

Restart the docker service.

$ sudo systemctl restart docker

After that,

Starting a GPU enabled CUDA container; using --gpus

$ docker run --rm --gpus all nvidia/cuda nvidia-smi

更改docker默认的存储路径

默认情况下,docker镜像的默认存储路径是/var/lib/docker,这相当于直接挂载系统目录下,而一般在搭系统时,这个区都不会太大,所以如果长期使用docker开发应用,就需要把默认的路径更改到需要路径下或外挂存储

1、docker镜像的默认路径

$ docker info  
 Docker Root Dir: /var/lib/docker 

2、为了解决这个问题, 计划将docker的默认存储路径从/var/lib/docker中移出去

方法:

$ mkdir /home/gy501/docker
$ cd  /home/gy501/docker

3、修改docker的systemd的docker.service的配置文件

不知道 配置文件在哪里可以使用systemd 命令显示一下

$ systemctl enable docker

Created symlink from /etc/systemd/system/multi-user.target.wants/docker.service to /usr/lib/systemd/system/docker.service.

4、修改docker.service文件. 图像界面下,使用gedit,命令行下使用vim

$ sudo gedit /usr/lib/systemd/system/docker.service

​ 在打开的文本编辑器中,更改以下内容 --graph 后面的路径

ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --graph=/home/gy501/docker

5、 重新enable 一下docker 服务 重新进行软连接 以及进行一次 daemon-reload

$ systemctl daemon-reload
$ systemctl restart docker

6、查看docker info 信息

$ docker info  
Docker Root Dir: /home/gy501/docker

拉取镜像

1、使用命令直接拉取镜像,请点击链接获取最新的的版本

镜像地址为:https://ngc.nvidia.com/catalog/containers/nvidia:pytorch

镜像源在国外,拉取速度较慢,镜像大小有4g左右

$ docker pull nvcr.io/nvidia/pytorch:20.09-py3

拉取完成后,使用下面命令查看已有的镜像

$ docker images
(base) gy501@gy501-ms-7b51f:~$ docker images
REPOSITORY               TAG                 IMAGE ID            CREATED             SIZE
nvcr.io/nvidia/pytorch   20.09-py3           86042df4bd3c        5 weeks ago         11.1GB
nvidia/cuda              11.0-base           2ec708416bb8        8 weeks ago         122MB 

启动容器

本地机器使用方法,

$ sudo docker run --gpus all -it --rm --ipc=host -v /media/gy501/SSD/nnunet:/workspace/nnunet nvcr.io/nvidia/pytorch:20.09-py3

--ipc==host可以避免docker报内存不足

详情参考下面官方文档:

  1. Run the container image.

    If you have Docker 19.03 or later, a typical command to launch the container is:

    $ docker run --gpus all -it --rm -v local_dir:container_dir nvcr.io/nvidia/pytorch:xx.xx-py3
    

    Where:

    • -it means run in interactive mode

    • --rm will delete the container when finished

    • -v is the mounting directory

    • local_dir is the directory or file from your host system (absolute path) that you want to access from inside your container. For example, the local_dir in the following path is /home/jsmith/data/mnist.

      -v /home/jsmith/data/mnist:/data/mnist
      

      If you are inside the container, for example, ls /data/mnist, you will see the same files as if you issued the ls /home/jsmith/data/mnist command from outside the container.

    • container_dir is the target directory when you are inside your container. For example, /data/mnist is the target directory in the example:

      -v /home/jsmith/data/mnist:/data/mnist
      
    • xx.xx is the container version. For example, 20.01.

    • command is the command you want to run in the image.

    Note: DIGITS uses shared memory to share data between processes. For example, if you use Torch multiprocessing for multi-threaded data loaders, the default shared memory segment size that the container runs with may not be enough. Therefore, you should increase the shared memory size by issuing either:

    --ipc=host
    

    or

    --shm-size=
    

    in the command line to:

    docker run --gpus all
    

​ See /workspace/README.md inside the container for information on customizing your PyTorch image.

nnUnet配置教程

1、配置路径,在/media/gy501/SSD/nnunet配置好数据处理的目录,并把nnUnet的源码放到该目录下,其中 nnUNet-1.6.4为源码文件夹,可以在官方仓库找到

(base) gy501@gy501-ms-7b51f:/media/gy501/SSD/nnunet$ ls
  nnUNet-1.6.4     nnUNet_raw     config.sh   nnUNet_preprocessed   nnUNet_trained_models

2、在目录下创建一个config.sh文件,用于配置容器内部的一些设置,内容如下

#!/bin/bash
#设置工作目录
export nnUNet_raw_data_base="/workspace/nnUNet_raw" 
export nnUNet_preprocessed="/workspace/nnUNet_preprocessed" 
export RESULTS_FOLDER="/workspace/nnUNet_trained_models"  
echo $RESULTS_FOLDER
#设置pip源,这里使用改成阿里源
#清华源
pip config set global.index-url http://mirrors.aliyun.com/pypi/simple/
pip config set install.trusted-host mirrors.aliyun.com

#根据源码的具体文件名和路径修改-e 后面的路径名
pip install -e ./nnUNet-1.6.4
echo "done!"

3、启动容器

$ sudo docker run --gpus all -it --rm --ipc=host -v /media/gy501/SSD/nnunet:/workspace/nnunet nvcr.io/nvidia/pytorch:20.09-py3

容器内,/workspace/nnunet 的内容和/media/gy501/SSD/nnunet完全一样。

4、加载配置,启动容器后,默认路径为/workspace ,使用pwd 查看当前路径,ls 列出目录下的内容

$ pwd
$ cd nnunet
$ sudo sh config.sh

5、预处理、训练网络、推理命令,不再赘述

#预处理
nnUNet_plan_and_preprocess -t 66
#训练
nnUNet_train 3d_fullres nnUNetTrainerV2 66 4

#推理
nnUNet_predict -i /workspace/nnunet/nnUNet_raw/nnUNet_raw_data/Task066_hepaticduct/imagesTs/ -o /workspace/nnunet/nnUNet_raw/nnUNet_raw_data/Task066_hepaticduct/inferTs -t 8 -m 3d_fullres -f 4

done!

你可能感兴趣的:(nnunet,深度学习,pytorch,神经网络)