Nvidia-Docker 构建可使用 GPUs 的容器:Python3 + Tensorflow-GPU + Opencv + Dlib

在之前写的文档 《 Docker Images: Centos7 + Python3.6 + Tensorflow + Opencv + Dlib 》中构建了基于 CPU 的图像处理常用开发环境的 Docker 镜像。但随着图形处理器 GPU 的快速发展, GPU 能够更好的保证内部高数据带宽和执行计算能力,有效代替 CPU 的部分计算,因此,也常常搭建基于 GPU 的开发环境。本文档主要是构建可使用 GPUs 的容器,包含: centos7 + cuda 9.0 + cudnn 7.0.5 + Python 3.6 + Tensorflow-GPU 1.5.0 + Opencv-Python + Dlib 等开发环境,并记录了构建过程中遇到的各种问题。

Nvidia-Docker 构建可使用 GPUs 的容器 : cuda9.0 + cudnn7.0.5 + Tensorflow-GPU + Opencv-Python + Dlib

    • 基础镜像的选择
    • 基础镜像及作者信息
    • 修改时区,安装中文支持
    • 安装 cudnn7.0.5
    • 安装 python3.6
    • 安装 tensorflow-gpu
    • 安装 opencv-python
    • 安装 dlib
    • 安装其他 python 依赖包
    • 其他设置
    • 完整的 dockerfile
    • 构建镜像并测试

基础镜像的选择

在使用 Nvidia-Docker 构建可使用 GPUs 的容器之前,先要确定所需构建的环境的版本对应关系,这里主要指的是 Tensorflow-GPU 与 CUDA 、 cuDNN 的版本对应关系。

如,这里使用的是 Tensorflow-GPU 1.5.0 版本,对应 CUDA 9.0 版本以及 cuDNN 7.0.x 版本。如果版本不对,报错如下:

  • Tensorflow-GPU 1.5.0 版本对应 CUDA 9.0 版本,否则报错 ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
  • 对应 cuDNN 7.0.x 版本,否则执行 Tensorflow-GPU 代码的时候报错提示 cuDNN 版本不对(原本安装的是 cuDNN 7.3.1 版本,后来降至 cuDNN 7.0.5 ,测试通过)。

其它版本的对应信息见 Tensorflow 中文官网中 经过测试的构建配置 或者 Tensorflow 英文官网中 Tested build configurations 。

Nvidia-Docker 构建可使用 GPUs 的容器:Python3 + Tensorflow-GPU + Opencv + Dlib_第1张图片

因此,这里选择 nvidia/cuda:9.0-devel-centos7 作为基础镜像,该镜像不带 cuDNN ,需要自己安装对应的 cuDNN 版本。 nvidia/cuda 官方也有提供带 cuDNN 的基础镜像—— nvidia/cuda:9.0-cudnn7-devel-centos7 ,当时是 cuDNN 7.3.1 版本的。

基础镜像及作者信息

这里使用的是 nvidia/cuda:9.0-devel-centos7 作为基础镜像,作者是 ELN ,还可以添加电子邮箱。

FROM nvidia/cuda:9.0-devel-centos7
MAINTAINER ELN

修改时区,安装中文支持

在基础镜像中直接安装 python3.6 ,进入 python3.6 中 print 中文字符时报如下错误,检查 python3.6 的默认编码为 utf-8 ,后来发现是 docker 中的基础镜像出现中文乱码。

[root@5c6c19af53c2 /]# python3.6
Python 3.6.5 (default, Apr 10 2018, 17:08:37)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"������")  # print(u"中文")
  File "", line 0

    ^
SyntaxError: 'ascii' codec can't decode byte 0xe4 in position 8: ordinal not in range(128)
>>> import sys
>>> print(sys.getdefaultencoding())
utf-8
>>> exit()
[root@5c6c19af53c2 /]# echo "������"
中文

查看 /etc/localtime 的结果是 lrwxrwxrwx. 1 root root 25 Oct 6 19:15 localtime -> ../usr/share/zoneinfo/UTC ,需要修改时区,安装中文支持,配置显示中文。

在 dockerfile 中修改时区,安装中文支持,配置显示中文 :

# 修改时区,安装中文支持,配置显示中文
RUN rm -rf /etc/localtime  && \
    ln -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime  && \
    yum -y install kde-l10n-Chinese  && \
    yum -y reinstall glibc-common  && \
    localedef -c -f UTF-8 -i zh_CN zh_CN.utf8  && \
    yum clean all  &&  rm -rf /var/cache/yum

ENV LC_ALL zh_CN.utf8
# 在终端执行: export LC_ALL=zh_CN.utf8

安装 cudnn7.0.5

安装 cudnn7.0.5 ,首先需要下载对应版本的安装包,安装包较大,下载较慢。当然,也可以先在本机下载好( 268M),并将安装包复制到 docker 容器中,再进行安装。

# Install cudnn7.0.5
RUN curl -fsSL https://developer.download.nvidia.com/compute/redist/cudnn/v7.0.5/cudnn-8.0-linux-x64-v7.tgz -O && \
    tar --no-same-owner -xzf cudnn-8.0-linux-x64-v7.tgz -C /usr/local  && \
    rm -rf cudnn-8.0-linux-x64-v7.tgz  && \
    ldconfig

安装 python3.6

在 dockerfile 中,安装 python3.6 的纯净环境,安装一些基础的 python 包:

# Install Python 3.6
RUN yum -y install https://centos7.iuscommunity.org/ius-release.rpm && \
    yum -y install python36 && \
    yum -y install python36-pip && \
    yum -y install vim && \
    yum clean all  &&  rm -rf /var/cache/yum  && \
    # ln /usr/bin/python3.6 /usr/bin/python3  && \
    # ln /usr/bin/pip3.6 /usr/bin/pip3  && \
    mkdir ~/.pip/  && \
    echo -e "[global]\nindex-url = http://mirrors.aliyun.com/pypi/simple\n\n[install]\ntrusted-host=mirrors.aliyun.com" > ~/.pip/pip.conf

RUN pip3.6 --no-cache-dir install \
        Pillow \
        h5py \
        ipykernel \
        jupyter \
        matplotlib==2.1.1 \
        numpy==1.15.4 \
        pandas \
        scipy==1.1.0 \
        sklearn \
        && \
    python3.6 -m ipykernel.kernelspec

安装 tensorflow-gpu

这里直接 pip 安装即可:

# Install TensorFlow GPU version from central repo
RUN pip3.6 --no-cache-dir install tensorflow-gpu==1.5.0

注意,为了避免 numpy 1.17.0+import tensorflow 报如下错误,需指定 numpy==1.15.4 (版本号 <1.17.0+ )。

  • FutureWarning: Deprecated numpy API calls in tf.python.framework.dtypes #30427
  • Fix numpy warning with numpy 1.17.0+ #30559
[eln@localhost docker]$ sudo docker run -it --rm --runtime=nvidia 1e3fc1854e8c /bin/bash
[root@13472124aca1 test]# python3
Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
>>> exit()

查询 CUDA 、 cuDNN 版本信息:

[eln@localhost docker]$ sudo docker run -it --rm --runtime=nvidia 1e3fc1854e8c /bin/bash

# 在容器里安装的 CUDA 版本是 9.0 的,在本机上安装的 CUDA 版本是 10.1 的
[root@c4a7dc3728ff /]# nvidia-smi
Mon Jul 29 17:26:42 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:01:00.0  On |                  Off |
| 26%   29C    P8    11W / 250W |    472MiB / 24447MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

# 查询 CUDA 版本信息
[root@c4a7dc3728ff /]# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

[root@c4a7dc3728ff /]# cat /usr/local/cuda/version.txt
CUDA Version 9.0.176

# 查询 cuDNN 版本信息
[root@c4a7dc3728ff /]# cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

安装 opencv-python

总结了一开始在 docker 容器中安装 opencv-python 遇到的问题,这里简化了之前的文档《 Docker Images: Centos7 + Python3.6 + Tensorflow + Opencv + Dlib 》中安装 opencv-python 的命令。

# Install opencv-python
RUN yum -y install libSM.x86_64 \
        libXrender.x86_64 \
        libXext.x86_64  && \
    yum clean all  &&  rm -rf /var/cache/yum

RUN pip3.6 --no-cache-dir install opencv-python==3.4.1.15

安装 dlib

安装 dlib 时遇到的问题及解决方法:

  • 直接安装 dlib 报错 CMake must be installed to build the following extensions: dlib
  • yum -y install cmake 安装 cmake 后,再安装 dlib 报错 subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-build-g_ptsyo_/dlib/tools/python', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/pip-build-g_ptsyo_/dlib/build/lib.linux-x86_64-3.6', '-DPYTHON_EXECUTABLE=/usr/bin/python3.6', '-DCMAKE_BUILD_TYPE=Release']' returned non-zero exit status 1.
  • yum install -y python36u-devel.x86_64 依然报上面的错误
  • yum -y groupinstall "Development tools" ,再安装 dlib 则安装成功
# Install dlib
RUN yum -y groupinstall "Development tools"  && \
    yum -y install cmake && \
    yum clean all  # &&  rm -rf /var/cache/yum

RUN yum install -y python36-devel.x86_64  && \
    yum clean all  # &&  rm -rf /var/cache/yum
# yum search python3 | grep devel

RUN pip3.6 --no-cache-dir install dlib

安装其他 python 依赖包

# Install keras ...
RUN pip3.6 --no-cache-dir install Cython

RUN pip3.6 --no-cache-dir install \
        keras \
        flask \
        flask_cors \
        flask_socketio \
        scikit-image \
        mrcnn \
        imgaug \
        pycocotools

其他设置

RUN mkdir /test

WORKDIR /test

CMD ["/bin/bash"]

完整的 dockerfile

这里只是简单的按照构建步骤写的 dockerfile ,可以根据需要调整镜像的分层结构。

FROM nvidia/cuda:9.0-devel-centos7
MAINTAINER ELN

# 修改时区,安装中文支持,配置显示中文
RUN rm -rf /etc/localtime  && \
    ln -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime  && \
    yum -y install kde-l10n-Chinese  && \
    yum -y reinstall glibc-common  && \
    localedef -c -f UTF-8 -i zh_CN zh_CN.utf8  && \
    yum clean all  &&  rm -rf /var/cache/yum

ENV LC_ALL zh_CN.utf8
# 在终端执行: export LC_ALL=zh_CN.utf8

# Install cudnn7.0.5
#RUN curl -fsSL https://developer.download.nvidia.com/compute/redist/cudnn/v7.0.5/cudnn-8.0-linux-x64-v7.tgz -O && \
#    tar --no-same-owner -xzf cudnn-8.0-linux-x64-v7.tgz -C /usr/local  && \
#    rm -rf cudnn-8.0-linux-x64-v7.tgz  && \
#    ldconfig

ADD cudnn-8.0-linux-x64-v7.tgz /usr/local/
RUN ldconfig

# Install Python 3.6
RUN yum -y install https://centos7.iuscommunity.org/ius-release.rpm && \
    yum -y install python36 && \
    yum -y install python36-pip && \
    yum -y install vim && \
    yum clean all  &&  rm -rf /var/cache/yum  && \
    # ln /usr/bin/python3.6 /usr/bin/python3  && \
    # ln /usr/bin/pip3.6 /usr/bin/pip3  && \
    mkdir ~/.pip/  && \
    echo -e "[global]\nindex-url = http://mirrors.aliyun.com/pypi/simple\n\n[install]\ntrusted-host=mirrors.aliyun.com" > ~/.pip/pip.conf

RUN pip3.6 --no-cache-dir install \
        Pillow \
        h5py \
        ipykernel \
        jupyter \
        matplotlib==2.1.1 \
        numpy==1.15.4 \
        pandas \
        scipy==1.1.0 \
        sklearn \
        && \
    python3.6 -m ipykernel.kernelspec

# Install TensorFlow GPU version from central repo
RUN pip3.6 --no-cache-dir install tensorflow-gpu==1.5.0

# Install opencv-python
RUN yum -y install libSM.x86_64 \
        libXrender.x86_64 \
        libXext.x86_64  && \
    yum clean all  &&  rm -rf /var/cache/yum

RUN pip3.6 --no-cache-dir install opencv-python==3.4.1.15

# Install dlib
RUN yum -y groupinstall "Development tools"  && \
    yum -y install cmake && \
    yum clean all  # &&  rm -rf /var/cache/yum

RUN yum install -y python36-devel.x86_64  && \
    yum clean all  # &&  rm -rf /var/cache/yum
# yum search python3 | grep devel

RUN pip3.6 --no-cache-dir install dlib

# Install keras ...
RUN pip3.6 --no-cache-dir install Cython

RUN pip3.6 --no-cache-dir install \
        keras \
        flask \
        flask_cors \
        flask_socketio \
        scikit-image \
        mrcnn \
        imgaug \
        pycocotools

RUN mkdir /test

WORKDIR /test

CMD ["/bin/bash"]

构建镜像并测试

将上述内容写入 dockerfile 中,构建镜像并测试:

[eln@localhost docker]$ vim dockerfile
[eln@localhost docker]$ sudo docker build -t="test" .
[eln@localhost docker]$ sudo docker images
REPOSITORY                                     TAG                                  IMAGE ID            CREATED             SIZE
test                                           latest                               eb7674684afa        2 seconds ago       4.18GB
nvidia/cuda                                    9.0-devel-centos7                    ff358ea56625        3 months ago        1.9GB
[eln@localhost docker]$ sudo docker run -it --rm --runtime=nvidia test

docker 运行命令加上 --runtime=nvidia ,如果有多块显卡可以通过 -e 指定使用哪块,如 -e NVIDIA_VISIBLE_DEVICES=0

[root@7da5e0966487 test]# echo "中文"
中文
[root@7da5e0966487 test]# pip3 --version
pip 8.1.2 from /usr/lib/python3.6/site-packages (python 3.6)
[root@7da5e0966487 test]# python3
Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>> import cv2
>>> import dlib
>>> print("中文")
中文
>>> exit()

测试 GPU 的计算能力,测试 tensorflow-gpu 版是否安装正确:

在容器中运行 tensorflow-gpu 测试代码:

[root@7da5e0966487 test]# vim testgpu.py
[root@7da5e0966487 test]# cat testgpu.py 
# -*- coding: utf-8 -*-
"""
测试 GPU 的计算能力,测试 tensorflow-GPU 版是否安装正确
"""
 
import tensorflow as tf
import numpy as np
import time
 
value = np.random.randn(5000, 1000)
a = tf.constant(value)
 
b = a * a
 
c =0
tic = time.time()
with tf.Session() as sess:
        for i in range(1000):
            sess.run(b)
 
            c+=1
            if c%100 == 0:
 
                d = c / 10
                # print(d)
                print("计算进行%s%%" % d)
 
toc = time.time()
t_cost = toc - tic
 
print("测试所用时间%s"%t_cost)

[root@7da5e0966487 test]# python3 testgpu.py 
2019-07-29 20:18:54.026579: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-07-29 20:18:54.290955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: Quadro P6000 major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:01:00.0
totalMemory: 23.87GiB freeMemory: 23.26GiB
2019-07-29 20:18:54.291030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P6000, pci bus id: 0000:01:00.0, compute capability: 6.1)
计算进行10.0%
计算进行20.0%
计算进行30.0%
计算进行40.0%
计算进行50.0%
计算进行60.0%
计算进行70.0%
计算进行80.0%
计算进行90.0%
计算进行100.0%
测试所用时间14.024679899215698

运行测试代码的同时另起两个终端,分别在容器与本机中查看 GPU 运行情况:

[eln@localhost docker]$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
7da5e0966487        test                "/bin/bash"         2 minutes ago       Up 2 minutes                            quirky_kapitsa
[eln@localhost docker]$ docker exec -it 7da5e0966487 /bin/bash
[root@7da5e0966487 test]# nvidia-smi
Mon Jul 29 20:19:00 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:01:00.0  On |                  Off |
| 26%   26C    P8    19W / 250W |  23260MiB / 24447MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
[root@7da5e0966487 test]# exit
exit
[eln@localhost docker]$ nvidia-smi
Mon Jul 29 20:19:05 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:01:00.0  On |                  Off |
| 26%   27C    P8    19W / 250W |  23260MiB / 24447MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1099      G   /usr/lib/xorg/Xorg                            36MiB |
|    0      1734      C   python3                                    22779MiB |
|    0      2569      G   fcitx-qimpanel                                36MiB |
|    0      3667      G   /usr/lib/xorg/Xorg                            39MiB |
|    0      4497      G   fcitx-qimpanel                                36MiB |
|    0      4521      G   unity-control-center                           4MiB |
|    0      5564      G   /usr/lib/xorg/Xorg                            51MiB |
|    0      6430      G   /usr/lib/xorg/Xorg                           107MiB |
+-----------------------------------------------------------------------------+

# 实时监控 GPU 运行情况
[eln@localhost docker]$ watch -n 0.1 -d nvidia-smi

你可能感兴趣的:(docker)