在之前写的文档 《 Docker Images: Centos7 + Python3.6 + Tensorflow + Opencv + Dlib 》中构建了基于 CPU 的图像处理常用开发环境的 Docker 镜像。但随着图形处理器 GPU 的快速发展, GPU 能够更好的保证内部高数据带宽和执行计算能力,有效代替 CPU 的部分计算,因此,也常常搭建基于 GPU 的开发环境。本文档主要是构建可使用 GPUs 的容器,包含: centos7 + cuda 9.0 + cudnn 7.0.5 + Python 3.6 + Tensorflow-GPU 1.5.0 + Opencv-Python + Dlib 等开发环境,并记录了构建过程中遇到的各种问题。
在使用 Nvidia-Docker 构建可使用 GPUs 的容器之前,先要确定所需构建的环境的版本对应关系,这里主要指的是 Tensorflow-GPU 与 CUDA 、 cuDNN 的版本对应关系。
如,这里使用的是 Tensorflow-GPU 1.5.0 版本,对应 CUDA 9.0 版本以及 cuDNN 7.0.x 版本。如果版本不对,报错如下:
其它版本的对应信息见 Tensorflow 中文官网中 经过测试的构建配置 或者 Tensorflow 英文官网中 Tested build configurations 。
因此,这里选择 nvidia/cuda:9.0-devel-centos7 作为基础镜像,该镜像不带 cuDNN ,需要自己安装对应的 cuDNN 版本。 nvidia/cuda 官方也有提供带 cuDNN 的基础镜像—— nvidia/cuda:9.0-cudnn7-devel-centos7 ,当时是 cuDNN 7.3.1 版本的。
这里使用的是 nvidia/cuda:9.0-devel-centos7 作为基础镜像,作者是 ELN ,还可以添加电子邮箱。
FROM nvidia/cuda:9.0-devel-centos7
MAINTAINER ELN
在基础镜像中直接安装 python3.6 ,进入 python3.6 中 print 中文字符时报如下错误,检查 python3.6 的默认编码为 utf-8 ,后来发现是 docker 中的基础镜像出现中文乱码。
[root@5c6c19af53c2 /]# python3.6
Python 3.6.5 (default, Apr 10 2018, 17:08:37)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"������") # print(u"中文")
File "" , line 0
^
SyntaxError: 'ascii' codec can't decode byte 0xe4 in position 8: ordinal not in range(128)
>>> import sys
>>> print(sys.getdefaultencoding())
utf-8
>>> exit()
[root@5c6c19af53c2 /]# echo "������"
中文
查看 /etc/localtime
的结果是 lrwxrwxrwx. 1 root root 25 Oct 6 19:15 localtime -> ../usr/share/zoneinfo/UTC
,需要修改时区,安装中文支持,配置显示中文。
在 dockerfile 中修改时区,安装中文支持,配置显示中文 :
# 修改时区,安装中文支持,配置显示中文
RUN rm -rf /etc/localtime && \
ln -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
yum -y install kde-l10n-Chinese && \
yum -y reinstall glibc-common && \
localedef -c -f UTF-8 -i zh_CN zh_CN.utf8 && \
yum clean all && rm -rf /var/cache/yum
ENV LC_ALL zh_CN.utf8
# 在终端执行: export LC_ALL=zh_CN.utf8
安装 cudnn7.0.5 ,首先需要下载对应版本的安装包,安装包较大,下载较慢。当然,也可以先在本机下载好( 268M),并将安装包复制到 docker 容器中,再进行安装。
# Install cudnn7.0.5
RUN curl -fsSL https://developer.download.nvidia.com/compute/redist/cudnn/v7.0.5/cudnn-8.0-linux-x64-v7.tgz -O && \
tar --no-same-owner -xzf cudnn-8.0-linux-x64-v7.tgz -C /usr/local && \
rm -rf cudnn-8.0-linux-x64-v7.tgz && \
ldconfig
在 dockerfile 中,安装 python3.6 的纯净环境,安装一些基础的 python 包:
# Install Python 3.6
RUN yum -y install https://centos7.iuscommunity.org/ius-release.rpm && \
yum -y install python36 && \
yum -y install python36-pip && \
yum -y install vim && \
yum clean all && rm -rf /var/cache/yum && \
# ln /usr/bin/python3.6 /usr/bin/python3 && \
# ln /usr/bin/pip3.6 /usr/bin/pip3 && \
mkdir ~/.pip/ && \
echo -e "[global]\nindex-url = http://mirrors.aliyun.com/pypi/simple\n\n[install]\ntrusted-host=mirrors.aliyun.com" > ~/.pip/pip.conf
RUN pip3.6 --no-cache-dir install \
Pillow \
h5py \
ipykernel \
jupyter \
matplotlib==2.1.1 \
numpy==1.15.4 \
pandas \
scipy==1.1.0 \
sklearn \
&& \
python3.6 -m ipykernel.kernelspec
这里直接 pip
安装即可:
# Install TensorFlow GPU version from central repo
RUN pip3.6 --no-cache-dir install tensorflow-gpu==1.5.0
注意,为了避免 numpy 1.17.0+ 下 import tensorflow
报如下错误,需指定 numpy==1.15.4 (版本号 <1.17.0+ )。
[eln@localhost docker]$ sudo docker run -it --rm --runtime=nvidia 1e3fc1854e8c /bin/bash
[root@13472124aca1 test]# python3
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
>>> exit()
查询 CUDA 、 cuDNN 版本信息:
[eln@localhost docker]$ sudo docker run -it --rm --runtime=nvidia 1e3fc1854e8c /bin/bash
# 在容器里安装的 CUDA 版本是 9.0 的,在本机上安装的 CUDA 版本是 10.1 的
[root@c4a7dc3728ff /]# nvidia-smi
Mon Jul 29 17:26:42 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 Off | 00000000:01:00.0 On | Off |
| 26% 29C P8 11W / 250W | 472MiB / 24447MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
# 查询 CUDA 版本信息
[root@c4a7dc3728ff /]# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
[root@c4a7dc3728ff /]# cat /usr/local/cuda/version.txt
CUDA Version 9.0.176
# 查询 cuDNN 版本信息
[root@c4a7dc3728ff /]# cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"
总结了一开始在 docker 容器中安装 opencv-python 遇到的问题,这里简化了之前的文档《 Docker Images: Centos7 + Python3.6 + Tensorflow + Opencv + Dlib 》中安装 opencv-python 的命令。
# Install opencv-python
RUN yum -y install libSM.x86_64 \
libXrender.x86_64 \
libXext.x86_64 && \
yum clean all && rm -rf /var/cache/yum
RUN pip3.6 --no-cache-dir install opencv-python==3.4.1.15
安装 dlib 时遇到的问题及解决方法:
CMake must be installed to build the following extensions: dlib
yum -y install cmake
安装 cmake 后,再安装 dlib 报错 subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-build-g_ptsyo_/dlib/tools/python', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/pip-build-g_ptsyo_/dlib/build/lib.linux-x86_64-3.6', '-DPYTHON_EXECUTABLE=/usr/bin/python3.6', '-DCMAKE_BUILD_TYPE=Release']' returned non-zero exit status 1.
yum install -y python36u-devel.x86_64
依然报上面的错误yum -y groupinstall "Development tools"
,再安装 dlib 则安装成功# Install dlib
RUN yum -y groupinstall "Development tools" && \
yum -y install cmake && \
yum clean all # && rm -rf /var/cache/yum
RUN yum install -y python36-devel.x86_64 && \
yum clean all # && rm -rf /var/cache/yum
# yum search python3 | grep devel
RUN pip3.6 --no-cache-dir install dlib
# Install keras ...
RUN pip3.6 --no-cache-dir install Cython
RUN pip3.6 --no-cache-dir install \
keras \
flask \
flask_cors \
flask_socketio \
scikit-image \
mrcnn \
imgaug \
pycocotools
RUN mkdir /test
WORKDIR /test
CMD ["/bin/bash"]
这里只是简单的按照构建步骤写的 dockerfile ,可以根据需要调整镜像的分层结构。
FROM nvidia/cuda:9.0-devel-centos7
MAINTAINER ELN
# 修改时区,安装中文支持,配置显示中文
RUN rm -rf /etc/localtime && \
ln -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
yum -y install kde-l10n-Chinese && \
yum -y reinstall glibc-common && \
localedef -c -f UTF-8 -i zh_CN zh_CN.utf8 && \
yum clean all && rm -rf /var/cache/yum
ENV LC_ALL zh_CN.utf8
# 在终端执行: export LC_ALL=zh_CN.utf8
# Install cudnn7.0.5
#RUN curl -fsSL https://developer.download.nvidia.com/compute/redist/cudnn/v7.0.5/cudnn-8.0-linux-x64-v7.tgz -O && \
# tar --no-same-owner -xzf cudnn-8.0-linux-x64-v7.tgz -C /usr/local && \
# rm -rf cudnn-8.0-linux-x64-v7.tgz && \
# ldconfig
ADD cudnn-8.0-linux-x64-v7.tgz /usr/local/
RUN ldconfig
# Install Python 3.6
RUN yum -y install https://centos7.iuscommunity.org/ius-release.rpm && \
yum -y install python36 && \
yum -y install python36-pip && \
yum -y install vim && \
yum clean all && rm -rf /var/cache/yum && \
# ln /usr/bin/python3.6 /usr/bin/python3 && \
# ln /usr/bin/pip3.6 /usr/bin/pip3 && \
mkdir ~/.pip/ && \
echo -e "[global]\nindex-url = http://mirrors.aliyun.com/pypi/simple\n\n[install]\ntrusted-host=mirrors.aliyun.com" > ~/.pip/pip.conf
RUN pip3.6 --no-cache-dir install \
Pillow \
h5py \
ipykernel \
jupyter \
matplotlib==2.1.1 \
numpy==1.15.4 \
pandas \
scipy==1.1.0 \
sklearn \
&& \
python3.6 -m ipykernel.kernelspec
# Install TensorFlow GPU version from central repo
RUN pip3.6 --no-cache-dir install tensorflow-gpu==1.5.0
# Install opencv-python
RUN yum -y install libSM.x86_64 \
libXrender.x86_64 \
libXext.x86_64 && \
yum clean all && rm -rf /var/cache/yum
RUN pip3.6 --no-cache-dir install opencv-python==3.4.1.15
# Install dlib
RUN yum -y groupinstall "Development tools" && \
yum -y install cmake && \
yum clean all # && rm -rf /var/cache/yum
RUN yum install -y python36-devel.x86_64 && \
yum clean all # && rm -rf /var/cache/yum
# yum search python3 | grep devel
RUN pip3.6 --no-cache-dir install dlib
# Install keras ...
RUN pip3.6 --no-cache-dir install Cython
RUN pip3.6 --no-cache-dir install \
keras \
flask \
flask_cors \
flask_socketio \
scikit-image \
mrcnn \
imgaug \
pycocotools
RUN mkdir /test
WORKDIR /test
CMD ["/bin/bash"]
将上述内容写入 dockerfile 中,构建镜像并测试:
[eln@localhost docker]$ vim dockerfile
[eln@localhost docker]$ sudo docker build -t="test" .
[eln@localhost docker]$ sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
test latest eb7674684afa 2 seconds ago 4.18GB
nvidia/cuda 9.0-devel-centos7 ff358ea56625 3 months ago 1.9GB
[eln@localhost docker]$ sudo docker run -it --rm --runtime=nvidia test
docker 运行命令加上
--runtime=nvidia
,如果有多块显卡可以通过-e
指定使用哪块,如-e NVIDIA_VISIBLE_DEVICES=0
[root@7da5e0966487 test]# echo "中文"
中文
[root@7da5e0966487 test]# pip3 --version
pip 8.1.2 from /usr/lib/python3.6/site-packages (python 3.6)
[root@7da5e0966487 test]# python3
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>> import cv2
>>> import dlib
>>> print("中文")
中文
>>> exit()
测试 GPU 的计算能力,测试 tensorflow-gpu 版是否安装正确:
在容器中运行 tensorflow-gpu 测试代码:
[root@7da5e0966487 test]# vim testgpu.py
[root@7da5e0966487 test]# cat testgpu.py
# -*- coding: utf-8 -*-
"""
测试 GPU 的计算能力,测试 tensorflow-GPU 版是否安装正确
"""
import tensorflow as tf
import numpy as np
import time
value = np.random.randn(5000, 1000)
a = tf.constant(value)
b = a * a
c =0
tic = time.time()
with tf.Session() as sess:
for i in range(1000):
sess.run(b)
c+=1
if c%100 == 0:
d = c / 10
# print(d)
print("计算进行%s%%" % d)
toc = time.time()
t_cost = toc - tic
print("测试所用时间%s"%t_cost)
[root@7da5e0966487 test]# python3 testgpu.py
2019-07-29 20:18:54.026579: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-07-29 20:18:54.290955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Quadro P6000 major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:01:00.0
totalMemory: 23.87GiB freeMemory: 23.26GiB
2019-07-29 20:18:54.291030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P6000, pci bus id: 0000:01:00.0, compute capability: 6.1)
计算进行10.0%
计算进行20.0%
计算进行30.0%
计算进行40.0%
计算进行50.0%
计算进行60.0%
计算进行70.0%
计算进行80.0%
计算进行90.0%
计算进行100.0%
测试所用时间14.024679899215698
运行测试代码的同时另起两个终端,分别在容器与本机中查看 GPU 运行情况:
[eln@localhost docker]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7da5e0966487 test "/bin/bash" 2 minutes ago Up 2 minutes quirky_kapitsa
[eln@localhost docker]$ docker exec -it 7da5e0966487 /bin/bash
[root@7da5e0966487 test]# nvidia-smi
Mon Jul 29 20:19:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 Off | 00000000:01:00.0 On | Off |
| 26% 26C P8 19W / 250W | 23260MiB / 24447MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
[root@7da5e0966487 test]# exit
exit
[eln@localhost docker]$ nvidia-smi
Mon Jul 29 20:19:05 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 Off | 00000000:01:00.0 On | Off |
| 26% 27C P8 19W / 250W | 23260MiB / 24447MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1099 G /usr/lib/xorg/Xorg 36MiB |
| 0 1734 C python3 22779MiB |
| 0 2569 G fcitx-qimpanel 36MiB |
| 0 3667 G /usr/lib/xorg/Xorg 39MiB |
| 0 4497 G fcitx-qimpanel 36MiB |
| 0 4521 G unity-control-center 4MiB |
| 0 5564 G /usr/lib/xorg/Xorg 51MiB |
| 0 6430 G /usr/lib/xorg/Xorg 107MiB |
+-----------------------------------------------------------------------------+
# 实时监控 GPU 运行情况
[eln@localhost docker]$ watch -n 0.1 -d nvidia-smi