目录
一. Nvidia-docker
二. Nvidia-docker2
1. 安装nvidia-docker2
2. nvidia-gpu-plugin安装
3. 容器中运行TensorFlow
nvidia-docker是一个可以使用GPU的docker,在Docker基础上做了一成封装
目前为止,已发布发布两个大的稳定版本,其中nvidia-docker已经被弃用,本次做一个简单介绍
nvidia-docker 作为Docker的一个包装,需要运行一个独立的daemon,实际上是一个Volume Plugin,nvidia-docker实现了一个专门的Volume Driver,指定用这个Driver创建Volume后,会自动针对宿主的驱动的版本,收集需要透传的文件统一放到一个目录下,直接把这个目录透传给容器就行
yum install nvidia-docker-1.0.1-1.x86_64.rpm
systemctl enable nvidia-docker
systemctl start nvidia-docker
# 其中384.90是安装的驱动的版本号
docker volume create --name=nvidia_driver_384.90 -d nvidia-docker
创建好卷后,docker volume inspect nvidia_driver_384.90会发现需要透传的文件都被收集到/var/lib/nvidia-docker/volumes/nvidia_driver/384.90下了,有很多符号链接和硬链接。最后创建容器的时候,使用下面的docker run 命令行参数就能把驱动、CUDA透传到容器里
--volume=nvidia_driver_384.90:/usr/local/nvidia:ro --volume=/usr/local/cuda/lib64:/usr/local/cuda/lib64:ro -e LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
nvidia-docker2是一个runtime,能更好的和docker兼容
nvidia-docker2对docker的版本有依赖性,我的环境如下
[root@k8s-node1 ~]# uname -a
Linux k8s-node1 4.17.6-1.el7.elrepo.x86_64 #1 SMP Wed Jul 11 17:24:30 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@k8s-node1 ~]# cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
[root@k8s-node1 ~]# docker version
Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:03 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:25:29 2018
OS/Arch: linux/amd64
Experimental: false
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
sudo tee /etc/yum.repos.d/nvidia-docker.repo
# Install nvidia-docker2 and reload the Docker daemon configuration
sudo yum install -y nvidia-docker2
安装后,需要配置新的Docker Runtime。 同时,也需要把默认的Runtime设为 nvidia
。
[root@k8s-node1 ~]# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
之后重启docker
systemctl restart docker
查看docker信息
[root@k8s-node1 ~]# docker info
Containers: 49
Running: 47
Paused: 0
Stopped: 2
Images: 84
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: nvidia runc
Default Runtime: nvidia # 默认runtime改成了nvidia
...
...
使用镜像进行测试
# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
通过kubernetes调度GPU资源需要安装nvidia-gpu-plugin,可以通过daemonsets部署
[root@k8s-node1 ~]# cat nvidia-gpu-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-gpu-plugin
namespace: kube-system
labels:
k8s-app: nvidia-gpu-plugin
version: "1.10"
spec:
selector:
matchLabels:
k8s-app: nvidia-gpu-plugin
version: "1.10"
template:
metadata:
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
# reserves resources for critical add-on pods so that they can be rescheduled after
# a failure. This annotation works in tandem with the toleration below.
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
k8s-app: nvidia-gpu-plugin
version: "1.10"
spec:
tolerations:
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists
containers:
- image: nvidia-gpu-plugin:1.10
name: nvidia-gpu-plugin
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
查看资源
[root@k8s-node1 ~]# kubectl --namespace=kube-system get daemonsets nvidia-gpu-plugin
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-gpu-plugin 2 2 2 2 2 6d17h
[root@k8s-node1 ~]#
[root@k8s-node1 ~]#
[root@k8s-node1 ~]# kubectl --namespace=kube-system get pod | grep nvidia-gpu-plugin
nvidia-gpu-plugin-vv5zh 1/1 Running 0 6d17h
nvidia-gpu-plugin-z8f8c 1/1 Running 0 6d17h
nvidia-gpu-plugin创建好之后查看k8s node节点可以看到已经可以提供gpu资源了
[root@k8s-node1 ~]# kubectl get node k8s-node1 -ojson | jq '.status.allocatable'
{
"cpu": "32",
"ephemeral-storage": "680077411227",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "65762860Ki",
"nvidia.com/gpu": "2", # 已经有2个gpu资源可以使用
"pods": "110"
}
创建包括GPU资源的Pod
[root@k8s-node1 example]# cat nginx-pod.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
k8s-app: nginx-pod
name: nginx-pod
spec:
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
name: nginx
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
[root@k8s-node1 example]# kubectl create -f nginx-pod.yaml
pod/nginx-pod created
进到容器中查看,可以看到在nginx pod中可以看到已经分配了一个GPU资源
[root@k8s-node1 example]# kubectl exec -it nginx-pod bash
root@nginx-pod:/# nvidia-smi
Tue May 28 02:56:52 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.57 Driver Version: 410.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 25% 27C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
容器镜像制作
GPU的tensorflow需要在指定型号上编译性能才是最佳的,编译方式参考:https://blog.csdn.net/iov_aaron/article/details/90268597
我的tensorflow是1.12版本,镜像是基于nvidia的nvidia/cuda:10.0-cudnn7-runtime-centos7镜像(https://gitlab.com/nvidia/cuda/tree/centos7/10.0)
[root@aidevops tensorflow-gpu-1.12.0]# cat Dockerfile
FROM nvidia/cuda:10.0-cudnn7-runtime-centos7
USER root
WORKDIR /root
COPY scripts /root/scripts
COPY sshconfig/* /etc/ssh/
COPY jupyterconfig /root/.jupyter
## SSH
RUN yum install -y openssh openssh-clients openssh-server && \
ssh-keygen -t rsa -P '' -f /etc/ssh/ssh_host_rsa_key && \
ssh-keygen -t ecdsa -P '' -f /etc/ssh/ssh_host_ecdsa_key && \
ssh-keygen -t ed25519 -P '' -f /etc/ssh/ssh_host_ed25519_key
# OpenCV
RUN yum install opencv-devel -y
# User Tools
RUN yum install vim python-imaging -y
# Jupyter
RUN yum install -y python-setuptools gcc python-devel && \
easy_install pip==9.0.1 && \
pip install tornado==4.5.3 graphviz==0.8.1 requests==2.18.4 && \
pip install notebook-5.2.2-py2.py3-none-any.whl
# Tensorflow
RUN pip install tensorflow-1.12.0-cp27-cp27mu-linux_x86_64.whl
# Clean
RUN yum clean all && rm -rf /var/cache/yum*
CMD ["/root/scripts/start.sh"]
镜像制作会后使用该镜像启动容器可以正常运行tensorflow应用
出现的问题:
使用Tensorflow训练时报错: kernel version 418.18 does not match DSO version 410.58.0
发现nvidia/cuda:10.0-cudnn7-runtime-centos7镜像中安装的driver是410.58,但是宿主机上安装的driver是418.18
卸载物理机上的驱动,重新安装了410.57(没有找到410.58的,只找到了410.57版本的驱动:http://de.download.nvidia.com/XFree86/Linux-x86_64/410.57/NVIDIA-Linux-x86_64-410.57.run)
之后可以正常运行