Kubernetes-通过nvidia-docker2使用GPU资源

目录

一. Nvidia-docker

二. Nvidia-docker2

1. 安装nvidia-docker2

2. nvidia-gpu-plugin安装

3. 容器中运行TensorFlow


一. Nvidia-docker

nvidia-docker是一个可以使用GPU的docker,在Docker基础上做了一成封装

目前为止,已发布发布两个大的稳定版本,其中nvidia-docker已经被弃用,本次做一个简单介绍

nvidia-docker 作为Docker的一个包装,需要运行一个独立的daemon,实际上是一个Volume Plugin,nvidia-docker实现了一个专门的Volume Driver,指定用这个Driver创建Volume后,会自动针对宿主的驱动的版本,收集需要透传的文件统一放到一个目录下,直接把这个目录透传给容器就行

yum install nvidia-docker-1.0.1-1.x86_64.rpm

systemctl enable nvidia-docker

systemctl start nvidia-docker

# 其中384.90是安装的驱动的版本号
docker volume create --name=nvidia_driver_384.90 -d nvidia-docker

创建好卷后,docker volume inspect nvidia_driver_384.90会发现需要透传的文件都被收集到/var/lib/nvidia-docker/volumes/nvidia_driver/384.90下了,有很多符号链接和硬链接。最后创建容器的时候,使用下面的docker run 命令行参数就能把驱动、CUDA透传到容器里

--volume=nvidia_driver_384.90:/usr/local/nvidia:ro --volume=/usr/local/cuda/lib64:/usr/local/cuda/lib64:ro -e LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

二. Nvidia-docker2

nvidia-docker2是一个runtime,能更好的和docker兼容

1. 安装nvidia-docker2

nvidia-docker2对docker的版本有依赖性,我的环境如下

[root@k8s-node1 ~]# uname -a
Linux k8s-node1 4.17.6-1.el7.elrepo.x86_64 #1 SMP Wed Jul 11 17:24:30 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

[root@k8s-node1 ~]# cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core)

[root@k8s-node1 ~]# docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:23:03 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:25:29 2018
  OS/Arch:          linux/amd64
  Experimental:     false
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo yum install -y nvidia-docker2

安装后,需要配置新的Docker Runtime。 同时,也需要把默认的Runtime设为 nvidia 。

[root@k8s-node1 ~]# cat /etc/docker/daemon.json 
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

之后重启docker

systemctl restart docker

查看docker信息

[root@k8s-node1 ~]# docker info
Containers: 49
 Running: 47
 Paused: 0
 Stopped: 2
Images: 84
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: nvidia runc
Default Runtime: nvidia # 默认runtime改成了nvidia
...
...

使用镜像进行测试

# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

2. nvidia-gpu-plugin安装

通过kubernetes调度GPU资源需要安装nvidia-gpu-plugin,可以通过daemonsets部署

[root@k8s-node1 ~]# cat nvidia-gpu-plugin.yaml 
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-plugin
  namespace: kube-system
  labels:
    k8s-app: nvidia-gpu-plugin
    version: "1.10"
spec:
  selector:
    matchLabels:
      k8s-app: nvidia-gpu-plugin
      version: "1.10"
  template:
    metadata:
      # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
      # reserves resources for critical add-on pods so that they can be rescheduled after
      # a failure.  This annotation works in tandem with the toleration below.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        k8s-app: nvidia-gpu-plugin
        version: "1.10"
    spec:
      tolerations:
      # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
      # This, along with the annotation above marks this pod as a critical add-on.
      - key: CriticalAddonsOnly
        operator: Exists
      containers:
      - image: nvidia-gpu-plugin:1.10
        name: nvidia-gpu-plugin
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

查看资源

[root@k8s-node1 ~]# kubectl --namespace=kube-system get daemonsets nvidia-gpu-plugin 
NAME                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
nvidia-gpu-plugin   2         2         2       2            2                     6d17h
[root@k8s-node1 ~]# 
[root@k8s-node1 ~]# 
[root@k8s-node1 ~]# kubectl --namespace=kube-system get pod | grep nvidia-gpu-plugin
nvidia-gpu-plugin-vv5zh                               1/1     Running   0          6d17h
nvidia-gpu-plugin-z8f8c                               1/1     Running   0          6d17h

nvidia-gpu-plugin创建好之后查看k8s node节点可以看到已经可以提供gpu资源了

[root@k8s-node1 ~]# kubectl get node k8s-node1 -ojson | jq '.status.allocatable'
{
  "cpu": "32",
  "ephemeral-storage": "680077411227",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "65762860Ki",
  "nvidia.com/gpu": "2", # 已经有2个gpu资源可以使用
  "pods": "110"
}

创建包括GPU资源的Pod

[root@k8s-node1 example]# cat nginx-pod.yaml 
apiVersion: v1
kind: Pod
metadata:
  labels:
    k8s-app: nginx-pod
  name: nginx-pod
spec:
  containers:
  - image: nginx:latest
    imagePullPolicy: Always
    name: nginx
    ports:
    - containerPort: 80
      name: nginx
      protocol: TCP
    resources:
      limits:
        nvidia.com/gpu: "1"

[root@k8s-node1 example]# kubectl create -f nginx-pod.yaml 
pod/nginx-pod created

进到容器中查看,可以看到在nginx pod中可以看到已经分配了一个GPU资源

[root@k8s-node1 example]# kubectl exec -it nginx-pod bash
root@nginx-pod:/# nvidia-smi 
Tue May 28 02:56:52 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.57                 Driver Version: 410.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 25%   27C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3. 容器中运行TensorFlow

容器镜像制作

GPU的tensorflow需要在指定型号上编译性能才是最佳的,编译方式参考:https://blog.csdn.net/iov_aaron/article/details/90268597

我的tensorflow是1.12版本,镜像是基于nvidia的nvidia/cuda:10.0-cudnn7-runtime-centos7镜像(https://gitlab.com/nvidia/cuda/tree/centos7/10.0)

[root@aidevops tensorflow-gpu-1.12.0]# cat Dockerfile 
FROM nvidia/cuda:10.0-cudnn7-runtime-centos7
USER root
WORKDIR /root

COPY scripts /root/scripts
COPY sshconfig/* /etc/ssh/
COPY jupyterconfig /root/.jupyter

## SSH
RUN yum install -y openssh openssh-clients openssh-server && \
    ssh-keygen -t rsa -P '' -f /etc/ssh/ssh_host_rsa_key && \
    ssh-keygen -t ecdsa -P '' -f /etc/ssh/ssh_host_ecdsa_key && \
    ssh-keygen -t ed25519 -P '' -f /etc/ssh/ssh_host_ed25519_key

# OpenCV
RUN yum install opencv-devel -y

# User Tools
RUN yum install vim python-imaging -y

# Jupyter
RUN yum install -y python-setuptools gcc python-devel && \
    easy_install pip==9.0.1 && \
    pip install tornado==4.5.3 graphviz==0.8.1 requests==2.18.4 && \
    pip install notebook-5.2.2-py2.py3-none-any.whl

# Tensorflow
RUN pip install tensorflow-1.12.0-cp27-cp27mu-linux_x86_64.whl

# Clean
RUN yum clean all && rm -rf /var/cache/yum*

CMD ["/root/scripts/start.sh"]

镜像制作会后使用该镜像启动容器可以正常运行tensorflow应用

 

出现的问题:

使用Tensorflow训练时报错: kernel version 418.18 does not match DSO version 410.58.0

发现nvidia/cuda:10.0-cudnn7-runtime-centos7镜像中安装的driver是410.58,但是宿主机上安装的driver是418.18

卸载物理机上的驱动,重新安装了410.57(没有找到410.58的,只找到了410.57版本的驱动:http://de.download.nvidia.com/XFree86/Linux-x86_64/410.57/NVIDIA-Linux-x86_64-410.57.run)

之后可以正常运行

你可能感兴趣的:(Kubernetes,Docker,tensorflow)