前言
Kubernetes集群中,基于已有的透传型GPU虚机,部署一个GPU-Node,与常规节点相比需要增加三个步骤:
- 安装NVIDIA-Driver
- 安装NVIDIA-Docker2
- 部署NVIDIA-Device-Plugin
安装NVIDIA-driver
下载 NVIDIA 驱动
驱动都是免费的,根据显卡型号选择下载合适的驱动,官方驱动下载地址
禁用 nouveau 驱动
添加conf 文件:
vi /etc/modprobe.d/blacklist.conf
在最后两行添加:
blacklist nouveau
options nouveau modeset=0
重新生成 kernel initramfs:
执行sudo update-initramfs -u
重启节点虚机:
reboot
验证:没输出代表禁用生效,在重启之后执行
lsmod | grep nouveau
安装驱动
示例中:虚机操作系统为 Ubuntu18.04-amd64,显卡型号为 Tesla-V100,安装驱动版本选择 440.33.01
在线安装:
apt install nvidia-driver-430 nvidia-utils-430 nvidia-settings
离线安装:
./NVIDIA-Linux-x86_64-{{ gpu_version }}.run -s
验证驱动安装:
nvidia-smi
正确安装驱动后,输出示例如下:
以上完成NVIDIA驱动安装
安装NVIDIA-docker2
由于18.06版本的docker不支持GPU容器, 需要安装NVIDIA-Docker2以支持容器使用NVIDIA-GPUs
注意:安装之前要先安装好 docker及NVIDIA驱动,但不需要安装 CUDA。
在线安装:
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
distribution= $(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update
apt-get install -y nvidia-docker2
systemctl restart docker
离线安装:
在通外网的机器上,运行以下命令:
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update
下载5个包
apt download libnvidia-container1
apt download libnvidia-container-tools
apt download nvidia-container-toolkit
apt download nvidia-container-runtime
apt download nvidia-docker2
将下载好的包拷贝到目标节点虚机后,执行如下命令进行安装
dpkg -i libnvidia-container1_1.0.7-1_amd64.deb && dpkg -i libnvidia-container-tools_1.0.7-1_amd64.deb && dpkg -i nvidia-container-toolkit_1.0.5-1_amd64.deb && dpkg -i nvidia-container-runtime_3.1.4-1_amd64.deb && dpkg -i nvidia-docker2_2.2.2-1_all.deb
设置GPU节点的docker default runtime 为 nvidia-container-runtime
vi /etc/docker/daemon.json
需要在该配置文件中添加的内容如下:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
重启docker
systemctl restart docker
验证安装
docker info
以上完成NVIDIA-docker2安装
安装插件 nvidia-device-plugin-daemonset
注:示例中版本为1.0.0-beta6,可去Nvidia Github 项目下查看所有可用版本
在线安装:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin:1.0.0-beta6/nvidia-device-plugin.yml
Nvidia官方manifest/清单为:
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvidia/k8s-device-plugin:1.0.0-beta6
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
注: 可以给GPU节点添加label,并在nvidia-device-plugin-daemonset.yaml中添加nodeselector或nodeAffinity。
以上步骤完成后,验证GPU-Node安装
kubectl get no {nodeName} -oyaml
图中Node详情中已经可以取到nvidia.com/gpu的值了,此时GPU资源是以显卡个数暴露给Kubernetes集群的,说明配置生效了,另外,也可以显存形式暴露给Kubernetes集群并实现GPU共享调度。
以上完成在Kubernetes集群中GPU节点的部署及验证
Reference
安装NVIDIA驱动: https://www.cnblogs.com/youpeng/p/10887346.html
禁用Nouveau:http://www.iewb.net/index.php/qg/3717.html
NVIDIA驱动官方下载地址:https://www.nvidia.cn/Download/index.aspx?lang=cn
安装NVIDIA-docker2: https://fanfuhan.github.io/2019/11/22/docker_based_use/
解决 Ubuntu18 无法安装 Nvidia-docker2: https://blog.csdn.net/wuzhongli/article/details/86539433
kubernetes官方文档: https://kubernetes.io/zh/docs/tasks/manage-gpus/scheduling-gpus/
NVIDIA device plugin 官方文档: https://github.com/NVIDIA/k8s-device-plugin