k8s集群内带GPU工作节点配置显卡驱动

k8s集群内带GPU工作节点配置显卡驱动

系统为Centos7

一、下载、安装显卡驱动
查看显卡型号

[root@VM-3-9-centos user]# lspci | grep -i nvidia
00:08.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

1.1、官网下载驱动程序
https://www.nvidia.cn/Download/index.aspx

注:cuda最好12版本
k8s集群内带GPU工作节点配置显卡驱动_第1张图片
1.2、安装显卡驱动

bash NVIDIA-Linux-x86_64-525.105.17.run

查看是否安装成功

[root@VM-3-9-centos user]# nvidia-smi
Wed May 17 13:04:48 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:08.0 Off |                    0 |
| N/A   45C    P0    26W /  70W |   3414MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     19692      C   python3                          3410MiB |
+-----------------------------------------------------------------------------+
[root@VM-3-9-centos user]#

卸载显卡驱动
需要重启服务器

/usr/bin/nvidia-uninstall

1.3、安装nvidia-docker2

yum install -y nvidia-docker2
yum install -y nvidia-container-runtime

二、配置环境支持显卡

2.1、修改daemon.json

{
  "registry-mirrors": [
      "https://tf72mndn.mirror.aliyuncs.com"
  ],
  "exec-opts": ["native.cgroupdriver=systemd"],
  "storage-driver": "overlay2",
  "log-opts": {
      "max-file": "3",
      "max-size": "500m"
  },
  "storage-opts": ["overlay2.override_kernel_check=true"],
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

2.2、部署k8s nvidia插件

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

注:修改部署类型,如果有多台显卡,可以选择部署到有显卡的服务器。

2.3、K8S集群内检查显卡

[root@VM-2-8-centos user]#  kubectl describe node vm-3-9-centos |grep nv
                    nvidia.com/gpu.present=true
 nvidia.com/gpu:     1
 nvidia.com/gpu:     1
  kube-system                nvidia-device-plugin-daemonset-4p97n      0 (0%)        0 (0%)      0 (0%)           0 (0%)         85m
  nvidia.com/gpu     1          1

2.4、通过rancher设置容器使用显卡数量
k8s集群内带GPU工作节点配置显卡驱动_第2张图片

你可能感兴趣的:(k8s,kubernetes,linux,运维)