Prometheus安装部署——(DCGM)NVIDIA GPU监控

一、(DCGM)NVIDIA GPU监控

必须先安装GPU驱动

  1. 安装go语言环境
sudo apt install golang-go
# 验证安装是否成功
go version
  1. 下载datacenter-gpu-manager(DCGM)
    https://developer.nvidia.com/dcgm注册后下载DCGM
    Prometheus安装部署——(DCGM)NVIDIA GPU监控_第1张图片
  2. 安装DCGM
sudo dpkg -i  datacenter-gpu-manager_1.7.2_amd64.deb 
  1. 下载gpu-monitoring-tools
git clone https://gitee.com/JackTpy/gpu-monitoring-tools.git

  1. 设置go的国内源
go env -w GOPROXY=https://goproxy.cn
  1. 编译
cd gpu-monitoring-tools/
sudo make binary
sudo make install
  1. 测试运行dcgm-exporter
dcgm-exporter
# 没有报错就是启动成功
  1. 创建自启动脚本
    sudo vim /etc/systemd/system/dcgm-exporter.service

  2. 写入以下脚本内容

[Unit]
Description=dcgm-exporter service

[Service]
User=root
ExecStart=/usr/bin/dcgm-exporter

TimeoutStopSec=10
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
  1. 启动
sudo systemctl daemon-reload
sudo systemctl enable dcgm-exporter
sudo systemctl start dcgm-exporter
  1. 查看运行状态
sudo systemctl status dcgm-exporter

你可能感兴趣的:(Prometheus)