GPU监控

说明

NVIDIA Data Center GPU Manager (DCGM) 是一套用于在集群环境中管理和监视Tesla™GPU的工具。可以集成到Prometheus监控方案中。

部署

从 https://developer.nvidia.com/dcgm 下载deb包(需要注册)

sudo dpkg -i  datacenter-gpu-manager_1.7.2_amd64.deb 
systemctl enable dcgm.service 
systemctl start dcgm.service

从 https://d.pr/free/f/qcUmPG 下载dcgm工具包

tar zxvf dcgm.tar.gz 
cd dcgm
cp dcgm-exporter /usr/local/bin/ 
cp node_exporter /usr/local/bin/
mkdir /run/prometheus
cp prometheus-dcgm.service  /etc/systemd/system/
cp prometheus-node-exporter.service  /etc/systemd/system/
systemctl daemon-reload
systemctl enable prometheus-dcgm.service
systemctl enable prometheus-node-exporter.service
systemctl start prometheus-dcgm.service
systemctl start prometheus-node-exporter.service

确认相关服务是否都已启动

systemctl status dcgm.service
systemctl status prometheus-dcgm.service
systemctl status prometheus-node-exporter.service

效果图 (Dashboard ID:11752)

你可能感兴趣的:(GPU监控)