Prometheus GPU 监控

Prometheus GPU 监控

  • 1,Prometheus GPU 监控
  • 2,安装gpu-monitoring-tools
    • 2.1,设置`dcgm-exporter`开机启动
  • 3,Prometheus修改配置
  • 4,grafana
  • 5,使用监控面板`9957`可以切换节点
  • 6,Grafana设置
  • 7,使用`12027`

1,Prometheus GPU 监控

  • 安装DCGM
  • datacenter-gpu-manager_1.7.2_amd64.deb
# dcgmi --version

dcgmi  version: 1.7.2

2,安装gpu-monitoring-tools

# git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
# cd gpu-monitoring-tools/
# make binary
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
# make install
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
install -m 557 dcgm-exporter /usr/bin/dcgm-exporter
install -m 557 -D ./etc/dcgm-exporter/default-counters.csv /etc/dcgm-exporter/default-counters.csv
install -m 557 -D ./etc/dcgm-exporter/dcp-metrics-included.csv /etc/dcgm-exporter/dcp-metrics-included.csv
  • 运行dcgm-exporter
# which dcgm-exporter
/usr/bin/dcgm-exporter
# dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Pipeline starting
INFO[0000] Starting webserver
  • 测试,可以看到监控数据
# curl 192.168.1.2:9400/metrics

2.1,设置dcgm-exporter开机启动

  • vim /lib/systemd/system/dcgm-exporter.service 新建服务
[Unit]
Description=dcgm-exporter service

[Service]
User=root
ExecStart=/usr/bin/dcgm-exporter

TimeoutStopSec=10
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
# systemctl daemon-reload
# systemctl enable dcgm-exporter.service
# systemctl start dcgm-exporter.service
# systemctl status dcgm-exporter.service

3,Prometheus修改配置

  • 添加dcgm-exporter
    # dcgm-exporter
  - job_name: 'gpu'
    static_configs:
    - targets: ['192.168.1.2:9400']
# cat prometheus.yml
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']


    # node_exporter
  - job_name: 'node'
    static_configs:
    - targets: ['127.0.0.1:9100','192.168.1.2:9100']

    # dcgm-exporter
  - job_name: 'gpu'
    static_configs:
    - targets: ['192.168.1.2:9400']
  • 重启prometheus
systemctl restart  prometheus.service

Prometheus GPU 监控_第1张图片

4,grafana

Prometheus GPU 监控_第2张图片

5,使用监控面板9957可以切换节点

Prometheus GPU 监控_第3张图片
Prometheus GPU 监控_第4张图片

6,Grafana设置

  • 监控功率,instance为ip地址
DCGM_FI_DEV_POWER_USAGE{instance="192.168.1.101:9400"}
  • 显卡使用率
DCGM_FI_DEV_GPU_UTIL{instance="192.168.1.101:9400"}

7,使用12027

Prometheus GPU 监控_第5张图片

   # dcgm-exporter
  - job_name: 'gpu-metrics'
    static_configs:
    - targets: ['127.0.0.1:9400','192.168.1.101:9400','192.168.1.102:9400']

Prometheus GPU 监控_第6张图片

  • 手动设置监控
    Prometheus GPU 监控_第7张图片
  • 查看显卡指标
curl http://127.0.0.1:9400/metrics
  • 使用功率
DCGM_FI_DEV_POWER_USAGE{instance="127.0.0.1:9400"}
  • 内存使用
DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}
  • 总内存
DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}+DCGM_FI_DEV_FB_FREE{instance="127.0.0.1:9400"}
  • GPU使用率
DCGM_FI_DEV_GPU_UTIL{instance="127.0.0.1:9400"}
  • GPU内存使用率
DCGM_FI_DEV_MEM_COPY_UTIL{instance="192.168.0.114:9400"}

参考:

  1. Prometheus + Grafana 监控 NVIDIA GPU
  2. DCGM 1.7.2 Downloads (December 2019)
  3. GPU Nodes v2
  4. NVIDIA/gpu-monitoring-tools
  5. NVIDIA DCGM Exporter Dashboard
  6. GPU Nodesby bkeyzers
  7. Integrating with DCGM
  8. 安装dcgm
  9. 基于DCGM和Prometheus的GPU监控方案 dcgm r采集指标项以及含义

你可能感兴趣的:(Prometheus,Grafana,Prometheus,GPU,grafana,dcgm-exporter)