以A机房为例,安装部署在192.168.10.103服务器上。
- 添加Zabbix安装源
rpm -Uvh https://repo.zabbix.com/zabbix/4.4/rhel/7/x86_64/zabbix-release-4.4-1.el7.noarch.rpm
yum clean all
yum install munin --nogpgcheck
- 安装Agent服务
yum install zabbix-agent
- 引入GPU查询脚本
mkdir -p /etc/zabbix/scripts
把脚本get_gpus_info.sh放入其中,并且添加执行权限
chmod +x /etc/zabbix/scripts/get_gpus_info.sh
get_gpus_info.sh的内容如下
#!/bin/bash
result=$(/usr/bin/nvidia-smi -L | sed 's/^GPU \([0-9]*\):.*(UUID: \(.*\))$/,{"{#GPUINDEX}":"\1","{#GPUUUID}":"\2"}/g')
first=1
echo "{"
echo "\"data\":["
for line in ${result[@]}
do
if [ "$first" == "1" ]; then
echo ${line:1}
first=0
else
echo -n $line
fi
done
echo
echo "]"
echo "}"
- 配置Agent
vi /etc/zabbix/zabbix_agent.conf
Server=192.168.10.101 # Zabbix proxy地址
LogFileSize=512
ServerActive=192.168.10.101
Hostname=DOMAIN_ZONEA_192.168.10.102_CPU
Timeout=10 # 超时时间,默认是3秒,根据网络情况而定,建议设置为10秒
UserParameter=gpu.number,/usr/bin/nvidia-smi -L | /usr/bin/wc -l
UserParameter=gpu.discovery,/etc/zabbix/scripts/get_gpus_info.sh
UserParameter=gpu.fanspeed[*],/usr/bin/nvidia-smi --query-gpu=fan.speed --format=csv,noheader,nounits -i $1 | tr -d "\n"
UserParameter=gpu.power[*],/usr/bin/nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits -i $1 | tr -d "\n"
UserParameter=gpu.temp[*],/usr/bin/nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits -i $1 | tr -d "\n"
UserParameter=gpu.utilization[*],/usr/bin/nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits -i $1 | tr -d "\n"
UserParameter=gpu.memfree[*],/usr/bin/nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits -i $1 | tr -d "\n"
UserParameter=gpu.memused[*],/usr/bin/nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i $1 | tr -d "\n"
UserParameter=gpu.memtotal[*],/usr/bin/nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i $1 | tr -d "\n"
- 在Server端导入GPU监控模板
Configuration ---> Templates ---> Import,导入文件zbx_nvidia-smi-multi-gpu-active.xml,该文件内容如下:
3.0
2018-06-05T20:56:12Z
Templates
Template Nvidia GPUs Performance active
Template Nvidia GPUs Performance active
Templates
Nvidia
-
Number of GPUs
7
0
gpu.number
30
90
365
0
0
0
0
0
0
1
0
0
The number of GPUs present on this system.
0
Nvidia
GPU discovery
7
gpu.discovery
600
0
0
0
0
0
0
30
Discovery of graphics cards.
GPU $1 Fan Speed
7
1
gpu.fanspeed[{#GPUINDEX}]
60
7
365
0
3
%
0
0
0
0
1
0
0
0
Nvidia
GPU $1 Memory Free
7
0
gpu.memfree[{#GPUINDEX}]
60
7
365
0
3
MB
0
0
0
0
1
0
0
0
Nvidia
GPU $1 Memory Total
7
0
gpu.memtotal[{#GPUINDEX}]
60
7
365
0
3
MB
0
0
0
0
1
0
0
0
Nvidia
GPU $1 Memory Used
7
0
gpu.memused[{#GPUINDEX}]
60
7
365
0
3
MB
0
0
0
0
1
0
0
0
Nvidia
GPU $1 Power in decaWatts
7
1
gpu.power[{#GPUINDEX}]
60
7
365
0
0
dW
0
0
0
0
0.1
0
0
0
Nvidia
GPU $1 Temperature
7
0
gpu.temp[{#GPUINDEX}]
60
7
365
0
0
C
0
0
0
0
1
0
0
0
Nvidia
GPU $1 Utilization
7
0
gpu.utilization[{#GPUINDEX}]
60
7
365
0
3
%
0
0
0
0
1
0
0
0
Nvidia
{Template Nvidia GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>80
GPU {#GPUINDEX} Temperature is extremely high
0
5
A GPU's temperature is getting extremely high!
0
{Template Nvidia GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>70
GPU {#GPUINDEX} Temperature is high
0
2
A GPU's temperature is getting high!
0
GPU {#GPUINDEX} Temperature is very high
{Template Nvidia GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>75
{Template Nvidia GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>75
GPU {#GPUINDEX} Temperature is very high
0
4
A GPU's temperature is getting very high!
0
GPU {#GPUINDEX} Temperature is extremely high
{Template Nvidia GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>80
GPU {#GPUINDEX} Memory
900
200
0.0000
100.0000
1
1
0
1
0
0.0000
0.0000
0
0
0
0
0
0
00AA00
0
2
0
-
Template Nvidia GPUs Performance
gpu.memfree[{#GPUINDEX}]
1
0
0000DD
0
2
0
-
Template Nvidia GPUs Performance
gpu.memused[{#GPUINDEX}]
GPU {#GPUINDEX} Temperature, Fan Speed and Power
900
200
0.0000
100.0000
1
1
0
1
0
0.0000
0.0000
0
0
0
0
0
0
1A7C11
0
2
0
-
Template Nvidia GPUs Performance
gpu.power[{#GPUINDEX}]
1
0
2774A4
0
2
0
-
Template Nvidia GPUs Performance
gpu.fanspeed[{#GPUINDEX}]
2
0
F63100
0
2
0
-
Template Nvidia GPUs Performance
gpu.temp[{#GPUINDEX}]
GPU {#GPUINDEX} Utilization
900
200
0.0000
100.0000
1
1
0
1
0
0.0000
0.0000
0
0
0
0
0
0
2774A4
0
2
0
-
Template Nvidia GPUs Performance
gpu.utilization[{#GPUINDEX}]
- 在Server端创建Host
以管理员身份登录
Configuration ---> Hosts ---> Create host
其中,Host name填写DOMAIN_ZONEA_192.168.10.102_CPU,Visible name填写:机房A_192.168.10.102,Groups:选Linux servers、Templates以及自定义的分组,Agent interfaces:填写机房A的防火墙IP123.123.123.124,Monitored by proxy选择刚创建的代理DOMAIN_ZONEA_192.168.10.101_PROXY,其他默认即可。
添加模板
Hosts --->DOMAIN_ZONEA_192.168.10.102_CPU ---> Templates
选择 “Template OS Linux by Zabbix agent active”和“Template Nvidia GPUs Performance active”两个模板,Update添加即可。 - 启动Proxy服务
systemctl restart zabbix-agent
- 添加为开启自启动
systemctl enable zabbix-agent
- 回到Server查看host
Monitoring ---> Graph
其中,Group选Linux servers,Host选DOMAIN_ZONEA_192.168.10.102_CPU,Graph选想查看的监控项,不出意外的话,几十秒内就会有结果了,或者多等几分钟。