由于公司准备以后转向微服务,使用docker集群部署项目。所以监控服务转向prometheus。
ubuntu下prometheus的安装(docker版):
1,安装docker
apt-get install -y docker.io
2,下载prometheus相关的镜像包
docker pull prom/node-exporter
docker pull prom/prometheus
docker pull grafana/grafana
3,启动node-exporter
docker run -d -p 9100:9100
-v "/proc:/host/proc:ro"
-v "/sys:/host/sys:ro"
-v "/:/rootfs:ro"
--net="host"
prom/node-exporter
查看端口是否启动:
netstat -tpln
4,启动prometheus
新建一个目录放prometheus的配置文件
mkdir /opt/prometheus
vim /opt/prometheus/prometheus.yml
配置文件内容:
my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_timeout is set to the global default (10s).
Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
- targets:
Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "first_rules.yml"
- "second_rules.yml"
A scrape configuration containing exactly one endpoint to scrape:
Here it's Prometheus itself.
scrape_configs:
The job name is added as a label job=
to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['192.168.10.5:19090']
- job_name: 'localhost'
static_configs:- targets: ['192.168.10.5:9100']
- job_name: 'server1'
static_configs:- targets: ['192.168.10.11:9100']
- job_name: 'server1_pushgateway'
static_configs:- targets: ['192.168.10.11:9091']
- job_name: 'server2'
static_configs:- targets: ['192.168.10.110:9100']
- job_name: 'server2_pushgateway'
static_configs:- targets: ['192.168.10.110:9091']
有时候,prometheus会报错,大部分原因是配置文件yml的格式问题,注意下配置文件中的上下级关系和开头空行,不能使用tab,要用空格键。
启动prometheus
docker run -d
-p 19090:9090
-v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
prom/prometheus
查看下端口启动情况
访问下:
查看下服务连接情况:
4,启动grafana
新建一个文件夹,用于存储数据
mkdir /opt/grafana-storage
设置权限
chmod 777 -R /opt/grafana-storage
启动grafana
docker run -d
-p 13000:3000
--name=grafana
-v /opt/grafana-storage:/var/lib/grafana
grafana/grafana
查看端口启动情况
访问:
绑定prometheus:
点击save&Test通过即可
导入官方插件
复制编码到grafana中下载插件
改变下change,保存下就可以。
选择下你要查看的服务器IP地址:
也可以自定义成自己的IP:
6,使用自定义脚本
docker安装pushgateway:
docker pull prom/pushgateway
docker run -d -p 9091:9091 prom/pushgateway
在被监控机器上编写脚本,下面是我写的监控GPU温度的shell脚本:
!/bin/bash
while true; do
#instance_name=hostname -f | cut -d'.' -f1
#获取本机名,用于后面的的标签
#instance_name=hostname -f | cut -d'.' -f1
label1="server1_gpu1_temperature" #定义key名
server1_gpu1_temperature=3}' |awk -FC '{print(nvidia-smi|awk 'NR==12'|awk '{print1}') #获取gpu2
label3="server1_gpu3_temperature"
server1_gpu3_temperature=3}' |awk -FC '{print(nvidia-smi|awk 'NR==18'|awk '{print1}') #获取gpu4
#echo label1: label2: label3: label4: label1 label2 label3 label4 $server1_gpu4_temperature" | curl --data-binary @- http://192.168.10.11:9091/metrics/job/server1_pushgateway/instance/
sleep 10;
done
重启pushgateway和prometheus。
在prometheus上能搜到自定义的脚本
在grafana上制作GPU温度表
7,使用grafana的报警推送到钉钉上。
在grafana上添加钉钉报警
把钉钉机器人的webhook粘贴到dingding settings的url中
在表格中添加报警
在钉钉上正常接受到报警